GPT-4V(ision) is a Generalist Internet Agent, if Grounded
Authors: Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
Summary: The current growth on massive multimodal fashions (LMMs), particularly GPT-4V(ision) and Gemini, has been rapidly increasing the potential boundaries of multimodal fashions past conventional duties like picture captioning and visible query answering. On this work, we discover the potential of LMMs like GPT-4V as a generalist internet agent that may comply with pure language directions to finish duties on any given web site. We suggest SEEACT, a generalist internet agent that harnesses the ability of LMMs for built-in visible understanding and appearing on the internet. We consider on the current MIND2WEB benchmark. Along with commonplace offline analysis on cached web sites, we allow a brand new on-line analysis setting by growing a device that enables operating internet brokers on stay web sites. We present that GPT-4V presents an excellent potential for internet brokers — it will possibly efficiently full 51.1 of the duties on stay web sites if we manually floor its textual plans into actions on the web sites. This considerably outperforms text-only LLMs like GPT-4 or smaller fashions (FLAN-T5 and BLIP-2) particularly fine-tuned for internet brokers. Nevertheless, grounding nonetheless stays a serious problem. Present LMM grounding methods like set-of-mark prompting seems to be not efficient for internet brokers, and one of the best grounding technique we develop on this paper leverages each the HTML construction and visuals. But, there may be nonetheless a considerable hole with oracle grounding, leaving ample room for additional enchancment. All code, knowledge, and analysis instruments can be found at https://github.com/OSU-NLP-Group/SeeAct