Personal assistants that interact with open-domain websites can assist humans with arbitrary tasks, such as booking flights or searching for information. Language-guided assistants can be a natural interface for users, enabling assistive technologies, browser automation tools, and web navigation in situations where people cannot use standard interfaces (e.g., while cooking or while browsing a website in an unfamiliar language).
Most existing web navigation assistants rely on text-only pretrained representations, which do not take advantage of structural information in web pages. However, rich structured representations of webpages are needed to more effectively ground natural language requests into webpage elements.
Consider asking an assistant to search for fall courses on the EECS department website, shown below.
The search box is just an empty input element, and there may be other input boxes further down the page. How can an assistant find the right place to input the search query? One clue might be the presence of a button next to the box. Even without considering the visual features of the “looking-glass” icon, the icon may have an alternative text description of “Search” (so-called “alt-text”, targeted at screen-readers and other accessibility tools). Another clue is the position of the search box on the page, which is similar to other pages on the web. To interpret even this simple instruction, it’s not sufficient to look at the single
element alone — the navigator must look at the context of the element on the page. To create more effective web assistants, we propose to (a) determine a good architecture for constructing context-dependent representations of webpages and their text, (b) train this architecture in a self-supervised manner, using only raw webpages obtained from the Internet, and (c) demonstrate that the provided representations provide benefits on tasks involving webpages.
This project is generously supported by compute resources from Microsoft.