Imagine you are a robot browsing the web // how LLM can be used as a browser pilot

4 min readMay 1, 2024

What is LLM? A large dataset of language patterns (can handle many situations just use it wisely). Not a single request-response pattern, but continuous micro interactions (chain of thoughts based on feedback).

https://github.com/githubpradeep/notebooks/blob/main/playwright-llama3.ipynb

This video describes a method called “set of marks prompting” for enabling web browsing capabilities using a large language model like Lama-3 that does not have vision capabilities. The key steps involved are:

1. Annotate all interactive elements (buttons, text boxes, links) on a web page with bounding boxes.
2. Send the annotated web page content to the large language model.
3. The model analyzes the content and provides the next action to take, like clicking an element, typing text, scrolling, waiting, going back, or searching.
4. Execute the suggested action on the web page.
5. Repeat steps 2–4 in a chain of thought reasoning until the model provides a final answer to the query.

The video provides Python code that implements this approach using the Lama-3 70 billion parameter model from Groq. It demonstrates examples of using this system to get a joke, find the price of an iPhone on Amazon, and play a song on YouTube, all through natural language queries without direct vision capabilities. The code handles annotating pages, executing actions, and prompting the model.

What if to use same technique with multi-modal vision models or just combine them.

Prompt engineering is crucial for deploying LLMs but is poorly understood mathematically. We formalize LLM systems as a class of discrete stochastic dynamical systems to explore prompt engineering through the lens of control theory. We investigate the reachable set of output token sequences Ry(x0) for which there exists a control input sequence u for each y ∈ Ry(x0) that steers the LLM to output y from initial state sequence x0. We offer analytic analysis on the limitations on the controllability of self-attention in terms of reachable set, where we prove an upper bound on the reachable set of outputs Ry(x0) as a function of the singular values of the parameter matrices. We present complementary empirical analysis on the controllability of a panel of LLMs, including Falcon-7b, Llama-7b, and Falcon-40b. Our results demonstrate a lower bound on the reachable set of outputs Ry(x0) w.r.t. initial state sequences x0 sampled from the Wikitext dataset. We find that the correct next Wikitext token following sequence x0 is reachable over 97% of the time with prompts of k ≤ 10 tokens. We also establish that the top 75 most likely next tokens, as estimated by the LLM itself, are reachable at least 85% of the time with prompts of k ≤ 10 tokens. Intriguingly, short prompt sequences can dramatically alter the likelihood of specific outputs, even making the least likely tokens become the most likely ones. This control-centric analysis of LLMs demonstrates the significant and poorly understood role of input sequences in steering output probabilities, offering a foundational perspective for enhancing language model system capabilities.

The video discusses a project that develops a control theory framework for understanding and controlling language models (LLMs). The key points are:

1. LLMs exhibit the “zero-shot learning miracle” where they can perform tasks without explicit training, but how they are prompted heavily impacts performance. Understanding LLMs as systems and developing control theory for them could lead to safer and more effective systems.

2. LLMs are formalized as control systems with inputs (prompts), states (token sequences), and outputs (generated text). Concepts like reachability (whether a desired output can be reached from a given state) and controllability are defined.

3. A theorem is proved bounding when a desired output is unreachable for a self-attention layer based on the model’s parameters and inputs.

4. The controllability of LLMs is measured via prompt optimization on datasets to steer models to desired next tokens. Results show reasonable controllability for real next tokens, but less for random tokens.

5. Future directions include distributional control beyond greedy decoding, understanding effects of techniques like chain-of-thought reasoning, composing multiple LLMs, and controlling high-level attributes like emotion.

The work aims to develop a systematic understanding of controlling LLM behavior through a control theoretic lens, with theoretical results complemented by empirical measurements of controllability.

Imagine you are a robot browsing the web // how LLM can be used as a browser pilot

Written by sbagency