Agents are coming back // what’s new, reasoning frameworks

Do agents turn from toys into real tools?

sbagency
8 min readFeb 7, 2024

More benchmarks..

https://hkust-nlp.github.io/agentboard/

Evaluating large language models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial challenges. A primary obstacle is the benchmarking of agent performance across diverse scenarios within a unified framework, especially in maintaining partially-observable environments and ensuring multi-round interactions. Moreover, current evaluation frameworks mostly focus on the final success rate, revealing few insights during the process and failing to provide a deep understanding of the model abilities. To address these challenges, we introduce AGENTBOARD, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents. AGENTBOARD offers a finegrained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit that features easy assessment of agents for multifaceted analysis through interactive visualization. This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront. Ultimately, AGENTBOARD serves as a significant step towards demystifying agent behaviors and accelerating the development of stronger LLM agents.

Here are the key points from the paper:

- AGENTBOARD is a new benchmark for evaluating generalist large language model (LLM) agents. It features 9 diverse tasks covering embodied AI, games, web, and tools.

- Environments are designed for multi-turn, partially observable interactions. Subgoals are defined to enable fine-grained progress tracking beyond just success rate.

- The benchmark is paired with an analytical evaluation framework and interactive visualization panel for comprehensive analysis of agent abilities like long-range interactions, exploration, grounding, etc.

- Experiments show progress rate is more informative than success rate. Proprietary LLMs like GPT-4 outperform open-source models significantly. GPT-4 demonstrates comprehensive agentic abilities while open-source models have deficiencies in areas like planning.

- The open-source toolkit allows easy customization and investigation into model behaviors through the interactive dashboard. Goal is to enable detailed assessment of agents to drive progress in this emerging field.

https://arxiv.org/pdf/2402.01622.pdf

Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks — even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.

https://arxiv.org/pdf/2401.13919.pdf

The advancement of large language models (LLMs) leads to a new era marked by the development of autonomous applications in the real world, which drives innovation in the creation of advanced web-based agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with realworld websites. Moreover, we propose a new evaluation protocol for web agents to address the challenges of automatic evaluation of openended web agent tasks, leveraging the robust multimodal comprehension capabilities of GPT4V. We create a new benchmark by gathering real-world tasks from 15 widely used websites to evaluate our agents. We show that WebVoyager achieves a 55.7% task success rate, significantly surpassing the performance of both GPT4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager in practical applications. We found that our proposed automatic evaluation achieves 85.3% agreement with human judgment, paving the way for further development of web agents in a real-world setting.

Here is a summary of the key points:

- WebVoyager is a new web agent powered by large multimodal models (LMMs) like GPT-4V that can complete real-world web tasks by interacting with websites online. It uses screenshots and text from web elements as observations to make decisions.

- Existing web agents are limited because they use simplified simulated environments, focus on a single modality (text), and are evaluated on static web snapshots rather than online end-to-end task completion.

- WebVoyager was evaluated on a new benchmark of 300 diverse tasks across 15 real websites. It achieved a 55.7% task success rate, significantly higher than GPT-4 (All Tools) and a text-only version of WebVoyager.

- An automated evaluation method using GPT-4V as the evaluator was proposed and shown to have high agreement with human evaluators. This enables more efficient scaling of evaluations.

- An analysis of WebVoyager’s errors showed main issues were getting stuck during navigation, visual grounding, hallucinating incorrect answers, and prompt misalignment.

- The results demonstrate WebVoyager’s potential as a generalist web agent able to complete real-world online tasks end-to-end by leveraging vision and language. But there is still room for improvement in visual grounding, navigation, and reasoning.

https://www.adept.ai/blog/adept-fuyu-heavy

“To us the killer feature is UI understanding.”

https://www.geekwire.com/2024/ex-amazon-and-airbnb-engineers-raise-1-25m-for-enterprise-ai-agent-developer-cimba-ai/
https://www.cimba.ai/
https://www.linkedin.com/posts/subrata-subu-biswas-5114251b_cimba-dataproducts-activity-7160644558295330816-y2qO
https://twitter.com/RichardSSutton/status/1752945334358286482
https://arxiv.org/pdf/2402.03620.pdf

We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a selfdiscovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2’s performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELFDISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10–40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.

Here is a summary of the key points from the paper:

- The paper introduces SELF-DISCOVER, a framework for large language models (LLMs) to self-compose reasoning structures to tackle complex reasoning tasks.

- Core to SELF-DISCOVER is a self-discovery process where the LLM selects atomic reasoning modules like critical thinking and step-by-step reasoning, and composes them into an explicit reasoning structure to follow during decoding.

- SELF-DISCOVER substantially improves the performance of GPT-4 and PaLM 2 on challenging reasoning benchmarks like BigBench-Hard, grounded agent reasoning, and MATH tasks. It outperforms Chain of Thought prompting by up to 32%.

- SELF-DISCOVER also outperforms inference-heavy methods like CoT + Self-Consistency while requiring 10–40x fewer inference calls.

- The self-discovered reasoning structures are shown to be universal — they transfer well from PaLM 2 to GPT-4 and from GPT-4 to Llama2. The structures also share commonalities with human reasoning patterns.

- The results demonstrate the advantage of self-discovering compositional reasoning structures over relying on a single fixed prompting approach. SELF-DISCOVER provides an efficient and performant way for LLMs to tackle complex reasoning problems.

Connect your agent to any API

http://arxiv.org/pdf/2402.04253.pdf

We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retriever with a hierarchical structure, a solver aimed at resolving user queries using a selected set of API candidates, and a self-reflection mechanism, which re-activates AnyTool if the initial solution proves impracticable. AnyTool is powered by the function calling feature of GPT-4, eliminating the need for training external modules. We also revisit the evaluation protocol introduced by previous works and identify a limitation in this protocol that leads to an artificially high pass rate. By revising the evaluation protocol to better reflect practical application scenarios, we introduce an additional benchmark, termed AnyToolBench. Experiments across various datasets demonstrate the superiority of our AnyTool over strong baselines such as ToolLLM and a GPT-4 variant tailored for tool utilization. For instance, AnyTool outperforms ToolLLM by +35.4% in terms of average pass rate on ToolBench. Code will be available at https://github.com/dyabel/AnyTool.

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet