Superhuman AI Reasoning // multi-modal

Computers can analyze many patterns per second in multi-modal domains

sbagency
3 min readMar 19, 2024
https://www.forbes.com/sites/forbestechcouncil/2024/03/18/the-dawn-of-superhuman-ai-reasoning/

The article discusses the emerging field of visual world models, which represent the next leap in foundation models for artificial intelligence. Unlike current language models that focus on linguistic data, visual world models aim to interpret and derive insights from visual data across space and time, mirroring human capabilities to derive meaning from sensory inputs.

The passage highlights that incorporating visual data into AI systems is not just an expansion of their skill set but a gateway to uncharted territories of knowledge. It suggests that vision is the foundation upon which much of new human knowledge is created, and by tapping into the vast reservoirs of “hidden data” from sources like social media and satellite imagery, AI can discover fundamental new laws of the natural world.

However, the post also acknowledges the implementation challenges and ethical implications of such visually empowered AI systems. It emphasizes the need for thorough backtesting, constraining use cases to well-defined and tested tasks, and involving stakeholders, ethicists, and policymakers in the development process.

Overall, author presents visual world models as a transformative era in AI, promising to redefine our understanding of intelligence itself and augment human vision and cognition through a symbiosis between human and machine intelligence.

https://arxiv.org/pdf/2403.09333.pdf

Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model’s potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight downsampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plugand-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet