We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct incontext learning (ICL), reason, chat, edit, etc., in an any-toany input-output modality paradigm. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not only understand complex modality-interleaved instructions and in-context examples, but also autoregressively generate grounded and coherent multimodal outputs in the continuous feature space. To train CoDi-2, we build a largescale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi2 demonstrates a wide range of zero-shot capabilities for multimodal generation, such as in-context learning, reasoning, and compositionality of any-to-any modality generation through multi-round interactive conversation. CoDi2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing. CoDi-2 signifies a substantial breakthrough in developing a comprehensive multimodal foundation model adept at interpreting in-context language-visionaudio interleaved instructions and producing multimodal outputs
The key ideas:
- CoDi-2 is a versatile multimodal large language model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning, reason, chat, edit, etc. in an any-to-any input-output modality setting.
- It aligns modalities like images, audio, and text to a shared representation space. This allows it to understand instructions mixing multiple modalities and generate outputs across modalities.
- CoDi-2 uses a large language model as its core “brain” to leverage LLMs’ strengths in reasoning, instruction following, chatting, etc. It maps other modalities like images and audio into the LLM’s input space.
- The authors create training datasets for multimodal in-context learning by transforming existing datasets and proposing new methods to build text-only datasets.
- Experiments show CoDi-2 demonstrates versatilite zero-shot and few-shot capabilities on tasks like image editing, audio manipulation, visual reasoning, video understanding, etc. involving complex interleaved multimodal instructions
In general it introduces a novel MLLM architecture aimed at multimodal in-context understanding and generation across modalities.
Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive interface for directing capabilities using flexible natural instructions. Without any task-specific tuning, InstructSeq achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning. The flexible control and multi-task unification empower the model with more human-like versatility and generalizability for computer vision. The code will be released soon at https://github.com/rongyaofang/InstructSeq.