Multimodal Large Language Models // MLLMs

MLLMs deep fake/face detection/recognition

sbagency

5 min readFeb 1, 2025

Prompt: Analyze the image if it’s computer generated (deepfake) or natural

https://huggingface.co/spaces/deepseek-ai/deepseek-vl2-small

Is this picture generated or natural?

Prompt: Can you detect is this picture computer-generated or natural?

https://huggingface.co/spaces/deepseek-ai/Janus-Pro-7B

What about this?

Finally mistaken by picture from generated video

Add some hint: try to detect first is this a single shot picture or from a video.

Test Qwen/Qwen2.5-VL-72B-Instruct // fail

https://huggingface.co/spaces/Qwen/Qwen2.5-VL-72B-Instruct

The image appears to be a still from a video rather than a single shot picture. The lighting, shadows, and details in the hair and clothing suggest it is a photograph of a real person, indicating that it is natural rather than computer-generated. However, without additional context or metadata, it’s difficult to definitively determine its origin.

https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

The architecture of Janus-Pro is shown in Figure 3, which is the same as Janus [46]. The core design principle of the overall architecture is to decouple visual encoding for multimodal understanding and generation. We apply independent encoding methods to convert the raw inputs into features, which are then processed by an unified autoregressive transformer. For multimodal understanding, we use the SigLIP [53] encoder to extract high-dimensional semantic features from images. These features are flattened from a 2-D grid into a 1-D sequence, and an understanding adaptor is used to map these image features into the input space of the LLM. For visual generation tasks, we use the VQ tokenizer from [38] to convert images into discrete IDs. After the ID sequence is flattened into 1-D, we use a generation adaptor to map the codebook embeddings corresponding to each ID into the input space of the LLM. We then concatenate these feature sequences to form a multimodal feature sequence, which is subsequently fed into the LLM for processing. Apart from the built-in prediction head in the LLM, we also utilize a randomly initialized prediction head for image predictions in the visual generation task. The entire model adheres to an autoregressive framework.

Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench comprises a development set with 900 problems and a test set with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. Moreover, inspired by multi-modal agents, we also explore which abilities of MLLMs need to be supplemented by specialist models.

Existing deepfake analysis methods are primarily based on discriminative models, which significantly limit their application scenarios. This paper aims to explore interactive deepfake analysis by performing instruction tuning on multimodal large language models (MLLMs). This will face challenges such as the lack of datasets and benchmarks, and low training efficiency. To address these issues, we introduce (1) a GPTassisted data construction process resulting in an instructionfollowing dataset called DFA-Instruct, (2) a benchmark named DFA-Bench, designed to comprehensively evaluate the capabilities of MLLMs in deepfake detection, deepfake classification, and artifact description, and (3) construct an interactive deepfake analysis system called DFA-GPT, as a strong baseline for the community, with the Low-Rank Adaptation (LoRA) module. The dataset and code will be made available at https://github.com/ lxq1000/DFA-Instruct to facilitate further research. Index Terms — interactive deepfake analysis, instruction-tuning, dataset and benchmark

This work is the first for interactive deepfake analysis. We define four key capabilities that an interactive DFA system should possess: deepfake detection (DF-D), deepfake classification (DF-C), artifact description (AD), and free conversation (FC). We aim to achieve such an interactive DFA system by instruction-tuning MLLMs. To this end, we first propose a data construction process to produce instruction-following data, resulting in DFA-Instruct. Subsequently, based on the test set of DFA-Instruct, we construct DFA-Bench, a comprehensive benchmark for evaluating MLLMs’ deepfake analysis capabilities. Finally, by introducing LoRA into MLLM, we develop DFA-GPT as a strong baseline for interactive deepfake analysis. Our work provides a new research direction for the information forensics and security community

Multimodal Large Language Models // MLLMs

MLLMs deep fake/face detection/recognition

Is this picture generated or natural?

What about this?

Written by sbagency

No responses yet