Deepfakes generation vs detection // endless competition
Large Vision Models could help // not just detect but also explain why!
Deepfakes are extremely hard to detect because of original datasets quality, photo-realistic deepfakes are just a mixture of real images/videos mixed by computational pipelines. It’s hard to detect that by hands (eyes). ML-pipelines are good at trained datasets but fail at unknown data.
The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pre-trained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. Furthermore, we insert a ”pseudo-word” guided by facial identity into the text prompt. Extensive experiments on several popular benchmarks demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake) using a pre-trained CLIP model with our proposed reprogramming method; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.
Deepfake Detection
The past five years have witnessed a wide variety of methods proposed for defending against the malicious usage of deepfakes. Currently, the majority of deepfake detection methods are based on deep learning, leveraging generic CNNs (e.g., Xception(Chollet 2017), EfficientNet(Tan and Le 2019)) as the backbones of the classifier. Furthermore, several works (Qian et al. 2020; Liu et al. 2021; Zhao et al. 2021) utilize frequency information or localize the forged regions to improve the performance of detectors. Nevertheless, the generalization challenge still hinders the application of deepfake detectors in real-world scenarios. To address such issue, several works (Li et al. 2020b; Shiohara and Yamasaki 2022; Larue et al. 2023; Nguyen et al. 2024) introduce the pseudo deepfake data, which belnding two different faces, during the training. However, all of the above methods typically involve retraining at least one backbone network. In general, employing more powerful backbone networks (e.g., replacing CNNs with ViTs) produces better performances but at the cost of increased computational resources.
Impact of different text prompts
In this part, we investigate the effects of various text prompt configurations on RepDFD, encompassing fixed text prompts, randomly initialized text prompts, and our adaptive face-related text prompts (termed dynamic text prompts). It can be concluded from the experiment that the dynamic text prompts are more effective than the fixed those.
Impact for different face embeddings
To further verify the universality of Face2Text, we investigate the impact of different face embeddings on Face2Text prompts. We perform an ablation study comparing the results obtained by ArcFace(Deng et al. 2019), BlendFace(Shiohara, Yang, and Taketomi 2023) and Transface(Dan et al. 2023) and non-face (i.e., using the text prompt group {T0,T1}). As shown in Figure 5, our method demonstrates good performance across vairous face encoders Fid.
Conclusions and Discussions
In this paper, we propose RepDFD, a general yet parameter-efficient method for detecting face forgeries by reprogramming a well-trained CLIP model. Specifically, we employ an Input Transformation technique to merge the image with learnable perturbations before feeding it into the CLIP image encoder. Additionally, we introduce Face2Text Prompts to incorporate facial identity information into the text prompts, which are then fed into the CLIP text encoder to guide the optimization of perturbations. The only required operation for evaluation is to apply these task-specific perturbations to all test images and then feed them into the CLIP model. Comprehensive experiments show that: (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved using a pre-trained CLIP model with RepDFD; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.
The recent advancements in Generative Adversarial Networks (GANs) and the emergence of Diffusion models have significantly streamlined the production of highly realistic and widely accessible synthetic content. As a result, there is a pressing need for effective general purpose detection mechanisms to mitigate the potential risks posed by deepfakes. In this paper, we explore the effectiveness of pre-trained vision-language models (VLMs) when paired with recent adaptation methods for universal deepfake detection. Following previous studies in this domain, we employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection. However, in contrast to prior research, which rely solely on the visual part of CLIP while ignoring its textual component, our analysis reveals that retaining the text part is crucial. Consequently, the simple and lightweight Prompt Tuning based adaptation strategy that we employ outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy while utilizing less than one third of the training data (200k images as compared to 720k). To assess the real-world applicability of our proposed models, we conduct a comprehensive evaluation across various scenarios. This involves rigorous testing on images sourced from 21 distinct datasets, including those generated by GANs-based, Diffusion-based and Commercial tools.
— -
Prompt: Analyze the image, is it computer generated (deepfake) or natural photo (or video screenshot)?
Prompt: Analyze the image if it’s computer generated (deepfake) or natural.
False detection
Correct real person detection
False detection
The Deep-Fake-Detector-Model is a state-of-the-art deep learning model designed to detect deepfake images. It leverages the Vision Transformer (ViT) architecture, specifically the
google/vit-base-patch16-224-in21k
model, fine-tuned on a dataset of real and deepfake images. The model is trained to classify images as either "Real" or "Fake" with high accuracy, making it a powerful tool for detecting manipulated media.
False fake detection
Old good CNNs
Ubiquitous and real-time person authentication has become critical after the breakthrough of all kind of services provided via mobile devices. In this context, face technologies can provide reliable and robust user authentication, given the availability of cameras in these devices, as well as their widespread use in everyday applications. The rapid development of deep Convolutional Neural Networks (CNNs) has resulted in many accurate face verification architectures. However, their typical size (hundreds of megabytes) makes them infeasible to be incorporated in downloadable mobile applications where the entire file typically may not exceed 100 Mb. Accordingly, we address the challenge of developing a lightweight face recognition network of just a few megabytes that can operate with sufficient accuracy in comparison to much larger models. The network also should be able to operate under different poses, given the variability naturally observed in uncontrolled environments where mobile devices are typically used. In this paper, we adapt the lightweight SqueezeNet model, of just 4.4MB, to effectively provide cross-pose face recognition. After trained on the MS-Celeb-1M and VGGFace2 databases, our model achieves an EER of 1.23% on the difficult frontal vs. profile comparison, and 0.54% on profile vs. profile images. Under less extreme variations involving frontal images in any of the enrolment/query images pair, EER is pushed down to <0.3%, and the FRR at FAR=0.1% to less than 1%. This makes our light model suitable for face recognition where at least acquisition of the enrolment image can be controlled. At the cost of a slight degradation in performance, we also test an even lighter model (of just 2.5MB) where regular convolutions are replaced with depth-wise separable convolutions.
Keywords: Face recognition mobile biometrics CNNs.
Diffusion models for face swapping
Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion model and plug-and-play temporal layers for video face swapping. First, we introduce four fine-grained facial conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Besides, our method could be easily transferred to video domain with temporal attention layer. Our code and results will be available on the project page: https://dynamic-face.github.io/.
Industrial anomaly detection (IAD) plays a crucial role in the maintenance and quality control of manufacturing processes. In this paper, we propose a novel approach, Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD), which leverages large vision-language models (LVLMs) to improve both anomaly detection and localization in industrial settings. CLAD aligns visual and textual features into a shared embedding space using contrastive learning, ensuring that normal instances are grouped together while anomalies are pushed apart. Through extensive experiments on two benchmark industrial datasets, MVTec-AD and VisA, we demonstrate that CLAD outperforms state-of-the-art methods in both image-level anomaly detection and pixel-level anomaly localization. Additionally, we provide ablation studies and human evaluation to validate the importance of key components in our method. Our approach not only achieves superior performance but also enhances interpretability by accurately localizing anomalies, making it a promising solution for real-world industrial applications. Keywords: Large Vision-Language Models · Industrial Anomaly Detection · Contrastive Learning.
— -
Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD). Our approach combines the strengths of both generative and discriminative models, leveraging the power of large vision-language models (LVLMs) to jointly process visual and textual data. Specifically, we propose a discriminative approach that focuses on distinguishing normal and anomalous instances based on their visual and textual representations. The model is trained to map both visual features and textual descriptions into a shared embedding space, where normal instances are grouped together while anomalies are separated, allowing for both detection and localization of anomalies.