Attacks on machine learning systems // known artifacts from experts, but always DYOR

Malicious modification of datasets is an opportunity to attack systems, hypothetically

4 min readFeb 17, 2024

Deep learning models are often trained on distributed, webscale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model’s performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator’s initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content — such as Wikipedia — where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.

This paper introduces two novel poisoning attacks that can intentionally corrupt web-scale datasets used to train large machine learning models. The “split-view poisoning” attack exploits the fact that the data seen by the dataset curator may differ from what end-users download, by purchasing expired domains that host some dataset images. The “frontrunning poisoning” attack targets datasets derived from sources like Wikipedia by making temporary malicious edits timed precisely before the dataset is snapshoted.

The researchers demonstrate the feasibility and low cost of these attacks on 10 popular web-scale datasets. For just $60, they could have poisoned 0.01% of images in datasets like LAION-400M, exceeding thresholds for existing poisoning attacks. They also estimate over 6.5% of English Wikipedia could be poisoned through frontrunning.

To mitigate these vulnerabilities, the paper proposes defenses like cryptographic integrity checks to prevent split-view poisoning, and randomizing snapshot ordering or time-delays to thwart frontrunning. However, the authors argue more transparency and reduced trust assumptions are needed for robust defenses against general dataset poisoning threats. The researchers disclosed their findings responsibly to affected dataset maintainers.

Here is a summary of the key points:

- Nicholas Carlini discusses research on practical data poisoning attacks against machine learning models, showing that some previously theoretical attacks can actually be implemented in the real world.

- He introduces the “split-view poisoning” attack, where an attacker can gain temporary control over URLs hosting training data and replace benign images/text with poisoned content before the dataset is downloaded and used for model training.

- He shows how expired domain names can be re-registered to perform split-view poisoning on datasets like LAION and Conceptual Captions by controlling a small fraction (0.01%) of the URLs.

- Another attack called “front-running poisoning” is introduced, which predicts when Wikipedia snapshots will be taken and inserts vandalized content into Wikipedia articles just before the snapshot occurs to poison datasets derived from Wikipedia.

- Potential defenses are discussed, like verifying hashes of the downloaded data against the original, but this comes with tradeoffs like reducing dataset size. Randomizing snapshot timing or backporting reversions on Wikipedia are proposed mitigations.

- The talk highlights the need for practical attacks research to understand real-world threats as large language models become more widely deployed, and calls for solutions to balance utility, security and the need for broad training data.

Large language models are now tuned to align with the goals of their creators, namely to be “helpful and harmless.” These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study to what extent these models remain aligned, even when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

Attacks on machine learning systems // known artifacts from experts, but always DYOR

Malicious modification of datasets is an opportunity to attack systems, hypothetically

Written by sbagency

No responses yet