LLMs under attacks

Model-stealing attack that extracts precise, nontrivial information from black-box production language models

2 min readMar 13, 2024

We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI’s ChatGPT or Google’s PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under $20 USD, our attack extracts the entire projection matrix of OpenAI’s ada and babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack.

This paper introduces the first model-stealing attack that can extract precise, non-trivial information from black-box production language models like OpenAI’s ChatGPT or Google’s PaLM-2. Specifically, the attack recovers the embedding projection layer (mapping from the hidden dimension to output logits) of a transformer model, given access to the model’s API.

The key contributions are:

1. A set of attacks that exploit the ability to set a logit bias and/or obtain output log probabilities from the API to recover the full logit vector for a given input.

2. Using the recovered logit vectors across multiple inputs, the attacks can extract the entire embedding projection matrix W of models like OpenAI’s ada, babbage, GPT-3.5, etc. with high precision.

3. Successful extraction of W reveals the true hidden dimension size of these black-box models for the first time (e.g. 1024 for ada, 2048 for babbage).

4. Discussion of potential defenses like removing logit bias capability, adding noise, rate-limiting API queries etc. to mitigate such attacks.

5. Responsible disclosure to affected providers like OpenAI and Google, who have implemented defenses against this specific attack.

The paper demonstrates the importance of studying practical model extraction attacks on real, deployed systems to identify and address potential vulnerabilities arising from seemingly innocuous API design choices.

LLMs under attacks

Model-stealing attack that extracts precise, nontrivial information from black-box production language models

Written by sbagency

No responses yet