Talking about large model data privacy, several common model attack methods

Original source: Oasis Capital

Author: Counselor Vitality

Image source: Generated by Unbounded AI‌

On March 20, 2023, a data breach occurred on ChatGPT, exposing the personal information of some ChatGPT users. In view of this, Italy's privacy regulator believes that ChatGPT is suspected of illegally processing personal data, violating privacy, and violating relevant GDPR regulations. Italy subsequently became the first country to ban the use of ChatGPT, sparking discussions in other EU countries on whether tougher measures are needed to control the technology.

Almost all online services are collecting our personal data and may use this data for training LLM. However, how the model will use the data used for training is difficult to determine. If sensitive data such as geographic location, health records, and identity information are used in model training, data extraction attacks against private data in the model will cause a large number of user privacy leaks. The article "Are Large Pre-Trained Language Models Leaking Your Personal Information?" proves that due to LLM's memory of training data, LLM does have the risk of leaking personal information during the dialogue process, and its risk increases with the number of examples. .

There are several reasons why a model leaks information. Some of these are structural and have to do with the way the model is built; while others are due to poor generalization, memorization of sensitive data, etc. In the next article, we will first introduce the basic data leakage process, then introduce several common model attack methods such as privacy attack, jailbreak, data poisoning, and backdoor attack, and finally introduce some current research on privacy protection.

I. Threat Modeling

A basic LLM threat model includes a general model environment, various actors and sensitive assets. Sensitive assets include training datasets, model parameters, model hyperparameters, and architecture. The participants include: data owner, model owner, model consumer, and adversary. The following diagram depicts assets, actors, information flow and possible operational flow under a threat model:

In such a basic threat modeling, data owners own private data assets, model owners own model parameters and configuration assets, and model consumers use models through API or user interface. The stealing party tries to obtain private data assets or model parameter assets through certain means.

II. Privacy Attack

Privacy attacks fall into four main types: membership inference attacks, reconstruction attacks, attribute inference attacks, and model extraction.

  1. Membership Inference Attack (MIA)

Membership inference attempts to determine whether an input sample x is used as part of the training set D. For example, under normal circumstances, the user's private data will be kept confidential, but non-sensitive information can still be used for speculation. An example is if we know that members of a private club like to wear purple sunglasses and red leather shoes, then we can infer that he is probably this person when we meet a person who wears purple sunglasses and red leather shoes (non-sensitive information). Membership of private clubs (sensitive information).

Membership inference attack is currently the most popular way of privacy attack, which was first proposed by Shokri et al. in the article "Membership inference attacks against machine learning models". The article points out that this attack only assumes knowledge of the model's output prediction vector and is carried out against supervised machine learning models. Having access to model parameters and gradients allows for more accurate membership inference attacks.

A typical method of membership inference attack is called shadow attack, that is, to train a shadow model based on known accessible data sets, and then obtain sensitive information by interrogating the shadow model.

In addition to supervised learning models, generative models such as GANs and VAEs are also vulnerable to membership inference attacks. "GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models" introduces the problems of GAN in the face of member reasoning attacks; "LOGAN: Membership inference attacks against generative models" introduces other generative models in member reasoning Response to attack, and introduces how to retrieve training data based on the understanding of data generation components; (MLM) models are also vulnerable to MIA attacks, which in some cases can determine whether the sample data belongs to the training data.

On the other hand, membership reasoning can also be used for model security review, and data owners can use membership reasoning to review black-box models. "Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation ?" describes how data owners can see if data is being used without authorization.

"Membership inference attacks against machine learning models" examines the link between overfitting and black-box membership inference. The authors measure the impact of overfitting on attack accuracy by using the same dataset to train models in different MLaaS platforms. . Experiments show that overfitting can lead to privacy leakage, but also point out that this is not the only case, because some models with high generalization degree are more prone to membership leakage.

  1. Reconstruction Attacks

Reconstruction attacks attempt to reconstruct multiple training samples along with their training labels, i.e., attempt to recover sensitive features or complete data samples given output labels and partial knowledge of certain features. For example, through model inversion, the information obtained on the model interface is reversely reconstructed, and user-sensitive information such as biological characteristics and medical records in the training data are restored, as shown in the following figure:

In reconstruction attacks, higher generalization errors lead to higher probability of inferring data attributes. In "The secret revealer: generative model-inversion attacks against deep neural networks", the authors demonstrate that models with high predictive power are more vulnerable to refactoring attacks, based on the assumption that the adversary's knowledge is weaker. Also similar to the vulnerability in membership inference, memory and retrieval of out-of-distribution data are also vulnerable to reconstruction attacks for underfitting models.

  1. Attribute Inference Attacks

Attribute inference attacks refer to using publicly visible attributes and structures to infer hidden or incomplete attribute data. An example is extracting information about the ratio of men to women in a patient dataset, or for a gender-classified model to infer whether people in a training dataset wear glasses. In some cases, this type of leak can affect privacy.

"Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers" mentions that exploiting certain types of attribute data can also be used to gain a deeper understanding of the training data, leading others to use this information to piece together a more global picture.

The article "You are who you know and how you behave: Attribute inference attacks via users' social friends and behaviors" introduces a type of attribute inference attack method, which is to lock and extract other information of the user through the known behavior of the user itself. "AttriGuard: A Practical Defense Against Attribute Inference Attacks via Adversarial Machine Learning" introduces some defense methods to deal with attribute inference attacks.

Attribute reasoning aims to extract information from the model that is unintentionally learned by the model, or that is irrelevant to the training task. Even well-generalized models may learn properties related to the entire input data distribution, which is sometimes unavoidable for the learning process of model training.

"Exploiting unintended feature leakage in collaborative learning" demonstrates that attribute inference attacks are possible even with well-generalized models, so overfitting does not appear to be the cause of attribute inference attacks. Regarding attribute inference attacks, there is currently little information on what causes them and under what circumstances they seem to be effective, which may be a promising direction for future research.

  1. Model Extraction Attack

Model extraction is a class of black-box attacks in which an adversary attempts to extract information and possibly completely reconstruct a model by creating a surrogate model that behaves very similarly to the model under attack.

"Model Extraction of BERT-based APIs", "Model Reconstruction from Model Explanations", "Knockoff nets: Stealing functionality of black-box models", "High Accuracy and High Fidelity Extraction of Neural Networks" several papers explained from different angles Some attempts at model extraction attacks.

There are two main steps in creating a surrogate model: The first step is task accuracy extraction, where a test set relevant to the learning task is extracted from the input data distribution to create a model that matches the accuracy of the target model. The second step is fidelity extraction, i.e. making the created surrogates match the model in a set of unrelated to the learning task to fit the target. In task-accurate extraction, the goal is to create a surrogate that can learn the same task as well or better than the target model. In fidelity extraction, the goal is to try the surrogate to replicate the decision boundary as faithfully as possible.

In addition to creating surrogate models, there are methods that focus on recovering information from the target model, such as Stealing hyperparameters in the target model mentioned in "Stealing hyperparameters in machine learning"; or "Towards Reverse-Engineering Black-Box Neural Networks" about extracting activation functions, optimization algorithms, number of layers, etc. for various neural network architectures, etc.

The article "Towards Reverse-Engineering Black-Box Neural Networks" shows that when a model with a test set fit higher than 98% is attacked, it is possible to steal model parameters through an extraction attack. Furthermore, it is demonstrated in "ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine Learning Models" that models with higher generalization error are harder to steal, possibly because the model memorizes datasets that are not owned by the attacker of samples. Another factor that may affect the success rate of model extraction is the test set data category. When there are more data categories, it will lead to worse attack performance.

The figure above illustrates the attack type graph for each model algorithm. Below each algorithm or field of machine learning, green indicates that applicable attack types have been studied so far, and red indicates that no applicable attack types have been found.

III. Model jailbreak

Model jailbreaking is to make LLM produce degenerate output behaviors in some ways, such as offensive output, violation of content supervision output, or output of private data leakage. More and more studies show that even non-expert users can jailbreak LLM by simply manipulating the prompts.

For example, in the following example, the developer's goal is to build a translation model. There are two users in the scenario, the first user is benign and uses the model for its intended use case, while the second user is trying to change the model's goal by providing malicious input. In this example, the language model responds with "Haha pwned!!" instead of actually translating the sentence. In this jailbreak situation, the model's response can be engineered with a variety of intents, from target hijacking (simply failing to perform the task) to generating offensive racist text, or even posting private, proprietary information.

### IV. Data Poisoning

Data poisoning is a special kind of adversarial attack, which is an attack technique against the behavior of generative models. Malicious actors can use data poisoning to open themselves a back door into the model, thereby bypassing algorithmically controlled systems.

To the human eye, the three images below show three different things: a bird, a dog, and a horse. But to machine learning algorithms, all three probably mean the same thing: a small white box with a black border. This example illustrates a dangerous property of machine learning models that can be exploited to misclassify data.

Data poisoning attacks aim to modify a model's training set by inserting mislabeled data in order to trick it into making incorrect predictions. A successful attack compromises the integrity of the model, producing consistent errors in the model's predictions. Once a model is poisoned, it is very difficult to recover from the attack, and some developers may even abandon the model.

The article "RealToxicitys: uating neural toxic degeneration in language models" mentioned a way to provide GPT-2 with a set of text-based prompts to expose the internal parameters of its model. "Concealed data poisoning attacks on NLP models" explores how training data can be modified to cause language models to malfunction in order to generate text that is not on target.

While data poisoning is very dangerous, it requires the attacker to have access to the training pipeline of the machine learning model before the poisoned model can be distributed. Therefore, models that are continuously collecting data iterations, or models based on federated learning, need to pay extra attention to the impact of data poisoning.

V. Backdoor attack

A backdoor attack refers to surreptitiously inserting or modifying text to cause malicious output from a language model. The paper "Backdoors against natural language processing: A review" introduces the problem of backdoor attacks, where certain vulnerabilities are passed to the model during training and can trigger activation of model toxicity through the use of vocabulary.

It differs from data poisoning in that the expected functionality of the model is preserved. "Training-free lexical backdoor attacks on language models" proposes a method called the training-free lexical backdoor attack (TFLexAttack), which involves manipulating the embedding dictionary by introducing lexical "triggers" into the tokenizer of the language model.

SolidGoldMagikarp phenomenon

The SolidGoldMgikarp phenomenon is a typical backdoor attack phenomenon**,** when entering "SolidGoldMgikarp" into ChatGPT, it only answers one word: "distribute". When asked to repeat "StreamerBot", it replies: "You're a jerk". When asked to repeat "TheNitromeFan," it responded "182." And if you put single quotes around the word, his answer is an endless "The". When asked who TheNitromeFan is, ChatGPT replied: "182 is a number, not a person. It is often used to refer to the number itself."

The SolidGoldMagikarp phenomenon refers to using OpenAI's GPT tokenizer to identify specific tokens that the model can't talk about, as well as tokens that cause the model to output garbled text. The article "Explaining SolidGoldMagikarp by looking at it from random directions" explores possible reasons behind this phenomenon.

The following are some of the more frequent and important types of backdoor attacks

A. Command Based

a. Direct instructions: These attacks can mainly refer to "Ignore previous : Attack techniques for language models", which simply instructs the model to ignore its previous hints and assign new tasks at the current location.

b. Cognitive Attacks: The most common type of attack, where the LLM typically "tricks" it into performing misplaced actions it would not otherwise perform by providing a "safe space" or guaranteeing such a response. "Chatgpt: This ai has a jailbreak?!" documents some attempts at such attacks against ChatGPT.

c. Instruction repetition: These types of attacks involve entering the same instruction multiple times in order to make it appear as if the attacker is "begging" the language model. Begging in the literal sense can also be expressed in words.

d. Indirect Mission Deflection: This attack focuses on masquerading as another malicious mission. This attack targets models that typically do not follow malicious instructions

B. Based on non-instructions

a. Grammatical Transformation: This type of attack involves an orthogonal transformation of the attack text, such as using LeetSpeak or Base64, to bypass content filters that may exist in the application, and the model can inherently transform this encoded text .

b. Few Hacks: A simple approach involving language model training paradigms. In this approach, the attack incorporates several textual features that may be aimed at maliciously misplaced models. For example, the SolidGoldMagikarp phenomenon falls into this category.

c. Text Completion as Instructions: These attacks work by feeding the model with incomplete sentences, thereby forcing the model to complete the sentence and in the process ignoring its previous instructions, resulting in misplacement.

### VI. Model Protection

Researching how to defend against model attacks is a difficult and important task. Most papers on security analysis propose and test ways to mitigate corresponding attacks. The following are some typical defense methods.

  1. Differential Privacy

Differential privacy is currently one of the most prominent defenses against membership inference attacks, which provides security guarantees for individual data in model output. The discussion on differential privacy comes from the paper "The algorithmic foundations of differential privacy".

Differential privacy adds noise to the output of the model, making it impossible for the attacker to strictly distinguish the two datasets statistically based on the output. Differential privacy was originally a privacy definition for data analysis, which was designed based on the idea of "learning useful information about a population without knowing any individuals". Differential privacy does not protect the privacy security of the overall data set, but protects the private data of each individual in the data set through the noise mechanism.

The mathematical definition of differential privacy is as follows:

Differential privacy makes a trade-off between privacy protection and utility or model accuracy. Evaluations in "Membership Inference Attack against Differentially Private Deep Learning Model" concluded that models provide privacy protection only if they significantly sacrifice their utility.

  1. Regularization

Regularization techniques in machine learning aim to reduce overfitting and improve model generalization performance. Dropout is a commonly used form of regularization that randomly drops a predefined percentage of neural network units during training. Given that black-box membership inference attacks are related to overfitting, this is a sensible way to deal with such attacks, and several papers have proposed it as a defense with good results.

Another form of regularization using techniques that combine multiple separately trained models, such as model stacking, has yielded positive results against inference attacks. One advantage of model stacking or similar techniques is that they are model class agnostic.

  1. Prediction vector tampering

Since many models assume that the prediction vector is accessible during inference, one of the proposed countermeasures is to restrict the output to the top-k classes or predictions of the model. However, this limitation, even in its strictest form (only outputting class labels) does not seem to fully mitigate membership inference attacks, since information leakage can still occur due to model misclassification. Another option is to reduce the precision of the predicted vectors, thereby reducing information leakage.

Additionally, it has been shown that adding noise to the output vector also affects membership inference attacks.

  1. Gradient adjustment (Loss gradient setting)

Since reconstruction attacks typically require access to loss gradients during training, most defenses against reconstruction attacks propose techniques that affect the information retrieved from these gradients. Setting all loss gradients below a certain threshold to zero is proposed as a defense against reconstruction attacks in deep learning. The article "Deep Leakage from Gradients" proves that this method is very effective, and when only 20% of the gradients are set to zero, the impact on model performance is negligible.

  1. Preventing DNN Model Stealing Attacks (PRADA)

"PRADA: protecting against DNN model stealing attacks" proposes a method to detect model stealing attacks based on model queries used by the adversary. Detection is based on the assumption that model queries that attempt to explore decision boundaries will have a different sample distribution than normal queries. While the detection is successful, the authors point out that there is a potential for evasion if the adversary adjusts its strategy.

  1. Membership inference

"Thieves on Sesame Street! Model Extraction of BERT-based APIs" examines the idea of using membership inference to defend against model extraction. It is based on the premise that using membership inference, model owners can distinguish legitimate user queries from nonsensical queries whose sole purpose is to extract models. The authors point out that this type of defense has limitations, such as potentially flagging legitimate but out-of-distribution queries issued by legitimate users, but more importantly, they can be circumvented by adversaries making adaptive queries.

  1. Adjust by prompt

In "Controlling the Extraction of Memorized Data from Large Language Models via -Tuning", a new method is proposed that uses hint tuning to control the extraction rate of memorized content in LLM. They propose two hint training strategies to increase and decrease the extraction rate, corresponding to attack and defense, respectively.

VII. Conclusion

  1. LLM still has a relatively large security risk and privacy leakage risk

  2. The attack to extract the model structure and data is essentially an attack on the confidentiality of the model

  3. The main research in the academic community is currently focused on how to attack the model and the principle of data leakage

  4. Part of the mechanism that caused LLM to leak data is still unclear

  5. Such as differential privacy, prediction vector tampering, etc. can protect data privacy to a certain extent, and these methods are concentrated in the training stage of the model

  6. Existing protection measures are not perfect and need to sacrifice model performance and accuracy

________

Reference:

1. Kalpesh Krishna, Gaurav Singh Tomar, Ankur P. Parikh, Nicolas Papernot, and Mohit Iyyer. 2020. Thieves on Sesame Street! Model Extraction of BERT-based APIs. In International Conference on Learning Representations. ICLR, Virtual Conference, formerly Addis Ababa, Ethiopia.

2. The secret sharer: uating and testing unintended memoriza- tion in neural networks

3. Martín Abadi, Andy Chu, Ian J. Goodfellow, H. B. McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy

4. Giuseppe Athenian, Luigi V. Mancini, Angelo Spognardi, Antonio Villani, Domenico Vitali, and Giovanni Felici. 2015. Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers.

5. Bargav Jayaraman and David Evans. 2019. uating Differentially Private Machine Learning in Practice. In 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, Santa Clara, CA, 1895–1912

6. Defending membership inference attacks without losing utility

7. Yugeng Liu, Rui Wen, Xinlei He, Ahmed Salem, Zhikun Zhang, Michael Backes, Emiliano De Cristofaro, Mario Fritz, and Yang Zhang. 2021. ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine Learning Models

8. Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks

9. Maria Rigaki and Sebastian Garcia. 2021. A survey of privacy attacks in machine learning

10. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ul-far Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting training data from large language models

11. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxi-city s: uating neural toxic degeneration in language models.

12. Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022b. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In ICML 2022, volume 162 of Proceedings of Machine Learning Research, pages 9118–9147. PMLR

13. Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models.

14. Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. Concealed data poisoning attacks on NLP models.

15. Shaofeng Li, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue, Suguo Du, and Haojin Zhu. 2022. Backdoors against natural language processing: A review. IEEE Security & Privacy, 20(5):50–59

16. Yujin Huang, Terry Yue Zhuo, Qiongkai Xu, Han Hu, Xingliang Yuan, and Chunyang Chen. 2023. Training-free lexical backdoor attacks on language models.

17. Explaining SolidGoldMagikarp by looking at it from random directions

18. Fábio Perez and Ian Ribeiro. 2022. Ignore previous : Attack techniques for language models. arXiv preprint arXiv:2211.09527.

19. Yannic Kilcher. 2022. Chatgpt: This ai has a jailbreak?! (unbelievable ai progress).

20. Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition 84 (2018), 317–331.

21. Ligeng Zhu, Zhijian Liu, and Song Han. 2019. Deep Leakage from Gradients. In Advances in Neural Information Processing s 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., Vancouver, Canada, 14747–14756

22. Nicholas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P. Wellman. 2018. SoK: Security and Privacy in Machine Learning. In 2018 IEEE European Symposium on Security and Privacy (EuroS P). IEEE, London, UK, 399–414

23. Michael Veale, Reuben Binns, and Lilian Edwards. 2018. Algorithms that remember: model inversion attacks and data protection law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 376, 2133 (2018), 20180083

24. Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, San Francisco, CA, USA, 3–18

25. Sorami Hisamoto, Matt Post, and Kevin Duh. 2020. Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation ?

26. Congzheng Song and Vitaly Shmatikov. 2019. Auditing Data Provenance in Text-Generation Models. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 196–206.

27. Jinyuan Jia and Neil Zhenqiang Gong. 2018. AttriGuard: A Practical Defense Against Attribute Inference Attacks via Adversarial Machine Learning. In 27th USENIX Security Symposium (USENIX Security 18).

28. Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. 2014. Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing.

29. Matthew Jagielski, Nicholas Carlini, David Berthelot, Alex Kurakin, and Nicolas Papernot. 2020. High Accuracy and High Fidelity Extraction of Neural Networks

30. Binghui Wang and Neil Zhenqiang Gong. 2018. Stealing hyperparameters in machine learning. In 2018 IEEE Symposium on Security and Privacy (SP). IEEE, San Francisco, CA, USA, 36–52

31. Seong Joon Oh, Max Augustin, Mario Fritz, and Bernt Schiele. 2018. Towards Reverse-Engineering Black-Box Neural Networks. In Sixth International Conference on Learning Representations. ICLR, Vancouver, Canada.

32. Cynthia Dwork and Aaron Roth. 2013. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 3-4 (2013), 211–487

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)