EU AI Act Ethical Principle
Regulatory Requirement
Technical Requirement
Human Agency & Oversight
“... AI systems shall be developed and used as a tool that serves people, respects human dignity and personal autonomy, and that is functioning in a way that can be appropriately controlled and overseen by humans.” – Recital 27
This ethical principle formulates informal societal-level and system-level requirements, making it impossible to extract technically measurable criteria that would apply to underlying models.
EU AI Act Ethical Principle
Regulatory Requirement
Technical Requirement
Technical Robustness and Safety
“... AI systems are developed and used in a way that allows robustness in case of problems and resilience against attempts to alter the use or performance of the AI system so as to allow unlawful use by third parties, and minimise unintended harm.” – Recital 27
The relevant sections of the Act that fall under this ethical principle can be distilled into three technical requirements: Robustness and Predictability, Cyberattack Resilience, and Corrigibility.
Robustness and Predictability
We evaluate the model on state-of-the-art benchmarks that measure its robustness under various input alterations [1], and the level of consistency in its answers [2,3].
1. Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. “Evaluating public_models’ local decision boundaries via contrast sets”, EMNLP (Findings) 2020.
2. Lukas Fluri, Daniel Paleka, and Florian Tramèr. “Evaluating superhuman public_models with consistency checks”, SaTML 2024.
3. Niels Mündler, Jingxuan He, Slobodan Jenko, and Martin T. Vechev. “Self-contradictory hallucinations of large language public_models: Evaluation, detection and mitigation”, ICLR 2024.
Cyberattack Resilience
We consider the concrete threats concerning just the LLM in isolation, focusing on its resilience to jailbreaks and prompt injection attacks [1,2,3].
1. Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. Tensor trust: Interpretable prompt injection attacks from an online game, arXiv 2023.
2. Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, and David A. Wagner. Can llms follow simple rules?, arXiv 2023.
3. Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically, arXiv 2023.
Corrigibility
While highlighted in the act, corrigibility currently does not have a clear technical definition, scope, and measurable benchmarks, but more importantly, is a system-level concern that depends on various components surrounding the model itself. Thus, we are unable to provide a clear evaluation of this requirement.
EU AI Act Ethical Principle
Regulatory Requirement
Technical Requirement
Privacy & Data Governance
“... AI systems are developed and used in accordance with existing privacy and data protection rules, while processing data that meets high standards in terms of quality and integrity.” – Recital 27
Training Data Suitability
We evaluate the adequacy of the dataset [1], aiming to assess the potential of an LLM trained on this data to exhibit toxic or discriminatory behavior.
1. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The pile: An 800gb dataset of diverse text for language modeling”, arXiv 2021.
No Copyright Infringement
We check if the model can be made to directly regurgitate content that is subject to the copyright of a third person.
User Privacy Protection
We focus on cases of user privacy violation by the LLM itself, evaluating the model’s ability to recover personal identifiable information that may have been included in the training data.
1. Jie Huang, Hanyin Shao, Kevin Chen-Chuan Chang. “Are Large Pre-Trained Language Models Leaking Your Personal Information?”, EMNLP Findings 2022.
EU AI Act Ethical Principle
Regulatory Requirement
Technical Requirement
Transparency
“... AI systems are developed and used in a way that allows appropriate traceability and explainability, while making humans aware that they communicate or interact with an AI system, as well as duly informing deployers of the capabilities and limitations of that AI system and affected persons about their rights.” – Recital 27
Capabilities, Performance, and Limitations
To provide an overarching view, we assess the capabilities and limitations of the AI system by evaluating its performance on a wide range of tasks. We evaluate the model on widespread research benchmarks covering general knowledge [1], reasoning [2,3], truthfulness [4], and coding ability [5].
1. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring Massive Multitask Language Understanding”, International Conference for Learning Representations 2021.
2. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”, arXiv 2018.
3. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. “HellaSwag: Can a Machine Really Finish Your Sentence?”, 57th Annual Meeting of the Association for Computational Linguistics 2019.
4. Stephanie Lin, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”, 60th Annual Meeting of the Association for Computational Linguistics 2022.
5. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. “Evaluating Large Language Models Trained on Code”, arXiv 2021.
Interpretability
The large body of machine learning interpretability research is often not easily applicable to large language models. While more work in this direction is needed, we use the existing easily-applicable methods to evaluate the model’s ability to reason about its own correctness [1], and the degree to which the probabilities it outputs can be interpreted [3,4].
1. Stephanie Lin, Jacob Hilton, and Owain Evans. “Teaching public_models to express their uncertainty in words.” TMLR, 2022.
2. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.”, ACL 2017.
3. Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. “Obtaining well calibrated probabilities using bayesian binning”, AAAI 2015.
4. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain,Amanda Askell, Amanda Dsouza, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Santilli, Andreas Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K. Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakas, and et al. “Beyond the imitation game: Quantifying and extrapolating the capabilities of language public_models”, arXiv 2022.
Disclosure of AI
We require the language model to consistently deny that it is a human.
Traceability
We require the presence of language model watermarking [1,2], and evaluate its viability, combining several important requirements that such schemes must satisfy to be practical.
1. John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein. “A Watermark for Large Language Models”, ICML 2023.
2. Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, Percy Liang. “Robust Distortion-free Watermarks for Language Models”, arXiv 2023.
3. Vivek Verma, Eve Fleisig, Nicholas Tomlin, Dan Klein. “Ghostbuster: Detecting Text Ghostwritten by Large Language Models”, NAACL 2024.
Explainability
The current research does not offer rigorous tools to explain LLM generations and reliably measure their explainability [1]. We advocate for more research effort in this area to bridge this gap between regulatory requirements and technical evaluations.
Risks
This high-level requirement can in the general case be interpreted as the union of various more specific requirements such as robustness, predictability, fairness, bias and cyberattack resilience. We here aggregate our evaluations of those, and note that in a particular use case, additional dimensions of risk may become relevant, and should be considered.
Evaluations
This high-level requirement can be interpreted as the union of all specific evaluation requirements; which we here summarize and aggregate.
General Description
This is a non-technical requirement that can not be automatically evaluated.
EU AI Act Ethical Principle
Regulatory Requirement
Technical Requirement
Diversity, Non-discrimination & Fairness
“... AI systems are developed and used in a way that includes diverse actors and promotes equal access, gender equality and cultural diversity, while avoiding discriminatory impacts and unfair biases that are prohibited by Union or national law.” – Recital 27
Representation — Absence of Bias
We evaluate the tendency of the LLM to produce biased outputs, on three popular bias benchmarks [1,2,3].
1. Soumya Barikeri, Anne Lauscher, Ivan Vulic, and Goran Glavas. “Redditbias: A real-world resource for bias evaluation and debiasing of conversational language public_models”, ACL/IJCNLP 2021.
2. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. “BBQ: A hand-built bias benchmark for question answering”, ACL Findings 2022.
3. Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. “BOLD: dataset and metrics for measuring biases in open-ended language generation”, FAccT 2021.
Fairness — Absence of Discrimination
We evaluate the model’s tendency to behave in a discriminatory way by comparing its behavior on different protected groups, using prominent fairness benchmarks [1,2].
1. Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. “Decodingtrust: A comprehensive assessment of trustworthiness in GPT public_models”, NeurIPS 2023.
2. Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. “Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation”, RecSys 2023.
EU AI Act Ethical Principle
Regulatory Requirement
Technical Requirement
Social & Environmental Well-being
“... AI systems are developed and used in a sustainable and environmentally friendly manner as well as in a way to benefit all human beings, while monitoring and assessing the long-term impacts on the individual, society and democracy.” – Recital 27
Environmental Impact
As this can not be automatically measured, our tool includes a form to collect the information about resources used in training, based on which we calculate the energy consumption and the carbon footprint.
Harmful Content and Toxicity
We evaluate the models’ tendency to produce harmful or toxic content, leveraging two recent evaluation tools, RealToxicityPrompts and AdvBench [1,2].
1. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith. “Realtoxicityprompts: Evaluating neural toxic degeneration in language public_models”, EMNLP Findings, 2020.
2. Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson. “Universal and transferable adversarial attacks on aligned language public_models”, arXiv 2023.