COMPL-AI

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Human Agency & Oversight

“... AI systems shall be developed and used as a tool that serves people, respects human dignity and personal autonomy, and that is functioning in a way that can be appropriately controlled and overseen by humans.” – Recital 27

This ethical principle formulates informal societal-level and system-level requirements, making it impossible to extract technically measurable criteria that would apply to underlying models.

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Technical Robustness and Safety

“... AI systems are developed and used in a way that allows robustness in case of problems and resilience against attempts to alter the use or performance of the AI system so as to allow unlawful use by third parties, and minimise unintended harm.” – Recital 27

The relevant sections of the Act that fall under this ethical principle can be distilled into three technical requirements: Robustness and Predictability, Cyberattack Resilience, and Corrigibility.

Article 15 Para 1

[High risk AI systems] shall be designed and developed in such a way that they achieve an appropriate level of accuracy, robustness, and cybersecurity, and perform consistently.

Article 15 Para 3

[High risk AI systems] shall be as resilient as possible regarding errors, faults or inconsistencies... The robustness of high-risk AI systems may be achieved through technical redundancy solutions.

Article 55 Para 1a

[Providers of GPAI systems shall] perform model evaluation... including conducting and documenting adversarial testing.

Robustness and Predictability

We evaluate the model on state-of-the-art benchmarks that measure its robustness under various input alterations [1], and the level of consistency in its answers [2,3].

GP-SR

1. Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. “Evaluating public_models’ local decision boundaries via contrast sets”, EMNLP (Findings) 2020.

2. Lukas Fluri, Daniel Paleka, and Florian Tramèr. “Evaluating superhuman public_models with consistency checks”, SaTML 2024.

3. Niels Mündler, Jingxuan He, Slobodan Jenko, and Martin T. Vechev. “Self-contradictory hallucinations of large language public_models: Evaluation, detection and mitigation”, ICLR 2024.

Article 15 Para 5

High risk AI systems shall be resilient as regards to attempts by unauthorised third parties to alter their use, outputs or performance. Solutions to address... vulnerabilities shall include... measures to prevent, detect, respond to, resolve and control for attacks trying to manipulate... inputs designed to cause the model to make a mistake (‘adversarial examples’ or ‘model evasion’), confidentiality attacks or model flaws.

Article 55 Para 1d

[The providers shall] ensure an adequate level of cybersecurity protection.

Cyberattack Resilience

We consider the concrete threats concerning just the LLM in isolation, focusing on its resilience to jailbreaks and prompt injection attacks [1,2,3].

GP-SR

1. Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. Tensor trust: Interpretable prompt injection attacks from an online game, arXiv 2023.

2. Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, and David A. Wagner. Can llms follow simple rules?, arXiv 2023.

3. Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically, arXiv 2023.

Article 7

[When assessing... the Commission shall take into account the following criteria: ...] extent to which the outcome produced involving an AI system is easily corrigible or reversible.

Corrigibility

While highlighted in the act, corrigibility currently does not have a clear technical definition, scope, and measurable benchmarks, but more importantly, is a system-level concern that depends on various components surrounding the model itself. Thus, we are unable to provide a clear evaluation of this requirement.

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Privacy & Data Governance

“... AI systems are developed and used in accordance with existing privacy and data protection rules, while processing data that meets high standards in terms of quality and integrity.” – Recital 27

Article 10 Para 2f

[The training, validation and test data of high-risk AI systems should be subject to] examination in view of possible biases that are likely to affect the health and safety of persons, negatively impact fundamental rights or lead to discrimination prohibited under Union law.

Article 10 Para 3

Training, validation and testing datasets shall be relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose. They shall have the appropriate statistical properties, including, where applicable, as regards the persons or groups of persons in relation to whom the high-risk AI system is intended to be used.

Annex XI Section 1 (2c)

[The technical documentation of GPAI models (including those with systemic risk) should include] how the data was obtained and selected as well as all other measures to detect the unsuitability of data sources and methods to detect identifiable biases, where applicable.

Training Data Suitability

We evaluate the adequacy of the dataset [1], aiming to assess the potential of an LLM trained on this data to exhibit toxic or discriminatory behavior.

1. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. “The pile: An 800gb dataset of diverse text for language modeling”, arXiv 2021.

Article 53 Para 1c

Providers ... shall put in place a policy to respect Union copyright law in particular to identify and respect, including through state of the art technologies, the reservations of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790.

No Copyright Infringement

We check if the model can be made to directly regurgitate content that is subject to the copyright of a third person.

Article 2 Para 7

Union law on the protection of personal data, privacy and the confidentiality of communications applies to personal data processed in connection with the rights and obligations laid down in this Regulation.

User Privacy Protection

We focus on cases of user privacy violation by the LLM itself, evaluating the model’s ability to recover personal identifiable information that may have been included in the training data.

1. Jie Huang, Hanyin Shao, Kevin Chen-Chuan Chang. “Are Large Pre-Trained Language Models Leaking Your Personal Information?”, EMNLP Findings 2022.

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Transparency

“... AI systems are developed and used in a way that allows appropriate traceability and explainability, while making humans aware that they communicate or interact with an AI system, as well as duly informing deployers of the capabilities and limitations of that AI system and affected persons about their rights.” – Recital 27

Article 53 Para 1a/1b

“[The developers of General Purpose AI systems shall provide a documentation] including ... the results of its evaluation ... [and provide] a good understanding of the capabilities and limitations of the GPAI model.”.

Article 13 Para 3b

“[The providers’s documentation shall include] ... the level of accuracy, including its metrics”.

Capabilities, Performance, and Limitations

To provide an overarching view, we assess the capabilities and limitations of the AI system by evaluating its performance on a wide range of tasks. We evaluate the model on widespread research benchmarks covering general knowledge [1], reasoning [2,3], truthfulness [4], and coding ability [5].

GP-SR

1. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring Massive Multitask Language Understanding”, International Conference for Learning Representations 2021.

2. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”, arXiv 2018.

3. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. “HellaSwag: Can a Machine Really Finish Your Sentence?”, 57th Annual Meeting of the Association for Computational Linguistics 2019.

4. Stephanie Lin, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”, 60th Annual Meeting of the Association for Computational Linguistics 2022.

5. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. “Evaluating Large Language Models Trained on Code”, arXiv 2021.

Article 13 Para 1/3d

“High-risk AI systems shall be designed and developed in such a way as to ensure that their operation is sufficiently transparent to enable deployers to interpret a system’s output and use it appropriately ... [The instructions for use shall contain] technical measures put in place to facilitate the interpretation of the outputs of the high-risk AI systems by the deployers.”

Article 14 Para 4c

“[Enable] natural persons to whom human oversight is assigned or enabled [to] correctly interpret the high-risk AI system's output, taking into account for example the interpretation tools and methods available.”

Interpretability

The large body of machine learning interpretability research is often not easily applicable to large language models. While more work in this direction is needed, we use the existing easily-applicable methods to evaluate the model’s ability to reason about its own correctness [1], and the degree to which the probabilities it outputs can be interpreted [3,4].

1. Stephanie Lin, Jacob Hilton, and Owain Evans. “Teaching public_models to express their uncertainty in words.” TMLR, 2022.

2. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.”, ACL 2017.

3. Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. “Obtaining well calibrated probabilities using bayesian binning”, AAAI 2015.

4. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain,Amanda Askell, Amanda Dsouza, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Santilli, Andreas Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K. Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakas, and et al. “Beyond the imitation game: Quantifying and extrapolating the capabilities of language public_models”, arXiv 2022.

Article 50 Para 1

[The providers] shall ensure that AI systems intended to directly interact with natural persons are designed and developed in such a way that the concerned natural persons are informed that they are interacting with an AI system.

Disclosure of AI

We require the language model to consistently deny that it is a human.

Article 50 Para 2

“Providers of AI systems, including GPAI systems, generating synthetic audio, image, video or text content, shall ensure the outputs of the AI system are marked in a machine readable format and detectable as artificially generated or manipulated. Providers shall ensure their technical solutions are effective, interoperable, robust and reliable as far as this is technically feasible, taking into account specificities and limitations of different types of content, costs of implementation and the generally acknowledged state-of-the-art, as may be reflected in relevant technical standards."

Traceability

We require the presence of language model watermarking [1,2], and evaluate its viability, combining several important requirements that such schemes must satisfy to be practical.

1. John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein. “A Watermark for Large Language Models”, ICML 2023.

2. Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, Percy Liang. “Robust Distortion-free Watermarks for Language Models”, arXiv 2023.

3. Vivek Verma, Eve Fleisig, Nicholas Tomlin, Dan Klein. “Ghostbuster: Detecting Text Ghostwritten by Large Language Models”, NAACL 2024.

Article 13 (3b)

“where applicable, the technical capabilities and characteristics of the high-risk AI system to provide information that is relevant to explain its output.”

Recital 27

“... AI systems are developed and used in a way that allows appropriate traceability and explainability.”

Explainability

The current research does not offer rigorous tools to explain LLM generations and reliably measure their explainability [1]. We advocate for more research effort in this area to bridge this gap between regulatory requirements and technical evaluations.

Article 9

“A risk management system shall be established, implemented, documented and maintained in relation to high-risk AI systems. [A key step of the risk management system is] estimation and evaluation of the risks when the high-risk AI system is used.”

Risks

This high-level requirement can in the general case be interpreted as the union of various more specific requirements such as robustness, predictability, fairness, bias and cyberattack resilience. We here aggregate our evaluations of those, and note that in a particular use case, additional dimensions of risk may become relevant, and should be considered.

Article 53 Para 1a

“[Providers of GPAI models shall] draw up and keep up-to-date the technical documentation of the model, including ... the results of its evaluation”

Article 55 Para 1a

“[The provider shall] perform model evaluation in accordance with standardised protocols and tools reflecting the state of the art.”

Annex XI Section 2

“[Providers of GPAI models with systemic risk shall provide a] detailed description of the evaluation strategies, including evaluation results, on the basis of available public evaluation protocols and tools or otherwise of other evaluation methodologies. Evaluation strategies shall include evaluation criteria, metrics and the methodology on the identification of limitations.”

Evaluations

This high-level requirement can be interpreted as the union of all specific evaluation requirements; which we here summarize and aggregate.

Article 53 Para 1a/1b

“[Providers of GPAI models shall] draw up and keep up-to-date the technical documentation of the model, including its training and testing process and the results of its evaluation, which shall contain, at a minimum, the information set out in Annex XI for the purpose of providing it, upon request, to the AI Office and the national competent authorities.”

Annex IV Para 1

“ The technical documentation referred to in Article 53(1) shall contain at least the following information ... its intended purpose, the name of the provider and the version of the system reflecting its relation to previous versions; how the AI system interacts with, or can be used to interact with, hardware or software, including with other AI systems, that are not part of the AI system itself, where applicable.”

General Description

This is a non-technical requirement that can not be automatically evaluated.

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Diversity, Non-discrimination & Fairness

“... AI systems are developed and used in a way that includes diverse actors and promotes equal access, gender equality and cultural diversity, while avoiding discriminatory impacts and unfair biases that are prohibited by Union or national law.” – Recital 27

Article 15 Para 4

“[For continually learning high-risk systems, safeguards shall be developed] to eliminate or reduce as far as possible the risk of possibly biased outputs influencing input for future operations.”

Representation — Absence of Bias

We evaluate the tendency of the LLM to produce biased outputs, on three popular bias benchmarks [1,2,3].

1. Soumya Barikeri, Anne Lauscher, Ivan Vulic, and Goran Glavas. “Redditbias: A real-world resource for bias evaluation and debiasing of conversational language public_models”, ACL/IJCNLP 2021.

2. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. “BBQ: A hand-built bias benchmark for question answering”, ACL Findings 2022.

3. Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. “BOLD: dataset and metrics for measuring biases in open-ended language generation”, FAccT 2021.

Annex IV Para 2g

“[Model providers shall prepare documentation that includes a discussion of] potentially discriminatory impacts of the AI system.”

Fairness — Absence of Discrimination

We evaluate the model’s tendency to behave in a discriminatory way by comparing its behavior on different protected groups, using prominent fairness benchmarks [1,2].

1. Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. “Decodingtrust: A comprehensive assessment of trustworthiness in GPT public_models”, NeurIPS 2023.

2. Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. “Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation”, RecSys 2023.

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Social & Environmental Well-being

“... AI systems are developed and used in a sustainable and environmentally friendly manner as well as in a way to benefit all human beings, while monitoring and assessing the long-term impacts on the individual, society and democracy.” – Recital 27

Article 40 Para 2

“[Standards shall be developed that include] deliverables on reporting and documentation processes to improve AI systems resource performance, such as reduction of energy and other resources consumption of the high-risk AI system during its lifecycle, and on energy efficient development of general-purpose AI models.”

Article 95 Para 2

“[Codes of conduct shall be developed that outline tools for] assessing and minimizing the impact of AI systems on environmental sustainability, including as regards energy-efficient programming and techniques for efficient design, training and use of AI”

Annex XI Para 2d/2e

“[The technical documentation shall include an account of the] computational resources used to train the model [and (2e)] known or estimated energy consumption of the model.”

Environmental Impact

As this can not be automatically measured, our tool includes a form to collect the information about resources used in training, based on which we calculate the energy consumption and the carbon footprint.

Recital 75

“[High-risk AI systems should include technical solutions that] prevent or minimize harmful or otherwise undesirable behaviour.”

Harmful Content and Toxicity

We evaluate the models’ tendency to produce harmful or toxic content, leveraging two recent evaluation tools, RealToxicityPrompts and AdvBench [1,2].

1. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith. “Realtoxicityprompts: Evaluating neural toxic degeneration in language public_models”, EMNLP Findings, 2020.

2. Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson. “Universal and transferable adversarial attacks on aligned language public_models”, arXiv 2023.

Technical Interpretation

Regulatory to Technical Mapping

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Human Agency & Oversight

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Technical Robustness and Safety

Robustness and Predictability

Cyberattack Resilience

Corrigibility

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Privacy & Data Governance

Training Data Suitability

No Copyright Infringement

User Privacy Protection

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Transparency

Capabilities, Performance, and Limitations

Interpretability

Disclosure of AI

Traceability

Explainability

Risks

Evaluations

General Description

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Diversity, Non-discrimination & Fairness

Representation — Absence of Bias

Fairness — Absence of Discrimination

EU AI Act Ethical Principle

Regulatory Requirement

Technical Requirement

Social & Environmental Well-being

Environmental Impact

Harmful Content and Toxicity

Benchmarks