Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants

Proceedings on Privacy Enhancing Technologies, 2024
Website privacy policies are often lengthy and intricate. Privacy assistants assist in simplifying policies and making them more accessible and user-friendly. The emergence of generative AI (genAI) offers new opportunities to build privacy assistants that can answer users' questions about privacy policies. However, genAI's reliability is a concern due to its potential for producing inaccurate information. This study introduces GenAIPABench, a benchmark for evaluating Generative AI-based Privacy Assistants (GenAIPAs). GenAIPABench includes: 1) A set of curated questions about privacy policies along with annotated answers for various organizations and regulations; 2) Metrics to assess the accuracy, relevance, and consistency of responses; and 3) A tool for generating prompts to introduce privacy policies and paraphrased variants of the curated questions. We evaluated 3 leading genAI systems-ChatGPT-4, Bard, and Bing AI-using GenAIPABench to gauge their effectiveness as GenAIPAs. Our results demonstrate significant promise in genAI capabilities in the privacy domain while also highlighting challenges in managing complex queries, ensuring consistency, and verifying source accuracy....Read more
GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Aamir Hamid Univ. of Maryland, Baltimore County ahamid2@umbc.edu Hemanth Reddy Samidi Univ. of Maryland, Baltimore County hsamidi1@umbc.edu Tim Finin Univ. of Maryland, Baltimore County fnin@umbc.edu Primal Pappachan Portland State University primal@pdx.edu Roberto Yus Univ. of Maryland Baltimore County ryus@umbc.edu ABSTRACT Website privacy policies are often lengthy and intricate. Privacy assistants assist in simplifying policies and making them more ac- cessible and user-friendly. The emergence of generative AI (genAI) ofers new opportunities to build privacy assistants that can an- swer users’ questions about privacy policies. However, genAI’s reliability is a concern due to its potential for producing inaccurate information. This study introduces GenAIPABench, a benchmark for evaluating Generative AI-based Privacy Assistants (GenAIPAs). GenAIPABench includes: 1) A set of curated questions about pri- vacy policies along with annotated answers for various organiza- tions and regulations; 2) Metrics to assess the accuracy, relevance, and consistency of responses; and 3) A tool for generating prompts to introduce privacy policies and paraphrased variants of the cu- rated questions. We evaluated 3 leading genAI systemsÐChatGPT-4, Bard, and Bing AIÐusing GenAIPABench to gauge their efective- ness as GenAIPAs. Our results demonstrate signifcant promise in genAI capabilities in the privacy domain while also highlighting challenges in managing complex queries, ensuring consistency, and verifying source accuracy. KEYWORDS Generative Artifcial Intelligence, Large Language Models, Privacy Policies, Data Protection Regulations, Benchmark 1 INTRODUCTION In today’s digital landscape, efectively managing and protecting personal information is crucial for both individuals and organiza- tions. Data privacy has become a central issue, highlighting the need for strong privacy regulations. These regulations, including the EU’s GDPR and California’s CCPA, enforce strict guidelines to protect user data against misuse or unauthorized access. In order to be compliant with these regulations, organizations provide users with information about how their data is managed in the form of privacy policies. However, both privacy policies and regulations often sufer from complexity [1ś4], making it difcult for users to comprehend their rights and the protections in place for their privacy. This work is licensed under the Creative Commons Attribu- tion 4.0 International License. To view a copy of this license visit https://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. Proceedings on Privacy Enhancing Technologies 2024(3), 1ś17 © YYYY Copyright held by the owner/author(s). https://doi.org/XXXXXXX.XXXXXXX The concept of a privacy assistant has been developed to ad- dress privacy concerns. These assistants, utilizing insights from privacy policy analysis, transform complex policies into accessible, user-friendly information and aid users in managing their data privacy more efectively [5ś7]. Privacy assistants come in various forms, such as software applications, chatbots, and browser exten- sions. Artifcial Intelligence (AI), with its capability to process vast data, adapt to user needs, and ofer tailored recommendations [8], is particularly efective in privacy management. Research in this feld includes developing AI tools for summarizing privacy poli- cies [9], providing personalized privacy recommendations [10], and conducting privacy risk analyses [11]. Emergence of Large Language Models (LLMs) such as GPT [12], Llama [13], and BERT [14] represents a signifcant advancement in generative AI. These models excel in generating human-like text, having been trained on vast datasets. GPT-4.0 [15], the latest version at this writing, is a leader among LLMs, trained on trillions of tokens from the Internet, demonstrating exceptional contex- tual understanding and response accuracy [16]. Advanced chatbots like ChatGPT have also been developed using these models [17]. These genAI models and chatbots are increasingly being applied in domain-specifc tasks, paving the way for a new generation of AI personal assistants. LLM-based chatbots, for instance, have shown great promise in various felds including customer support [18], healthcare [19], personal fnance management [20], mental health support [21], and education [22]. Considering the critical impor- tance of privacy and the challenges users face in understanding privacy policies, this trend suggests the potential emergence of highly efcient and reliable generative AI privacy assistants (to which we will refer in the following as GenAIPAs). While genAI features are promising, several challenges persist. The accuracy of LLM-generated responses is often questioned due to their propensity to produce łglitchesž or incorrect information, impacting their trustworthiness [23ś25]. They may also generate misleading or erroneous references, further compromising their credibility. A recent study [26] highlights the need for a robust benchmark system for LLMs like GPT-3.5 and GPT-4 to ensure consistent performance evaluation and quality control and to pro- mote transparency and accountability. The complexity of evaluating LLMs and genAI arises from their training on extensive datasets and their capability to produce text akin to human writing. A range of evaluation metrics such as F1, BLEU, ROUGE, METEOR scores, Adversarial evaluation, and CIDEr [27ś31] have been suggested. 1 authors' advanced copy 2024-03-17
Proceedings on Privacy Enhancing Technologies 2024(3) Aamir Hamid, Hemanth Reddy Samidi, Tim Finin, Primal Pappachan, and Roberto Yus Y et, there is no single universally accepted metric due to domain- specifc evaluation needs. In particular, evaluation ofefectiveness for genAIs as privacy assistants faces unique challenges, including the lack of clear ground truth, multidimensional objectives like data minimization and user consent, and the subjective nature of user perception, which often diverges from technical metrics. Hence, while genAI systems have been evaluated in sectors like healthcare, fnance, and even mental health, up to the authors’ knowledge, they have not been evaluated in the privacy domain. This lack of focus on privacy-related aspects could leave genAI users vulnerable to various risks, such as making misinformed decisions when shar- ing data with an online service, underscoring the urgent need for comprehensive evaluations in this feld. We have designed the GenAIPABench benchmark to evaluate genAI-enabled privacy assistants, focusing on diverse tasks in areas such as transparency, user control, data minimization, security, and encryption. The benchmark includes: 1) A selected corpus of privacy policies and regulations; 2) Policy-related questions sourced from FAQs, online forums, and direct user inquiries, accompanied by annotated answers; 3) Metrics to assess GenAIPA responses for relevance, accuracy, clarity, completeness, and reference; and 4) An evaluator tool that applies these metrics to gauge GenAIPA performance 1 . The main contributions of this paper are: The introduction of the frst benchmark, to our knowledge, for evaluating GenAIPAs. The assessment of three popular genAI chatbots (ChatGPT, BARD, and Bing Chat) using GenAIPABench. An analysis of the results, highlighting challenges and op- portunities in developing GenAIPAs. The rest of the paper is structured as follows. Section 2 reviews the state of the art on privacy benchmarking and genAI evalua- tion. In Section 3, we introduce the benchmark. In Section 4 and Section 5, we detail GenAIPABench’s question corpus and metrics, respectively. Section 6 presents the experiments performed using GenAIPABench. In Section 7, we discuss challenges and opportuni- ties. Finally, Section 8 concludes the paper and presents directions for future research. 2 RELATED WORK Since our benchmark is the frst developed to assess the perfor- mance of GenAIPAs, we survey previous work on privacy bench- marks and on benchmarking general-purpose genAI systems as well as general-purpose question-answering systems. Privacy Benchmarks. The growing interest in privacy bench- marks and evaluation frameworks has led to innovative projects to enhance the efectiveness, usability, and transparency of pri- vacy policies and language models. For example, the PrivacyQA Project [32] developed a corpus containing 1,750 QAs on mobile app privacy policies, enriched with over 3,500 expert annotations, to improve user awareness and selective exploration of privacy issues. Its strength lies in the high reliability and precision of its expert- generated responses, though the queries are specifcally tailored to the included mobile applications. Similarly, the Usable Privacy Policy Project [33] employs machine learning and NLP to analyze and summarize privacy policies. As a result, the "OPP-115 Corpus" 1 Note that the benchmark’s content is only in English. dataset [34] consists of 115 website privacy policies annotated with diverse information types. genAI Evaluation. Recent research has signifcantly advanced our understanding of Large Language Models (LLMs). Ge et al. [27] demonstrated on the OpenAGI platform that domain-enhanced, optimized smaller LLMs can surpass larger models through Task Feedback Reinforcement Learning. Kang et al. [28] explored LLMs in understanding user preferences. They noted their competitive performance against traditional Collaborative Filtering methods when fne-tuned, despite initial shortcomings in zero-shot and few- shot scenarios. Chiang and Lee [35] found a strong correlation between LLM and human evaluations in text quality assessments, especially with advanced models like InstructGPT and ChatGPT. Liu et al. [29] introduced AgentBench, a benchmark focusing on LLMs as decision-making agents in interactive settings. Bang et al. [30] examined ChatGPT across various tasks, highlighting its lim- itations in low-resource and non-Latin languages. Zhang et al. [31] cautioned about the potential inaccuracies in LLM-generated news summaries. Finally, Liu et al. [36] used EvalPlus to reveal previ- ously unnoticed errors in LLM-generated code, emphasizing the need for robust evaluation. Collectively, these studies highlight the importance of diverse and comprehensive metrics for the efective and safe deployment of LLMs. General Question-answering Benchmarks. Benchmarks like SQuAD [37], TriviaQA [38], and Holistic Evaluation of Language Models (HELM) [39] are pivotal in evaluating LLMs. These bench- marks consist of diverse questions, ranging from factual to com- plex reasoning tasks, and are typically derived from domains like Wikipedia or news articles. They employ metrics such as accuracy, precision, recall, and F1 score to gauge LLM performance, focusing on answer quality aspects like clarity, relevance, and completeness. The HELM initiative stands out for its multi-metric approach and extensive evaluation across various language models, scenarios, and metrics, aiming for a thorough understanding of these models’ capabilities, limitations, and potential risks. TriviaQA introduces a unique challenge by ofering over 650,000 question-answer pairs covering a broad spectrum of topics, from science to popular culture. Its distinctiveness lies in its requirement for systems to retrieve and integrate information from diverse sources, as it presents questions independent of specifc contexts. 3 THE GENAIPABENCH BENCHMARK The GenAIPABench benchmark assesses generative AI-based pri- vacy assistants (GenAIPAs), focusing on their ability to aid users in understanding the intricate realm of data privacy, namely: 1) An- swering questions an individual might have about the privacy pol- icy of an organization/corporation/service; 2) Answering questions about data privacy regulations in a specifc country/state; 3) Sum- marizing privacy policies and privacy regulations. GenAIPABench comprises privacy documents, questions (with variations), anno- tated answers, and an evaluation tool (see Figure 1). The full bench- mark, as well as the results obtained evaluating three popular genAI systems (see Section 6) has been made available online 2 . Privacy documents: Extracted from web resources, the cur- rent version of GenAIPABench includes fve privacy policies and 2 https://anonymous.4open.science/r/GenAIPABench-FAB5/ 2
authors' advanced copy 2024-03-17 GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Aamir Hamid Hemanth Reddy Samidi Tim Finin Univ. of Maryland, Baltimore County ahamid2@umbc.edu Univ. of Maryland, Baltimore County hsamidi1@umbc.edu Univ. of Maryland, Baltimore County finin@umbc.edu Primal Pappachan Roberto Yus Portland State University primal@pdx.edu Univ. of Maryland Baltimore County ryus@umbc.edu ABSTRACT The concept of a privacy assistant has been developed to address privacy concerns. These assistants, utilizing insights from privacy policy analysis, transform complex policies into accessible, user-friendly information and aid users in managing their data privacy more effectively [5ś7]. Privacy assistants come in various forms, such as software applications, chatbots, and browser extensions. Artificial Intelligence (AI), with its capability to process vast data, adapt to user needs, and offer tailored recommendations [8], is particularly effective in privacy management. Research in this field includes developing AI tools for summarizing privacy policies [9], providing personalized privacy recommendations [10], and conducting privacy risk analyses [11]. Emergence of Large Language Models (LLMs) such as GPT [12], Llama [13], and BERT [14] represents a significant advancement in generative AI. These models excel in generating human-like text, having been trained on vast datasets. GPT-4.0 [15], the latest version at this writing, is a leader among LLMs, trained on trillions of tokens from the Internet, demonstrating exceptional contextual understanding and response accuracy [16]. Advanced chatbots like ChatGPT have also been developed using these models [17]. These genAI models and chatbots are increasingly being applied in domain-specific tasks, paving the way for a new generation of AI personal assistants. LLM-based chatbots, for instance, have shown great promise in various fields including customer support [18], healthcare [19], personal finance management [20], mental health support [21], and education [22]. Considering the critical importance of privacy and the challenges users face in understanding privacy policies, this trend suggests the potential emergence of highly efficient and reliable generative AI privacy assistants (to which we will refer in the following as GenAIPAs). While genAI features are promising, several challenges persist. The accuracy of LLM-generated responses is often questioned due to their propensity to produce łglitchesž or incorrect information, impacting their trustworthiness [23ś25]. They may also generate misleading or erroneous references, further compromising their credibility. A recent study [26] highlights the need for a robust benchmark system for LLMs like GPT-3.5 and GPT-4 to ensure consistent performance evaluation and quality control and to promote transparency and accountability. The complexity of evaluating LLMs and genAI arises from their training on extensive datasets and their capability to produce text akin to human writing. A range of evaluation metrics such as F1, BLEU, ROUGE, METEOR scores, Adversarial evaluation, and CIDEr [27ś31] have been suggested. Website privacy policies are often lengthy and intricate. Privacy assistants assist in simplifying policies and making them more accessible and user-friendly. The emergence of generative AI (genAI) offers new opportunities to build privacy assistants that can answer users’ questions about privacy policies. However, genAI’s reliability is a concern due to its potential for producing inaccurate information. This study introduces GenAIPABench, a benchmark for evaluating Generative AI-based Privacy Assistants (GenAIPAs). GenAIPABench includes: 1) A set of curated questions about privacy policies along with annotated answers for various organizations and regulations; 2) Metrics to assess the accuracy, relevance, and consistency of responses; and 3) A tool for generating prompts to introduce privacy policies and paraphrased variants of the curated questions. We evaluated 3 leading genAI systemsÐChatGPT-4, Bard, and Bing AIÐusing GenAIPABench to gauge their effectiveness as GenAIPAs. Our results demonstrate significant promise in genAI capabilities in the privacy domain while also highlighting challenges in managing complex queries, ensuring consistency, and verifying source accuracy. KEYWORDS Generative Artificial Intelligence, Large Language Models, Privacy Policies, Data Protection Regulations, Benchmark 1 INTRODUCTION In today’s digital landscape, effectively managing and protecting personal information is crucial for both individuals and organizations. Data privacy has become a central issue, highlighting the need for strong privacy regulations. These regulations, including the EU’s GDPR and California’s CCPA, enforce strict guidelines to protect user data against misuse or unauthorized access. In order to be compliant with these regulations, organizations provide users with information about how their data is managed in the form of privacy policies. However, both privacy policies and regulations often suffer from complexity [1ś4], making it difficult for users to comprehend their rights and the protections in place for their privacy. This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license visit https://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. Proceedings on Privacy Enhancing Technologies 2024(3), 1ś17 © YYYY Copyright held by the owner/author(s). https://doi.org/XXXXXXX.XXXXXXX 1 Proceedings on Privacy Enhancing Technologies 2024(3) Aamir Hamid, Hemanth Reddy Samidi, Tim Finin, Primal Pappachan, and Roberto Yus Yet, there is no single universally accepted metric due to domainspecific evaluation needs. In particular, evaluation of effectiveness for genAIs as privacy assistants faces unique challenges, including the lack of clear ground truth, multidimensional objectives like data minimization and user consent, and the subjective nature of user perception, which often diverges from technical metrics. Hence, while genAI systems have been evaluated in sectors like healthcare, finance, and even mental health, up to the authors’ knowledge, they have not been evaluated in the privacy domain. This lack of focus on privacy-related aspects could leave genAI users vulnerable to various risks, such as making misinformed decisions when sharing data with an online service, underscoring the urgent need for comprehensive evaluations in this field. We have designed the GenAIPABench benchmark to evaluate genAI-enabled privacy assistants, focusing on diverse tasks in areas such as transparency, user control, data minimization, security, and encryption. The benchmark includes: 1) A selected corpus of privacy policies and regulations; 2) Policy-related questions sourced from FAQs, online forums, and direct user inquiries, accompanied by annotated answers; 3) Metrics to assess GenAIPA responses for relevance, accuracy, clarity, completeness, and reference; and 4) An evaluator tool that applies these metrics to gauge GenAIPA performance1 . The main contributions of this paper are: • The introduction of the first benchmark, to our knowledge, for evaluating GenAIPAs. • The assessment of three popular genAI chatbots (ChatGPT, BARD, and Bing Chat) using GenAIPABench. • An analysis of the results, highlighting challenges and opportunities in developing GenAIPAs. The rest of the paper is structured as follows. Section 2 reviews the state of the art on privacy benchmarking and genAI evaluation. In Section 3, we introduce the benchmark. In Section 4 and Section 5, we detail GenAIPABench’s question corpus and metrics, respectively. Section 6 presents the experiments performed using GenAIPABench. In Section 7, we discuss challenges and opportunities. Finally, Section 8 concludes the paper and presents directions for future research. 2 dataset [34] consists of 115 website privacy policies annotated with diverse information types. genAI Evaluation. Recent research has significantly advanced our understanding of Large Language Models (LLMs). Ge et al. [27] demonstrated on the OpenAGI platform that domain-enhanced, optimized smaller LLMs can surpass larger models through Task Feedback Reinforcement Learning. Kang et al. [28] explored LLMs in understanding user preferences. They noted their competitive performance against traditional Collaborative Filtering methods when fine-tuned, despite initial shortcomings in zero-shot and fewshot scenarios. Chiang and Lee [35] found a strong correlation between LLM and human evaluations in text quality assessments, especially with advanced models like InstructGPT and ChatGPT. Liu et al. [29] introduced AgentBench, a benchmark focusing on LLMs as decision-making agents in interactive settings. Bang et al. [30] examined ChatGPT across various tasks, highlighting its limitations in low-resource and non-Latin languages. Zhang et al. [31] cautioned about the potential inaccuracies in LLM-generated news summaries. Finally, Liu et al. [36] used EvalPlus to reveal previously unnoticed errors in LLM-generated code, emphasizing the need for robust evaluation. Collectively, these studies highlight the importance of diverse and comprehensive metrics for the effective and safe deployment of LLMs. General Question-answering Benchmarks. Benchmarks like SQuAD [37], TriviaQA [38], and Holistic Evaluation of Language Models (HELM) [39] are pivotal in evaluating LLMs. These benchmarks consist of diverse questions, ranging from factual to complex reasoning tasks, and are typically derived from domains like Wikipedia or news articles. They employ metrics such as accuracy, precision, recall, and F1 score to gauge LLM performance, focusing on answer quality aspects like clarity, relevance, and completeness. The HELM initiative stands out for its multi-metric approach and extensive evaluation across various language models, scenarios, and metrics, aiming for a thorough understanding of these models’ capabilities, limitations, and potential risks. TriviaQA introduces a unique challenge by offering over 650,000 question-answer pairs covering a broad spectrum of topics, from science to popular culture. Its distinctiveness lies in its requirement for systems to retrieve and integrate information from diverse sources, as it presents questions independent of specific contexts. RELATED WORK Since our benchmark is the first developed to assess the performance of GenAIPAs, we survey previous work on privacy benchmarks and on benchmarking general-purpose genAI systems as well as general-purpose question-answering systems. Privacy Benchmarks. The growing interest in privacy benchmarks and evaluation frameworks has led to innovative projects to enhance the effectiveness, usability, and transparency of privacy policies and language models. For example, the PrivacyQA Project [32] developed a corpus containing 1,750 QAs on mobile app privacy policies, enriched with over 3,500 expert annotations, to improve user awareness and selective exploration of privacy issues. Its strength lies in the high reliability and precision of its expertgenerated responses, though the queries are specifically tailored to the included mobile applications. Similarly, the Usable Privacy Policy Project [33] employs machine learning and NLP to analyze and summarize privacy policies. As a result, the "OPP-115 Corpus" 1 Note 3 THE GENAIPABENCH BENCHMARK The GenAIPABench benchmark assesses generative AI-based privacy assistants (GenAIPAs), focusing on their ability to aid users in understanding the intricate realm of data privacy, namely: 1) Answering questions an individual might have about the privacy policy of an organization/corporation/service; 2) Answering questions about data privacy regulations in a specific country/state; 3) Summarizing privacy policies and privacy regulations. GenAIPABench comprises privacy documents, questions (with variations), annotated answers, and an evaluation tool (see Figure 1). The full benchmark, as well as the results obtained evaluating three popular genAI systems (see Section 6) has been made available online2 . Privacy documents: Extracted from web resources, the current version of GenAIPABench includes five privacy policies and 2 https://anonymous.4open.science/r/GenAIPABench-FAB5/ that the benchmark’s content is only in English. 2 GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Proceedings on Privacy Enhancing Technologies 2024(3) Figure 1: A high-level overview of GenAIPABench. two data regulations with their corresponding manually annotated answers to questions. This dataset equips GenAIPAs with specific content knowledge, facilitating a uniform comparison across various models, regardless of their prior training on these documents. Privacy questions: Intended to test GenAIPAs’ proficiency in interpreting and responding to typical queries about website/service privacy policies and regulations. The dataset contains 32 questions for privacy policies and six questions for privacy regulations covering crucial topics like data collection, storage, sharing, and user rights (see Section 4). Along with the questions, the benchmark includes a set of paraphrased questions and variations for each. Metrics: These criteria assess GenAIPAs’ effectiveness in answering privacy policy and regulation questions. Metrics include accuracy, relevance, clarity, completeness, and reference, as detailed in Section 5. Human analysts use these metrics to review the responses generated by GenAIPAs and pinpoint areas needing enhancement. Annotated answers: For the five privacy policies and two regulations included in the corpus, we meticulously curated answers for each benchmark question. This process involved two experts, each responsible for a different privacy policy, who created answers based on their assigned documents. After the initial answer generation, they conducted a reciprocal review, cross-verifying the responses against the original policies and refining them as needed. This rigorous process guarantees the precision and thoroughness of the annotated answers. Evaluator: The evaluator automates the generation of prompts to introduce GenAIPAs to the privacy documents and pose the benchmark questions (see Appendix A). If an API is available, it also executes the prompts and handles the collection of answers. The evaluator initializes the GenAIPA with a prompt that includes information about which privacy document to refer to. The evaluator uses three different types of initialization prompts: (1) Benchmark execution without accompanying privacy policy document: The evaluator prompts the GenAIPA, explaining that it will ask questions about a specific privacy document (e.g., the privacy policy of Uber). (2) Benchmark execution with accompanying privacy document: The initial prompt explains that the evaluator will send the privacy document in segmented portions due to possible token limit constraints of GenAIPAs, followed by questions about the privacy document. (3) Benchmark execution on summarized privacy document: The initial prompt requests the GenAIPA to summarize the privacy document (both with and without explicit privacy document introduction). 3 Proceedings on Privacy Enhancing Technologies 2024(3) Aamir Hamid, Hemanth Reddy Samidi, Tim Finin, Primal Pappachan, and Roberto Yus The benchmark questions are posed following this prompt. The conversation is reset before performing the next type of initialization prompt. The process is repeated multiple times (the number of repetitions is configurable), and a conversation reset is forced after each repetition. 4 In the following, we introduce each category and its questions. We will denote the generalized and individual-specific questions with the subscripts 𝑓1/2/3 and 𝑢 1 , respectively. Transparency (𝑇 ) refers to how easily users can understand and access information regarding the collection, usage, and sharing of their personal data by companies or organizations. Key elements include the types of data collected, its intended use, and any thirdparty sharing. Crucial to transparency is using clear, understandable language and the accessibility of privacy policies. To evaluate transparency in privacy policies, the following questions are proposed: QUESTION CORPUS We introduce the question corpus that represents privacy questions an individual might ask the GenAIPA. 4.1 Tf1 łDoes the policy outline data collection practices?ž Tf2 łWhat is [the company]’s stance on government requests for user data?ž Tf3 łHow does the policy address potential conflicts of interest in data usage or sharing?ž Tu1 łWhat sort of data is collected from me while using this?ž Privacy Policy Questions To evaluate GenAIPA’s performance comprehensively, we gathered questions spanning a broad spectrum of privacy-related topics concerning organizational or service privacy policies. These were grounded in established privacy frameworks and guidelines, as well as web resources. Initially, we selected pertinent privacy categories from the existing literature, notably referencing the ISO/IEC 29100:2011 - Information Technology - Security Techniques - Privacy framework [40], which offers a detailed privacy management framework, encompassing guidelines for privacy impact assessments and policies [41ś43]. For each of the eight categories, we identified four questions3 per category, with two sets of questions: 1) General user concerns and 2) Questions from specific individuals. For the first set, we scoured privacy FAQs on websites with common user questions about data management. We also searched online forums like Reddit and Twitter with keywords from the privacy categories. The questions were generalized and combined to extract three per category. For the second set, we incorporated questions from a user study by Abhilasha et al. [44] that used Amazon Mechanical Turk to collect lay user questions about various app policies. We selected one question per category. Note that some questions include the placeholder [the company], to be replaced by the evaluator with the relevant company name when generating prompts for the assistant. The final corpus of questions includes not only the original queries but also their paraphrased variants (i.e., reworded versions of the questions that maintain their original meaning). We introduce these variants to evaluate GenAIPAs’ understanding and response abilities across diverse linguistic scenarios reflecting, for instance, different privacy knowledge of individuals. We used two approaches to generate the variations. For questions about general user concerns, we used QuillBot4 , an AI tool that automatically restructures sentences and alters words or phrases while preserving their original intent. We automatically generated ten variations per question, ultimately selecting three that showed diversity while guaranteeing that the original meaning was preserved. As an example, for the original question, "Does [the company] minimize data retention periods?" we generated variants such as "Does [the company] keep data for shorter times?" and "Does [the company] hold onto data for less time?". For each individual-specific question, we manually select, from the same dataset, three questions from other individuals with the same meaning but different wording. Tf1 is a straightforward yes or no question that does not require much explanation or context. Tf2 asks about the company’s stance on government requests for user data, which may require some knowledge of privacy regulations and the company’s policies. Tf3 addresses potential conflicts of interest in data usage and sharing, a more nuanced and complex issue requiring a deeper understanding of the company’s business practices and policies. Finally, Tu1 informs users about the specific data types collected while interacting with the service or product. User Control (𝑈𝐶) refers to the options available to users to manage their personal information and privacy settings. These controls can include the ability to opt out of data collection and sharing, to delete personal data, to access and modify personal data, and to set preferences for how their data is used. To evaluate user control in privacy policies, the following questions are proposed: UCf1 łAre users given control over their data and privacy settings?ž UCf2 łAre there clear mechanisms for users to request data deletion or access?ž UCf3 łHow does [the company] manage consent and withdrawal of consent from users?ž UCu1 łCan I opt out of letting them collect data and still use the app?ž UCf1 checks for the fundamental aspect of a privacy policy: whether it empowers users to manage their data. UCf2 delves into the company’s data deletion and access procedures, requiring detailed knowledge of their data management practices.UCf3 explores the complexities of how the company navigates user consent and its revocation, an area influenced by both the company’s specific policies and the legal framework it operates within. Finally, UCu1 examines whether the policy allows users to decline data collection while maintaining access to the app, reflecting a critical aspect of user control and consent in privacy practices. Data Minimization and Purpose Limitation (𝐷𝑀) are key principles safeguarding user privacy. Data minimization restricts the collection, use, and storage of personal data to essentials, mitigating risks and preventing misuse for unrelated purposes. Conversely, Purpose Limitation confines data use to its original collection intent, giving users more control over their information. To evaluate data minimization and purpose limitation in privacy policies, the following questions are proposed: 3We limited the number of questions per category to four because of the intensive manual effort needed to generate and validate ground truth answers for each policyquestion combination. 4 https://www.quillbot.com 4 GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Proceedings on Privacy Enhancing Technologies 2024(3) PbDf1 is a straightforward yes-or-no question about whether the company conducts privacy impact assessments, a standard procedure in data privacy. PbDf2 involves the concept of differential privacy, a more advanced and technical area that requires a nuanced understanding of how to balance data utility and privacy. PbDf3 is about the complex topics of automated decision-making and profiling, which demand a deep technical understanding and the ability to assess ethical and privacy implications. Finally, PbDu1 seeks to understand how the company uses, analyzes, and potentially benefits from user data while also considering the privacy implications of such analytics. Responsiveness and Communication (𝑅𝐶) pertains to how organizations interact with users regarding privacy matters. This encompasses providing transparent, easily understandable information on data practices and swiftly addressing user privacy queries and concerns. To assess these aspects in privacy policies, consider the following questions: DMf1 łDoes [the company] minimize data retention periods?ž DMf2 łHow is user data anonymized or aggregated to protect individual privacy?ž DMf3 łAre there any restrictions on data processing for specific purposes or contexts?ž DMu1 łHow long is my data stored?ž DMf1 inquires if the company minimizes data retention periods, seeking a direct response based on the company’s data retention policy. DMf2 delves into the techniques for anonymizing or aggregating user data to safeguard privacy, which may demand technical insight for a comprehensive answer. DMf3 probes into any constraints on data processing tailored to particular purposes or contexts, necessitating a thorough understanding of the company’s policies and legal obligations. Finally, DMu1 seeks specific information about the length of time the company retains user data. Security and Encryption (𝑆𝐸) involves strategies organizations use to safeguard users’ personal data from unauthorized access, theft, or cyber-attacks. This includes employing encryption for sensitive data like usernames, passwords, and credit card details, and using secure communication protocols to avert data interception. Additionally, organizations often establish policies for managing security breaches, encompassing user notification, breach investigation, and preventive measures for future incidents. To assess security and encryption in privacy policies, consider these questions: RCf1 łIs the privacy policy regularly updated and communicated to users?ž RCf2 łIs there a process in place to address user privacy complaints?ž RCf3 łDoes [the company] publish transparency reports detailing government data requests, surveillance, or law enforcement interactions?ž RCu1 łHas there ever been a security breach?ž RCf1 is straightforward, seeking a yes or no answer regarding the communication of privacy policy updates. RCf2 delves into the company’s mechanisms for handling privacy complaints, requiring an understanding of their specific procedures. RCf3 , more intricate, probes into the company’s transparency regarding governmental data requests and legal interactions, demanding insight into their commitment to transparency and legal compliance. Finally, RCu1 is highly relevant in the context of how a company manages and communicates about security incidents, a critical aspect of user trust and data protection. Accessibility, Education, and Empowerment (𝐴𝐸𝐸) focuses on ensuring that privacy policies are user-friendly and empowering. Policies should be accessible, including to those with disabilities, through various formats like audio or video. They need to be in plain language for easy comprehension, explaining key concepts and terms clearly. It is crucial to educate users about their privacy rights and the implications of data sharing. Policies should guide users on how to exercise their privacy rights and control their personal data. Empowerment is key, providing users with meaningful choices in a straightforward manner. The following questions are proposed to evaluate these aspects of privacy policies: SEf1 łAre user communications encrypted end-to-end?ž SEf2 łWhat measures are in place to prevent unauthorized access to user data?ž SEf3 łHow are data breaches or security incidents handled and communicated to users?ž SEu1 łHow well secured is my private information?ž SEf1 checks if the company uses end-to-end encryption for user communications, requiring a simple yes or no answer. SEf2 inquires about specific security measures against unauthorized data access, calling for a detailed response about the company’s security protocols. SEf3 explores the handling and communication of data breaches or security incidents, necessitating an understanding of the company’s response strategy and relevant legal/regulatory frameworks. Finally, SEu1 seeks a general assessment of the overall security measures in place to protect users’ private information. Privacy by Design and Innovation (𝑃𝑏𝐷) embodies a data protection strategy that integrates privacy considerations throughout all stages of product or service development. This method involves embedding privacy-enhancing features, like data minimization, purpose limitation, and robust security measures, from the outset. The aim is to proactively mitigate privacy risks and ensure default data protection. Moreover, 𝑃𝑏𝐷 advocates continuous monitoring and updating of privacy practices to address emerging privacy concerns. To assess 𝑃𝑏𝐷 in privacy policies, consider these questions: AEEf1 łAre employees trained on data privacy best practices and handling sensitive information?ž AEEf2 łHow are user data privacy preferences managed across different devices or platforms?ž AEEf3 łDoes [the company] offer user-friendly resources, such as tutorials or guides, to help users effectively manage their privacy settings and understand their data rights?ž AEEu1 łDoes it share any data with a third party?ž PbDf1 łDoes [the company] conduct privacy impact assessments?ž PbDf2 łAre there privacy-enhancing technologies implemented, such as differential privacy?ž PbDf3 łDoes [the company] use automated decision-making or profiling, and if so, how does it impact user privacy?ž PbDu1 łWhat sort of analytics will my data be subjected to?ž AEEf1 is straightforward, asking whether the company ensures its employees are trained in data privacy. AEEf2 inquires about 5 Proceedings on Privacy Enhancing Technologies 2024(3) Aamir Hamid, Hemanth Reddy Samidi, Tim Finin, Primal Pappachan, and Roberto Yus 5 managing user privacy across various platforms, requiring an understanding of integrated privacy systems. AEEf3 delves into the availability of educational resources for users, indicating the company’s commitment to user education in privacy. AEEu1 directly addresses the transparency of the company’s data-sharing policies, which is a fundamental aspect of user trust and privacy management. Compliance and Accountability (𝐶𝐴) are critical for organizations to ensure adherence to privacy laws and standards. This includes conducting regular audits, performing data protection impact assessments, and appointing a Data Protection Officer (DPO) to oversee privacy matters. Accountability extends to taking responsibility for privacy breaches or violations and providing remedies to affected parties. The proposed questions for evaluating compliance and accountability in privacy policies are: CAf1 łDoes the policy comply with applicable privacy laws and regulations?ž CAf2 łWhat steps are taken to ensure data processors and subprocessors adhere to privacy requirements?ž CAf3 łDoes [the company] have a process in place for reporting and addressing privacy violations or non-compliance issues, both internally and with third-party vendors?ž CAu1 łDo I have any rights as far as whether I want my account info deleted?ž CAf1 assesses the company’s alignment with privacy laws, a fundamental aspect of privacy management. CAf2 explores how the company ensures that it’s data processors and subprocessors comply with privacy standards, reflecting an advanced understanding of third-party risk management. CAf3 inquires about the procedures for handling privacy violations, demonstrating the depth of the company’s commitment to accountability. Finally, CAu1 directly relates to the "right to be forgotten," a key provision of regulations such as GDPR, which empowers individuals to request the deletion of their personal data under certain circumstances. 4.2 METRICS To assess the GenAIPA’s response quality, we developed metrics integrating five principal elements anchored in privacy policy assessment. These metrics draw on resources like the Future of Privacy Forum’s report and other key studies [47ś49], which offer valuable guidance on designing and evaluating privacy policies. Relevance (M𝑟𝑒𝑙 ) gauges the alignment of an answer with the user’s question, a critical factor for user satisfaction in conversational agents [50]. Relevant responses empower users to make knowledgeable decisions about their data privacy, whereas irrelevant answers may cause frustration and dissatisfaction, obstructing users’ comprehension of their rights and responsibilities [51]. Accuracy (M𝑎𝑐𝑐 ) evaluates the correctness of information provided by AI systems, crucial for fostering trust and acceptance, as emphasized in [52]. Inaccurate or misleading information can lead to poor decisions, adversely affecting user perception of the system. Specifically, incorrect responses by a GenAIPA can result in misguided actions regarding privacy, such as unwisely continuing to use a service perceived as less intrusive. Moreover, recognizing inaccuracies can diminish the perceived reliability and trustworthiness of the system, impacting user confidence [53]. Clarity (M𝑐𝑙𝑎 ) assesses the effectiveness of communication, focusing on clear and coherent responses, as per Grice’s principles [54]. It emphasizes the importance of easily understandable and coherent responses for informed decision-making. A significant challenge with privacy policies, as noted in [55], is their complexity due to legal and technical jargon. GenAIPAs should strive for simplicity, avoid ambiguity and unnecessary technical terms, and provide clear explanations. Tailoring responses to the user’s comprehension level is also vital. By ensuring clarity, GenAIPAs improve user satisfaction and guarantee effective information transmission. Completeness (M𝑐𝑜𝑚 ) measures if an answer fully addresses the user’s question [56]. Responses must encompass all necessary aspects and details, avoiding the need for multiple follow-up questions. A complete answer should thoroughly cover the topic, provide accurate and exhaustive information, and consider any related issues pertinent to the user’s query. Inadequate or flawed information can lead to misinformed decisions or a lack of understanding regarding privacy options, resulting in user frustration and diminished trust in the AI system [53]. To ensure completeness, GenAIPAs must understand the context of queries and tailor responses to meet specific user needs. This approach not only boosts user satisfaction but also streamlines communication. Reference (M𝑟𝑒 𝑓 ) evaluates the inclusion of proper citations or mentions of relevant policy sections in responses, crucial for transparency and credibility in legal or policy contexts, as underscored in [57]. AI systems, when applicable, should incorporate accurate citations or references to pertinent policy sections. This practice bolsters the response’s accuracy and completeness and enhances user trust by providing transparency and credibility. Proper citation entails using the correct legal or policy language, including relevant section numbers and any other information essential for comprehending the legal or policy implications of the user’s query. By integrating appropriate references in their responses, GenAIPAs can assure users of the accuracy and compliance of their responses with relevant laws or policies. Privacy Regulation Questions GenAIPABench includes questions to evaluate the performance of the GenAIPA in helping users understand privacy and data protection regulations such as the GDPR or the CCPA. We compiled and generalized the following questions extracted from different sources [45, 46] that aim to cover a range of topics, from the scope and applicability of the regulations to specific requirements and rights: łWho must comply with the [regulation]?ž łWhat are the [regulation] fines?ž łHow do I comply with the [regulation]?ž łDoes the [regulation] require encryption?ž łWhat is personal information and sensitive personal information under the [regulation]?ž PR6 łWhat rights do I have under the [regulation]?ž PR1 PR2 PR3 PR4 PR5 Note that the evaluator will replace the placeholder [regulation] with the specific privacy regulation to be evaluated from the privacy document dataset (e.g., GDPR, CCPA, LGPD, etc.). Like the prior question corpus, the benchmark includes question variations through paraphrasing for comprehensive evaluation. 6 GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Proceedings on Privacy Enhancing Technologies 2024(3) Metric Evaluation. The proposed evaluation method assesses each response across five metrics on a scale from +1 to -1: 13 (Airbnb), and 8 (Spotify, Uber), were on content not explicitly stated in the policy. • M𝑟𝑒𝑙 : +1 for a relevant response, +0.5 for a partially relevant response, and -1 for a not relevant response. • M𝑎𝑐𝑐 : +1 for an entirely correct response, +0.5 for a partially correct response, and -1 for an incorrect response. • M𝑐𝑙𝑎 : +1 for a clear and easy-to-understand response, +0.5 for a somewhat clear but could be improved response, and -1 for a confusing or hard-to-comprehend response. • M𝑐𝑜𝑚 : +1 for a comprehensive response, +0.5 for a somewhat complete but lacking some minor information response, and -1 for an incomplete or missing important details response. • M𝑟𝑒 𝑓 : +1 for a correctly cited relevant policy section, +0.5 for a mentioned section without explicitly citing it, and -1 for an incorrect reference. Table 1: Privacy policies analyzed. Policy Unique Reading Reading Connective Words Time Level Words Twitter 0.21 21 10.3 0.04 Spotify 0.16 32 12.4 0.03 Uber 0.16 37 11.9 0.04 Airbnb 0.19 27 14.1 0.04 Facebook 0.18 20 11.8 0.05 The following sections present the analysis of the results obtained. The performance results are plotted in graphs (e.g., Figure 2a) that show the performance score, calculated using interquartile range (IQR) values from questions. To enhance visual interpretation, average performance scores are also mapped across five metrics using heatmaps (e.g., Figure 2b), where the x-axis represents the chosen metrics and the y-axis corresponds to the different categories of privacy questions). Note that it is possible that a specific privacy document (e.g., a specific privacy policy) might lack information to answer a benchmark question. In that case, the desired answer should be that the document does not contain enough information to answer the question. Hence, any mention of a policy section would score a -1 for M𝑟𝑒 𝑓 . We propose to aggregate these into an overall quality metric + ) and total (M𝑎𝑙𝑙) by calculating total positive/partial points (M𝑎𝑙𝑙 − negative points (M𝑎𝑙𝑙 ) separately. The overall score is normalized using the equation: 6.1 This experiment aims to assess the quality of responses concerning privacy policy questions. The results (see Figure 2b) show that ChatGPT-4 and BingAI consistently outperform Bard in most questions. Notably, BingAI stands out in its ability to adeptly handle user and FAQ-generated questions, especially in the context of Spotify, Twitter, and Airbnb policies. This proficiency may be due to a simpler reading level, a more diverse vocabulary, and lower reading times of the policies (see Table 1). Bard’s performance tends to diminish as question complexity increases, a trend not observed as prominently in ChatGPT-4 or BingAI. We next analyze in detail the performance of each system. ChatGPT-4: While ChatGPT-4 often achieved a median score of 100 (see Figure 2a), its performance varied significantly with scores ranging from 10 to 100 across all policies. This fluctuation was particularly noticeable in responses to FAQ-sourced questions. Its performance dipped in handling questions on the Spotify policy, with a lower median score of 79.75, compared to the Facebook and Uber policies, which had median scores of 100 and 96.5, respectively. The interquartile range further illustrated this trend. ChatGPT-4’s relevance (Figure 2b) in answering questions was generally strong, with scores ranging between 0.6 to 1 for most categories. However, it seemed to struggle slightly with 𝐶𝐴 𝑓 2 at 0. The clarity exhibited by ChatGPT-4 was commendable, consistently hovering between 0.6 and 1, except for a noticeable dip to -0.1 for 𝐶𝐴 𝑓 2 . Accuracy, however, was a mixed bag, while ChatGPT-4 scored admirably with a peak of 1 for 𝑆𝐸 𝑓 1 , it descended to -0.4 for 𝐶𝐴 𝑓 2 . Completeness followed a similar trajectory, ranging from highs like 1 (𝑆𝐸 𝑓 3 ) to lows of -0.4 (𝐶𝐴𝑓 2). ChatGPT-4’s referencing capabilities appeared as an area of improvement, with several scores lying in the negative domain. Moreover, ChatGPT-4 showed consistently strong performance across all policies, without a specific trend towards those with higher or lower proportions of non-existent content. (Current Score − Minimum Score) ×9+1 M𝑎𝑙𝑙 = (Maximum Score − Minimum Score) where the Minimum Score is -5 and the Maximum Score is 5. This approach highlights the potential negative impact of answers on privacy decision-making. 6 Assessing the Quality of Responses to Privacy Policy Questions EXPERIMENTS We assessed three leading generative AI systems: ChatGPT-4 [15], Bard 5 , and BingAI6 using GenAIPABench. ChatGPT-4, accessed via OpenAI’s API7 , and Bing AI, both based on GPT-4, differ in their fine-tuning and deployment, influencing their functionalities and user interactions. Bard and BingAI were accessed through their websites, as no official APIs were available. Our evaluation analyzed five privacy policies (Uber, Spotify, Airbnb, Facebook, Twitter) and two major privacy regulations (GDPR, CCPA). We included statistics about the selected privacy policies in Table 1. The policy’s unique word frequency [58] indicates whether complex and/or specialized language is used, which might challenge LLMs if it is beyond their training. The estimated reading time (computed as the number of words multiplied by the average time in minutes required per word) represents the length of the document, which might impact response coherence and relevance. The reading level (Flesch-Kincaid Grade Level [59]) metric assesses text difficulty, indicating the education level needed for understanding. Finally, although long, connective-word-rich sentences can confuse humans, they might help GenAIPAs understand context and logic. Additionally, we note that out of 32 questions, 15 (Facebook), 14 (Twitter), 5 https://bard.google.com/ 6 https://chat.openai.com/ 7 https://platform.openai.com/ 7 Proceedings on Privacy Enhancing Technologies 2024(3) Aamir Hamid, Hemanth Reddy Samidi, Tim Finin, Primal Pappachan, and Roberto Yus (a) Overall score per policy. (b) Average scores for all policies across metrics. Figure 2: Performance of systems when the privacy policy is explicitly shared. Bard: Bard (see Figure 2a) frequently registered minimum scores of 10 across various question categories and very occasionally scored higher (it peaked at around 100 for the combination of the Spotify policy and user-generated questions). The median scores provide further insights into its tendency to gravitate towards midrange values for all questions, evidenced by median scores like 50.5 (Uber) and 64 (Twitter). User-generated questions yielded higher median scores than FAQ questions, with the Bard policy excelling in Spotify-related questions at a median of 95.5. Additionally, the Twitter policy outperformed others, with a median score of 64.4 for FAQ questions. Bard’s 1st quartile performance for questions often struggled, while its 3rd quartile results indicated that even its top performance strata seldom achieved peak scores. Figure 2b shows that Bard’s relevance was high across the board, with scores largely hovering around 1, though it faced challenges with 𝑆𝐸 𝑓 3 at -0.3. The clarity metric also displayed consistency, mostly remaining close to 0.9, but 𝑆𝐸 𝑓 3 presented a deviation with a score of -0.3. Accuracy for Bard varied considerably: it showed robustness in questions like 𝑇𝑢1 and 𝐷𝑀𝑢1 with scores at 1 but dipped to -1 for metrics like 𝑆𝐸 𝑓 2 and 𝑃𝐷 𝑓 1 . Regarding completeness, Bard oscillated between a high of 1 (𝑇𝑢1 ) to a low of -0.7 (𝑈𝐶𝑢1 ). The reference domain was particularly an issue for Bard, with scores mainly revolving around -1 across all the questions showcasing the least favourable outcome. Additionally, Bard’s performance was consistently moderate across a range of policies, with a noticeable dip in performance when dealing with Uber’s policy, which had less non-existent content but required more reading time. In contrast, Bard excelled with Facebook’s policy, characterized by less reading time and a higher proportion of non-existent content. BingAI: Of the three systems, BingAI consistently demonstrated superior performance metrics (Figure 2a). Its score spectrum was high, frequently attaining the maximum score of 100 across various difficulties and policies, seldom dropping below 20 for a specific case (Uber, FAQ-sourced question). This performance was equally evident in the median values, where BingAI displayed high consistency even for all questions. Noteworthy were scores of 100 (Twitter, FAQ questions) and 95.5 (Airbnb, user-generated questions). FAQ-sourced questions yielded lower median scores than user-generated questions, with BingAI excelling in Airbnb-related privacy questions at a median of 100. Additionally, in the FAQ group, the Twitter policy outperformed others with a median score of 100 for user-generated questions. The quartile analysis reinforced 8 GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Proceedings on Privacy Enhancing Technologies 2024(3) its robustness, with the 1st quartile values indicating high baseline performance and the 3rd quartile metrics often culminating near or at 100. BingAI’s performance showed consistently high values in several metrics (Figure 2b). Its relevance and clarity stood out, surpassing the 0.8 mark. However, 𝐷𝑀 𝑓 3 was an outlier in relevance with a score of -0.2. BingAI’s accuracy demonstrated consistent strength, frequently achieving a score of 1, though significant challenges were noted in 𝐷𝑀 𝑓 3 and 𝑆𝐸𝑢1 with scores of -1 and 0, respectively. Regarding completeness, BingAI’s metrics were predominantly positive, with a substantial number of questions securing a score of 0.9 or higher, but a noticeable decline was observed in 𝐷𝑀 𝑓 3 at -1. Referencing for BingAI showed variance but managed to avoid deeply negative scores. BingAI consistently outperformed across different policies and questions (even for those questions on content not explicitly stated in the policy). Notably, a slight decline in BingAI’s effectiveness was observed with Spotify’s policy, characterized by lower non-existent content yet demanding more reading time. Conversely, BingAI’s performance excelled with Twitter’s policy, which required less reading time and contained a higher amount of non-existent content. 6.2 However, there was a significant drop in performance with Uber’s policy characterized by longer reading duration. Bard: Bard’s performance varied across policies (Figure 3a). While FAQs and user-generated achieved median scores of 60.4 and 66.85, respectively, for Spotify, a significant drop to 59.5 was observed for user-generated questions. The minimum scores for the 𝑆𝐸 𝑓 3 ,𝑃𝐷 𝑓 2 among others were as low as 38.8 and 41.5, respectively, indicating challenging queries for Bard. Uber-related privacy questions posed difficulties across both categories, with the FAQsourced questions having a median of 61.75 but a minimum score of 1. Both Twitter and Facebook had mid-range median scores, with user-sourced questions in Facebook policy yielding a consistent median and third quartile, at 6 and 7.75, respectively. Airbnb responses were relatively stable, with scores fluctuating between 46 to 70. Analyzing Bard’s performance across metrics (see Figure 3b), we observe that relevance ranged from scores as high as 1 for 𝑈𝐶 𝑓 1 , 𝑈𝐶 𝑓 2 , and 𝑈𝐶 𝑓 3 to as low as 0.2 for 𝑇𝑓 3 and 𝑃𝐷 𝑓 1 . Clarity was similarly distributed, with certain questions like 𝑈𝐶𝑢1 receiving high scores of 1, while others, such as 𝐷𝑀 𝑓 3 , only achieved a score of 0.1. Accuracy proved to be a challenging area, with the lowest score being -1 for several questions, including 𝑆𝐸 𝑓 3 , 𝑃𝐷 𝑓 1 , and 𝐴𝐸𝐸 𝑓 1 . However, the model managed to score 0.8 for 𝑇𝑢1 . Completeness ranged from a notable 0.9 for 𝑆𝐸 𝑓 1 to less promising results like -0.3 for 𝑆𝐸 𝑓 2 . The Reference metric had its highs and lows, with the highest score being 1 for 𝑇𝑓 1 and several instances of -1, indicating an inconsistency in this domain. Finally, Bard also mirrored the ChatGPT performance on questions on content not explicitly stated in the policy, showing good performance, especially on Facebook’s policy, while a performance reduction was noted with Uber’s policy (the longest of the policies used). BingAI: BingAI exhibited a mix of outstanding and lacklustre performances (Figure 3a). For Airbnb, it achieved perfect medians of 100 for all questions, but the range was wide, from 10 to 100. The Uber policy was challenging, especially in user-generated questions, with a median of just 68.5 and a narrow range, indicating a uniform struggle. Twitter and Facebook policies saw robust results, with medians consistently above 84.5. For Airbnb questions, BingAI’s performance was notable, particularly Twitter and Facebook in FAQ-sourced questions, where the system reached an almost perfect median score of 93.25. BingAI demonstrated great performance in Relevance, particularly for questions like 𝑇𝑓 2 , 𝑃𝐷 𝑓 2 , 𝑈𝐶 𝑓 1 , and 𝑆𝐸 𝑓 1 , all scoring a perfect 1, but also showed weaker areas with scores like -0.2 for 𝐷𝑀 𝑓 3 (Figure 3b). Clarity maintained a consistent trend, with scores predominantly leaning toward the higher end. For Accuracy, BingAI had top-performing scores in areas like 𝐴𝐸𝐸 𝑓 1 , 𝑃𝐷 𝑓 1 , and 𝑃𝐷 𝑓 2 , but faltered in others, achieving a score of -1 for 𝐷𝑀 𝑓 3 . In terms of Completeness, it exhibited excellence in 𝑆𝐸 𝑓 1 and 𝑆𝐸 𝑓 3 , both scoring 1, but saw a drop in areas like 𝐷𝑀 𝑓 3 . The Reference scores varied, ranging from 1 in 𝑃𝐷 𝑓 1 and 𝑃𝐷 𝑓 2 to lows of -1 in areas such as 𝐷𝑀 𝑓 3 . Additionally, BingAI consistently showed strong performance across all policies. However, there was a slight dip in its performance for Uber’s policy like for other systems. Assessing Robustness through Paraphrased Questions The main goal of this experiment is to evaluate the robustness and consistency of the systems in providing similar responses to paraphrased variants of the questions. The results (see Figure 3a) show that ChatGPT-4 displayed consistent strengths, BingAI excelled in certain areas but showed referencing challenges, and Bard presented a mix of highs and noticeable lows. ChatGPT-4: ChatGPT-4 exhibited consistent performance across most policies (Figure 3a), irrespective of questions coming from user or FAQ’s. With Spotify, there was a decline in performance as we moved from FAQ questions to user-generated questions, from a median score of 86.5 to 76.3. Interestingly, the third quartile score remained around 100 for all questions, indicating that while the central tendency was lower, a subset of responses still reached the top performance. Twitter and Facebook cases showcased strong performance, with median scores not dipping below 70 across all questions. For Airbnb, ChatGPT-4 answered with high proficiency for FAQ and user-generated questions, with the system achieving medians of 95.1 and 91, respectively. For Relevance, scores ranged between 0 and 1, showing high consistency in areas such as 𝑆𝐸 𝑓 1 , 𝑆𝐸 𝑓 3 , 𝑃𝐷 𝑓 1 , and 𝑃𝐷 𝑓 2 among others (Figure 3b). Clarity ratings showed a similar tendency, with the model performing excellently on queries like 𝑆𝐸 𝑓 1 and 𝑈𝐶𝑢1 , scoring a perfect 1 while encountering challenges in 𝑅𝐶 𝑓 2 and 𝑈𝐶 𝑓 2 . Accuracy results were more variable, with instances like 𝐷𝑀𝑢1 , 𝑆𝐸 𝑓 3 , and 𝐴𝐸𝐸 𝑓 1 ’ scoring near or at the top, juxtaposed against scores as low as -0.7 in 𝐷𝑀 𝑓 3 . Completeness spanned from high performances in 𝑆𝐸 𝑓 3 to lows in 𝑅𝐶 𝑓 2 , 𝑈𝐶 𝑓 3 , and 𝐷𝑀 𝑓 3 . Reference scores showed strong points, such as 0.9 in 𝑈𝐶 𝑓 1 and 𝑈𝐶 𝑓 2 , but also revealed potential areas of improvement with scores like -0.7 in 𝑇𝑓 1 . Additionally, ChatGPT4 demonstrated strong performance across policies, particularly excelling in policies like Facebook’s, which had a higher number of questions on content non-explicitly stated in the policy and required shorter reading times, achieving a median score near 100. 9 Proceedings on Privacy Enhancing Technologies 2024(3) Aamir Hamid, Hemanth Reddy Samidi, Tim Finin, Primal Pappachan, and Roberto Yus (a) Overall score per policy. (b) Average scores for all policies across metrics. Figure 3: Performance of systems for paraphrased questions when the privacy policy is explicitly shared. 6.3 Assessing the Ability to Recall Learned Privacy Policy Knowledge scores of 100. For Twitter and Airbnb, the medians (82 and 95.5, respectively for user-generated questions) were strong, and the compact interquartile ranges again indicated reliable performances. Facebook’s policy showed a similar trend with a median above 70.75. ChatGPT-4 predominantly had scores close to 1 in Relevance across both question categories, with only a slight dip to 0.2 for 𝑅𝐶 𝑓 2 (Figure 4b). Clarity remained fairly consistent, with many of its scores ranging between 0.8 to 1, but there was a notable drop to 0.1 for 𝑅𝐶 𝑓 2 . In terms of Accuracy, while GPT-4 generally performed well in user-generated questions, there was a clear reduction in its performance in FAQ questions, dropping as low as -0.7 and 0.3 in the 𝐷𝑀 𝑓 2 and 𝑅𝐶 𝑓 2 respectively. Completeness scores demonstrated a similar trend with higher scores in user-generated questions and diminishing results in the FAQ questions, the lowest being -0.2 for 𝑅𝐶 𝑓 2 . The Reference, however, remained relatively low throughout, with a peak score of 1 for 𝑃𝐷𝑒 and a dip to -0.6 in several FAQ questions. Finally, ChatGPT demonstrated improved performance in handling policies with a higher proportion of questions on content not explicitly stated in the policy, particularly evident in the 𝑆𝐸 𝑓 3 question for Twitter’s policy. The purpose of this experiment is to assess the performance of the systems when the privacy policy is not given explicitly. Hence, the system has to rely on the information it obtained when it was trained or obtain the policy from online sources (if the system supports it). In summary, the results revealed that BingAI consistently showed high proficiency across policies (with high reliability and consistency scores and low referencing scores). Bard displayed broader variabilities and pronounced inconsistencies (e.g., low referencing scores and marked variability in accuracy and completeness). ChatGPT’s performance was a blend of high scores for certain combinations of questions and policies counterbalanced by stark inconsistencies across all criteria. We observe that considering Uber’s policy, ChatGPT-4’s performance ranges between 10 and 100 scores in both question categories (Figure 4a). Despite this variability, a strong median of 91.1 indicates its overall competence. The consistency was further emphasized by the narrow interquartile range (82 to 95.5). In the Spotify policy, all three categories saw the model reaching its zenith with maximum 10 GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Proceedings on Privacy Enhancing Technologies 2024(3) (a) Overall score per policy. (b) Average scores for all policies across metrics. Figure 4: Performance of systems when the policy is not explicitly shared. Bard: Bard displayed wider variability than ChatGPT-4 (Figure 4a). In the Uber policy, both the question categories witnessed a spread from 10 to 100, suggesting more variance in their responses. The broader interquartile range (43.7 to 9.1) compared to ChatGPT4 underlined this. Twitter’s performance indicated considerable inconsistency, with the lowest score being 10 and Q1 also at 10, suggesting that 25% of responses were at the floor of the scoring metric. Airbnb’s further echoed this inconsistency, with both minimum and Q1 at 3.7,4,6. However, Facebook policy in both question categories showed tighter interquartile ranges, hinting at a better consistency. Bard maintained a high Relevance, predominantly fluctuating between 0.4 to 1 in both category questions, but saw a drastic decline for 𝐷𝑀 𝑓 3 , scoring 0.2 (Figure 4b). Its Clarity mostly mirrored ChatGPT-4’s pattern, though it had a steeper drop in FAQ questions, reaching as low as 0.1 in the 𝐷𝑀 𝑓 3 question. Accuracy exhibited significant variability, with scores ranging from a high of 0.7 in user-generated questions like 𝑇𝑢1 to a troubling -0.8 in 𝑈𝐶𝑢1 . For FAQ questions, the same variability was seen with 0.8 in 𝑇𝑈 𝑓 3 as a high and -1 in 𝑆𝐸 𝑓 3 as low. Completeness varied considerably as well, with scores peaking at 0.9 for 𝑆𝐸 𝑓 1 and plummeting to -0.3 in 𝑈𝐶𝑢1 . Reference scores were particularly notable for Bard due to their consistent negative values, dropping as low as -1 for multiple questions, suggesting possible issues with citation or source integrity. BingAI: BingAI showcased a peculiar trend (Figure 4a). For Facebook’s policy, it ranged from 37 to a perfect 100, with a commendable median of 95.5. Yet, the FAQ questions revealed stark contrasts from 10 to 100, with a median dropping to 88.7. This disparity between FAQ and user-oriented questions was further underscored by the interquartile range shift from 37-100 in FAQ questions to a much broader 46-86.5 in user-oriented questions. Similarly, Both question categories in Uber reflected a pronounced inconsistency, with both the minimum and 25% of scores (Q1) languishing at 10, while the upper quartile (Q3) stretched to 82. Notably, for Airbnb, BingAI achieved an 84.25 median, indicating that over half of its responses received the maximum score, though its minimum at 10 demonstrates the presence of some extreme outliers. BingAI’s performance in Relevance started strong, reaching 1 in categories like 𝑇𝑓 2 , 𝑈𝐶 𝑓 1 , and 𝑃𝐷 𝑓 1 , but faltered for 𝐷𝑀 𝑓 2 question which scored -0.2 as shown in Figure 4b. Clarity remained relatively stable, with many scores hovering around the 0.6 to 1 range. However, its accuracy was inconsistent, dropping to -1 for 𝐷𝑀 𝑓 3 but redeeming 11 Proceedings on Privacy Enhancing Technologies 2024(3) Aamir Hamid, Hemanth Reddy Samidi, Tim Finin, Primal Pappachan, and Roberto Yus itself with scores like 1 in 𝑃𝐷 𝑓 1 . Completeness scores were highly variable, from a full score of 1 for 𝑃𝐷 𝑓 1 to a concerning -1 for 𝐷𝑀 𝑓 2 . As for the Referencing, it scored negatively for most of the questions. Finally, BingAI exhibited its strongest performance in processing policies with a higher proportion of questions on not explicitly stated content. 6.4 robust layers that promote, for instance, appropriate referencing on top of the genAI models. The three systems also showed a strong understanding of the two privacy regulations evaluated, generally scoring higher than for the privacy policies. This might be because there has been more discussion about data privacy regulations online than about specific privacy policies. However, the three systems also encountered challenges in dealing with GenAIPABench. First, when paraphrased versions of the questions were used, the performance of the systems was reduced (particularly of BingAI). This inconsistency highlights that some systems might expect users to express their questions in specific ways, which would be an issue given the difference in perception about privacy among the general public [60]. The paraphrased questions for FAQs were generated automatically and hence in a few cases might diverge from conventional grammatical norms. We generated ten paraphrased versions for each question, ensuring that at least half met acceptable grammatical standards. We decided to keep the rest to mirror the way users formulate their queries on digital platforms, particularly search engines as it is common for users to phrase their queries in ways that might not meet conventional expectations of grammatical correctness [61]. Second, of particular concern was a disconnect between the relevance and clarity of generated responses and their factual accuracy and completeness. Substantially incorrect responses were often presented coherently and relevantly, posing the risk of misleading users. Third, we observed frequent issues concerning references that often point to outdated or incorrect data from the model’s training set rather than the most recent privacy policy information (even when this policy was explicitly provided to the systems). We also explore which GenAIPABench questions seem to be easier or harder for the current genAI technology. To this end, we average the performance obtained for each question and their paraphrased versions for all the systems and policies. Table 2 summarizes the results with the top five questions with the highest and lowest scores. In particular, we note that questions with explicitly defined content in the policy generally tend to be easier for the systems. Notable exceptions were 𝑅𝐶𝑢1 , which obtained a high score despite all the policies lacking information about whether any privacy breach had occurred, and 𝐶𝐴 𝑓 1 , which obtained a lower score in spite of all policies mentioning their compliance to GDPR/CCPA and/or other regulations. This might be explained because of the tendency of genAI systems to create non-existing content. It is also worth noting that questions about transparency and those explicitly posed by individual users seem easier for the technology, especially when compared to questions on compliance and accountability topics that obtained lower scores in general. We also analyze which privacy policies seem easier or harder to process by current genAI technology. To this end, we average the score obtained for all metrics, questions (including their paraphrased versions), and systems for each privacy policy. We observe that the highest scores are obtained for the privacy policies of Facebook and Airbnb and the lowest for the privacy policies of Uber and Spotify. Note that while the percentages of unique and connective words are similar across policies, the main differences between the policies are observed with respect to their length and required reading level. The ranking obtained points to the length of the policy being a decisive factor in performance. In particular, the highest Assessing the Quality of Responses to Privacy Regulation Questions This experiment examines the quality of responses generated by the systems for questions concerning the CCPA and GPDR data protection regulations. Figure 5 shows the results obtained after executing the privacy regulation benchmark for both data protection regulations. Both ChatGPT-4 and BingAI excelled in answering privacy regulation queries, with ChatGPT-4 consistently achieving top scores across every metric. While Bard demonstrated good performance, it consistently struggled to provide accurate references, placing it behind the other two models. Figure 5: Scores for privacy regulation questions. For all six questions (i.e., 𝑃𝑅1 , till 𝑃𝑅6 ), ChatGPT-4 and BingAI responses were accurate, relevant, comprehensive, and included correct references to regulation details. BingAI’s scores took a hit due to its tendency to refer to online articles for its information rather than directly citing the articles from the GDPR and CCPA, a practice that ChatGPT-4 consistently followed. On the other hand, Bard’s responses for questions 𝑃𝑅1 , 𝑃𝑅2 , and 𝑃𝑅5 scored 0.5 for completeness as they lacked some details. Also, the responses scored -1 across all questions wrt the reference metric for both CCPA and GDPR. 7 DISCUSSION While, up to the authors’ knowledge, no specific GenAIPA has been proposed yet, our experiments indicate that current generalpurpose genAI models can be effective tools when confronted with privacy-related questions. The three evaluated systems (i.e., Bard, BingAI, and ChatGPT-4) demonstrated commendable capabilities when evaluated with GenAIPABench. When addressing questions related to an organization’s privacy policies, all systems obtained a reasonably high score for all questions. In particular, BingAI emerged as the most consistent performer, demonstrating superior outcomes across most metrics. This is particularly interesting since BingAI and ChatGPT-4 use the same underlying model but are different chatbots, which highlights the importance of developing 12 GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Q Score 𝑇𝑢1 𝐷𝑀𝑢1 𝑇𝑓 2 𝐶𝐴𝑢1 𝑅𝐶𝑢1 8.99 8.58 8.57 8.40 8.4 (a) Missing Content None None None None All Proceedings on Privacy Enhancing Technologies 2024(3) Q Score 𝐷𝑀 𝑓 3 𝑇𝑓 3 𝐴𝐸𝐸 𝑓 2 𝐶𝐴 𝑓 3 𝐶𝐴 𝑓 1 4.23 5.2 5.91 5.95 6.33 (b) Missing Content A, U, S F, S, A T, F T, F, S, A None aim to develop the infrastructure to perform a periodic evaluation of current and future versions of genAI and GenAIPA systems. ACKNOWLEDGMENTS This work was supported in part by the National Science Foundation under grant DGE-2114892. Table 2: Top 5 highest (a) and lowest (b) scoring questions across systems. REFERENCES [1] P. Voigt and A. von dem Bussche, łThe eu general data protection regulation (gdpr): A practical guide,ž Springer, vol. 2, no. 1, pp. 1ś16, 2017. [2] J. Greenberg and J. Maier, łCalifornia consumer privacy act (ccpa): Compliance guide,ž Business Law Today, vol. 30, no. 3, pp. 1ś11, 2020. [3] D. Solove, Nothing to hide: The false tradeoff between privacy and security. Yale University Press, 2013. [4] J. A. Obar and A. Oeldorf-Hirsch, łThe biggest lie on the internet: Ignoring the privacy policies and terms of service policies of social networking services,ž in 44th Research Conference on Communication, Information and Internet Policy, 2018. [5] M. Langheinrich, łPrivacy and mobile devices,ž ACM SIGMOBILE Mobile Computing and Communications Review, vol. 5, no. 1, pp. 34ś44, 2001. [6] M. Ackerman, L. Cranor, and J. Reagle, łPrivacy policies that people can understand,ž in Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 415ś422, ACM, 2001. [7] N. Sadeh, B. Liu, A. Das, M. Degeling, and F. Schaub, łPersonalized privacy assistant,ž Sept. 26 2023. US Patent 11,768,949. [8] A. Cavoukian, łPrivacy by design: The 7 foundational principles,ž in 2010 33rd International Conference on Privacy and Data Protection, pp. 2ś58, IEEE, 2010. [9] S. Wilson, S. Komanduri, G. Norcie, A. Acquisti, P. Leon, and L. Cranor, łSummarizing privacy policies with crowdsourcing and natural language processing,ž in Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 2363ś2374, ACM, 2016. [10] B. Knijnenburg and A. Kobsa, łPersonalized privacy assistants for the internet of things: enabling user control over privacy in smart environments,ž in Adjunct Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 1603ś1608, ACM, 2013. [11] Y. Zhang, Y. Chen, and N. Li, łPrivacy risk analysis for mobile applications,ž IEEE Transactions on Dependable and Secure Computing, vol. 15, no. 6, pp. 968ś981, 2016. [12] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, łImproving language understanding by generative pre-training,ž 2018. [13] Y. Belinkov, I. Dagan, S. Shieber, and A. Subramanian, łLama: Language-agnostic model agnosticization,ž in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 215ś225, 2020. [14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, łBert: Pre-training of deep bidirectional transformers for language understanding,ž in Proceedings of NAACLHLT, pp. 4171ś4186, 2019. [15] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., łGpt-4 technical report,ž arXiv preprint arXiv:2303.08774, 2023. [16] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, Sastry, et al., łLanguage models are few-shot learners,ž in Advances in Neural Information Processing Systems, vol. 33, pp. 1877ś1901, Curran Associates, Inc., 2020. [17] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, łLanguage models are unsupervised multitask learners,ž OpenAI Blog, vol. 1, no. 8, p. 9, 2019. [18] J. Gao, M. Galley, and L. Li, Neural Approaches to Conversational AI: Question Answering, Task-oriented Dialogues and Social Chatbots. Now Foundations and Trends, 2019. [19] T. W. Bickmore and R. W. Picard, łEstablishing and maintaining long-term human-computer relationships,ž ACM Transactions on Computer-Human Interaction (TOCHI), vol. 12, no. 2, pp. 293ś327, 2005. [20] B. Li, X. Wu, L. Qin, and J. Huang, łAlice: A conversational agent for financial planning,ž in International Conference on Web Intelligence, pp. 1163ś1167, 2017. [21] K. K. Fitzpatrick, A. Darcy, and M. Vierhile, łDelivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (woebot): A randomized controlled trial,ž JMIR Mental Health, vol. 4, no. 2, p. e19, 2017. [22] R. Winkler, M. Söllner, and S. Neuweiler, łEvaluating the engagement with conversational agents: Experiments in education and health,ž in International Conference on Design Science Research in Information Systems and Technology, pp. 102ś114, 2018. [23] S. Lin, J. Hilton, and O. Evans, łTruthfulqa: Measuring how models mimic human falsehoods,ž in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214ś3252, 2022. and lowest scores are obtained by the shortest and longest policies (Facebook’s and Uber’s, respectively). Interestingly, even when the reading level of the Airbnb policy was two levels above the Uber policy, it scored higher, which indicates that while it is a critical factor for users to understand the policy, it might not be for the genAI systems. We further investigate the performance differences between systems when responding to user-generated vs. FAQ-based questions. In general, the systems behaved slightly better for user-generated questions (which were simpler in nature). In particular, ChatGPT obtained an average score of 81.7 for user-generated and 78.4 for FAQ-based queries; Bard scored 73.3 and 62.2, respectively; Bing scored 74.4 for user-generated and 74.8, respectively. Finally, we want to note that evaluating a GenAI system with our benchmark involves several manual steps. Our evaluator tool simplifies the process for systems with a public API by automating question execution and answer collection, minimizing execution and data collection efforts. The primary effort is in analyzing the responses, which involves comparing them to predefined ground truth data. This evaluation phase required about 5-10 minutes per answer while conducting this study. It includes a detailed review for accuracy, relevance, completeness, clarity, and reference, ensuring a comprehensive and fair assessment of the GenAI system’s performance. 8 CONCLUSION AND FUTURE WORK The emergence of generative AI systems and their ability to summarize text and answer questions generating human-like text presents an opportunity to develop more sophisticated privacy assistants (GenAIPAs). Due to the implications for individuals receiving wrong information that might impact their privacy, it is required to evaluate such systems properly. In this paper, we have presented a benchmark, GenAIPABench, to evaluate future GenAIPAs, which includes questions about privacy policies and data privacy regulations, evaluation metrics, and annotated privacy documents. Our evaluation of popular genAI technology, including ChatGPT, Bard, and BingAI, shows promise for the technology but highlights that significant work remains to enhance their capabilities in handling complex queries, ensuring accuracy, maintaining response consistency, and citing proper sources. One limitation of this paper is that it includes only policies and questions in English. As future work, we plan to continue expanding GenAIPABench with more annotated answers for a larger number of privacy documents (and in multiple languages) to maintain its relevance and utility. We also 13 Proceedings on Privacy Enhancing Technologies 2024(3) Aamir Hamid, Hemanth Reddy Samidi, Tim Finin, Primal Pappachan, and Roberto Yus [24] T. Schick, A. Lauscher, and I. Gurevych, ł‘it’s not a bug, it’s a feature’: Unwanted model outputs as bugs in ai systems,ž in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2849ś2859, 2021. [25] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, łOn the dangers of stochastic parrots: Can language models be too big?,ž Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1354ś1360, 2021. [26] L. Chen, M. Zaharia, and J. Zou, łHow is ChatGPT’s behavior changing over time?,ž arXiv preprint arXiv:2307.09009, 2023. [27] Y. Ge, W. Hua, J. Ji, J. Tan, S. Xu, and Y. Zhang, łOpenAGI: When LLM meets domain experts,ž arXiv preprint arXiv:2304.04370, 2023. [28] W.-C. Kang, J. Ni, N. Mehta, M. Sathiamoorthy, L. Hong, E. Chi, and D. Z. Cheng, łDo LLMs understand user preferences? evaluating LLMs on user rating prediction,ž arXiv preprint arXiv:2305.06474, 2023. [29] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang, łAgentbench: Evaluating llms as agents,ž arXiv preprint arXiv:2308.03688, 2023. [30] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, Y. Xu, and P. Fung, łA multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity,ž in 13th International Joint Conference on Natural Language Processing and 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pp. 675ś 718, Association for Computational Linguistics, 2023. [31] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKeown, and T. B. Hashimoto, łBenchmarking large language models for news summarization,ž 2023. [32] A. Ravichander and W. Alan, łQuestion answering for privacy policies: Combining computational and legal perspectives,ž in Empirical Methods in Natural Language Processing, 2019. [33] N. Sadeh, A. Acquisti, T. D. Breaux, L. F. Cranor, A. M. McDonald, J. R. Reidenberg, N. A. Smith, F. Liu, N. C. Russell, F. Schaub, et al., łThe usable privacy policy project,ž Tech. Rep. CMU-ISR-13-119, Carnegie Mellon University, 2013. [34] A. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. G. Leon, M. S. Andersen, S. Zimmeck, K. M. Sathyendra, N. C. Russell, et al., łThe creation and analysis of a website privacy policy corpus,ž in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1330ś1340, Association for Computational Linguistics, 2016. [35] C.-H. Chiang and H.-y. Lee, łCan large language models be an alternative to human evaluations?,ž in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 15607ś15631, Association for Computational Linguistics, 2023. [36] J. Liu, C. S. Xia, Y. Wang, and L. ZHANG, łIs your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,ž in Advances in Neural Information Processing Systems, vol. 36, pp. 21558ś21572, Curran Associates, Inc., 2023. [37] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, łSquad: 100,000+ questions for machine comprehension of text,ž in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383ś2392, 2016. [38] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer, łTriviaqa: A large scale distantly supervised challenge dataset for reading comprehension,ž in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601ś1611, 2017. [39] R. Bommasani, P. Liang, and T. Lee, łHolistic evaluation of language models,ž Annals of the New York Academy of Sciences, vol. 1525, no. 1, pp. 140ś146, 2023. [40] International Organization for Standardization, łISO/IEC 29100:2011 - Information technology, Security techniques, Privacy framework,ž 2011. [41] A. Cavoukian, ł7 foundational principles of privacy by design,ž 2011. [42] I. Pollach, łWhat’s wrong with online privacy policies?,ž Commun. ACM, vol. 50, pp. 103ś108, 09 2007. [43] Australian Government, łOffice of the australian information commissioner.ž https://www.oaic.gov.au/, 2023. Accessed: May 3, 2023. [44] A. Ravichander, A. W. Black, S. Wilson, T. Norton, and N. Sadeh, łQuestion answering for privacy policies: Combining computational and legal perspectives,ž in Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, pp. 4947ś4958, Association for Computational Linguistics, Nov. 2019. [45] GDPR.eu, łGeneral data protection regulation faqs.ž http://gdpr.eu/faq/, 2021. [46] California Attorney General, łCalifornia privacy protection agency faqs.ž http://cppa.ca.gov/faq.html, 2021. [47] Future of Privacy Forum, łBest Practices for Consumer-Facing Privacy Notices and Consent Forms,ž June 2020. [48] K. Martin, łEthical implications and accountability of algorithms,ž Journal of Business Ethics, vol. 160, 12 2019. [49] K. A. Bamberger and D. K. Mulligan, łPrivacy on the books and on the ground,ž Stanford Law Review, vol. 63, p. 247, 2011. [50] T. W. Bickmore, L. M. Pfeifer, D. Schulman, and L. Yin, łMaintaining continuity in longitudinal, relational agents for chronic disease self-care,ž Journal of Medical Systems, vol. 42, no. 5, p. 91, 2018. [51] E. Luger and A. Sellen, łLike having a really bad pa: the gulf between user expectation and experience of conversational agents,ž Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 2016. [52] Q. V. Liao, Y. Gao, Y. Wu, and Y. Zhang, łEvaluating the effectiveness of humanmachine collaboration in human-in-the-loop text classification,ž Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, 2019. [53] H. Choi, J. Park, and Y. Jung, łThe role of privacy fatigue in online privacy behavior,ž Computers in Human Behavior, vol. 81, pp. 42ś51, 2018. [54] H. P. Grice, łLogic and conversation,ž Speech acts, 1975. [55] C. Jensen and C. Potts, łPrivacy policies as decision-making tools: an evaluation of online privacy notices,ž in Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 471ś478, ACM, 2004. [56] N. Radziwill and M. Benton, łEvaluating quality of chatbots and intelligent conversational agents,ž Software Quality Professional, vol. 19, no. 3, p. 25, 2017. [57] J. Savelka and K. D. Ashley, łExtracting case law sentences for argumentation about the gdpr,ž in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, ACL, 2016. [58] E. H. Hiebert, łUnique words require unique instruction.ž http://textproject.org/, 2012. [59] R. Flesch, łFlesch-kincaid readability test. retrieved october,ž 2007. [60] W. M. Steijn and A. Vedder, łPrivacy under construction: A developmental perspective on privacy perception,ž Science, Technology, & Human Values, vol. 40, no. 4, pp. 615ś637, 2015. [61] G. Rózsa, A. Komlodi, and P. Chu, łOnline searching in english as a foreign language,ž in Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, (New York, NY, USA), p. 875ś880, Association for Computing Machinery, 2015. Appendix A EVALUATOR GenAIPABench includes a component evaluator whose goal is to communicate with the GenAIPA, sharing the privacy documents and questions and collecting answers and summaries (see Algorithm 1). The methodology centers around evaluating responses to questions based on a privacy document (PD) and a company name (CN). The procedure unfolds over multiple iterations, each comprising different initializations focusing on the PD or the CN. This approach is tailored to examine how the GenAIPA system’s responses vary under distinct contextual setups. For each run 𝑖 in the total number of runs 𝑟 , the procedure undertakes the following distinct initialization: • Initialization with Company Name (CN): The GenAIPA is introduced to the CN, forming the context for the subsequent query execution. This approach is designed to assess how the system interprets and responds to questions when primed with the company name. Here, the evaluator engages the GenAIPA by explaining that it will pose questions about the privacy policy of a specific organization, such as Uber. The purpose of this approach is to determine whether GenAIPA possesses prior knowledge about the company’s privacy policy and to gauge the accuracy of its responses based solely on this knowledge. This evaluation aspect is crucial for understanding the extent to which the GenAIPA can rely on its pre-existing data to generate informed responses about a company’s privacy practices. • Initialization with Privacy Document (PD): In this phase, the GenAIPA attention is directed towards the PD. The system is introduced to the specific content within the privacy document. This method is pivotal for analyzing how effectively the GenAIPA can generate responses directly influenced by the detailed information provided in the PD. To accommodate potential token limit constraints of GenAIPAs, the initial prompt clarifies that the privacy document will 14 GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Proceedings on Privacy Enhancing Technologies 2024(3) Algorithm 1 Revised Privacy Document Analysis and Query Response be delivered in segmented portions. This approach ensures that the GenAIPA comprehensively processes the document in manageable segments. Following the introduction of each segment, the system is then presented with questions related to the content of the PD. This step is critical for assessing the GenAIPA’s capability to understand and respond accurately to queries directly tied to the nuances and specificities of the privacy document. • Initialization with Summary based on CN: The procedure also involves a unique initialization where GenAIPA is requested to summarize the privacy document based on CN, albeit without an explicit introduction to the PD. This step is designed to gauge the AI system’s capacity to synthesize and summarize content based on its pre-existing knowledge or understanding. The initial prompt directs GenAIPA to create a summary, providing a foundation for subsequent queries. This method tests the system’s ability to process and condense information without direct exposure to the entire document, focusing on its internal processing capabilities and prior knowledge. • Initialization with Summary based on PD: Similar to the previous step, but this time, the summary is generated based on the PD. This initialization tests the system’s response efficiency when working with a condensed version of the privacy document. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: Query Execution and Data Collection: In each initialization, the set of questions 𝑄 is shuffled (as 𝑄 ′ ) to introduce variability. The system executes these queries, and the responses are collected. The responses are stored separately for each initialization method, identified as 𝐴1, 𝐴2, 𝐴3, and 𝐴4. 24: 25: 26: 27: 28: Aggregation of Responses: Upon completing all initializations for a single run, the responses from each method ( 𝐴1, 𝐴2, 𝐴3, and 𝐴4) are aggregated into a comprehensive list 𝐴. This process is repeated for each run, enriching 𝐴 with a diverse set of responses that reflect the system’s performance across different contexts. The algorithm concludes by returning the aggregated data 𝐴, which encompasses the varied responses generated under each initialization scenario. This output serves as a valuable dataset for analyzing the GenAIPA’s adaptability and accuracy in responding to privacy-related inquiries under different contextual influences. input 𝐴. The procedure starts by initializing an empty set 𝑃 and then iterates through each score 𝐴𝑖 in 𝐴, accumulating them for subsequent categorization. This evaluation is performed for each set of scores, representing multiple runs, and the average scores for each run are determined. Ultimately, an overall average score is calculated across all runs. This average score is then categorized into one of three groups: Green, Yellow, or Red, based on predefined criteria (see Section 5. The category is stored in set P, which upon completion of all iterations, contains the categorized average scores for all runs. The primary goal of this evaluation is to offer insights into GenAIPA’s capabilities in generating privacy policy-related responses and identify areas of potential improvement. Algorithm 2 GenAIPA Response Evaluation 1: 2: 3: 4: 5: 6: 7: procedure GenerateAndStoreQueryResponses(Privacy Document PD, Company Name CN, Questions Q, Runs r) Initialize 𝐴 as an empty list for 𝑖 = 1 to 𝑟 do 𝑄 ′ ← ShuffleQuestions(𝑄) #initialization 1 IntroducePrivacyDocument(CN) 𝐴1 ← QueryExecution(𝑄 ′ ) ResetConversation() #initialization 2 IntroducePrivacyDocument(PD) 𝐴2 ← QueryExecution(𝑄 ′ ) ResetConversation() #nitialization 3 IntroducePrivacyDocument(CN) 𝑆 ← GenerateSummary ResetConversation() IntroducePrivacyDocument(S) 𝐴3 ← QueryExecution(𝑄 ′ ) ResetConversation() #initialization 4 IntroducePrivacyDocument(PD) 𝑆 ← GenerateSummary ResetConversation() IntroducePrivacyDocument(S) 𝐴4 ← QueryExecution(𝑄 ′ ) ResetConversation() 𝐴 ← 𝐴 + [𝐴1, 𝐴2, 𝐴3, 𝐴4] Return 𝐴 procedure EvaluateResponse(Scores A provided by analyst) 𝑃←∅ for 𝑖 = 1 to |𝐴[1]| do 1 Í 𝐴𝑖 ← |𝐴| 𝑠𝑐𝑜𝑟𝑒 ∈𝐴 𝑠𝑐𝑜𝑟𝑒 [𝑖] Appendix B Categorize 𝐴𝑖 as Green, Yellow, or Red 𝑃 ← 𝑃 ∪ Category return 𝑃 ADDITIONAL EXPERIMENTS This section includes additional experiments and results from our evaluation of ChatGPT-4, Bard, and BingAI using GenAIPABench. B.1 In the GenAIPA Response Evaluation process (see Algorithm 2), the analyst scrutinizes the generated responses based on five key features: Relevance, Accuracy, Clarity, Completeness, and Reference to policy sections. Scores for each feature are provided as Assessing the Quality of Privacy Policy Summaries This experiment examines the quality of the summary generated for privacy policies. The results (see Figure 6) demonstrate that ChatGPT-4 consistently excelled across various tasks, particularly 15 Proceedings on Privacy Enhancing Technologies 2024(3) Aamir Hamid, Hemanth Reddy Samidi, Tim Finin, Primal Pappachan, and Roberto Yus (a) Overall score per policy. (b) Average scores for all policies across metrics. Figure 6: Performance of systems when the privacy policy summary is explicitly shared. in handling policies with higher non-existing content. Bard, showing promise in simpler FAQ questions, faced challenges in more complex, user-generated content. BingAI, adaptable yet exhibiting performance variability, performed best in medium-difficulty challenges but sometimes struggled with complex questions. Both Bard and BingAI encountered difficulties in citing references accurately, underscoring areas for improvement in understanding, clarity, consistency, and accuracy. ChatGPT-4: ChatGPT-4 demonstrated remarkable consistency across diverse policies (see Figure 6a). In Spotify’s policy, with a moderate level of non-existing content (25%), the model achieved a median score of 79.75. This performance slightly dipped in the more complex Uber policy, which has a similar percentage of nonexisting content but a longer reading time, resulting in a median of 77.5. In Twitter’s policy, characterized by a significant amount of non-existing content (43.7%), ChatGPT-4 maintained a strong median of 84.25. Its performance peaked in Facebook’s policy, with the highest non-existing content (46.8%), achieving an impressive median of 97.75. For Airbnb, despite the policy’s high reading level (14.15 FKGL), ChatGPT-4 upheld a solid median of 82, showcasing its adaptability to complex information. Figure 6b reveals ChatGPT4’s robust performance across different metrics. It achieved high scores in Relevance, particularly in FAQ questions such as ’𝑆𝐸 𝑓 1 ’ and ’𝑃𝐷 𝑓 2 ’ (both scoring 1.0), signifying its effective understanding and summarization capabilities. However, challenges were noted in Clarity and Completeness for user-generated questions like ’𝐷𝑀 𝑓 3 ’ (scoring -1.0) and ’𝑈𝐶 𝑓 2 ’ (scoring -0.2), indicating some difficulty in maintaining coherence and thoroughness. Accuracy fluctuated, performing strongly in questions like ’𝑆𝐸 𝑓 1 ’, ’𝑃𝐷 𝑓 1 ’, and ’𝑃𝐷 𝑓 2 ’ (all scoring 1.0), but showing weaknesses in ’𝑇𝑓 3 ’ (-0.7) and ’𝐷𝑀 𝑓 2 ’ (-0.2). Completeness varied, ranging from high scores in ’𝑆𝐸 𝑓 1 ’ and ’𝑃𝐷 𝑓 1 ’ (1.0) to lower scores in ’𝑅𝐶 𝑓 2 ’ (0.4). Reference metrics, though generally strong in areas like ’𝑆𝐸 𝑓 1 ’ and ’𝑃𝐷 𝑓 1 ’ (0.9), revealed potential areas for improvement, particularly in ’𝐷𝑀 𝑓 2 ’ and ’𝐷𝑀 𝑓 3 ’ (both scoring -1.0), suggesting a need to enhance source citation accuracy. Bard: Bard’s performance, as indicated by the IQR data (see Figure 6a), varied significantly across different policies. For Spotify, it achieved a median of 82, suggesting competent handling of standard policy content. However, in the more complex Uber policy, the model displayed a wider performance range (Min: 10, Q3: 95.5), 16 GenAIPABench: A Benchmark for Generative AI-based Privacy Assistants Proceedings on Privacy Enhancing Technologies 2024(3) indicating inconsistencies in handling diverse and challenging content. Twitter’s policy, with a large amount of non-existing content, was more challenging for Bard, resulting in a lower median score of 64. In Facebook’s policy, Bard managed a median score of 68.5, showing some capability in dealing with incomplete information but also highlighting room for improvement. Airbnb’s policy posed a moderate challenge, with Bard achieving a median score of 75.25. Figure 6b presents a mixed performance across various metrics. In Relevance, Bard scored high in simpler FAQ questions such as ’𝑇𝑓 1 ’ and ’𝑇𝑢 𝑓 1 ’ (both scoring 1.0), but it struggled with more nuanced user-generated questions, evidenced by lower scores in ’𝑆𝐸 𝑓 3 ’ (0.2). Clarity was similarly variable, with high scores in ’𝑇𝑓 1 ’ and ’𝑇𝑢 𝑓 1 ’ (1.0), but significantly lower scores in more complex questions like ’𝑃𝐷 𝑓 2 ’ (0.3) and ’𝐶𝐴 𝑓 2 ’ (0.1). This inconsistency in Clarity, especially in more complex or nuanced questions, underscores a need for Bard to improve its ability to convey information clearly and effectively. Accuracy showed similar fluctuations, with top scores in straightforward questions like ’𝑇𝑓 1 ’ and ’𝑇𝑢 𝑓 1 ’ (1.0), but it faltered in questions like ’𝑈𝐶𝑢 𝑓 1 ’, ’𝑃𝐷 𝑓 1 ’, and ’𝑃𝐷 𝑓 2 ’ (all scoring -1.0). This variability in Accuracy, particularly in user-generated questions, suggests Bard’s potential challenges in consistently maintaining precision. Completeness spanned from high scores in ’𝑇𝑓 1 ’ and ’𝑆𝐸 𝑓 1 ’ (1.0) to lows in ’𝑈𝐶𝑢 𝑓 1 ’ and ’𝑃𝐷 𝑓 1 ’ (-1.0), indicating fluctuating thoroughness in its responses. Reference metrics were notably poor, with about half of the questions scoring below -0.6, indicating a significant area for improvement in citing sources and maintaining informational integrity. BingAI: In the IQR data analysis, BingAI showed distinctive performance patterns across policies. Facebook’s policy, with high non-existing content (46.8%), saw BingAI range from 28 to 100, achieving a median score of 84.25, demonstrating its adaptability to varying content within the same policy. However, Uber’s policy presented a challenge, particularly in user-generated questions, where BingAI’s median score was only 73, reflecting the difficulty in achieving uniformity and consistency. In Airbnb’s policy, characterized by substantial non-existing content and a high FKGL level, BingAI excelled with a median score of 97.75. It also performed effectively in Twitter’s policy and Spotify’s policy, attaining median scores of 88.75 and 82, respectively. Figure 6b showcases its performance across different metrics. In Relevance, BingAI scored highly in FAQ questions like ’𝑆𝐸 𝑓 1 ’, ’𝑃𝐷 𝑓 1 ’, and ’𝑃𝐷 𝑓 2 ’ (all scoring 1.0), reflecting its strong understanding and summarization of policy content. However, it faced challenges in ’𝑇𝑓 3 ’ (scoring 0.2), indicating areas where it could enhance its comprehension. Clarity was generally good, with high scores in ’𝑆𝐸 𝑓 1 ’ and ’𝑃𝐷 𝑓 1 ’ (1.0), but slightly lower in ’𝐷𝑀 𝑓 1 ’, ’𝑇𝑓 2 ’, ’𝑇𝑓 3 ’, and ’𝐶𝐴 𝑓 1 ’ (0.4), suggesting some variability in presenting information clearly. Accuracy showed strengths and weaknesses, with high scores in ’𝑆𝐸 𝑓 1 ’ and ’𝑃𝐷 𝑓 1 ’ (1.0) but lower scores in ’𝑆𝐸𝑢 𝑓 1 ’ (-0.6) and ’𝑅𝐶 𝑓 2 ’ (-0.4), indicating areas where BingAI could improve in maintaining factual correctness. Completeness varied, scoring high in ’𝑆𝐸 𝑓 3 ’ and ’𝑃𝐷 𝑓 1 ’ (1.0) but showing lower scores in ’𝑆𝐸𝑢 𝑓 1 ’ (-0.6) and ’𝑅𝐶 𝑓 2 ’ (-0.4), highlighting inconsistency in covering all relevant aspects of the policy content. Reference metrics were notably weak, with 75% of the questions scoring below 0, pointing to a significant area for improvement in providing well-cited and reliable summarizations. In conclusion, while GenAIPAs demonstrate promising capabilities in summarizing privacy policies and answering specific privacyrelated questions, their effectiveness is closely tied to the availability and complexity of content within these policies. ChatGPT-4 distinguishes itself with consistent performance and adaptability, particularly in handling policies with higher non-existing content. Bard shows promise in simpler FAQ questions but faces challenges with more complex user-generated content, indicating a need for enhanced understanding and clarity. BingAI, while adaptable, exhibits variability in its performance, especially in user-generated questions, suggesting areas for further improvement in consistency and accuracy. 17