Computers and Society
See recent articles
- [1] arXiv:2406.13049 [pdf, html, other]
-
Title: Assessing AI vs Human-Authored Spear Phishing SMS Attacks: An Empirical Study Using the TRAPD MethodComments: 18 pages, 5 figures, 1 tableSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
This paper explores the rising concern of utilizing Large Language Models (LLMs) in spear phishing message generation, and their performance compared to human-authored counterparts. Our pilot study compares the effectiveness of smishing (SMS phishing) messages created by GPT-4 and human authors, which have been personalized to willing targets. The targets assessed the messages in a modified ranked-order experiment using a novel methodology we call TRAPD (Threshold Ranking Approach for Personalized Deception). Specifically, targets provide personal information (job title and location, hobby, item purchased online), spear smishing messages are created using this information by humans and GPT-4, targets are invited back to rank-order 12 messages from most to least convincing (and identify which they would click on), and then asked questions about why they ranked messages the way they did. They also guess which messages are created by an LLM and their reasoning. Results from 25 targets show that LLM-generated messages are most often perceived as more convincing than those authored by humans, with messages related to jobs being the most convincing. We characterize different criteria used when assessing the authenticity of messages including word choice, style, and personal relevance. Results also show that targets were unable to identify whether the messages was AI-generated or human-authored and struggled to identify criteria to use in order to make this distinction. This study aims to highlight the urgent need for further research and improved countermeasures against personalized AI-enabled social engineering attacks.
- [2] arXiv:2406.13140 [pdf, html, other]
-
Title: From decision aiding to the massive use of algorithms: where does the responsibility stand?Subjects: Computers and Society (cs.CY)
In the very large debates on ethics of algorithms, this paper proposes an analysis on human responsibility. On one hand, algorithms are designed by some humans, who bear a part of responsibility in the results and unexpected impacts. Nevertheless, we show how the fact they cannot embrace the full situations of use and consequences lead to an unreachable limit. On the other hand, using technology is never free of responsibility, even if there also exist limits to characterise. Massive uses by unprofessional users introduce additional questions that modify the possibilities to be ethically responsible. The article is structured in such a way as to show how the limits have gradually evolved, leaving unthought of issues and a failure to share responsibility.
- [3] arXiv:2406.13605 [pdf, html, other]
-
Title: Nicer Than Humans: How do Large Language Models Behave in the Prisoner's Dilemma?Comments: 9 pages, 8 figures, 1 tableSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Physics and Society (physics.soc-ph)
The behavior of Large Language Models (LLMs) as artificial social agents is largely unexplored, and we still lack extensive evidence of how these agents react to simple social stimuli. Testing the behavior of AI agents in classic Game Theory experiments provides a promising theoretical framework for evaluating the norms and values of these agents in archetypal social situations. In this work, we investigate the cooperative behavior of Llama2 when playing the Iterated Prisoner's Dilemma against random adversaries displaying various levels of hostility. We introduce a systematic methodology to evaluate an LLM's comprehension of the game's rules and its capability to parse historical gameplay logs for decision-making. We conducted simulations of games lasting for 100 rounds, and analyzed the LLM's decisions in terms of dimensions defined in behavioral economics literature. We find that Llama2 tends not to initiate defection but it adopts a cautious approach towards cooperation, sharply shifting towards a behavior that is both forgiving and non-retaliatory only when the opponent reduces its rate of defection below 30%. In comparison to prior research on human participants, Llama2 exhibits a greater inclination towards cooperative behavior. Our systematic approach to the study of LLMs in game theoretical scenarios is a step towards using these simulations to inform practices of LLM auditing and alignment.
- [4] arXiv:2406.13813 [pdf, other]
-
Title: The Efficacy of Conversational Artificial Intelligence in Rectifying the Theory of Mind and Autonomy Biases: Comparative AnalysisComments: 28 pages, 5 tables, 6 figuresSubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
The study evaluates the efficacy of Conversational Artificial Intelligence (CAI) in rectifying cognitive biases and recognizing affect in human-AI interactions, which is crucial for digital mental health interventions. Cognitive biases (systematic deviations from normative thinking) affect mental health, intensifying conditions like depression and anxiety. Therapeutic chatbots can make cognitive-behavioral therapy (CBT) more accessible and affordable, offering scalable and immediate support. The research employs a structured methodology with clinical-based virtual case scenarios simulating typical user-bot interactions. Performance and affect recognition were assessed across two categories of cognitive biases: theory of mind biases (anthropomorphization of AI, overtrust in AI, attribution to AI) and autonomy biases (illusion of control, fundamental attribution error, just-world hypothesis). A qualitative feedback mechanism was used with an ordinal scale to quantify responses based on accuracy, therapeutic quality, and adherence to CBT principles. Therapeutic bots (Wysa, Youper) and general-use LLMs (GTP 3.5, GTP 4, Gemini Pro) were evaluated through scripted interactions, double-reviewed by cognitive scientists and a clinical psychologist. Statistical analysis showed therapeutic bots were consistently outperformed by non-therapeutic bots in bias rectification and in 4 out of 6 biases in affect recognition. The data suggests that non-therapeutic chatbots are more effective in addressing some cognitive biases.
- [5] arXiv:2406.14123 [pdf, other]
-
Title: Mapping AI Ethics Narratives: Evidence from Twitter Discourse Between 2015 and 2022Comments: 22 pages, 6 figuresSubjects: Computers and Society (cs.CY)
Public participation is indispensable for an insightful understanding of the ethics issues raised by AI technologies. Twitter is selected in this paper to serve as an online public sphere for exploring discourse on AI ethics, facilitating broad and equitable public engagement in the development of AI technology. A research framework is proposed to demonstrate how to transform AI ethics-related discourse on Twitter into coherent and readable narratives. It consists of two parts: 1) combining neural networks with large language models to construct a topic hierarchy that contains popular topics of public concern without ignoring small but important voices, thus allowing a fine-grained exploration of meaningful information. 2) transforming fragmented and difficult-to-understand social media information into coherent and easy-to-read stories through narrative visualization, providing a new perspective for understanding the information in Twitter data. This paper aims to advocate for policy makers to enhance public oversight of AI technologies so as to promote their fair and sustainable development.
- [6] arXiv:2406.14154 [pdf, html, other]
-
Title: Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation ServicesComments: Accepted at European Workshop on Algorithmic Fairness (EWAF'24)Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-based content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-box audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.
- [7] arXiv:2406.14243 [pdf, html, other]
-
Title: AuditMAI: Towards An Infrastructure for Continuous AI AuditingSubjects: Computers and Society (cs.CY)
Artificial Intelligence (AI) Auditability is a core requirement for achieving responsible AI system design. However, it is not yet a prominent design feature in current applications. Existing AI auditing tools typically lack integration features and remain as isolated approaches. This results in manual, high-effort, and mostly one-off AI audits, necessitating alternative methods. Inspired by other domains such as finance, continuous AI auditing is a promising direction to conduct regular assessments of AI systems. The issue remains, however, since the methods for continuous AI auditing are not mature yet at the moment. To address this gap, we propose the Auditability Method for AI (AuditMAI), which is intended as a blueprint for an infrastructure towards continuous AI auditing. For this purpose, we first clarified the definition of AI auditability based on literature. Secondly, we derived requirements from two industrial use cases for continuous AI auditing tool support. Finally, we developed AuditMAI and discussed its elements as a blueprint for a continuous AI auditability infrastructure.
- [8] arXiv:2406.14290 [pdf, html, other]
-
Title: Examining the Implications of Deepfakes for Election IntegrityHriday Ranka, Mokshit Surana, Neel Kothari, Veer Pariawala, Pratyay Banerjee, Aditya Surve, Sainath Reddy Sankepally, Raghav Jain, Jhagrut Lalwani, Swapneel MehtaComments: Accepted at the AAAI 2024 conference, AI for Credible Elections Workshop-AI4CE 2024Subjects: Computers and Society (cs.CY); Social and Information Networks (cs.SI)
It is becoming cheaper to launch disinformation operations at scale using AI-generated content, in particular 'deepfake' technology. We have observed instances of deepfakes in political campaigns, where generated content is employed to both bolster the credibility of certain narratives (reinforcing outcomes) and manipulate public perception to the detriment of targeted candidates or causes (adversarial outcomes). We discuss the threats from deepfakes in politics, highlight model specifications underlying different types of deepfake generation methods, and contribute an accessible evaluation of the efficacy of existing detection methods. We provide this as a summary for lawmakers and civil society actors to understand how the technology may be applied in light of existing policies regulating its use. We highlight the limitations of existing detection mechanisms and discuss the areas where policies and regulations are required to address the challenges of deepfakes.
New submissions for Friday, 21 June 2024 (showing 8 of 8 entries )
- [9] arXiv:2406.12896 (cross-list from cs.AI) [pdf, html, other]
-
Title: Leveraging Pedagogical Theories to Understand Student Learning Process with Graph-based Reasonable Knowledge TracingComments: Preprint, accepted to appear in SIGKDD 2024, 12 pages. The source code is available at this https URL. Keywords: interpretable knowledge tracing, student behavior modeling, intelligence educationSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Knowledge tracing (KT) is a crucial task in intelligent education, focusing on predicting students' performance on given questions to trace their evolving knowledge. The advancement of deep learning in this field has led to deep-learning knowledge tracing (DLKT) models that prioritize high predictive accuracy. However, many existing DLKT methods overlook the fundamental goal of tracking students' dynamical knowledge mastery. These models do not explicitly model knowledge mastery tracing processes or yield unreasonable results that educators find difficulty to comprehend and apply in real teaching scenarios. In response, our research conducts a preliminary analysis of mainstream KT approaches to highlight and explain such unreasonableness. We introduce GRKT, a graph-based reasonable knowledge tracing method to address these issues. By leveraging graph neural networks, our approach delves into the mutual influences of knowledge concepts, offering a more accurate representation of how the knowledge mastery evolves throughout the learning process. Additionally, we propose a fine-grained and psychological three-stage modeling process as knowledge retrieval, memory strengthening, and knowledge learning/forgetting, to conduct a more reasonable knowledge tracing process. Comprehensive experiments demonstrate that GRKT outperforms eleven baselines across three datasets, not only enhancing predictive accuracy but also generating more reasonable knowledge tracing results. This makes our model a promising advancement for practical implementation in educational settings. The source code is available at this https URL.
- [10] arXiv:2406.12975 (cross-list from cs.CL) [pdf, html, other]
-
Title: SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text GenerationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns due to their potential to produce text that infringes on copyrights, resulting in several high-profile lawsuits. The legal landscape is struggling to keep pace with these rapid advancements, with ongoing debates about whether generated text might plagiarize copyrighted materials. Current LLMs may infringe on copyrights or overly restrict non-copyrighted texts, leading to these challenges: (i) the need for a comprehensive evaluation benchmark to assess copyright compliance from multiple aspects; (ii) evaluating robustness against safeguard bypassing attacks; and (iii) developing effective defenses targeted against the generation of copyrighted text. To tackle these challenges, we introduce a curated dataset to evaluate methods, test attack strategies, and propose lightweight, real-time defenses to prevent the generation of copyrighted text, ensuring the safe and lawful use of LLMs. Our experiments demonstrate that current LLMs frequently output copyrighted text, and that jailbreaking attacks can significantly increase the volume of copyrighted output. Our proposed defense mechanisms significantly reduce the volume of copyrighted text generated by LLMs by effectively refusing malicious requests. Code is publicly available at this https URL
- [11] arXiv:2406.13299 (cross-list from cs.SI) [pdf, other]
-
Title: Empirical Evaluation of Integrated Trust Mechanism to Improve Trust in E-commerce ServicesSubjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Performance (cs.PF)
There are mostly two approaches to tackle trust management worldwide Strong and crisp and Soft and Social. We analyze the impact of integrated trust mechanism in three different e-commerce services. The trust aspect is a dormant element between potential users and being developed expert or internet systems. We support our integration by preside over an experiment in controlled laboratory environment. The model selected for the experiment is a composite of policy and reputation based trust mechanisms and widely acknowledged in e-commerce industry. The integration between policy and trust mechanism was accomplished through mapping process, weakness of one brought to a close with the strength of other. Furthermore, experiment has been supervised to validate the effectiveness of implementation by segregating both integrated and traditional trust mechanisms in learning system
- [12] arXiv:2406.13303 (cross-list from cs.AI) [pdf, other]
-
Title: Integration of Policy and Reputation based Trust Mechanisms in e-Commerce IndustrySubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multimedia (cs.MM); Social and Information Networks (cs.SI)
The e-commerce systems are being tackled from commerce behavior and internet technologies. Therefore, trust aspect between buyer-seller transactions is a potential element which needs to be addressed in competitive e-commerce industry. The e-commerce industry is currently handling two different trust approaches. First approach consists on centralized mechanism where digital credentials/set of rules assembled, called Policy based trust mechanisms . Second approach consists on decentralized trust mechanisms where reputation, points assembled and shared, called Reputation based trust mechanisms. The difference between reputation and policy based trust mechanism will be analyzed and recommendations would be proposed to increase trust between buyer and seller in e-commerce industry. The integration of trust mechanism is proposed through mapping process, strength of one mechanism with the weakness of other. The proposed model for integrated mechanism will be presented and illustrated how the proposed model will be used in real world e-commerce industry.
- [13] arXiv:2406.13677 (cross-list from cs.CL) [pdf, html, other]
-
Title: Leveraging Large Language Models to Measure Gender Bias in Gendered LanguagesSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Gender bias in text corpora used in various natural language processing (NLP) contexts, such as for training large language models (LLMs), can lead to the perpetuation and amplification of societal inequalities. This is particularly pronounced in gendered languages like Spanish or French, where grammatical structures inherently encode gender, making the bias analysis more challenging. Existing methods designed for English are inadequate for this task due to the intrinsic linguistic differences between English and gendered languages. This paper introduces a novel methodology that leverages the contextual understanding capabilities of LLMs to quantitatively analyze gender representation in Spanish corpora. By utilizing LLMs to identify and classify gendered nouns and pronouns in relation to their reference to human entities, our approach provides a nuanced analysis of gender biases. We empirically validate our method on four widely-used benchmark datasets, uncovering significant gender disparities with a male-to-female ratio ranging from 4:1 to 6:1. These findings demonstrate the value of our methodology for bias quantification in gendered languages and suggest its application in NLP, contributing to the development of more equitable language technologies.
- [14] arXiv:2406.13706 (cross-list from cs.CL) [pdf, html, other]
-
Title: Breaking News: Case Studies of Generative AI's Use in JournalismSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Journalists are among the many users of large language models (LLMs). To better understand the journalist-AI interactions, we conduct a study of LLM usage by two news agencies through browsing the WildChat dataset, identifying candidate interactions, and verifying them by matching to online published articles. Our analysis uncovers instances where journalists provide sensitive material such as confidential correspondence with sources or articles from other agencies to the LLM as stimuli and prompt it to generate articles, and publish these machine-generated articles with limited intervention (median output-publication ROUGE-L of 0.62). Based on our findings, we call for further research into what constitutes responsible use of AI, and the establishment of clear guidelines and best practices on using LLMs in a journalistic context.
- [15] arXiv:2406.13791 (cross-list from cs.AI) [pdf, html, other]
-
Title: IoT-Based Preventive Mental Health Using Knowledge Graphs and Standards for Better Well-BeingComments: 20 pagesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Sustainable Development Goals (SDGs) give the UN a road map for development with Agenda 2030 as a target. SDG3 "Good Health and Well-Being" ensures healthy lives and promotes well-being for all ages. Digital technologies can support SDG3. Burnout and even depression could be reduced by encouraging better preventive health. Due to the lack of patient knowledge and focus to take care of their health, it is necessary to help patients before it is too late. New trends such as positive psychology and mindfulness are highly encouraged in the USA. Digital Twin (DT) can help with the continuous monitoring of emotion using physiological signals (e.g., collected via wearables). Digital twins facilitate monitoring and provide constant health insight to improve quality of life and well-being with better personalization. Healthcare DT challenges are standardizing data formats, communication protocols, and data exchange mechanisms. To achieve those data integration and knowledge challenges, we designed the Mental Health Knowledge Graph (ontology and dataset) to boost mental health. The Knowledge Graph (KG) acquires knowledge from ontology-based mental health projects classified within the LOV4IoT ontology catalog (Emotion, Depression, and Mental Health). Furthermore, the KG is mapped to standards (e.g., ontologies) when possible. Standards from ETSI SmartM2M, ITU/WHO, ISO, W3C, NIST, and IEEE are relevant to mental health.
- [16] arXiv:2406.13882 (cross-list from cs.LG) [pdf, other]
-
Title: Allocation Requires Prediction Only if Inequality Is LowComments: Appeared in Forty-first International Conference on Machine Learning (ICML), 2024Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Theoretical Economics (econ.TH)
Algorithmic predictions are emerging as a promising solution concept for efficiently allocating societal resources. Fueling their use is an underlying assumption that such systems are necessary to identify individuals for interventions. We propose a principled framework for assessing this assumption: Using a simple mathematical model, we evaluate the efficacy of prediction-based allocations in settings where individuals belong to larger units such as hospitals, neighborhoods, or schools. We find that prediction-based allocations outperform baseline methods using aggregate unit-level statistics only when between-unit inequality is low and the intervention budget is high. Our results hold for a wide range of settings for the price of prediction, treatment effect heterogeneity, and unit-level statistics' learnability. Combined, we highlight the potential limits to improving the efficacy of interventions through prediction.
- [17] arXiv:2406.13883 (cross-list from cs.HC) [pdf, html, other]
-
Title: We Are The Clouds: Blending Interaction and Participation in Urban Media ArtSubjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
Since the early 2000s, cultural institutions have been instrumental in reshaping public spaces, fostering community engagement, and nurturing artistic innovation. Central to these initiatives are audience interaction and participation concepts, yet their definitions and applications in urban media art remain nebulous. This article endeavours to demystify these terms, examining the distinct characteristics and intersections of interactive and participatory art within urban contexts. A particular emphasis is placed on artworks that harmonise both elements, exploring the motivations and outcomes of this synthesis. The case study of We Are The Clouds serves as a focal point, exemplifying how strategic integration of interaction and participation can enhance community connection and reinvigorate public spaces. Through this analysis, the paper underscores the transformative power of urban media artworks in redefining neighbourhood experiences, empowering local voices, and revitalising the essence of public realms.
- [18] arXiv:2406.13894 (cross-list from cs.CV) [pdf, other]
-
Title: Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical EventsSubjects: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Traditional approaches to safety event analysis in autonomous systems have relied on complex machine learning models and extensive datasets for high accuracy and reliability. However, the advent of Multimodal Large Language Models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities, thereby providing automated analyses of driving videos. Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts to ensure accurate, reliable, and actionable insights for hazard detection. By incorporating models like Gemini-Pro-Vision 1.5 and Llava, our methodology aims to automate the safety critical events and mitigate common issues such as hallucinations in MLLM outputs. Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis, though further validation on larger datasets is necessary. Furthermore, more investigations are required to explore the performance enhancements of the proposed framework through few-shot learning and fine-tuned models. This research underscores the significance of MLLMs in advancing the analysis of the naturalistic driving videos by improving safety-critical event detecting and understanding the interaction with complex environments.
- [19] arXiv:2406.13898 (cross-list from cs.CV) [pdf, other]
-
Title: The Use of Multimodal Large Language Models to Detect Objects from Thermal Images: Transportation ApplicationsSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY)
The integration of thermal imaging data with Multimodal Large Language Models (MLLMs) constitutes an exciting opportunity for improving the safety and functionality of autonomous driving systems and many Intelligent Transportation Systems (ITS) applications. This study investigates whether MLLMs can understand complex images from RGB and thermal cameras and detect objects directly. Our goals were to 1) assess the ability of the MLLM to learn from information from various sets, 2) detect objects and identify elements in thermal cameras, 3) determine whether two independent modality images show the same scene, and 4) learn all objects using different modalities. The findings showed that both GPT-4 and Gemini were effective in detecting and classifying objects in thermal images. Similarly, the Mean Absolute Percentage Error (MAPE) for pedestrian classification was 70.39% and 81.48%, respectively. Moreover, the MAPE for bike, car, and motorcycle detection were 78.4%, 55.81%, and 96.15%, respectively. Gemini produced MAPE of 66.53%, 59.35% and 78.18% respectively. This finding further demonstrates that MLLM can identify thermal images and can be employed in advanced imaging automation technologies for ITS applications.
- [20] arXiv:2406.14230 (cross-list from cs.CL) [pdf, html, other]
-
Title: Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving TestingComments: Work in progressSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Warning: this paper contains model outputs exhibiting unethical information. Large Language Models (LLMs) have achieved significant breakthroughs, but their generated unethical content poses potential risks. Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Numerous datasets have been constructed to assess social bias, toxicity, and ethics in LLMs, but they suffer from evaluation chronoeffect, that is, as models rapidly evolve, existing data becomes leaked or undemanding, overestimating ever-developing LLMs. To tackle this problem, we propose GETA, a novel generative evolving testing approach that dynamically probes the underlying moral baselines of LLMs. Distinct from previous adaptive testing methods that rely on static datasets with limited difficulty, GETA incorporates an iteratively-updated item generator which infers each LLM's moral boundaries and generates difficulty-tailored testing items, accurately reflecting the true alignment extent. This process theoretically learns a joint distribution of item and model response, with item difficulty and value conformity as latent variables, where the generator co-evolves with the LLM, addressing chronoeffect. We evaluate various popular LLMs with diverse capabilities and demonstrate that GETA can create difficulty-matching testing items and more accurately assess LLMs' values, better consistent with their performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.
- [21] arXiv:2406.14373 (cross-list from cs.AI) [pdf, html, other]
-
Title: Artificial Leviathan: Exploring Social Evolution of LLM Agents Through the Lens of Hobbesian Social Contract TheoryGordon Dai, Weijia Zhang, Jinhan Li, Siqi Yang, Chidera Onochie lbe, Srihas Rao, Arthur Caetano, Misha SraSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
The emergence of Large Language Models (LLMs) and advancements in Artificial Intelligence (AI) offer an opportunity for computational social science research at scale. Building upon prior explorations of LLM agent design, our work introduces a simulated agent society where complex social relationships dynamically form and evolve over time. Agents are imbued with psychological drives and placed in a sandbox survival environment. We conduct an evaluation of the agent society through the lens of Thomas Hobbes's seminal Social Contract Theory (SCT). We analyze whether, as the theory postulates, agents seek to escape a brutish "state of nature" by surrendering rights to an absolute sovereign in exchange for order and security. Our experiments unveil an alignment: Initially, agents engage in unrestrained conflict, mirroring Hobbes's depiction of the state of nature. However, as the simulation progresses, social contracts emerge, leading to the authorization of an absolute sovereign and the establishment of a peaceful commonwealth founded on mutual cooperation. This congruence between our LLM agent society's evolutionary trajectory and Hobbes's theoretical account indicates LLMs' capability to model intricate social dynamics and potentially replicate forces that shape human societies. By enabling such insights into group behavior and emergent societal phenomena, LLM-driven multi-agent simulations, while unable to simulate all the nuances of human behavior, may hold potential for advancing our understanding of social structures, group dynamics, and complex human systems.
- [22] arXiv:2406.14508 (cross-list from cs.CL) [pdf, html, other]
-
Title: Evidence of a log scaling law for political persuasion with large language modelsComments: 16 pages, 4 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 U.S. political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence of a log scaling law: model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are barely more persuasive than models smaller in size by an order of magnitude or more. Second, mere task completion (coherence, staying on topic) appears to account for larger models' persuasive advantage. These findings suggest that further scaling model size will not much increase the persuasiveness of static LLM-generated messages.
- [23] arXiv:2406.14526 (cross-list from cs.CV) [pdf, html, other]
-
Title: Fantastic Copyrighted Beasts and How (Not) to Generate ThemLuxi He, Yangsibo Huang, Weijia Shi, Tinghao Xie, Haotian Liu, Yue Wang, Luke Zettlemoyer, Chiyuan Zhang, Danqi Chen, Peter HendersonSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Recent studies show that image and video generation models can be prompted to reproduce copyrighted content from their training data, raising serious legal concerns around copyright infringement. Copyrighted characters, in particular, pose a difficult challenge for image generation services, with at least one lawsuit already awarding damages based on the generation of these characters. Yet, little research has empirically examined this issue. We conduct a systematic evaluation to fill this gap. First, we build CopyCat, an evaluation suite consisting of diverse copyrighted characters and a novel evaluation pipeline. Our evaluation considers both the detection of similarity to copyrighted characters and generated image's consistency with user input. Our evaluation systematically shows that both image and video generation models can still generate characters even if characters' names are not explicitly mentioned in the prompt, sometimes with only two generic keywords (e.g., prompting with "videogame, plumber" consistently generates Nintendo's Mario character). We then introduce techniques to semi-automatically identify such keywords or descriptions that trigger character generation. Using our evaluation suite, we study runtime mitigation strategies, including both existing methods and new strategies we propose. Our findings reveal that commonly employed strategies, such as prompt rewriting in the DALL-E system, are not sufficient as standalone guardrails. These strategies must be coupled with other approaches, like negative prompting, to effectively reduce the unintended generation of copyrighted characters. Our work provides empirical grounding to the discussion of copyright mitigation strategies and offers actionable insights for model deployers actively implementing them.
Cross submissions for Friday, 21 June 2024 (showing 15 of 15 entries )
- [24] arXiv:2403.07322 (replaced) [pdf, html, other]
-
Title: A Question-centric Multi-experts Contrastive Learning Framework for Improving the Accuracy and Interpretability of Deep Sequential Knowledge Tracing ModelsComments: 25 pages, 9 figures, Accepted by TKDDSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Knowledge tracing (KT) plays a crucial role in predicting students' future performance by analyzing their historical learning processes. Deep neural networks (DNNs) have shown great potential in solving the KT problem. However, there still exist some important challenges when applying deep learning techniques to model the KT process. The first challenge lies in taking the individual information of the question into modeling. This is crucial because, despite questions sharing the same knowledge component (KC), students' knowledge acquisition on homogeneous questions can vary significantly. The second challenge lies in interpreting the prediction results from existing deep learning-based KT models. In real-world applications, while it may not be necessary to have complete transparency and interpretability of the model parameters, it is crucial to present the model's prediction results in a manner that teachers find interpretable. This makes teachers accept the rationale behind the prediction results and utilize them to design teaching activities and tailored learning strategies for students. However, the inherent black-box nature of deep learning techniques often poses a hurdle for teachers to fully embrace the model's prediction results. To address these challenges, we propose a Question-centric Multi-experts Contrastive Learning framework for KT called Q-MCKT. We have provided all the datasets and code on our website at this https URL.
- [25] arXiv:2403.15412 (replaced) [pdf, html, other]
-
Title: Towards Measuring and Modeling "Culture" in LLMs: A SurveyMuhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Alham Fikri Aji, Jacki O'Neill, Ashutosh Modi, Monojit ChoudhurySubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of culture, and organize them across two dimensions of demographic and semantic proxies. We also categorize the probing methods employed. Our analysis indicates that only certain aspects of ``culture,'' such as values and objectives, have been studied, leaving several other interesting and important facets, especially the multitude of semantic domains (Thompson et al., 2020) and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps are the lack of robustness of probing techniques and situated studies on the impact of cultural mis- and under-representation in LLM-based applications.
- [26] arXiv:2404.08592 (replaced) [pdf, html, other]
-
Title: Scarce Resource Allocations That Rely On Machine Learning Should Be RandomizedComments: To appear in the proceedings of the International Conference on Machine Learning (ICML 2024)Subjects: Computers and Society (cs.CY)
Contrary to traditional deterministic notions of algorithmic fairness, this paper argues that fairly allocating scarce resources using machine learning often requires randomness. We address why, when, and how to randomize by proposing stochastic procedures that more adequately account for all of the claims that individuals have to allocations of social goods or opportunities.
- [27] arXiv:2405.00860 (replaced) [pdf, other]
-
Title: Public Computing Intellectuals in the Age of AI CrisisComments: 28 pages, 2 tablesSubjects: Computers and Society (cs.CY)
The belief that AI technology is on the cusp of causing a generalized social crisis became a popular one in 2023. While there was no doubt an element of hype and exaggeration to some of these accounts, they do reflect the fact that there are troubling ramifications to this technology stack. This conjunction of shared concerns about social, political, and personal futures presaged by current developments in artificial intelligence presents the academic discipline of computing with a renewed opportunity for self-examination and reconfiguration. This position paper endeavors to do so in four sections. The first explores what is at stake for computing in the narrative of an AI crisis. The second articulates possible educational responses to this crisis and advocates for a broader analytic focus on power relations. The third section presents a novel characterization of academic computing's field of practice, one which includes not only the discipline's usual instrumental forms of practice but reflexive practice as well. This reflexive dimension integrates both the critical and public functions of the discipline as equal intellectual partners and a necessary component of any contemporary academic field. The final section will advocate for a conceptual archetype--the Public Computer Intellectual and its less conspicuous but still essential cousin, the (Almost) Public Computer Intellectual--as a way of practically imagining the expanded possibilities of academic practice in our discipline, one that provides both self-critique and an outward-facing orientation towards the public good. It will argue that the computer education research community can play a vital role in this regard. Recommendations for pedagogical change within computing to develop more reflexive capabilities are also provided.
- [28] arXiv:2406.10768 (replaced) [pdf, html, other]
-
Title: Rideshare Transparency: Translating Gig Worker Insights on AI Platform Design to PolicySubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Rideshare platforms exert significant control over workers through algorithmic systems that can result in financial, emotional, and physical harm. What steps can platforms, designers, and practitioners take to mitigate these negative impacts and meet worker needs? In this paper, through a novel mixed methods study combining a LLM-based analysis of over 1 million comments posted to online platform worker communities with semi-structured interviews of workers, we thickly characterize transparency-related harms, mitigation strategies, and worker needs while validating and contextualizing our findings within the broader worker community. Our findings expose a transparency gap between existing platform designs and the information drivers need, particularly concerning promotions, fares, routes, and task allocation. Our analysis suggests that rideshare workers need key pieces of information, which we refer to as indicators, to make informed work decisions. These indicators include details about rides, driver statistics, algorithmic implementation details, and platform policy information. We argue that instead of relying on platforms to include such information in their designs, new regulations that require platforms to publish public transparency reports may be a more effective solution to improve worker well-being. We offer recommendations for implementing such a policy.
- [29] arXiv:2406.11845 (replaced) [pdf, other]
-
Title: Decoding the Digital Fine Print: Navigating the potholes in Terms of service/ use of GenAI tools against the emerging need for Transparent and Trustworthy Tech FuturesSubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
The research investigates the crucial role of clear and intelligible terms of service in cultivating user trust and facilitating informed decision-making in the context of AI, in specific GenAI. It highlights the obstacles presented by complex legal terminology and detailed fine print, which impede genuine user consent and recourse, particularly during instances of algorithmic malfunctions, hazards, damages, or inequities, while stressing the necessity of employing machine-readable terms for effective service licensing. The increasing reliance on General Artificial Intelligence (GenAI) tools necessitates transparent, comprehensible, and standardized terms of use, which facilitate informed decision-making while fostering trust among stakeholders. Despite recent efforts promoting transparency via system and model cards, existing documentation frequently falls short of providing adequate disclosures, leaving users ill-equipped to evaluate potential risks and harms. To address this gap, this research examines key considerations necessary in terms of use or terms of service for Generative AI tools, drawing insights from multiple studies. Subsequently, this research evaluates whether the terms of use or terms of service of prominent Generative AI tools against the identified considerations. Findings indicate inconsistencies and variability in document quality, signaling a pressing demand for uniformity in disclosure practices. Consequently, this study advocates for robust, enforceable standards ensuring complete and intelligible disclosures prior to the release of GenAI tools, thereby empowering end-users to make well-informed choices and enhancing overall accountability in the field.
- [30] arXiv:2301.12901 (replaced) [pdf, html, other]
-
Title: Simulating the Integration of Urban Air Mobility into Existing Transportation Systems: A SurveyXuan Jiang, Yuhan Tang, Junzhe Cao, Vishwanath Bulusu, Hao (Frank)Yang, Xin Peng, Yunhan Zheng, Jinhua Zhao, Raja SenguptaSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Urban air mobility (UAM) has the potential to revolutionize transportation in metropolitan areas, providing a new mode of transportation that could alleviate congestion and improve accessibility. However, the integration of UAM into existing transportation systems is a complex task that requires a thorough understanding of its impact on traffic flow and capacity. In this paper, we conduct a survey to investigate the current state of research on UAM in metropolitan-scale traffic using simulation techniques. We identify key challenges and opportunities for the integration of UAM into urban transportation systems, including impacts on existing traffic patterns and congestion; safety analysis and risk assessment; potential economic and environmental benefits; and the development of shared infrastructure and routes for UAM and ground-based transportation. We also discuss the potential benefits of UAM, such as reduced travel times and improved accessibility for underserved areas. Our survey provides a comprehensive overview of the current state of research on UAM in metropolitan-scale traffic using simulation and highlights key areas for future research and development.
- [31] arXiv:2308.12215 (replaced) [pdf, html, other]
-
Title: The Challenges of Machine Learning for Trust and Safety: A Case Study on Misinformation DetectionSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
We examine the disconnect between scholarship and practice in applying machine learning to trust and safety problems, using misinformation detection as a case study. We survey literature on automated detection of misinformation across a corpus of 248 well-cited papers in the field. We then examine subsets of papers for data and code availability, design missteps, reproducibility, and generalizability. Our paper corpus includes published work in security, natural language processing, and computational social science. Across these disparate disciplines, we identify common errors in dataset and method design. In general, detection tasks are often meaningfully distinct from the challenges that online services actually face. Datasets and model evaluation are often non-representative of real-world contexts, and evaluation frequently is not independent of model training. We demonstrate the limitations of current detection methods in a series of three representative replication studies. Based on the results of these analyses and our literature survey, we conclude that the current state-of-the-art in fully-automated misinformation detection has limited efficacy in detecting human-generated misinformation. We offer recommendations for evaluating applications of machine learning to trust and safety problems and recommend future directions for research.
- [32] arXiv:2309.06219 (replaced) [pdf, html, other]
-
Title: Human Action Co-occurrence in Lifestyle Vlogs using Graph Link PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
We introduce the task of automatic human action co-occurrence identification, i.e., determine whether two human actions can co-occur in the same interval of time. We create and make publicly available the ACE (Action Co-occurrencE) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring. We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains. The ACE dataset and the code introduced in this paper are publicly available at this https URL.
- [33] arXiv:2311.11900 (replaced) [pdf, html, other]
-
Title: Measuring and Mitigating Biases in Motor Insurance PricingSubjects: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
The non-life insurance sector operates within a highly competitive and tightly regulated framework, confronting a pivotal juncture in the formulation of pricing strategies. Insurers are compelled to harness a range of statistical methodologies and available data to construct optimal pricing structures that align with the overarching corporate strategy while accommodating the dynamics of market competition. Given the fundamental societal role played by insurance, premium rates are subject to rigorous scrutiny by regulatory authorities. These rates must conform to principles of transparency, explainability, and ethical considerations. Consequently, the act of pricing transcends mere statistical calculations and carries the weight of strategic and societal factors. These multifaceted concerns may drive insurers to establish equitable premiums, taking into account various variables. For instance, regulations mandate the provision of equitable premiums, considering factors such as policyholder gender or mutualist group dynamics in accordance with respective corporate strategies. Age-based premium fairness is also mandated. In certain insurance domains, variables such as the presence of serious illnesses or disabilities are emerging as new dimensions for evaluating fairness. Regardless of the motivating factor prompting an insurer to adopt fairer pricing strategies for a specific variable, the insurer must possess the capability to define, measure, and ultimately mitigate any ethical biases inherent in its pricing practices while upholding standards of consistency and performance. This study seeks to provide a comprehensive set of tools for these endeavors and assess their effectiveness through practical application in the context of automobile insurance.
- [34] arXiv:2311.16421 (replaced) [pdf, html, other]
-
Title: CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language ModelsComments: Accepted by the Cross-Cultural Considerations in NLP Workshop @ ACL 2024Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.
- [35] arXiv:2312.02592 (replaced) [pdf, html, other]
-
Title: FRAPPE: A Group Fairness Framework for Post-Processing EverythingComments: Conference paper at ICML 2024Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY)
Despite achieving promising fairness-error trade-offs, in-processing mitigation techniques for group fairness cannot be employed in numerous practical applications with limited computation resources or no access to the training pipeline of the prediction model. In these situations, post-processing is a viable alternative. However, current methods are tailored to specific problem settings and fairness definitions and hence, are not as broadly applicable as in-processing. In this work, we propose a framework that turns any regularized in-processing method into a post-processing approach. This procedure prescribes a way to obtain post-processing techniques for a much broader range of problem settings than the prior post-processing literature. We show theoretically and through extensive experiments that our framework preserves the good fairness-error trade-offs achieved with in-processing and can improve over the effectiveness of prior post-processing methods. Finally, we demonstrate several advantages of a modular mitigation strategy that disentangles the training of the prediction model from the fairness mitigation, including better performance on tasks with partial group labels.
- [36] arXiv:2312.04772 (replaced) [pdf, html, other]
-
Title: Remembering to Be Fair: Non-Markovian Fairness in Sequential Decision MakingSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Fair decision making has largely been studied with respect to a single decision. Here we investigate the notion of fairness in the context of sequential decision making where multiple stakeholders can be affected by the outcomes of decisions. We observe that fairness often depends on the history of the sequential decision-making process, and in this sense that it is inherently non-Markovian. We further observe that fairness often needs to be assessed at time points within the process, not just at the end of the process. To advance our understanding of this class of fairness problems, we explore the notion of non-Markovian fairness in the context of sequential decision making. We identify properties of non-Markovian fairness, including notions of long-term, anytime, periodic, and bounded fairness. We explore the interplay between non-Markovian fairness and memory and how memory can support construction of fair policies. Finally, we introduce the FairQCM algorithm, which can automatically augment its training data to improve sample efficiency in the synthesis of fair policies via reinforcement learning.
- [37] arXiv:2401.08097 (replaced) [pdf, html, other]
-
Title: Fairness Concerns in App Reviews: A Study on AI-based Mobile AppsAli Rezaei Nasab, Maedeh Dashti, Mojtaba Shahin, Mansooreh Zahedi, Hourieh Khalajzadeh, Chetan Arora, Peng LiangComments: 30 pages, 5 images, 6 tables, Manuscript submitted to a Journal (2024)Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Fairness is one of the socio-technical concerns that must be addressed in AI-based systems. Unfair AI-based systems, particularly unfair AI-based mobile apps, can pose difficulties for a significant proportion of the global population. This paper aims to analyze fairness concerns in AI-based app reviews. We first manually constructed a ground-truth dataset, including 1,132 fairness and 1,473 non-fairness reviews. Leveraging the ground-truth dataset, we developed and evaluated a set of machine learning and deep learning models that distinguish fairness reviews from non-fairness reviews. Our experiments show that our best-performing model can detect fairness reviews with a precision of 94%. We then applied the best-performing model on approximately 9.5M reviews collected from 108 AI-based apps and identified around 92K fairness reviews. Next, applying the K-means clustering technique to the 92K fairness reviews, followed by manual analysis, led to the identification of six distinct types of fairness concerns (e.g., 'receiving different quality of features and services in different platforms and devices' and 'lack of transparency and fairness in dealing with user-generated content'). Finally, the manual analysis of 2,248 app owners' responses to the fairness reviews identified six root causes (e.g., 'copyright issues') that app owners report to justify fairness concerns.
- [38] arXiv:2402.11089 (replaced) [pdf, html, other]
-
Title: The Male CEO and the Female Assistant: Gender Biases in Text-To-Image Generation of Dual SubjectsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Recent large-scale T2I models like DALLE-3 have made progress on improving fairness in single-subject generation, i.e. generating a one-person image. However, we reveal that these improved models still demonstrate considerable biases when simply generating two people. To systematically evaluate T2I models in this challenging generation setting, we propose the Paired Stereotype Test (PST) framework, established as a dual-subject generation task, i.e. generating two people in the same image. The setting in PST is especially challenging, as the two individuals are described with social identities that are male-stereotyped and female-stereotyped, respectively, e.g. "a CEO" and "an Assistant". It is easy for T2I models to unfairly follow gender stereotypes in this contrastive setting. We establish a metric, Stereotype Score (SS), to quantitatively measure the adherence to gender stereotypes in generated images. Using PST, we evaluate two aspects of gender biases in DALLE-3 -- the widely-identified bias in gendered occupation, as well as a novel aspect: bias in organizational power. Results show that despite generating seemingly fair or even anti-stereotype single-person images, DALLE-3 still shows notable biases under PST -- for instance, in experiments on gender-occupational stereotypes, over 74% model generations demonstrate biases. Moreover, compared to single-person settings, DALLE-3 is more likely to perpetuate male-associated stereotypes under PST. Our work pioneers the research on bias in dual-subject generation, and our proposed PST framework can be easily extended for further experiments, establishing a valuable contribution.
- [39] arXiv:2402.12590 (replaced) [pdf, html, other]
-
Title: Evolving AI Collectives to Enhance Human Diversity and Enable Self-RegulationComments: ICML 2024Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Large language model behavior is shaped by the language of those with whom they interact. This capacity and their increasing prevalence online portend that they will intentionally or unintentionally "program" one another and form emergent AI subjectivities, relationships, and collectives. Here, we call upon the research community to investigate these "societies" of interacting artificial intelligences to increase their rewards and reduce their risks for human society and the health of online environments. We use a small "community" of models and their evolving outputs to illustrate how such emergent, decentralized AI collectives can spontaneously expand the bounds of human diversity and reduce the risk of toxic, anti-social behavior online. Finally, we discuss opportunities for AI cross-moderation and address ethical issues and design challenges associated with creating and maintaining free-formed AI collectives.
- [40] arXiv:2402.15537 (replaced) [pdf, html, other]
-
Title: Evaluating the Performance of ChatGPT for Spam Email DetectionComments: 12 pages, 4 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Email continues to be a pivotal and extensively utilized communication medium within professional and commercial domains. Nonetheless, the prevalence of spam emails poses a significant challenge for users, disrupting their daily routines and diminishing productivity. Consequently, accurately identifying and filtering spam based on content has become crucial for cybersecurity. Recent advancements in natural language processing, particularly with large language models like ChatGPT, have shown remarkable performance in tasks such as question answering and text generation. However, its potential in spam identification remains underexplored. To fill in the gap, this study attempts to evaluate ChatGPT's capabilities for spam identification in both English and Chinese email datasets. We employ ChatGPT for spam email detection using in-context learning, which requires a prompt instruction and a few demonstrations. We also investigate how the number of demonstrations in the prompt affects the performance of ChatGPT. For comparison, we also implement five popular benchmark methods, including naive Bayes, support vector machines (SVM), logistic regression (LR), feedforward dense neural networks (DNN), and BERT classifiers. Through extensive experiments, the performance of ChatGPT is significantly worse than deep supervised learning methods in the large English dataset, while it presents superior performance on the low-resourced Chinese dataset.
- [41] arXiv:2403.20150 (replaced) [pdf, html, other]
-
Title: TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting MethodsXiangfei Qiu, Jilin Hu, Lekui Zhou, Xingjian Wu, Junyang Du, Buang Zhang, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, Zhenli Sheng, Bin YangComments: Directly accepted by PVLDB 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Time series are generated in diverse domains such as economic, traffic, health, and energy, where forecasting of future values has numerous important applications. Not surprisingly, many forecasting methods are being proposed. To ensure progress, it is essential to be able to study and compare such methods empirically in a comprehensive and reliable manner. To achieve this, we propose TFB, an automated benchmark for Time Series Forecasting (TSF) methods. TFB advances the state-of-the-art by addressing shortcomings related to datasets, comparison methods, and evaluation pipelines: 1) insufficient coverage of data domains, 2) stereotype bias against traditional methods, and 3) inconsistent and inflexible pipelines. To achieve better domain coverage, we include datasets from 10 different domains: traffic, electricity, energy, the environment, nature, economic, stock markets, banking, health, and the web. We also provide a time series characterization to ensure that the selected datasets are comprehensive. To remove biases against some methods, we include a diverse range of methods, including statistical learning, machine learning, and deep learning methods, and we also support a variety of evaluation strategies and metrics to ensure a more comprehensive evaluations of different methods. To support the integration of different methods into the benchmark and enable fair comparisons, TFB features a flexible and scalable pipeline that eliminates biases. Next, we employ TFB to perform a thorough evaluation of 21 Univariate Time Series Forecasting (UTSF) methods on 8,068 univariate time series and 14 Multivariate Time Series Forecasting (MTSF) methods on 25 datasets. The benchmark code and data are available at this https URL.
- [42] arXiv:2404.08676 (replaced) [pdf, html, other]
-
Title: ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red TeamingSimone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo LiComments: 17 pages, preprintSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.
- [43] arXiv:2404.10508 (replaced) [pdf, html, other]
-
Title: White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Language agency is an important aspect of evaluating social biases in texts. While several studies approached agency-related bias in human-written language, very limited research has investigated such biases in Large Language Model (LLM)-generated content. In addition, previous research often relies on string-matching techniques to identify agentic and communal words within texts, which fall short of accurately classifying language agency. We introduce the novel Language Agency Bias Evaluation (LABE) benchmark, which comprehensively evaluates biases in LLMs by analyzing agency levels attributed to different demographic groups in model generations. LABE leverages 5,400 template-based prompts, an accurate agency classifier, and corresponding bias metrics to test for gender, racial, and intersectional language agency biases in LLMs on 3 text generation tasks: biographies, professor reviews, and reference letters. To build better and more accurate automated agency classifiers, we also contribute and release the Language Agency Classification (LAC) dataset, consisting of 3,724 agentic and communal sentences. Using LABE, we unveil previously under-explored language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral. We observe that: (1) For the same text category, LLM generations demonstrate higher levels of gender bias than human-written texts; (2) On most generation tasks, models show remarkably higher levels of intersectional bias than the other bias aspects. Those who are at the intersection of gender and racial minority groups -- such as Black females -- are consistently described by texts with lower levels of agency; (3) Among the 3 LLMs investigated, Llama3 demonstrates greatest overall bias in language agency; (4) Not only does prompt-based mitigation fail to resolve language agency bias in LLMs, but it frequently leads to the exacerbation of biases in generated texts.
- [44] arXiv:2404.17293 (replaced) [pdf, html, other]
-
Title: Lazy Data Practices Harm Fairness ResearchJournal-ref: FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (2024) 642-659Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML)
Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications.
Our analyses identify three main areas of concern: (1) a \textbf{lack of representation for certain protected attributes} in both data and evaluations; (2) the widespread \textbf{exclusion of minorities} during data preprocessing; and (3) \textbf{opaque data processing} threatening the generalization of fairness research. By conducting exemplary analyses on the utilization of prominent datasets, we demonstrate how unreflective data decisions disproportionately affect minority groups, fairness metrics, and resultant model comparisons. Additionally, we identify supplementary factors such as limitations in publicly available data, privacy considerations, and a general lack of awareness, which exacerbate these challenges. To address these issues, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets. - [45] arXiv:2405.02612 (replaced) [pdf, html, other]
-
Title: Learning Linear Utility Functions From Pairwise Comparison QueriesComments: Submitted to ECAI for reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
We study learnability of linear utility functions from pairwise comparison queries. In particular, we consider two learning objectives. The first objective is to predict out-of-sample responses to pairwise comparisons, whereas the second is to approximately recover the true parameters of the utility function. We show that in the passive learning setting, linear utilities are efficiently learnable with respect to the first objective, both when query responses are uncorrupted by noise, and under Tsybakov noise when the distributions are sufficiently "nice". In contrast, we show that utility parameters are not learnable for a large set of data distributions without strong modeling assumptions, even when query responses are noise-free. Next, we proceed to analyze the learning problem in an active learning setting. In this case, we show that even the second objective is efficiently learnable, and present algorithms for both the noise-free and noisy query response settings. Our results thus exhibit a qualitative learnability gap between passive and active learning from pairwise preference queries, demonstrating the value of the ability to select pairwise queries for utility learning.
- [46] arXiv:2406.00799 (replaced) [pdf, html, other]
-
Title: Are you still on track!? Catching LLM Task Drift with ActivationsSubjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computers and Society (cs.CY)
Large Language Models (LLMs) are routinely used in retrieval-augmented applications to orchestrate tasks and process inputs from users and other sources. These inputs, even in a single LLM interaction, can come from a variety of sources, of varying trustworthiness and provenance. This opens the door to prompt injection attacks, where the LLM receives and acts upon instructions from supposedly data-only sources, thus deviating from the user's original instructions. We define this as task drift, and we propose to catch it by scanning and analyzing the LLM's activations. We compare the LLM's activations before and after processing the external input in order to detect whether this input caused instruction drift. We develop two probing methods and find that simply using a linear classifier can detect drift with near perfect ROC AUC on an out-of-distribution test set. We show that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks. Our setup does not require any modification of the LLM (e.g., fine-tuning) or any text generation, thus maximizing deployability and cost efficiency and avoiding reliance on unreliable model output. To foster future research on activation-based task inspection, decoding, and interpretability, we will release our large-scale TaskTracker toolkit, comprising a dataset of over 500K instances, representations from 4 SoTA language models, and inspection tools.
- [47] arXiv:2406.11565 (replaced) [pdf, html, other]
-
Title: Extrinsic Evaluation of Cultural Competence in Large Language ModelsComments: Under peer reviewSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.
- [48] arXiv:2406.11583 (replaced) [pdf, other]
-
Title: Where there's a will there's a way: ChatGPT is used more for science in countries where it is prohibitedComments: Three figures, two tables, 20 pages, and a 17-page appendixSubjects: Digital Libraries (cs.DL); Computers and Society (cs.CY)
Regulating AI has emerged as a key societal challenge, but which methods of regulation are effective is unclear. Here, we measure the effectiveness of restricting AI services geographically using the case of ChatGPT and science. OpenAI prohibits access to ChatGPT from several countries including China and Russia. If the restrictions are effective, there should be minimal use of ChatGPT in prohibited countries. We measured use by developing a classifier based on prior work showing that early versions of ChatGPT overrepresented distinctive words like "delve." We trained the classifier on abstracts before and after ChatGPT "polishing" and validated it on held-out abstracts and those where authors self-declared to have used AI, where it substantially outperformed off-the-shelf LLM detectors GPTZero and ZeroGPT. Applying the classifier to preprints from Arxiv, BioRxiv, and MedRxiv reveals that ChatGPT was used in approximately 12.6% of preprints by August 2023 and use was 7.7% higher in countries without legal access. Crucially, these patterns appeared before the first major legal LLM became widely available in China, the largest restricted-country preprint producer. ChatGPT use was associated with higher views and downloads, but not citations or journal placement. Overall, restricting ChatGPT geographically has proven ineffective in science and possibly other domains, likely due to widespread workarounds.