Computer Science
See recent articles
Showing new listings for Friday, 15 November 2024
- [1] arXiv:2411.08881 [pdf, html, other]
-
Title: Can We Trust AI Agents? An Experimental Study Towards Trustworthy LLM-Based Multi-Agent Systems for AI EthicsJosé Antonio Siqueira de Cerqueira, Mamia Agbese, Rebekah Rousi, Nannan Xi, Juho Hamari, Pekka AbrahamssonSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
AI-based systems, including Large Language Models (LLMs), impact millions by supporting diverse tasks but face issues like misinformation, bias, and misuse. Ethical AI development is crucial as new technologies and concerns emerge, but objective, practical ethical guidance remains debated. This study examines LLMs in developing ethical AI systems, assessing how trustworthiness-enhancing techniques affect ethical AI output generation. Using the Design Science Research (DSR) method, we identify techniques for LLM trustworthiness: multi-agents, distinct roles, structured communication, and multiple rounds of debate. We design the multi-agent prototype LLM-BMAS, where agents engage in structured discussions on real-world ethical AI issues from the AI Incident Database. The prototype's performance is evaluated through thematic analysis, hierarchical clustering, ablation studies, and source code execution. Our system generates around 2,000 lines per run, compared to only 80 lines in the ablation study. Discussions reveal terms like bias detection, transparency, accountability, user consent, GDPR compliance, fairness evaluation, and EU AI Act compliance, showing LLM-BMAS's ability to generate thorough source code and documentation addressing often-overlooked ethical AI issues. However, practical challenges in source code integration and dependency management may limit smooth system adoption by practitioners. This study aims to shed light on enhancing trustworthiness in LLMs to support practitioners in developing ethical AI-based systems.
- [2] arXiv:2411.08882 [pdf, html, other]
-
Title: A Novel Multimodal System to Predict Agitation in People with Dementia Within Clinical Settings: A Proof of ConceptSubjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Dementia is a neurodegenerative condition that combines several diseases and impacts millions around the world and those around them. Although cognitive impairment is profoundly disabling, it is the noncognitive features of dementia, referred to as Neuropsychiatric Symptoms (NPS), that are most closely associated with a diminished quality of life. Agitation and aggression (AA) in people living with dementia (PwD) contribute to distress and increased healthcare demands. Current assessment methods rely on caregiver intervention and reporting of incidents, introducing subjectivity and bias. Artificial Intelligence (AI) and predictive algorithms offer a potential solution for detecting AA episodes in PwD when utilized in real-time. We present a 5-year study system that integrates a multimodal approach, utilizing the EmbracePlus wristband and a video detection system to predict AA in severe dementia patients. We conducted a pilot study with three participants at the Ontario Shores Mental Health Institute to validate the functionality of the system. The system collects and processes raw and digital biomarkers from the EmbracePlus wristband to accurately predict AA. The system also detected pre-agitation patterns at least six minutes before the AA event, which was not previously discovered from the EmbracePlus wristband. Furthermore, the privacy-preserving video system uses a masking tool to hide the features of the people in frames and employs a deep learning model for AA detection. The video system also helps identify the actual start and end time of the agitation events for labeling. The promising results of the preliminary data analysis underscore the ability of the system to predict AA events. The ability of the proposed system to run autonomously in real-time and identify AA and pre-agitation symptoms without external assistance represents a significant milestone in this research field.
- [3] arXiv:2411.08883 [pdf, html, other]
-
Title: KisanQRS: A Deep Learning-based Automated Query-Response System for Agricultural Decision-MakingSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Delivering prompt information and guidance to farmers is critical in agricultural decision-making. Farmers helpline centres are heavily reliant on the expertise and availability of call centre agents, leading to inconsistent quality and delayed responses. To this end, this article presents Kisan Query Response System (KisanQRS), a Deep Learning-based robust query-response framework for the agriculture sector. KisanQRS integrates semantic and lexical similarities of farmers queries and employs a rapid threshold-based clustering method. The clustering algorithm is based on a linear search technique to iterate through all queries and organize them into clusters according to their similarity. For query mapping, LSTM is found to be the optimal method. Our proposed answer retrieval method clusters candidate answers for a crop, ranks these answer clusters based on the number of answers in a cluster, and selects the leader of each cluster. The dataset used in our analysis consists of a subset of 34 million call logs from the Kisan Call Centre (KCC), operated under the Government of India. We evaluated the performance of the query mapping module on the data of five major states of India with 3,00,000 samples and the quantifiable outcomes demonstrate that KisanQRS significantly outperforms traditional techniques by achieving 96.58% top F1-score for a state. The answer retrieval module is evaluated on 10,000 samples and it achieves a competitive NDCG score of 96.20%. KisanQRS is useful in enabling farmers to make informed decisions about their farming practices by providing quick and pertinent responses to their queries.
- [4] arXiv:2411.08884 [pdf, html, other]
-
Title: Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-PlaySubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
As Large Language Models (LLMs) become more prevalent, concerns about their safety, ethics, and potential biases have risen. Systematically evaluating LLMs' risk decision-making tendencies and attitudes, particularly in the ethical domain, has become crucial. This study innovatively applies the Domain-Specific Risk-Taking (DOSPERT) scale from cognitive science to LLMs and proposes a novel Ethical Decision-Making Risk Attitude Scale (EDRAS) to assess LLMs' ethical risk attitudes in depth. We further propose a novel approach integrating risk scales and role-playing to quantitatively evaluate systematic biases in LLMs. Through systematic evaluation and analysis of multiple mainstream LLMs, we assessed the "risk personalities" of LLMs across multiple domains, with a particular focus on the ethical domain, and revealed and quantified LLMs' systematic biases towards different groups. This research helps understand LLMs' risk decision-making and ensure their safe and reliable application. Our approach provides a tool for identifying and mitigating biases, contributing to fairer and more trustworthy AI systems. The code and data are available.
- [5] arXiv:2411.08885 [pdf, html, other]
-
Title: Enhancing Lie Detection Accuracy: A Comparative Study of Classic ML, CNN, and GCN Models using Audio-Visual FeaturesComments: 11 pages, 18 figuresSubjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Inaccuracies in polygraph tests often lead to wrongful convictions, false information, and bias, all of which have significant consequences for both legal and political systems. Recently, analyzing facial micro-expressions has emerged as a method for detecting deception; however, current models have not reached high accuracy and generalizability. The purpose of this study is to aid in remedying these problems. The unique multimodal transformer architecture used in this study improves upon previous approaches by using auditory inputs, visual facial micro-expressions, and manually transcribed gesture annotations, moving closer to a reliable non-invasive lie detection model. Visual and auditory features were extracted using the Vision Transformer and OpenSmile models respectively, which were then concatenated with the transcriptions of participants micro-expressions and gestures. Various models were trained for the classification of lies and truths using these processed and concatenated features. The CNN Conv1D multimodal model achieved an average accuracy of 95.4%. However, further research is still required to create higher-quality datasets and even more generalized models for more diverse applications.
- [6] arXiv:2411.08888 [pdf, html, other]
-
Title: Exploring Capabilities of Time Series Foundation Models in Building AnalyticsComments: 7 pages, 1 figures, and 4 tablesSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
The growing integration of digitized infrastructure with Internet of Things (IoT) networks has transformed the management and optimization of building energy consumption. By leveraging IoT-based monitoring systems, stakeholders such as building managers, energy suppliers, and policymakers can make data-driven decisions to improve energy efficiency. However, accurate energy forecasting and analytics face persistent challenges, primarily due to the inherent physical constraints of buildings and the diverse, heterogeneous nature of IoT-generated data. In this study, we conduct a comprehensive benchmarking of two publicly available IoT datasets, evaluating the performance of time series foundation models in the context of building energy analytics. Our analysis shows that single-modal models demonstrate significant promise in overcoming the complexities of data variability and physical limitations in buildings, with future work focusing on optimizing multi-modal models for sustainable energy management.
- [7] arXiv:2411.08889 [pdf, html, other]
-
Title: Multilingual Standalone Trustworthy Voice-Based Social Network for Disaster SituationsComments: Accepted for publication in IEEE UEMCON 2024, to appear in December 2024. 7 pages, 3 figuresSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In disaster scenarios, effective communication is crucial, yet language barriers often hinder timely and accurate information dissemination, exacerbating vulnerabilities and complicating response efforts. This paper presents a novel, multilingual, voice-based social network specifically designed to address these challenges. The proposed system integrates advanced artificial intelligence (AI) with blockchain technology to enable secure, asynchronous voice communication across multiple languages. The application operates independently of external servers, ensuring reliability even in compromised environments by functioning offline through local networks. Key features include AI-driven real-time translation of voice messages, ensuring seamless cross-linguistic communication, and blockchain-enabled storage for secure, immutable records of all interactions, safeguarding message integrity. Designed for cross-platform use, the system offers consistent performance across devices, from mobile phones to desktops, making it highly adaptable in diverse disaster situations. Evaluation metrics demonstrate high accuracy in speech recognition and translation, low latency, and user satisfaction, validating the system's effectiveness in enhancing communication during crises. This solution represents a significant advancement in disaster communication, bridging language gaps to support more inclusive and efficient emergency response.
- [8] arXiv:2411.08890 [pdf, other]
-
Title: Spotlight Session on Autonomous Weapons Systems at ICRC 34th International ConferenceComments: 8 pages, 2415 words, 1 figure. Panelist notes for the Spotlight Session on Autonomous Weapons Systems at the ICRC 34th International Conference 28-31 Oct 2024Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Autonomous weapons systems (AWS) change the way humans make decisions, the effect of those decisions and who is accountable for decisions made. We must remain vigilant, informed and human-centred as we tackle our deliberations on developing norms regarding their development, use and justification. Ways to enhance compliance in international humanitarian law (IHL) include: Training weapons decision makers in IHL; developing best practice in weapons reviews including requirements for industry to ensure that any new weapon, means or method of warfare is capable of being used lawfully; develop human-centred test and evaluation methods; invest in digital infrastructure to increase knowledge of the civilian environment in a conflict and its dynamics; invest in research on the real effects and consequences of civilian harms to the achievement of military and political objectives; improve secure communications between stakeholders in a conflict; and finally to upskill governments and NGOs in what is technically achievable with emerging technologies so that they can contribute to system requirements, test and evaluation protocols and operational rules of use and engagement. Governments are responsible for setting requirements for weapons systems. They are responsible for driving ethicality as well as lethality. Governments can require systems to be made and used to better protect civilians and protected objects. The UN can advocate for compliance with IHL, human rights, human-centred use of weapons systems and improved mechanisms to monitor and trace military decision making including those decisions affected by autonomous functionality.
- [9] arXiv:2411.08891 [pdf, html, other]
-
Title: Calibrated Decision-Making through LLM-Assisted RetrievalSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Recently, large language models (LLMs) have been increasingly used to support various decision-making tasks, assisting humans in making informed decisions. However, when LLMs confidently provide incorrect information, it can lead humans to make suboptimal decisions. To prevent LLMs from generating incorrect information on topics they are unsure of and to improve the accuracy of generated content, prior works have proposed Retrieval Augmented Generation (RAG), where external documents are referenced to generate responses. However, traditional RAG methods focus only on retrieving documents most relevant to the input query, without specifically aiming to ensure that the human user's decisions are well-calibrated. To address this limitation, we propose a novel retrieval method called Calibrated Retrieval-Augmented Generation (CalibRAG), which ensures that decisions informed by the retrieved documents are well-calibrated. Then we empirically validate that CalibRAG improves calibration performance as well as accuracy, compared to other baselines across various datasets.
- [10] arXiv:2411.08892 [pdf, other]
-
Title: Auto-assessment of assessment: A conceptual framework towards fulfilling the policy gaps in academic assessment practicesWasiq Khan, Luke K. Topham, Peter Atherton, Raghad Al-Shabandar, Hoshang Kolivand, Iftikhar Khan, Abir HussainComments: 20 Pages, 5 Figures, submitted for journal peer-reviewSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Education is being transformed by rapid advances in Artificial Intelligence (AI), including emerging Generative Artificial Intelligence (GAI). Such technology can significantly support academics and students by automating monotonous tasks and making personalised suggestions. However, despite the potential of the technology, there are significant concerns regarding AI misuse, particularly by students in assessments. There are two schools of thought: one advocates for a complete ban on it, while the other views it as a valuable educational tool, provided it is governed by a robust usage policy. This contradiction clearly indicates a major policy gap in academic practices, and new policies are required to uphold academic standards while enabling staff and students to benefit from technological advancements. We surveyed 117 academics from three countries (UK, UAE, and Iraq), and identified that most academics retain positive opinions regarding AI in education. For example, the majority of experienced academics do not favour complete bans, and they see the potential benefits of AI for students, teaching staff, and academic institutions. Importantly, academics specifically identified the particular benefits of AI for autonomous assessment (71.79% of respondents agreed). Therefore, for the first time, we propose a novel AI framework for autonomously evaluating students' work (e.g., reports, coursework, etc.) and automatically assigning grades based on their knowledge and in-depth understanding of the submitted content. The survey results further highlight a significant lack of awareness of modern AI-based tools (e.g., ChatGPT) among experienced academics, a gap that must be addressed to uphold educational standards.
- [11] arXiv:2411.08894 [pdf, html, other]
-
Title: Temporal Patterns of Multiple Long-Term Conditions in Welsh Individuals with Intellectual Disabilities: An Unsupervised Clustering Approach to Disease TrajectoriesRania Kousovista, Georgina Cosma, Emeka Abakasanga, Ashley Akbari, Francesco Zaccardi, Gyuchan Thomas Jun, Reza Kiani, Satheesh GangadharanSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
Identifying and understanding the co-occurrence of multiple long-term conditions (MLTC) in individuals with intellectual disabilities (ID) is vital for effective healthcare management. These individuals often face earlier onset and higher prevalence of MLTCs, yet specific co-occurrence patterns remain unexplored. This study applies an unsupervised approach to characterise MLTC clusters based on shared disease trajectories using electronic health records (EHRs) from 13069 individuals with ID in Wales (2000-2021). The population consisted of 52.3% males and 47.7% females, with an average of 4.5 conditions per patient. Disease associations and temporal directionality were assessed, followed by spectral clustering to group shared trajectories. Males under 45 formed a single cluster dominated by neurological conditions (32.4%), while males above 45 had three clusters, the largest featuring circulatory conditions (51.8%). Females under 45 formed one cluster with digestive conditions (24.6%) as most prevalent, while those aged 45 and older showed two clusters: one dominated by circulatory conditions (34.1%), and the other by digestive (25.9%) and musculoskeletal (21.9%) issues. Mental illness, epilepsy, and reflux were common across groups. Individuals above 45 had higher rates of circulatory and musculoskeletal issues. These clusters offer insights into disease progression in individuals with ID, informing targeted interventions and personalised healthcare strategies.
- [12] arXiv:2411.08895 [pdf, html, other]
-
Title: Performance-Complexity-Latency Trade-offs of Concatenated RS-SDBCH CodesComments: Submitted to IEEE Journal of Lightwave Technology (JLT)Subjects: Information Theory (cs.IT)
Concatenated bit-interleaved and multilevel coded modulation with outer Reed--Solomon codes, inner Chase-algorithm-based soft-decision-decoded Bose--Ray-Chaudhuri--Hocquenghem codes, and four-level pulse amplitude modulation is considered. A semi-analytical formula is derived for estimating the decoded frame error rate (FER) at the output the additive white Gaussian noise channel, obviating the need for time-consuming Monte Carlo simulations. The formula is used to search a large space of codes (including the KP4 code) to find those achieving good trade-offs among performance (measured by the gap to the constrained Shannon limit at $10^{-13}$ FER), complexity (measured by the number of elementary decoder operations), and latency (measured by overall block length).
- [13] arXiv:2411.08897 [pdf, other]
-
Title: Comment on Is Complexity an Illusion?Comments: Comment on arXiv:2404.07227Subjects: Artificial Intelligence (cs.AI)
The paper "Is Complexity an Illusion?" (Bennett, 2024) provides a formalism for complexity, learning, inference, and generalization, and introduces a formal definition for a "policy". This reply shows that correct policies do not exist for a simple task of supervised multi-class classification, via mathematical proof and exhaustive search. Implications of this result are discussed, as well as possible responses and amendments to the theory.
- [14] arXiv:2411.08901 [pdf, html, other]
-
Title: SoccerGuard: Investigating Injury Risk Factors for Professional Soccer Players with Machine LearningSubjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
We present SoccerGuard, a novel framework for predicting injuries in women's soccer using Machine Learning (ML). This framework can ingest data from multiple sources, including subjective wellness and training load reports from players, objective GPS sensor measurements, third-party player statistics, and injury reports verified by medical personnel. We experiment with a number of different settings related to synthetic data generation, input and output window sizes, and ML models for prediction. Our results show that, given the right configurations and feature combinations, injury event prediction can be undertaken with considerable accuracy. The optimal results are achieved when input windows are reduced and larger combined output windows are defined, in combination with an ideally balanced data set. The framework also includes a dashboard with a user-friendly Graphical User Interface (GUI) to support interactive analysis and visualization.
- [15] arXiv:2411.08905 [pdf, html, other]
-
Title: Synthesis Method for Obtaining Characteristic Modes of Multi-Structure SystemsSubjects: Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
This paper introduces an efficient method of characteristic mode decomposition for multi-structure systems. Our approach leverages the translation and rotation matrices associated with vector spherical wavefunctions, enabling the synthesis of a total system's characteristic modes through independent simulation of each constituent structure. We simplify the computationally demanding translation problem by dividing it into three manageable sub-tasks: rotation, z-axis translation, and inverse rotation, which collectively enhance computational efficiency. Furthermore, this method facilitates the exploration of structural orientation effects without incurring additional computational overhead. To demonstrate the effectiveness of our approach, we present a series of compelling numerical examples that not only validate the accuracy of the method but also highlight its significant advantages.
- [16] arXiv:2411.08906 [pdf, html, other]
-
Title: Assessing the Auditability of AI-integrating Systems: A Framework and Learning Analytics Case StudySubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Audits contribute to the trustworthiness of Learning Analytics (LA) systems that integrate Artificial Intelligence (AI) and may be legally required in the future. We argue that the efficacy of an audit depends on the auditability of the audited system. Therefore, systems need to be designed with auditability in mind. We present a framework for assessing the auditability of AI-integrating systems that consists of three parts: (1) Verifiable claims about the validity, utility and ethics of the system, (2) Evidence on subjects (data, models or the system) in different types (documentation, raw sources and logs) to back or refute claims, (3) Evidence must be accessible to auditors via technical means (APIs, monitoring tools, explainable AI, etc.). We apply the framework to assess the auditability of Moodle's dropout prediction system and a prototype AI-based LA. We find that Moodle's auditability is limited by incomplete documentation, insufficient monitoring capabilities and a lack of available test data. The framework supports assessing the auditability of AI-based LA systems in use and improves the design of auditable systems and thus of audits.
- [17] arXiv:2411.08907 [pdf, html, other]
-
Title: From Simulators to Digital Twins for Enabling Emerging Cellular Networks: A Tutorial and SurveyMarvin Manalastas, Muhammad Umar Bin Farooq, Syed Muhammad Asad Zaidi, Haneya Naeem Qureshi, Yusuf Sambo, Ali ImranSubjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Simulators are indispensable parts of the research and development necessary to advance countless industries, including cellular networks. With simulators, the evaluation, analysis, testing, and experimentation of novel designs and algorithms can be executed in a more cost-effective and convenient manner without the risk of real network service disruption. Additionally, recent trends indicate that the advancement of these Digital System Models (DSM), such as system-level simulators, will hold a pivotal role in advancing cellular networks by facilitating the development of digital twins. Given this growing significance, in this survey and tutorial paper, we present an extensive review of the currently available DSMs for 5G and beyond (5G&B) networks. Specifically, we begin with a tutorial on the fundamental concepts of 5G&B network simulations, followed by an identification of the essential design requirements needed to model the key features of these networks. We also devised a taxonomy of different types of 5G&B network simulators. In contrast to existing simulator surveys, which mostly leverage traditional metrics applicable to legacy networks, we devise and use 5G-specific evaluation metrics that capture three key facets of a network simulator, namely realism, completeness, and computational efficiency. We evaluate each simulator according to the devised metrics to generate an applicability matrix that maps different 5G&B simulators vis-a-vis the different research themes they can potentially enable. We also present the current challenges in developing 5G&B simulators while laying out several potential solutions to address the issues. Finally, we discuss the future challenges related to simulator design provisions that will arise with the emergence of 6G networks.
- [18] arXiv:2411.08908 [pdf, other]
-
Title: Balancing Innovation and Sustainability: Addressing the Environmental Impact of Bitcoin MiningComments: 20 pages, 5 fuguresSubjects: Computers and Society (cs.CY)
This study explores the intersection of technological innovation and environmental sustainability in the context of Bitcoin mining. With Bitcoin's growing adoption, concerns surrounding the energy consumption and environmental impact of mining activities have intensified. The study examines the core process of Bitcoin mining, focusing on its energy-intensive proof-of-work mechanism, and provides a detailed analysis of its ecological footprint, especially in terms of carbon emissions and electronic waste. Various models estimate that Bitcoin's energy consumption rivals that of entire nations, highlighting serious sustainability concerns. To address these issues, the paper unearths potential technological innovations, such as energy-efficient mining hardware and the integration of renewable energy sources, as viable strategies to reduce environmental impact. Additionally, the study reviews current sustainability initiatives, including efforts to lower carbon footprints and manage electronic waste effectively. Regulatory developments and market-based approaches are also discussed as possible pathways to mitigate the environmental harm associated with Bitcoin mining. Ultimately, the paper advocates for a balanced approach that fosters technological innovation while promoting environmental responsibility, suggesting that, with appropriate policy and technological interventions, Bitcoin mining can evolve to be both innovative and sustainable.
- [19] arXiv:2411.08910 [pdf, html, other]
-
Title: Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended ResponsesSami Baral, Eamon Worden, Wen-Chiang Lim, Zhuang Luo, Christopher Santorelli, Ashish Gurung, Neil HeffernanComments: 12 pages including references, 4 figures, 9 tablesSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The effectiveness of feedback in enhancing learning outcomes is well documented within Educational Data Mining (EDM). Various prior research has explored methodologies to enhance the effectiveness of feedback. Recent developments in Large Language Models (LLMs) have extended their utility in enhancing automated feedback systems. This study aims to explore the potential of LLMs in facilitating automated feedback in math education. We examine the effectiveness of LLMs in evaluating student responses by comparing 3 different models: Llama, SBERT-Canberra, and GPT4 model. The evaluation requires the model to provide both a quantitative score and qualitative feedback on the student's responses to open-ended math problems. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. A similar approach was taken for training the SBERT model as well, while the GPT4 model used a zero-shot learning approach. We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers. The teachers utilized a shared rubric in assessing the accuracy and relevance of the generated feedback. We conduct both quantitative and qualitative analyses of the model performance. By offering a detailed comparison of these methods, this study aims to further the ongoing development of automated feedback systems and outlines potential future directions for leveraging generative LLMs to create more personalized learning experiences.
- [20] arXiv:2411.08912 [pdf, html, other]
-
Title: Can Large-Language Models Help us Better Understand and Teach the Development of Energy-Efficient Software?Ryan Hasler (1), Konstantin Läufer (1), George K. Thiruvathukal (1), Huiyun Peng (2), Kyle Robinson (2), Kirsten Davis (2), Yung-Hsiang Lu (2), James C. Davis (2) ((1) Loyola University Chicago, (2) Purdue University)Subjects: Computers and Society (cs.CY); Software Engineering (cs.SE)
Computing systems are consuming an increasing and unsustainable fraction of society's energy footprint, notably in data centers. Meanwhile, energy-efficient software engineering techniques are often absent from undergraduate curricula. We propose to develop a learning module for energy-efficient software, suitable for incorporation into an undergraduate software engineering class. There is one major problem with such an endeavor: undergraduate curricula have limited space for mastering energy-related systems programming aspects. To address this problem, we propose to leverage the domain expertise afforded by large language models (LLMs). In our preliminary studies, we observe that LLMs can generate energy-efficient variations of basic linear algebra codes tailored to both ARM64 and AMD64 architectures, as well as unit tests and energy measurement harnesses. On toy examples suitable for classroom use, this approach reduces energy expenditure by 30-90%. These initial experiences give rise to our vision of LLM-based meta-compilers as a tool for students to transform high-level algorithms into efficient, hardware-specific implementations. Complementing this tooling, we will incorporate systems thinking concepts into the learning module so that students can reason both locally and globally about the effects of energy optimizations.
- [21] arXiv:2411.08913 [pdf, other]
-
Title: System Reliability Engineering in the Age of Industry 4.0: Challenges and InnovationsAntoine Tordeux, Tim M. Julitz, Isabelle Müller, Zikai Zhang, Jannis Pietruschka, Nicola Fricke, Nadine Schlüter, Stefan Bracke, Manuel LöwerComments: 34 pages, 19 figuresSubjects: Computers and Society (cs.CY)
In the era of Industry 4.0, system reliability engineering faces both challenges and opportunities. On the one hand, the complexity of cyber-physical systems, the integration of novel numerical technologies, and the handling of large amounts of data pose new difficulties for ensuring system reliability. On the other hand, innovations such as AI-driven prognostics, digital twins, and IoT-enabled systems enable the implementation of new methodologies that are transforming reliability engineering. Condition-based monitoring and predictive maintenance are examples of key advancements, leveraging real-time sensor data collection and AI to predict and prevent equipment failures. These approaches reduce failures and downtime, lower costs, and extend equipment lifespan and sustainability. However, it also brings challenges such as data management, integrating complexity, and the need for fast and accurate models and algorithms. Overall, the convergence of advanced technologies in Industry 4.0 requires a rethinking of reliability tasks, emphasising adaptability and real-time data processing. In this chapter, we propose to review recent innovations in the field, related methods and applications, as well as challenges and barriers that remain to be explored. In the red lane, we focus on smart manufacturing and automotive engineering applications with sensor-based monitoring and driver assistance systems.
- [22] arXiv:2411.08914 [pdf, other]
-
Title: Future-Proofing IoT: Unleashing the Power of AWS Greengrass in Propelling Smart Devices to New HeightsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The advent of edge computing is set to revolutionize cloud computing in various sectors, including Agriculture, Health, and more. AWS Greengrass Core Device plays a pivotal role in this transformative process by bridging connections between IoT devices by improving data sharing between them, unlocking new possibilities in Smart Home, Agriculture, Health, Vehicular Cloud, Smart City, Industry Automation, and beyond. However, these advancements also introduce novel challenges for testing and quality assurance in cloud computing. This paper explores the impact of AWS Greengrass in different fields, addressing challenges, opportunities, and potential benefits of edge and cloud computing in terms of processing speed, latency, and bandwidth usage.
- [23] arXiv:2411.08916 [pdf, html, other]
-
Title: Enhanced Secure Transmission of Medical Images through OFDM using Hyperchaotic SystemsComments: Fourth International Conference on Technological Advances in Electrical Engineering (ICTAEE 23), May 23-34 2023Subjects: Information Theory (cs.IT); Cryptography and Security (cs.CR); Image and Video Processing (eess.IV)
Orthogonal Frequency Division Multiplexing (OFDM) is a popular modulation technique for transmitting digital data over wireless radio channels, including medical images due to its high transmission capacity, low interference, bandwidth efficiency, and scalability. However, the security of medical images is a major concern, and combining OFDM with encryption techniques such as chaos-based image encryption can enhance security measures. This study proposes a secure medical image transmission system that combines OFDM, 6D hyperchaotic system, and Fibonacci Q-matrix and analyzes its impact on image transmission quality using simulation results obtained through MATLAB. The study examines the Q-PSK constellation diagram, fast Fourier transform (IFFT) signal, cyclic prefix (CP) techniques, NIST, signal noise ratio (SNR), and bit error rate (BER). The results provide insights into the effectiveness of OFDM in securely transmitting high-quality medical images.
- [24] arXiv:2411.08918 [pdf, html, other]
-
Title: Wireless Federated Learning over UAV-enabled Integrated Sensing and CommunicationComments: Accepted to IEEE Conference on Standards for Communications and Networking (CSCN), 6 pagesSubjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
This paper studies a new latency optimization problem in unmanned aerial vehicles (UAVs)-enabled federated learning (FL) with integrated sensing and communication. In this setup, distributed UAVs participate in model training using sensed data and collaborate with a base station (BS) serving as FL aggregator to build a global model. The objective is to minimize the FL system latency over UAV networks by jointly optimizing UAVs' trajectory and resource allocation of both UAVs and the BS. The formulated optimization problem is troublesome to solve due to its non-convexity. Hence, we develop a simple yet efficient iterative algorithm to find a high-quality approximate solution, by leveraging block coordinate descent and successive convex approximation techniques. Simulation results demonstrate the effectiveness of our proposed joint optimization strategy under practical parameter settings, saving the system latency up to 68.54\% compared to benchmark schemes.
- [25] arXiv:2411.08923 [pdf, html, other]
-
Title: Aligning Visual Contrastive learning models via Preference OptimizationAmirabbas Afzali, Borna Khodabandeh, Ali Rasekh, Mahyar JafariNodeh, Sepehr kazemi, Simon GottschalkSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Contrastive learning models have demonstrated impressive abilities to capture semantic similarities by aligning representations in the embedding space. However, their performance can be limited by the quality of the training data and its inherent biases. While Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to generative models to align them with human preferences, their use in contrastive learning has yet to be explored. This paper introduces a novel method for training contrastive learning models using Preference Optimization (PO) to break down complex concepts. Our method systematically aligns model behavior with desired preferences, enhancing performance on the targeted task. In particular, we focus on enhancing model robustness against typographic attacks, commonly seen in contrastive models like CLIP. We further apply our method to disentangle gender understanding and mitigate gender biases, offering a more nuanced control over these sensitive attributes. Our experiments demonstrate that models trained using PO outperform standard contrastive learning techniques while retaining their ability to handle adversarial challenges and maintain accuracy on other downstream tasks. This makes our method well-suited for tasks requiring fairness, robustness, and alignment with specific preferences. We evaluate our method on several vision-language tasks, tackling challenges such as typographic attacks. Additionally, we explore the model's ability to disentangle gender concepts and mitigate gender bias, showcasing the versatility of our approach.
- [26] arXiv:2411.08924 [pdf, html, other]
-
Title: The Geometry of Codes for Random Access in DNA StorageSubjects: Information Theory (cs.IT); Combinatorics (math.CO)
Effective and reliable data retrieval is critical for the feasibility of DNA storage, and the development of random access efficiency plays a key role in its practicality and reliability. In this paper, we study the Random Access Problem, which asks to compute the expected number of samples one needs in order to recover an information strand. Unlike previous work, we took a geometric approach to the problem, aiming to understand which geometric structures lead to codes that perform well in terms of reducing the random access expectation (Balanced Quasi-Arcs). As a consequence, two main results are obtained. The first is a construction for $k=3$ that outperforms previous constructions aiming to reduce the random access expectation. The second, exploiting a result from [1], is the proof of a conjecture from [2] for rate $1/2$ codes in any dimension.
- [27] arXiv:2411.08925 [pdf, html, other]
-
Title: Retrieval of sun-induced plant fluorescence in the O$_2$-A absorption band from DESIS imageryJim Buffat, Miguel Pato, Kevin Alonso, Stefan Auer, Emiliano Carmona, Stefan Maier, Rupert Müller, Patrick Rademske, Uwe Rascher, Hanno ScharrComments: submitted to ECCV CVPPA 2024, 14 pages, 8 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
We provide the first method allowing to retrieve spaceborne SIF maps at 30 m ground resolution with a strong correlation ($r^2=0.6$) to high-quality airborne estimates of sun-induced fluorescence (SIF). SIF estimates can provide explanatory information for many tasks related to agricultural management and physiological studies. While SIF products from airborne platforms are accurate and spatially well resolved, the data acquisition of such products remains science-oriented and limited to temporally constrained campaigns. Spaceborne SIF products on the other hand are available globally with often sufficient revisit times. However, the spatial resolution of spaceborne SIF products is too small for agricultural applications. In view of ESA's upcoming FLEX mission we develop a method for SIF retrieval in the O$_2$-A band of hyperspectral DESIS imagery to provide first insights for spaceborne SIF retrieval at high spatial resolution. To this end, we train a simulation-based self-supervised network with a novel perturbation based regularizer and test performance improvements under additional supervised regularization of atmospheric variable prediction. In a validation study with corresponding HyPlant derived SIF estimates at 740 nm we find that our model reaches a mean absolute difference of 0.78 mW / nm / sr / m$^2$.
- [28] arXiv:2411.08930 [pdf, html, other]
-
Title: Structured Pattern Expansion with Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Recent advances in diffusion models have significantly improved the synthesis of materials, textures, and 3D shapes. By conditioning these models via text or images, users can guide the generation, reducing the time required to create digital assets. In this paper, we address the synthesis of structured, stationary patterns, where diffusion models are generally less reliable and, more importantly, less controllable.
Our approach leverages the generative capabilities of diffusion models specifically adapted for the pattern domain. It enables users to exercise direct control over the synthesis by expanding a partially hand-drawn pattern into a larger design while preserving the structure and details of the input. To enhance pattern quality, we fine-tune an image-pretrained diffusion model on structured patterns using Low-Rank Adaptation (LoRA), apply a noise rolling technique to ensure tileability, and utilize a patch-based approach to facilitate the generation of large-scale assets.
We demonstrate the effectiveness of our method through a comprehensive set of experiments, showing that it outperforms existing models in generating diverse, consistent patterns that respond directly to user input. - [29] arXiv:2411.08932 [pdf, html, other]
-
Title: PyGen: A Collaborative Human-AI Approach to Python Package CreationComments: 33 pages, 13 figuresSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The principles of automation and innovation serve as foundational elements for advancement in contemporary science and technology. Here, we introduce Pygen, an automation platform designed to empower researchers, technologists, and hobbyists to bring abstract ideas to life as core, usable software tools written in Python. Pygen leverages the immense power of autoregressive large language models to augment human creativity during the ideation, iteration, and innovation process. By combining state-of-the-art language models with open-source code generation technologies, Pygen has significantly reduced the manual overhead of tool development. From a user prompt, Pygen automatically generates Python packages for a complete workflow from concept to package generation and documentation. The findings of our work show that Pygen considerably enhances the researcher's productivity by enabling the creation of resilient, modular, and well-documented packages for various specialized purposes. We employ a prompt enhancement approach to distill the user's package description into increasingly specific and actionable. While being inherently an open-ended task, we have evaluated the generated packages and the documentation using Human Evaluation, LLM-based evaluation, and CodeBLEU, with detailed results in the results section. Furthermore, we documented our results, analyzed the limitations, and suggested strategies to alleviate them. Pygen is our vision of ethical automation, a framework that promotes inclusivity, accessibility, and collaborative development. This project marks the beginning of a large-scale effort towards creating tools where intelligent agents collaborate with humans to improve scientific and technological development substantially.
Our code and generated examples are open-sourced at [this https URL] - [30] arXiv:2411.08933 [pdf, html, other]
-
Title: Confidence-aware Denoised Fine-tuning of Off-the-shelf Models for Certified RobustnessComments: 26 pages; TMLR 2024; Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
The remarkable advances in deep learning have led to the emergence of many off-the-shelf classifiers, e.g., large pre-trained models. However, since they are typically trained on clean data, they remain vulnerable to adversarial attacks. Despite this vulnerability, their superior performance and transferability make off-the-shelf classifiers still valuable in practice, demanding further work to provide adversarial robustness for them in a post-hoc manner. A recently proposed method, denoised smoothing, leverages a denoiser model in front of the classifier to obtain provable robustness without additional training. However, the denoiser often creates hallucination, i.e., images that have lost the semantics of their originally assigned class, leading to a drop in robustness. Furthermore, its noise-and-denoise procedure introduces a significant distribution shift from the original distribution, causing the denoised smoothing framework to achieve sub-optimal robustness. In this paper, we introduce Fine-Tuning with Confidence-Aware Denoised Image Selection (FT-CADIS), a novel fine-tuning scheme to enhance the certified robustness of off-the-shelf classifiers. FT-CADIS is inspired by the observation that the confidence of off-the-shelf classifiers can effectively identify hallucinated images during denoised smoothing. Based on this, we develop a confidence-aware training objective to handle such hallucinated images and improve the stability of fine-tuning from denoised images. In this way, the classifier can be fine-tuned using only images that are beneficial for adversarial robustness. We also find that such a fine-tuning can be done by updating a small fraction of parameters of the classifier. Extensive experiments demonstrate that FT-CADIS has established the state-of-the-art certified robustness among denoised smoothing methods across all $\ell_2$-adversary radius in various benchmarks.
- [31] arXiv:2411.08934 [pdf, html, other]
-
Title: Predicting household socioeconomic position in Mozambique using satellite and household imageryCarles Milà, Teodimiro Matsena, Edgar Jamisse, Jovito Nunes, Quique Bassat, Paula Petrone, Elisa Sicuri, Charfudin Sacoor, Cathryn TonneSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Many studies have predicted SocioEconomic Position (SEP) for aggregated spatial units such as villages using satellite data, but SEP prediction at the household level and other sources of imagery have not been yet explored. We assembled a dataset of 975 households in a semi-rural district in southern Mozambique, consisting of self-reported asset, expenditure, and income SEP data, as well as multimodal imagery including satellite images and a ground-based photograph survey of 11 household elements. We fine-tuned a convolutional neural network to extract feature vectors from the images, which we then used in regression analyzes to model household SEP using different sets of image types. The best prediction performance was found when modeling asset-based SEP using random forest models with all image types, while the performance for expenditure- and income-based SEP was lower. Using SHAP, we observed clear differences between the images with the largest positive and negative effects, as well as identified the most relevant household elements in the predictions. Finally, we fitted an additional reduced model using only the identified relevant household elements, which had an only slightly lower performance compared to models using all images. Our results show how ground-based household photographs allow to zoom in from an area-level to an individual household prediction while minimizing the data collection effort by using explainable machine learning. The developed workflow can be potentially integrated into routine household surveys, where the collected household imagery could be used for other purposes, such as refined asset characterization and environmental exposure assessment.
- [32] arXiv:2411.08935 [pdf, html, other]
-
Title: Classification of Keratitis from Eye Corneal Photographs using Deep LearningMaria Miguel Beirão, João Matos, Tiago Gonçalves, Camila Kase, Luis Filipe Nakayama, Denise de Freitas, Jaime S. CardosoComments: 6 pages; Accepted at IEEE's International Conference on Bioinformatics and Biomedicine (2024)Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Keratitis is an inflammatory corneal condition responsible for 10% of visual impairment in low- and middle-income countries (LMICs), with bacteria, fungi, or amoeba as the most common infection etiologies. While an accurate and timely diagnosis is crucial for the selected treatment and the patients' sight outcomes, due to the high cost and limited availability of laboratory diagnostics in LMICs, diagnosis is often made by clinical observation alone, despite its lower accuracy. In this study, we investigate and compare different deep learning approaches to diagnose the source of infection: 1) three separate binary models for infection type predictions; 2) a multitask model with a shared backbone and three parallel classification layers (Multitask V1); and, 3) a multitask model with a shared backbone and a multi-head classification layer (Multitask V2). We used a private Brazilian cornea dataset to conduct the empirical evaluation. We achieved the best results with Multitask V2, with an area under the receiver operating characteristic curve (AUROC) confidence intervals of 0.7413-0.7740 (bacteria), 0.8395-0.8725 (fungi), and 0.9448-0.9616 (amoeba). A statistical analysis of the impact of patient features on models' performance revealed that sex significantly affects amoeba infection prediction, and age seems to affect fungi and bacteria predictions.
- [33] arXiv:2411.08937 [pdf, html, other]
-
Title: Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary HeadComments: PreprintSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Traditional knowledge distillation focuses on aligning the student's predicted probabilities with both ground-truth labels and the teacher's predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Specifically, the gradients of the two loss functions exhibit contradictions in the linear classifier yet display no such conflict within the backbone. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-of-the-art counterparts.
- [34] arXiv:2411.08954 [pdf, html, other]
-
Title: Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better SamplesNoël Vouitsis, Rasa Hosseinzadeh, Brendan Leigh Ross, Valentin Villecroze, Satya Krishna Gorti, Jesse C. Cresswell, Gabriel Loaiza-GanemComments: NeurIPS 2024 ATTRIB WorkshopSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Although diffusion models can generate remarkably high-quality samples, they are intrinsically bottlenecked by their expensive iterative sampling procedure. Consistency models (CMs) have recently emerged as a promising diffusion model distillation method, reducing the cost of sampling by generating high-fidelity samples in just a few iterations. Consistency model distillation aims to solve the probability flow ordinary differential equation (ODE) defined by an existing diffusion model. CMs are not directly trained to minimize error against an ODE solver, rather they use a more computationally tractable objective. As a way to study how effectively CMs solve the probability flow ODE, and the effect that any induced error has on the quality of generated samples, we introduce Direct CMs, which \textit{directly} minimize this error. Intriguingly, we find that Direct CMs reduce the ODE solving error compared to CMs but also result in significantly worse sample quality, calling into question why exactly CMs work well in the first place. Full code is available at: this https URL.
- [35] arXiv:2411.08968 [pdf, html, other]
-
Title: Sparse Upcycling: Inference Inefficient FinetuningComments: 12 pages, 4 figures, To appear in the 4th NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP), 2024Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Small, highly trained, open-source large language models are widely used due to their inference efficiency, but further improving their quality remains a challenge. Sparse upcycling is a promising approach that transforms a pretrained dense model into a Mixture-of-Experts (MoE) architecture, increasing the model's parameter count and quality. In this work, we compare the effectiveness of sparse upcycling against continued pretraining (CPT) across different model sizes, compute budgets, and pretraining durations. Our experiments show that sparse upcycling can achieve better quality, with improvements of over 20% relative to CPT in certain scenarios. However, this comes with a significant inference cost, leading to 40% slowdowns in high-demand inference settings for larger models. Our findings highlight the trade-off between model quality and inference efficiency, offering insights for practitioners seeking to balance model quality and deployment constraints.
- [36] arXiv:2411.08972 [pdf, other]
-
Title: Designing Automated Market Makers for Combinatorial Securities: A Geometric ViewpointComments: 38 pages, 1 figure, accepted at SODA'25Subjects: Computer Science and Game Theory (cs.GT); Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS)
Designing automated market makers (AMMs) for prediction markets on combinatorial securities over large outcome spaces poses significant computational challenges. Prior research has primarily focused on combinatorial prediction markets within specific set systems (e.g., intervals, permutations). We introduce a framework for designing AMMs on arbitrary set systems by building a novel connection to the range query problem in computational geometry. This connection enables the analysis of computational complexity and the design of efficient AMMs.
We first demonstrate the equivalence between price queries and trade updates under the popular combinatorial logarithmic market scoring rule market and the range query and range update problem. Building on this equivalence, we construct sublinear time algorithms when the VC dimension of the set system is bounded and show the non-existence of such algorithms for unbounded VC dimension cases. We then extend this approach to AMMs for combinatorial prediction markets with quadratic and power scoring rules. Finally, we show that the multi-resolution market design can be naturally integrated into the partition-tree scheme.
Additionally, we introduce the combinatorial swap operation problem for automated market makers in decentralized finance and show that it can be efficiently reduced to range update problems. - [37] arXiv:2411.08977 [pdf, html, other]
-
Title: Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of OffensivenessComments: 18 pages, 8 figures, ACL'25Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Large language models (LLMs) are known to exhibit demographic biases, yet few studies systematically evaluate these biases across multiple datasets or account for confounding factors. In this work, we examine LLM alignment with human annotations in five offensive language datasets, comprising approximately 220K annotations. Our findings reveal that while demographic traits, particularly race, influence alignment, these effects are inconsistent across datasets and often entangled with other factors. Confounders -- such as document difficulty, annotator sensitivity, and within-group agreement -- account for more variation in alignment patterns than demographic traits alone. Specifically, alignment increases with higher annotator sensitivity and group agreement, while greater document difficulty corresponds to reduced alignment. Our results underscore the importance of multi-dataset analyses and confounder-aware methodologies in developing robust measures of demographic bias in LLMs.
- [38] arXiv:2411.08979 [pdf, html, other]
-
Title: CoCoP: Enhancing Text Classification with LLM through Code Completion PromptSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Text classification is a fundamental task in natural language processing (NLP), and large language models (LLMs) have demonstrated their capability to perform this task across various domains. However, the performance of LLMs heavily depends on the quality of their input prompts. Recent studies have also shown that LLMs exhibit remarkable results in code-related tasks. To leverage the capabilities of LLMs in text classification, we propose the Code Completion Prompt (CoCoP) method, which transforms the text classification problem into a code completion task. CoCoP significantly improves text classification performance across diverse datasets by utilizing LLMs' code-completion capability. For instance, CoCoP enhances the accuracy of the SST2 dataset by more than 20%. Moreover, when CoCoP integrated with LLMs specifically designed for code-related tasks (code models), such as CodeLLaMA, this method demonstrates better or comparable performance to few-shot learning techniques while using only one-tenth of the model size. The source code of our proposed method will be available to the public upon the acceptance of the paper.
- [39] arXiv:2411.08981 [pdf, html, other]
-
Title: Reliability, Resilience and Human Factors Engineering for Trustworthy AI SystemsSubjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
As AI systems become integral to critical operations across industries and services, ensuring their reliability and safety is essential. We offer a framework that integrates established reliability and resilience engineering principles into AI systems. By applying traditional metrics such as failure rate and Mean Time Between Failures (MTBF) along with resilience engineering and human reliability analysis, we propose an integrate framework to manage AI system performance, and prevent or efficiently recover from failures. Our work adapts classical engineering methods to AI systems and outlines a research agenda for future technical studies. We apply our framework to a real-world AI system, using system status data from platforms such as openAI, to demonstrate its practical applicability. This framework aligns with emerging global standards and regulatory frameworks, providing a methodology to enhance the trustworthiness of AI systems. Our aim is to guide policy, regulation, and the development of reliable, safe, and adaptable AI technologies capable of consistent performance in real-world environments.
- [40] arXiv:2411.08982 [pdf, html, other]
-
Title: Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert SelectionSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Mixture-of-Experts (MoE) architectures have recently gained popularity in enabling efficient scaling of large language models. However, we uncover a fundamental tension: while MoEs are designed for selective expert activation, production serving requires request batching, which forces the activation of all experts and negates MoE's efficiency benefits during the decode phase. We present Lynx, a system that enables efficient MoE inference through dynamic, batch-aware expert selection. Our key insight is that expert importance varies significantly across tokens and inference phases, creating opportunities for runtime optimization. Lynx leverages this insight through a lightweight framework that dynamically reduces active experts while preserving model accuracy. Our evaluations show that Lynx achieves up to 1.55x reduction in inference latency while maintaining negligible accuracy loss from baseline model across complex code generation and mathematical reasoning tasks.
- [41] arXiv:2411.08986 [pdf, html, other]
-
Title: A Priori Error Bounds and Parameter Scalings for the Time Relaxation Reduced Order ModelSubjects: Numerical Analysis (math.NA)
The a priori error analysis of reduced order models (ROMs) for fluids is relatively scarce. In this paper, we take a step in this direction and conduct numerical analysis of the recently introduced time relaxation ROM (TR-ROM), which uses spatial filtering to stabilize ROMs for convection-dominated flows. Specifically, we prove stability, an a priori error bound, and parameter scalings for the TR-ROM. Our numerical investigation shows that the theoretical convergence rate and the parameter scalings with respect to ROM dimension and filter radius are recovered numerically. In addition, the parameter scaling can be used to extrapolate the time relaxation parameter to other ROM dimensions and filter radii. Moreover, the parameter scaling with respect to filter radius is also observed in the predictive regime.
- [42] arXiv:2411.08989 [pdf, html, other]
-
Title: Nearly Tight Bounds on Testing of Metric PropertiesSubjects: Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS)
Given a non-negative $n \times n$ matrix viewed as a set of distances between $n$ points, we consider the property testing problem of deciding if it is a metric. We also consider the same problem for two special classes of metrics, tree metrics and ultrametrics. For general metrics, our paper is the first to consider these questions. We prove an upper bound of $O(n^{2/3}/\epsilon^{4/3})$ on the query complexity for this problem. Our algorithm is simple, but the analysis requires great care in bounding the variance on the number of violating triangles in a sample. When $\epsilon$ is a slowly decreasing function of $n$ (rather than a constant, as is standard), we prove a lower bound of matching dependence on $n$ of $\Omega (n^{2/3})$, ruling out any property testers with $o(n^{2/3})$ query complexity unless their dependence on $1/\epsilon$ is super-polynomial.
Next, we turn to tree metrics and ultrametrics. While there were known upper and lower bounds, we considerably improve these bounds showing essentially tight bounds of $\tilde{O}(1/\epsilon )$ on the sample complexity. We also show a lower bound of $\Omega ( 1/\epsilon^{4/3} )$ on the query complexity. Our upper bounds are derived by doing a more careful analysis of a natural, simple algorithm. For the lower bounds, we construct distributions on NO instances, where it is hard to find a witness showing that these are not ultrametrics. - [43] arXiv:2411.08994 [pdf, html, other]
-
Title: A characterization of positive spanning sets with ties to strongly connected digraphsSubjects: Discrete Mathematics (cs.DM); Combinatorics (math.CO); Optimization and Control (math.OC)
Positive spanning sets (PSSs) are families of vectors that span a given linear space through non-negative linear combinations. Despite certain classes of PSSs being well understood, a complete characterization of PSSs remains elusive. In this paper, we explore a relatively understudied relationship between positive spanning sets and strongly edge-connected digraphs, in that the former can be viewed as a generalization of the latter. We leverage this connection to define a decomposition structure for positive spanning sets inspired by the ear decomposition from digraph theory.
- [44] arXiv:2411.08995 [pdf, html, other]
-
Title: Computed tomography using meta-opticsSubjects: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
Computer vision tasks require processing large amounts of data to perform image classification, segmentation, and feature extraction. Optical preprocessors can potentially reduce the number of floating point operations required by computer vision tasks, enabling low-power and low-latency operation. However, existing optical preprocessors are mostly learned and hence strongly depend on the training data, and thus lack universal applicability. In this paper, we present a metaoptic imager, which implements the Radon transform obviating the need for training the optics. High quality image reconstruction with a large compression ratio of 0.6% is presented through the use of the Simultaneous Algebraic Reconstruction Technique. Image classification with 90% accuracy is presented on an experimentally measured Radon dataset through neural network trained on digitally transformed images.
- [45] arXiv:2411.08999 [pdf, html, other]
-
Title: Learning-Based Control Barrier Function with Provably Safe Guarantees: Reducing Conservatism with Heading-Aware Safety MarginComments: 8 pages, 5 figuresSubjects: Robotics (cs.RO); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
We propose a learning-based Control Barrier Function (CBF) to reduce conservatism in collision avoidance of car-like robots. Traditional CBFs often use Euclidean distance between robots' centers as safety margin, neglecting headings and simplifying geometries to circles. While this ensures smooth, differentiable safety functions required by CBFs, it can be overly conservative in tight environments. To address this limitation, we design a heading-aware safety margin that accounts for the robots' orientations, enabling a less conservative and more accurate estimation of safe regions. Since the function computing this safety margin is non-differentiable, we approximate it with a neural network to ensure differentiability and facilitate integration with CBFs. We describe how we achieve bounded learning error and incorporate the upper bound into the CBF to provide formal safety guarantees through forward invariance. We show that our CBF is a high-order CBF with relative degree two for a system with two robots whose dynamics are modeled by the nonlinear kinematic bicycle model. Experimental results in overtaking and bypassing scenarios reveal a 33.5 % reduction in conservatism compared to traditional methods, while maintaining safety. Code: this https URL
- [46] arXiv:2411.09001 [pdf, other]
-
Title: Virtual teaching assistant for undergraduate students using natural language processing & deep learningSubjects: Computers and Society (cs.CY)
Online education's popularity has been continuously increasing over the past few years. Many universities were forced to switch to online education as a result of COVID-19. In many cases, even after more than two years of online instruction, colleges were unable to resume their traditional classroom programs. A growing number of institutions are considering blended learning with some parts in-person and the rest of the learning taking place online. Nevertheless, many online education systems are inefficient, and this results in a poor rate of student retention. In this paper, we are offering a primary dataset, the initial implementation of a virtual teaching assistant named VTA-bot, and its system architecture. Our primary implementation of the suggested system consists of a chatbot that can be queried about the content and topics of the fundamental python programming language course. Students in their first year of university will be benefited from this strategy, which aims to increase student participation and involvement in online education.
- [47] arXiv:2411.09003 [pdf, html, other]
-
Title: Refusal in LLMs is an Affine FunctionSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and test it on refusal using Llama 3 8B and Hermes Eagle RWKV v5. ACE ultimately combines affine subspace projection and activation addition to reliably control the model's refusal responses across prompt types. We evaluate the results using LLM-based scoring on a collection of harmful and harmless prompts. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior and generalizes to models where directional ablation via affine subspace projection alone produces incoherent outputs. Code for reproducing our results is available at this https URL .
- [48] arXiv:2411.09004 [pdf, html, other]
-
Title: The geometry of the deep linear networkSubjects: Neural and Evolutionary Computing (cs.NE); Dynamical Systems (math.DS); Probability (math.PR); Adaptation and Self-Organizing Systems (nlin.AO)
This article provides an expository account of training dynamics in the Deep Linear Network (DLN) from the perspective of the geometric theory of dynamical systems. Rigorous results by several authors are unified into a thermodynamic framework for deep learning.
The analysis begins with a characterization of the invariant manifolds and Riemannian geometry in the DLN. This is followed by exact formulas for a Boltzmann entropy, as well as stochastic gradient descent of free energy using a Riemannian Langevin Equation. Several links between the DLN and other areas of mathematics are discussed, along with some open questions. - [49] arXiv:2411.09007 [pdf, html, other]
-
Title: Scale Contrastive Learning with Selective Attentions for Blind Image Quality AssessmentSubjects: Computer Vision and Pattern Recognition (cs.CV)
Blind image quality assessment (BIQA) serves as a fundamental task in computer vision, yet it often fails to consistently align with human subjective perception. Recent advances show that multi-scale evaluation strategies are promising due to their ability to replicate the hierarchical structure of human vision. However, the effectiveness of these strategies is limited by a lack of understanding of how different image scales influence perceived quality. This paper addresses two primary challenges: the significant redundancy of information across different scales, and the confusion caused by combining features from these scales, which may vary widely in quality. To this end, a new multi-scale BIQA framework is proposed, namely Contrast-Constrained Scale-Focused IQA Framework (CSFIQA). CSFIQA features a selective focus attention mechanism to minimize information redundancy and highlight critical quality-related information. Additionally, CSFIQA includes a scale-level contrastive learning module equipped with a noise sample matching mechanism to identify quality discrepancies across the same image content at different scales. By exploring the intrinsic relationship between image scales and the perceived quality, the proposed CSFIQA achieves leading performance on eight benchmark datasets, e.g., achieving SRCC values of 0.967 (versus 0.947 in CSIQ) and 0.905 (versus 0.876 in LIVEC).
- [50] arXiv:2411.09009 [pdf, html, other]
-
Title: Cut Your Losses in Large-Vocabulary Language ModelsComments: Code is available at this https URLSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.
- [51] arXiv:2411.09018 [pdf, html, other]
-
Title: Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted CaptionsSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Recent research increasingly focuses on training vision-language models (VLMs) with long, detailed image captions. However, small-scale VLMs often struggle to balance the richness of these captions with the risk of hallucinating content during fine-tuning. In this paper, we explore how well VLMs adapt to such captions. To quantify caption quality, we propose Decomposed NLI (DNLI), an evaluation framework that breaks down generated captions into individual propositions, assessing each in isolation. This fine-grained analysis reveals a critical balance between capturing descriptive details and preventing hallucinations. Our findings show that simply reducing caption complexity or employing standard data curation techniques does not effectively resolve this issue. To tackle this challenge, we introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model's existing knowledge and visual understanding. KnowAda minimizes hallucinations while preserving high descriptiveness. We validate this approach across several small-scale VLMs (up to 7B parameters) and dense caption datasets, demonstrating that KnowAda effectively balances hallucination reduction and descriptiveness. Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations. We will release our code and models.
- [52] arXiv:2411.09020 [pdf, html, other]
-
Title: Predictive Visuo-Tactile Interactive Perception Framework for Object Properties InferenceSubjects: Robotics (cs.RO)
Interactive exploration of the unknown physical properties of objects such as stiffness, mass, center of mass, friction coefficient, and shape is crucial for autonomous robotic systems operating continuously in unstructured environments. Precise identification of these properties is essential to manipulate objects in a stable and controlled way, and is also required to anticipate the outcomes of (prehensile or non-prehensile) manipulation actions such as pushing, pulling, lifting, etc. Our study focuses on autonomously inferring the physical properties of a diverse set of various homogeneous, heterogeneous, and articulated objects utilizing a robotic system equipped with vision and tactile sensors. We propose a novel predictive perception framework for identifying object properties of the diverse objects by leveraging versatile exploratory actions: non-prehensile pushing and prehensile pulling. As part of the framework, we propose a novel active shape perception to seamlessly initiate exploration. Our innovative dual differentiable filtering with Graph Neural Networks learns the object-robot interaction and performs consistent inference of indirectly observable time-invariant object properties. In addition, we formulate a $N$-step information gain approach to actively select the most informative actions for efficient learning and inference. Extensive real-robot experiments with planar objects show that our predictive perception framework results in better performance than the state-of-the-art baseline and demonstrate our framework in three major applications for i) object tracking, ii) goal-driven task, and iii) change in environment detection.
- [53] arXiv:2411.09022 [pdf, html, other]
-
Title: DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language ModelsYongdong Wang, Runze Xiao, Jun Younes Louhi Kasahara, Ryosuke Yajima, Keiji Nagatani, Atsushi Yamashita, Hajime AsamaComments: Submitted to the 2025 IEEE International Conference on Robotics & Automation on September 15, 2024Subjects: Robotics (cs.RO)
Large Language Models (LLMs) have demonstrated significant reasoning capabilities in robotic systems. However, their deployment in multi-robot systems remains fragmented and struggles to handle complex task dependencies and parallel execution. This study introduces the DART-LLM (Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models) system, designed to address these challenges. DART-LLM utilizes LLMs to parse natural language instructions, decomposing them into multiple subtasks with dependencies to establish complex task sequences, thereby enhancing efficient coordination and parallel execution in multi-robot systems. The system includes the QA LLM module, Breakdown Function modules, Actuation module, and a Vision-Language Model (VLM)-based object detection module, enabling task decomposition and execution from natural language instructions to robotic actions. Experimental results demonstrate that DART-LLM excels in handling long-horizon tasks and collaborative tasks with complex dependencies. Even when using smaller models like Llama 3.1 8B, the system achieves good performance, highlighting DART-LLM's robustness in terms of model size. Please refer to the project website \url{this https URL} for videos and code.
- [54] arXiv:2411.09023 [pdf, html, other]
-
Title: CoMiX: Cross-Modal Fusion with Deformable Convolutions for HSI-X Semantic SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from a supplementary data type (referred to X-modality) is promising but challenging due to differences in imaging sensors, image content, and resolution. Current techniques struggle to enhance modality-specific and modality-shared information, as well as to capture dynamic interaction and fusion between different modalities. In response, this study proposes CoMiX, an asymmetric encoder-decoder architecture with deformable convolutions (DCNs) for HSI-X semantic segmentation. CoMiX is designed to extract, calibrate, and fuse information from HSI and X data. Its pipeline includes an encoder with two parallel and interacting backbones and a lightweight all-multilayer perceptron (ALL-MLP) decoder. The encoder consists of four stages, each incorporating 2D DCN blocks for the X model to accommodate geometric variations and 3D DCN blocks for HSIs to adaptively aggregate spatial-spectral features. Additionally, each stage includes a Cross-Modality Feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX is designed to exploit spatial-spectral correlations from different modalities to recalibrate and enhance modality-specific and modality-shared features while adaptively exchanging complementary information between them. Outputs from CMFeX are fed into the FFM for fusion and passed to the next stage for further information learning. Finally, the outputs from each FFM are integrated by the ALL-MLP decoder for final prediction. Extensive experiments demonstrate that our CoMiX achieves superior performance and generalizes well to various multimodal recognition tasks. The CoMiX code will be released.
- [55] arXiv:2411.09026 [pdf, other]
-
Title: On the Complexity of Hazard-Free FormulasSubjects: Computational Complexity (cs.CC)
This paper studies the hazard-free formula complexity of Boolean functions.
As our main result, we prove that unate functions are the only Boolean functions for which the monotone formula complexity of the hazard-derivative equals the hazard-free formula complexity of the function itself. Consequently, every non-unate function breaks the so-called monotone barrier, as introduced and discussed by Ikenmeyer, Komarath, and Saurabh (ITCS 2023).
Our second main result shows that the hazard-free formula complexity of random Boolean functions is at most $2^{(1+o(1))n}$. Prior to this, no better upper bound than $O(3^n)$ was known. Notably, unlike in the general case of Boolean circuits and formulas, where the typical complexity matches that of the multiplexer function, the hazard-free formula complexity is smaller than the optimal hazard-free formula for the multiplexer by an exponential factor in $n$.
Additionally, we explore the hazard-free formula complexity of block composition of Boolean functions and obtain a result in the hazard-free setting that is analogous to a result of Karchmer, Raz, and Wigderson (Computational Complexity, 1995) in the monotone setting. We demonstrate that our result implies a lower bound on the hazard-free formula depth of the block composition of the set covering function with the multiplexer function, which breaks the monotone barrier. - [56] arXiv:2411.09027 [pdf, html, other]
-
Title: Transformer-based Time-Series Biomarker Discovery for COPD DiagnosisComments: Accepted as a workshop paper to NeurIPS 2024Subjects: Machine Learning (cs.LG)
Chronic Obstructive Pulmonary Disorder (COPD) is an irreversible and progressive disease which is highly heritable. Clinically, COPD is defined using the summary measures derived from a spirometry test but these are not always adequate. Here we show that using the high-dimensional raw spirogram can provide a richer signal compared to just using the summary measures. We design a transformer-based deep learning technique to process the raw spirogram values along with demographic information and predict clinically-relevant endpoints related to COPD. Our method is able to perform better than prior works while being more computationally efficient. Using the weights learned by the model, we make the framework more interpretable by identifying parts of the spirogram that are important for the model predictions. Pairing up with a board-certified pulmonologist, we also provide clinical insights into the different aspects of the spirogram and show that the explanations obtained from the model align with underlying medical knowledge.
- [57] arXiv:2411.09029 [pdf, html, other]
-
Title: Newton's Method Applied to Nonlinear Boundary Value Problems: A Numerical ApproachComments: 9 pages, 2 figuresSubjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
This work investigates the application of the Newton's method for the numerical solution of a nonlinear boundary value problem formulated through an ordinary differential equation (ODE). Nonlinear ODEs arise in various mathematical modeling contexts, where an exact solution is often unfeasible due to the intrinsic complexity of these equations. Thus, a numerical approach is employed, using Newton's method to solve the system resulting from the discretization of the original problem. The procedure involves the iterative formulation of the method, which enables the approximation of solutions and the evaluation of convergence with respect to the problem parameters. The results demonstrate that Newton's method provides a robust and efficient solution, highlighting its applicability to complex boundary value problems and reinforcing its relevance for the numerical analysis of nonlinear systems. It is concluded that the methodology discussed is suitable for solving a wide range of boundary value problems, ensuring precision and stability in the results.
- [58] arXiv:2411.09037 [pdf, html, other]
-
Title: A Transformer-Based Visual Piano Transcription AlgorithmComments: 9 pages, 2 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Automatic music transcription (AMT) for musical performances is a long standing problem in the field of Music Information Retrieval (MIR). Visual piano transcription (VPT) is a multimodal subproblem of AMT which focuses on extracting a symbolic representation of a piano performance from visual information only (e.g., from a top-down video of the piano keyboard). Inspired by the success of Transformers for audio-based AMT, as well as their recent successes in other computer vision tasks, in this paper we present a Transformer based architecture for VPT. The proposed VPT system combines a piano bounding box detection model with an onset and pitch detection model, allowing our system to perform well in more naturalistic conditions like imperfect image crops around the piano and slightly tilted images.
- [59] arXiv:2411.09047 [pdf, html, other]
-
Title: Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and DatasetMohammad Saiful Islam, Mohamed Sami Rakha, William Pourmajidi, Janakan Sivaloganathan, John Steinbacher, Andriy MiranskyySubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
As Large-Scale Cloud Systems (LCS) become increasingly complex, effective anomaly detection is critical for ensuring system reliability and performance. However, there is a shortage of large-scale, real-world datasets available for benchmarking anomaly detection methods.
To address this gap, we introduce a new high-dimensional dataset from IBM Cloud, collected over 4.5 months from the IBM Cloud Console. This dataset comprises 39,365 rows and 117,448 columns of telemetry data. Additionally, we demonstrate the application of machine learning models for anomaly detection and discuss the key challenges faced in this process.
This study and the accompanying dataset provide a resource for researchers and practitioners in cloud system monitoring. It facilitates more efficient testing of anomaly detection methods in real-world data, helping to advance the development of robust solutions to maintain the health and performance of large-scale cloud infrastructures. - [60] arXiv:2411.09050 [pdf, html, other]
-
Title: The Systems Engineering Approach in Times of Large Language ModelsComments: This paper has been accepted for the upcoming 58th Hawaii International Conference on System Sciences (HICSS-58)Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
Using Large Language Models (LLMs) to address critical societal problems requires adopting this novel technology into socio-technical systems. However, the complexity of such systems and the nature of LLMs challenge such a vision. It is unlikely that the solution to such challenges will come from the Artificial Intelligence (AI) community itself. Instead, the Systems Engineering approach is better equipped to facilitate the adoption of LLMs by prioritising the problems and their context before any other aspects. This paper introduces the challenges LLMs generate and surveys systems research efforts for engineering AI-based systems. We reveal how the systems engineering principles have supported addressing similar issues to the ones LLMs pose and discuss our findings to provide future directions for adopting LLMs.
- [61] arXiv:2411.09052 [pdf, html, other]
-
Title: ClevrSkills: Compositional Language and Visual Reasoning in RoboticsComments: To appear at NeurIPS 2024 (D&B track)Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Robotics tasks are highly compositional by nature. For example, to perform a high-level task like cleaning the table a robot must employ low-level capabilities of moving the effectors to the objects on the table, pick them up and then move them off the table one-by-one, while re-evaluating the consequently dynamic scenario in the process. Given that large vision language models (VLMs) have shown progress on many tasks that require high level, human-like reasoning, we ask the question: if the models are taught the requisite low-level capabilities, can they compose them in novel ways to achieve interesting high-level tasks like cleaning the table without having to be explicitly taught so? To this end, we present ClevrSkills - a benchmark suite for compositional reasoning in robotics. ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying dataset. The dataset contains trajectories generated on a range of robotics tasks with language and visual annotations as well as multi-modal prompts as task specification. The suite includes a curriculum of tasks with three levels of compositional understanding, starting with simple tasks requiring basic motor skills. We benchmark multiple different VLM baselines on ClevrSkills and show that even after being pre-trained on large numbers of tasks, these models fail on compositional reasoning in robotics tasks.
- [62] arXiv:2411.09053 [pdf, html, other]
-
Title: Information Need in Metaverse Recordings -- A Field StudyComments: 12 pages, 3 Figures, 8 TablesSubjects: Information Retrieval (cs.IR); Multimedia (cs.MM)
Metaverse Recordings (MVRs) represent an emerging and underexplored media type within the field of Multimedia Information Retrieval (MMIR). This paper presents findings from a field study aimed at understanding the users information needs and search behaviors specific to MVR retrieval. By conducting and analyzing expert interviews, the study identifies application scenarios and highlights challenges in retrieving multimedia content from the metaverse. The results reveal existing application scenarios of MVRs and confirm the relevance of capturing time-series data from the graphical rendering process and related input-output devices, which are also highly relevant to user needs. Furthermore, the study provides a foundation for developing retrieval systems tailored to MVRs by defining use cases, user stereotypes, and specific requirements for MVR Retrieval systems. The findings contribute to a better understanding of information search behaviors in MVR Retrieval and pave the way for future research and system design in this field.
- [63] arXiv:2411.09055 [pdf, other]
-
Title: SAFELOC: Overcoming Data Poisoning Attacks in Heterogeneous Federated Machine Learning for Indoor LocalizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Machine learning (ML) based indoor localization solutions are critical for many emerging applications, yet their efficacy is often compromised by hardware/software variations across mobile devices (i.e., device heterogeneity) and the threat of ML data poisoning attacks. Conventional methods aimed at countering these challenges show limited resilience to the uncertainties created by these phenomena. In response, in this paper, we introduce SAFELOC, a novel framework that not only minimizes localization errors under these challenging conditions but also ensures model compactness for efficient mobile device deployment. Our framework targets a distributed and co-operative learning environment that uses federated learning (FL) to preserve user data privacy and assumes heterogeneous mobile devices carried by users (just like in most real-world scenarios). Within this heterogeneous FL context, SAFELOC introduces a novel fused neural network architecture that performs data poisoning detection and localization, with a low model footprint. Additionally, a dynamic saliency map-based aggregation strategy is designed to adapt based on the severity of the detected data poisoning scenario. Experimental evaluations demonstrate that SAFELOC achieves improvements of up to 5.9x in mean localization error, 7.8x in worst-case localization error, and a 2.1x reduction in model inference latency compared to state-of-the-art indoor localization frameworks, across diverse building floorplans, mobile devices, and ML data poisoning attack scenarios.
- [64] arXiv:2411.09056 [pdf, html, other]
-
Title: Optimisation Strategies for Ensuring Fairness in Machine Learning: With and Without DemographicsComments: PhD thesis. arXiv admin note: text overlap with arXiv:2310.11407Subjects: Machine Learning (cs.LG)
Ensuring fairness has emerged as one of the primary concerns in AI and its related algorithms. Over time, the field of machine learning fairness has evolved to address these issues. This paper provides an extensive overview of this field and introduces two formal frameworks to tackle open questions in machine learning fairness.
In one framework, operator-valued optimisation and min-max objectives are employed to address unfairness in time-series problems. This approach showcases state-of-the-art performance on the notorious COMPAS benchmark dataset, demonstrating its effectiveness in real-world scenarios.
In the second framework, the challenge of lacking sensitive attributes, such as gender and race, in commonly used datasets is addressed. This issue is particularly pressing because existing algorithms in this field predominantly rely on the availability or estimations of such attributes to assess and mitigate unfairness. Here, a framework for a group-blind bias-repair is introduced, aiming to mitigate bias without relying on sensitive attributes. The efficacy of this approach is showcased through analyses conducted on the Adult Census Income dataset.
Additionally, detailed algorithmic analyses for both frameworks are provided, accompanied by convergence guarantees, ensuring the robustness and reliability of the proposed methodologies. - [65] arXiv:2411.09059 [pdf, html, other]
-
Title: Sublinear Metric Steiner Tree via Improved Bounds for Set CoverSubjects: Data Structures and Algorithms (cs.DS)
We study the metric Steiner tree problem in the sublinear query model. In this problem, for a set of $n$ points $V$ in a metric space given to us by means of query access to an $n\times n$ matrix $w$, and a set of terminals $T\subseteq V$, the goal is to find the minimum-weight subset of the edges that connects all the terminal vertices.
Recently, Chen, Khanna and Tan [SODA'23] gave an algorithm that uses $\widetilde{O}(n^{13/7})$ queries and outputs a $(2-\eta)$-estimate of the metric Steiner tree weight, where $\eta>0$ is a universal constant. A key component in their algorithm is a sublinear algorithm for a particular set cover problem where, given a set system $(U, F)$, the goal is to provide a multiplicative-additive estimate for $|U|-\textsf{SC}(U, F)$. Here $U$ is the set of elements, $F$ is the collection of sets, and $\textsf{SC}(U, F)$ denotes the optimal set cover size of $(U, F)$. In particular, their algorithm returns a $(1/4, \varepsilon\cdot|U|)$-multiplicative-additive estimate for this set cover problem using $\widetilde{O}(|F|^{7/4})$ membership oracle queries (querying whether a set $S$ contains an $e$), where $\varepsilon$ is a fixed constant.
In this work, we improve the query complexity of $(2-\eta)$-estimating the metric Steiner tree weight to $\widetilde{O}(n^{5/3})$ by showing a $(1/2, \varepsilon \cdot |U|)$-estimate for the above set cover problem using $\widetilde{O}(|F|^{5/3})$ membership queries. To design our set cover algorithm, we estimate the size of a random greedy maximal matching for an auxiliary multigraph that the algorithm constructs implicitly, without access to its adjacency list or matrix. - [66] arXiv:2411.09062 [pdf, html, other]
-
Title: Multimodal Object Detection using Depth and Image Data for Manufacturing PartsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Manufacturing requires reliable object detection methods for precise picking and handling of diverse types of manufacturing parts and components. Traditional object detection methods utilize either only 2D images from cameras or 3D data from lidars or similar 3D sensors. However, each of these sensors have weaknesses and limitations. Cameras do not have depth perception and 3D sensors typically do not carry color information. These weaknesses can undermine the reliability and robustness of industrial manufacturing systems. To address these challenges, this work proposes a multi-sensor system combining an red-green-blue (RGB) camera and a 3D point cloud sensor. The two sensors are calibrated for precise alignment of the multimodal data captured from the two hardware devices. A novel multimodal object detection method is developed to process both RGB and depth data. This object detector is based on the Faster R-CNN baseline that was originally designed to process only camera images. The results show that the multimodal model significantly outperforms the depth-only and RGB-only baselines on established object detection metrics. More specifically, the multimodal model improves mAP by 13% and raises Mean Precision by 11.8% in comparison to the RGB-only baseline. Compared to the depth-only baseline, it improves mAP by 78% and raises Mean Precision by 57%. Hence, this method facilitates more reliable and robust object detection in service to smart manufacturing applications.
- [67] arXiv:2411.09065 [pdf, html, other]
-
Title: Language-Model Prior Overcomes Cold-Start ItemsComments: This paper is dedicated to cold-start item recommendation using language-model priorsSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The growth of recommender systems (RecSys) is driven by digitization and the need for personalized content in areas such as e-commerce and video streaming. The content in these systems often changes rapidly and therefore they constantly face the ongoing cold-start problem, where new items lack interaction data and are hard to value. Existing solutions for the cold-start problem, such as content-based recommenders and hybrid methods, leverage item metadata to determine item similarities. The main challenge with these methods is their reliance on structured and informative metadata to capture detailed item similarities, which may not always be available. This paper introduces a novel approach for cold-start item recommendation that utilizes the language model (LM) to estimate item similarities, which are further integrated as a Bayesian prior with classic recommender systems. This approach is generic and able to boost the performance of various recommenders. Specifically, our experiments integrate it with both sequential and collaborative filtering-based recommender and evaluate it on two real-world datasets, demonstrating the enhanced performance of the proposed approach.
- [68] arXiv:2411.09066 [pdf, html, other]
-
Title: A multidimensional measurement of photorealistic avatar quality of experienceComments: arXiv admin note: text overlap with arXiv:2204.06784Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Photorealistic avatars are human avatars that look, move, and talk like real people. The performance of photorealistic avatars has significantly improved recently based on objective metrics such as PSNR, SSIM, LPIPS, FID, and FVD. However, recent photorealistic avatar publications do not provide subjective tests of the avatars to measure human usability factors. We provide an open source test framework to subjectively measure photorealistic avatar performance in ten dimensions: realism, trust, comfortableness using, comfortableness interacting with, appropriateness for work, creepiness, formality, affinity, resemblance to the person, and emotion accuracy. We show that the correlation of nine of these subjective metrics with PSNR, SSIM, LPIPS, FID, and FVD is weak, and moderate for emotion accuracy. The crowdsourced subjective test framework is highly reproducible and accurate when compared to a panel of experts. We analyze a wide range of avatars from photorealistic to cartoon-like and show that some photorealistic avatars are approaching real video performance based on these dimensions. We also find that for avatars above a certain level of realism, eight of these measured dimensions are strongly correlated. In particular, for photorealistic avatars there is a linear relationship between avatar affinity and realism; in other words, there is no uncanny valley effect for photorealistic avatars in the telecommunication scenario. We provide several extensions of this test framework for future work and discuss design implications for telecommunication systems. The test framework is available at this https URL.
- [69] arXiv:2411.09068 [pdf, other]
-
Title: Liner Shipping Network Design with Reinforcement LearningSubjects: Artificial Intelligence (cs.AI)
This paper proposes a novel reinforcement learning framework to address the Liner Shipping Network Design Problem (LSNDP), a challenging combinatorial optimization problem focused on designing cost-efficient maritime shipping routes. Traditional methods for solving the LSNDP typically involve decomposing the problem into sub-problems, such as network design and multi-commodity flow, which are then tackled using approximate heuristics or large neighborhood search (LNS) techniques. In contrast, our approach employs a model-free reinforcement learning algorithm on the network design, integrated with a heuristic-based multi-commodity flow solver, to produce competitive results on the publicly available LINERLIB benchmark. Additionally, our method also demonstrates generalization capabilities by producing competitive solutions on the benchmark instances after training on perturbed instances.
- [70] arXiv:2411.09072 [pdf, html, other]
-
Title: Continuous GNN-based Anomaly Detection on Edge using Efficient Adaptive Knowledge Graph LearningComments: Accepted to DATE 2025Subjects: Machine Learning (cs.LG)
The increasing demand for robust security solutions across various industries has made Video Anomaly Detection (VAD) a critical task in applications such as intelligent surveillance, evidence investigation, and violence detection. Traditional approaches to VAD often rely on finetuning large pre-trained models, which can be computationally expensive and impractical for real-time or resource-constrained environments. To address this, MissionGNN introduced a more efficient method by training a graph neural network (GNN) using a fixed knowledge graph (KG) derived from large language models (LLMs) like GPT-4. While this approach demonstrated significant efficiency in computational power and memory, it faces limitations in dynamic environments where frequent updates to the KG are necessary due to evolving behavior trends and shifting data patterns. These updates typically require cloud-based computation, posing challenges for edge computing applications. In this paper, we propose a novel framework that facilitates continuous KG adaptation directly on edge devices, overcoming the limitations of cloud dependency. Our method dynamically modifies the KG through a three-phase process: pruning, alternating, and creating nodes, enabling real-time adaptation to changing data trends. This continuous learning approach enhances the robustness of anomaly detection models, making them more suitable for deployment in dynamic and resource-constrained environments.
- [71] arXiv:2411.09073 [pdf, html, other]
-
Title: Code-mixed LLM: Improve Large Language Models' Capability to Handle Code-Mixing through Reinforcement Learning from AI FeedbackComments: initial version: 5 pages, 2 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Code-mixing(CM) or code-switching(CSW) refers to the juxtaposition of linguistic units from two or more languages during the conversation or sometimes even a single utterance. Code-mixing introduces unique challenges in daily life, such as syntactic mismatches and semantic blending, that are rarely encountered in monolingual settings. Large language models (LLMs) have revolutionized the field of natural language processing (NLP) by offering unprecedented capabilities in understanding human languages. However, the effectiveness of current state-of-the-art multilingual LLMs has not yet been fully explored in the CM scenario. To fill this gap, we first benchmark the performance of multilingual LLMs on various code-mixing NLP tasks. Then we propose to improve the multilingual LLMs' ability to understand code-mixing through reinforcement learning from human feedback (RLHF) and code-mixed machine translation tasks. Given the high-cost and time-consuming preference labeling procedure, we improve this by utilizing LLMs as annotators to perform the reinforcement learning from AI feedback (RLAIF). The experiments show the effectiveness of the proposed method.
- [72] arXiv:2411.09077 [pdf, html, other]
-
Title: Drone Detection using Deep Neural Networks Trained on Pure Synthetic DataComments: 12 pages, 8 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Drone detection has benefited from improvements in deep neural networks, but like many other applications, suffers from the availability of accurate data for training. Synthetic data provides a potential for low-cost data generation and has been shown to improve data availability and quality. However, models trained on synthetic datasets need to prove their ability to perform on real-world data, known as the problem of sim-to-real transferability. Here, we present a drone detection Faster-RCNN model trained on a purely synthetic dataset that transfers to real-world data. We found that it achieves an AP_50 of 97.0% when evaluated on the MAV-Vid - a real dataset of flying drones - compared with 97.8% for an equivalent model trained on real-world data. Our results show that using synthetic data for drone detection has the potential to reduce data collection costs and improve labelling quality. These findings could be a starting point for more elaborate synthetic drone datasets. For example, realistic recreations of specific scenarios could de-risk the dataset generation of safety-critical applications such as the detection of drones at airports. Further, synthetic data may enable reliable drone detection systems, which could benefit other areas, such as unmanned traffic management systems. The code is available this https URL alongside the datasets this https URL.
- [73] arXiv:2411.09080 [pdf, html, other]
-
Title: Language Models for Music Medicine GenerationEmmanouil Nikolakakis, Joann Ching, Emmanouil Karystinaios, Gabrielle Sipin, Gerhard Widmer, Razvan MarinescuComments: Late-Breaking / Demo Session Extended Abstract, ISMIR 2024 ConferenceSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Music therapy has been shown in recent years to provide multiple health benefits related to emotional wellness. In turn, maintaining a healthy emotional state has proven to be effective for patients undergoing treatment, such as Parkinson's patients or patients suffering from stress and anxiety. We propose fine-tuning MusicGen, a music-generating transformer model, to create short musical clips that assist patients in transitioning from negative to desired emotional states. Using low-rank decomposition fine-tuning on the MTG-Jamendo Dataset with emotion tags, we generate 30-second clips that adhere to the iso principle, guiding patients through intermediate states in the valence-arousal circumplex. The generated music is evaluated using a music emotion recognition model to ensure alignment with intended emotions. By concatenating these clips, we produce a 15-minute "music medicine" resembling a music therapy session. Our approach is the first model to leverage Language Models to generate music medicine. Ultimately, the output is intended to be used as a temporary relief between music therapy sessions with a board-certified therapist.
- [74] arXiv:2411.09089 [pdf, html, other]
-
Title: Set-Based Retrograde Analysis: Precomputing the Solution to 24-card Bridge Double Dummy DealsSubjects: Artificial Intelligence (cs.AI)
Retrograde analysis is used in game-playing programs to solve states at the end of a game, working backwards toward the start of the game. The algorithm iterates through and computes the perfect-play value for as many states as resources allow. We introduce setrograde analysis which achieves the same results by operating on sets of states that have the same game value. The algorithm is demonstrated by computing exact solutions for Bridge double dummy card-play. For deals with 24 cards remaining to be played ($10^{27}$ states, which can be reduced to $10^{15}$ states using preexisting techniques), we strongly solve all deals. The setrograde algorithm performs a factor of $10^3$ fewer search operations than a standard retrograde algorithm, producing a database with a factor of $10^4$ fewer entries. For applicable domains, this allows retrograde searching to reach unprecedented search depths.
- [75] arXiv:2411.09090 [pdf, html, other]
-
Title: A Vectorial Envelope Maxwell Formulation for Electromagnetic Waveguides with Application to Nonlinear Fiber OpticsSubjects: Numerical Analysis (math.NA)
This article presents an ultraweak discontinuous Petrov-Galerkin (DPG) formulation of the time-harmonic Maxwell equations for the vectorial envelope of the electromagnetic field in a weakly-guiding multi-mode fiber waveguide. This formulation is derived using an envelope ansatz for the vector-valued electric and magnetic field components, factoring out an oscillatory term of $exp(-i \mathsf{k}z)$ with a user-defined wavenumber $\mathsf{k}$, where $z$ is the longitudinal fiber axis and field propagation direction. The resulting formulation is a modified system of the time-harmonic Maxwell equations for the vectorial envelope of the propagating field. This envelope is less oscillatory in the $z$-direction than the original field, so that it can be more efficiently discretized and computed, enabling solution of the vectorial DPG Maxwell system for $1000\times$ longer fibers than previously possible. Different approaches for incorporating a perfectly matched layer for absorbing the outgoing wave modes at the fiber end are derived and compared numerically. The resulting formulation is used to solve a 3D Maxwell model of an ytterbium-doped active gain fiber amplifier, coupled with the heat equation for including thermal effects. The nonlinear model is then used to simulate thermally-induced transverse mode instability (TMI). The numerical experiments demonstrate that it is computationally feasible to perform simulations and analysis of real-length optical fiber laser amplifiers using discretizations of the full vectorial time-harmonic Maxwell equations. The approach promises a new high-fidelity methodology for analyzing TMI in high-power fiber laser systems and is extendable to including other nonlinearities.
- [76] arXiv:2411.09100 [pdf, html, other]
-
Title: General linear threshold models with application to influence maximizationComments: 30 pages, 10 figuresSubjects: Social and Information Networks (cs.SI); Methodology (stat.ME)
A number of models have been developed for information spread through networks, often for solving the Influence Maximization (IM) problem. IM is the task of choosing a fixed number of nodes to "seed" with information in order to maximize the spread of this information through the network, with applications in areas such as marketing and public health. Most methods for this problem rely heavily on the assumption of known strength of connections between network members (edge weights), which is often unrealistic. In this paper, we develop a likelihood-based approach to estimate edge weights from the fully and partially observed information diffusion paths. We also introduce a broad class of information diffusion models, the general linear threshold (GLT) model, which generalizes the well-known linear threshold (LT) model by allowing arbitrary distributions of node activation thresholds. We then show our weight estimator is consistent under the GLT and some mild assumptions. For the special case of the standard LT model, we also present a much faster expectation-maximization approach for weight estimation. Finally, we prove that for the GLT models, the IM problem can be solved by a natural greedy algorithm with standard optimality guarantees if all node threshold distributions have concave cumulative distribution functions. Extensive experiments on synthetic and real-world networks demonstrate that the flexibility in the choice of threshold distribution combined with the estimation of edge weights significantly improves the quality of IM solutions, spread prediction, and the estimates of the node activation probabilities.
- [77] arXiv:2411.09101 [pdf, html, other]
-
Title: Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing ImagerySubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. These models have done particularly well in the field of image classification and segmentation. Research on semantic and instance segmentation has emerged to accelerate with the inception of the new architecture, with over 80\% of the top 20 benchmarks for the iSAID dataset being either based on the ViT architecture or the attention mechanism behind its success. This paper focuses on the heuristic comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID. The experimental results observed during the course of the research were under the scrutinization of the following objectives: 1. Use of weighted fused loss function for the maximum mean Intersection over Union (mIoU) score, Dice score, and minimization or conservation of entropy or class representation, 2. Comparison of transfer learning on Meta's MaskFormer, a ViT-based semantic segmentation model, against generic UNet Convolutional Neural Networks (CNNs) judged over mIoU, Dice scores, training efficiency, and inference time, and 3. What do we lose for what we gain? i.e., the comparison of the two models against current state-of-art segmentation models. We show the use of the novel combined weighted loss function significantly boosts the CNN model's performance capacities as compared to transfer learning the ViT. The code for this implementation can be found on \url{this https URL}.
- [78] arXiv:2411.09102 [pdf, html, other]
-
Title: Provocation: Who benefits from "inclusion" in Generative AI?Comments: 3 pages, 1 figure. Published as a Short Paper in the NeurIPS 2024 Workshop on Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AISubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
The demands for accurate and representative generative AI systems means there is an increased demand on participatory evaluation structures. While these participatory structures are paramount to to ensure non-dominant values, knowledge and material culture are also reflected in AI models and the media they generate, we argue that dominant structures of community participation in AI development and evaluation are not explicit enough about the benefits and harms that members of socially marginalized groups may experience as a result of their participation. Without explicit interrogation of these benefits by AI developers, as a community we may remain blind to the immensity of systemic change that is needed as well. To support this provocation, we present a speculative case study, developed from our own collective experiences as AI researchers. We use this speculative context to itemize the barriers that need to be overcome in order for the proposed benefits to marginalized communities to be realized, and harms mitigated.
- [79] arXiv:2411.09105 [pdf, html, other]
-
Title: VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video CognitionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recent advancements in Large Video-Language Models (LVLMs) have driven the development of benchmarks designed to assess cognitive abilities in video-based tasks. However, most existing benchmarks heavily rely on web-collected videos paired with human annotations or model-generated questions, which limit control over the video content and fall short in evaluating advanced cognitive abilities involving symbolic elements and abstract concepts. To address these limitations, we introduce VCBench, a controllable benchmark to assess LVLMs' cognitive abilities, involving symbolic and abstract concepts at varying difficulty levels. By generating video data with the Python-based engine, VCBench allows for precise control over the video content, creating dynamic, task-oriented videos that feature complex scenes and abstract concepts. Each task pairs with tailored question templates that target specific cognitive challenges, providing a rigorous evaluation test. Our evaluation reveals that even state-of-the-art (SOTA) models, such as Qwen2-VL-72B, struggle with simple video cognition tasks involving abstract concepts, with performance sharply dropping by 19% as video complexity rises. These findings reveal the current limitations of LVLMs in advanced cognitive tasks and highlight the critical role of VCBench in driving research toward more robust LVLMs for complex video cognition challenges.
- [80] arXiv:2411.09109 [pdf, html, other]
-
Title: Personalized Help for Optimizing Low-Skilled Users' StrategyFeng Gu, Wichayaporn Wongkamjan, Jordan Lee Boyd-Graber, Jonathan K. Kummerfeld, Denis Peskoff, Jonathan MayComments: 9 pages, 3 figuresSubjects: Computation and Language (cs.CL)
AIs can beat humans in game environments; however, how helpful those agents are to human remains understudied. We augment CICERO, a natural language agent that demonstrates superhuman performance in Diplomacy, to generate both move and message advice based on player intentions. A dozen Diplomacy games with novice and experienced players, with varying advice settings, show that some of the generated advice is beneficial. It helps novices compete with experienced players and in some instances even surpass them. The mere presence of advice can be advantageous, even if players do not follow it.
- [81] arXiv:2411.09110 [pdf, html, other]
-
Title: Information-Optimal Multi-Spacecraft Positioning for Interstellar Object ExplorationComments: IEEE Aerospace Conference, Preprint Version, Accepted: November 2024Subjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA); Robotics (cs.RO); Optimization and Control (math.OC)
Interstellar objects (ISOs), astronomical objects not gravitationally bound to the sun, could present valuable opportunities to advance our understanding of the universe's formation and composition. In response to the unpredictable nature of their discoveries that inherently come with large and rapidly changing uncertainty in their state, this paper proposes a novel multi-spacecraft framework for locally maximizing information to be gained through ISO encounters with formal probabilistic guarantees. Given some approximated control and estimation policies for fully autonomous spacecraft operations, we first construct an ellipsoid around its terminal position, where the ISO would be located with a finite probability. The large state uncertainty of the ISO is formally handled here through the hierarchical property in stochastically contracting nonlinear systems. We then propose a method to find the terminal positions of the multiple spacecraft optimally distributed around the ellipsoid, which locally maximizes the information we can get from all the points of interest (POIs). This utilizes a probabilistic information cost function that accounts for spacecraft positions, camera specifications, and ISO position uncertainty, where the information is defined as visual data collected by cameras. Numerical simulations demonstrate the efficacy of this approach using synthetic ISO candidates generated from quasi-realistic empirical populations. Our method allows each spacecraft to optimally select its terminal state and determine the ideal number of POIs to investigate, potentially enhancing the ability to study these rare and fleeting interstellar visitors while minimizing resource utilization.
- [82] arXiv:2411.09111 [pdf, other]
-
Title: Reducing Reasoning Costs -- The Path of Optimization for Chain of Thought via Sparse Attention MechanismComments: The main text is 9 pages, totaling 13 pages; 5 figures, 3 tables; preprints have been submitted to NeurIPS 2024 Workshop MusIML and OpenReviewSubjects: Machine Learning (cs.LG)
In order to address the chain of thought in the large language model inference cost surge, this research proposes to use a sparse attention mechanism that only focuses on a few relevant tokens. The researcher constructed a new attention mechanism and used GiantRabbit trained with custom GPTs as an experimental tool. The experiment tested and compared the reasoning time, correctness score and chain of thought length of this model and o1 Preview in solving the linear algebra test questions of MIT OpenCourseWare. The results show that GiantRabbit's reasoning time and chain of thought length are significantly lower than o1 Preview, confirming the feasibility of the sparse attention mechanism in reducing chain of thought reasoning. Detailed architectural details and experimental process have been uploaded to Github, the link is:this https URL.
- [83] arXiv:2411.09113 [pdf, html, other]
-
Title: Convergence rates of Landweber-type methods for inverse problems in Banach spacesSubjects: Numerical Analysis (math.NA)
Landweber-type methods are prominent for solving ill-posed inverse problems in Banach spaces and their convergence has been well-understood. However, how to derive their convergence rates remains a challenging open question. In this paper, we tackle the challenge of deriving convergence rates for Landweber-type methods applied to ill-posed inverse problems, where forward operators map from a Banach space to a Hilbert space. Under a benchmark source condition, we introduce a novel strategy to derive convergence rates when the method is terminated by either an {\it a priori} stopping rule or the discrepancy principle. Our results offer substantial flexibility regarding step sizes, by allowing the use of variable step sizes. By extending the strategy to deal with the stochastic mirror descent method for solving nonlinear ill-posed systems with exact data, under a benchmark source condition we also obtain an almost sure convergence rate in terms of the number of iterations.
- [84] arXiv:2411.09114 [pdf, other]
-
Title: Comparative genomics with succinct colored de Bruijn graphsSubjects: Data Structures and Algorithms (cs.DS)
DNA technologies have evolved significantly in the past years enabling the sequencing of a large number of genomes in a short time. Nevertheless, the underlying computational problem is hard, and many technical factors and limitations complicate obtaining the complete sequence of a genome. Many genomes are left in a draft state, in which each chromosome is represented by a set of sequences with partial information on their relative order. Recently, some approaches have been proposed to compare draft genomes by comparing paths in de Bruijn graphs, which are constructed by many practical genome assemblers. In this article we introduce gcBB, a method for comparing genomes represented as succinct colored de Bruijn graphs directly, without resorting to sequence alignments, by means of the entropy and expectation measures based on the Burrows-Wheeler Similarity Distribution. We also introduce an improved version of gcBB, called mgcBB, that improves the time performance considerably through the selection of different data structures. We have compared phylogenies of genomes obtained by other methods to those obtained with gcBB, achieving promising results.
- [85] arXiv:2411.09116 [pdf, html, other]
-
Title: P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMsYidan Zhang, Boyi Deng, Yu Wan, Baosong Yang, Haoran Wei, Fei Huang, Bowen Yu, Junyang Lin, Fei Huang, Jingren ZhouSubjects: Computation and Language (cs.CL)
Recent advancements in large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. To alleviate this drawback, we aim to present a comprehensive multilingual multitask benchmark. First, we present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks, i.e., their ability to differentiate between models being evaluated. Leveraging this pipeline, we introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets. Furthermore, P-MMEval delivers consistent language coverage across various datasets and provides parallel samples. Finally, we conduct extensive experiments on representative multilingual model series to compare performances across models, analyze dataset effectiveness, examine prompt impacts on model performances, and explore the relationship between multilingual performances and factors such as tasks, model sizes, and languages. These insights offer valuable guidance for future research. The dataset is available at this https URL.
- [86] arXiv:2411.09117 [pdf, html, other]
-
Title: Efficiently learning and sampling multimodal distributions with data-based initializationSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Probability (math.PR); Machine Learning (stat.ML)
We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure. Although mixing can be arbitrarily slow, we show that if the Markov chain has a $k$th order spectral gap, initialization from a set of $\tilde O(k/\varepsilon^2)$ samples from the stationary distribution will, with high probability over the samples, efficiently generate a sample whose conditional law is $\varepsilon$-close in TV distance to the stationary measure. In particular, this applies to mixtures of $k$ distributions satisfying a Poincaré inequality, with faster convergence when they satisfy a log-Sobolev inequality. Our bounds are stable to perturbations to the Markov chain, and in particular work for Langevin diffusion over $\mathbb R^d$ with score estimation error, as well as Glauber dynamics combined with approximation error from pseudolikelihood estimation. This justifies the success of data-based initialization for score matching methods despite slow mixing for the data distribution, and improves and generalizes the results of Koehler and Vuong (2023) to have linear, rather than exponential, dependence on $k$ and apply to arbitrary semigroups. As a consequence of our results, we show for the first time that a natural class of low-complexity Ising measures can be efficiently learned from samples.
- [87] arXiv:2411.09120 [pdf, html, other]
-
Title: Neural Graph Simulator for Complex SystemsSubjects: Machine Learning (cs.LG)
Numerical simulation is a predominant tool for studying the dynamics in complex systems, but large-scale simulations are often intractable due to computational limitations. Here, we introduce the Neural Graph Simulator (NGS) for simulating time-invariant autonomous systems on graphs. Utilizing a graph neural network, the NGS provides a unified framework to simulate diverse dynamical systems with varying topologies and sizes without constraints on evaluation times through its non-uniform time step and autoregressive approach. The NGS offers significant advantages over numerical solvers by not requiring prior knowledge of governing equations and effectively handling noisy or missing data with a robust training scheme. It demonstrates superior computational efficiency over conventional methods, improving performance by over $10^5$ times in stiff problems. Furthermore, it is applied to real traffic data, forecasting traffic flow with state-of-the-art accuracy. The versatility of the NGS extends beyond the presented cases, offering numerous potential avenues for enhancement.
- [88] arXiv:2411.09121 [pdf, html, other]
-
Title: AutoQ 2.0: From Verification of Quantum Circuits to Verification of Quantum ProgramsYu-Fang Chen, Kai-Min Chung, Min-Hsiu Hsieh, Wei-Jia Huang, Ondřej Lengál, Jyun-Ao Lin, Wei-Lun TsaiComments: regular tool paper submitted to TACAS 2025Subjects: Logic in Computer Science (cs.LO); Formal Languages and Automata Theory (cs.FL)
We present a verifier of quantum programs called AutoQ 2.0. Quantum programs extend quantum circuits (the domain of AutoQ 1.0) by classical control flow constructs, which enable users to describe advanced quantum algorithms in a formal and precise manner. The extension is highly non-trivial, as we needed to tackle both theoretical challenges (such as the treatment of measurement, the normalization problem, and lifting techniques for verification of classical programs with loops to the quantum world), and engineering issues (such as extending the input format with a~support for specifying loop invariants). We have successfully used AutoQ 2.0 to verify two types of advanced quantum programs that cannot be expressed using only quantum circuits: the \emph{repeat-until-success} (RUS) algorithm and the weak-measurement-based version of Grover's search algorithm. AutoQ 2.0 can efficiently verify all our benchmarks: all RUS algorithms were verified instantly and, for the weak-measurement-based version of Grover's search, we were able to handle the case of 100 qubits in $\sim$20 minutes.
- [89] arXiv:2411.09125 [pdf, html, other]
-
Title: DROJ: A Prompt-Driven Attack against Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Due to their training on internet-sourced datasets, LLMs can sometimes generate objectionable content, necessitating extensive alignment with human feedback to avoid such outputs. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks, which usually are manipulated prompts designed to circumvent safety mechanisms and elicit harmful responses. Here, we introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat model show that DROJ achieves a 100\% keyword-based Attack Success Rate (ASR), effectively preventing direct refusals. However, the model occasionally produces repetitive and non-informative responses. To mitigate this, we introduce a helpfulness system prompt that enhances the utility of the model's responses. Our code is available at this https URL.
- [90] arXiv:2411.09126 [pdf, html, other]
-
Title: SCAN: Bootstrapping Contrastive Pre-training for Data EfficiencySubjects: Computer Vision and Pattern Recognition (cs.CV)
While contrastive pre-training is widely employed, its data efficiency problem has remained relatively under-explored thus far. Existing methods often rely on static coreset selection algorithms to pre-identify important data for training. However, this static nature renders them unable to dynamically track the data usefulness throughout pre-training, leading to subpar pre-trained models. To address this challenge, our paper introduces a novel dynamic bootstrapping dataset pruning method. It involves pruning data preparation followed by dataset mutation operations, both of which undergo iterative and dynamic updates. We apply this method to two prevalent contrastive pre-training frameworks: \textbf{CLIP} and \textbf{MoCo}, representing vision-language and vision-centric domains, respectively. In particular, we individually pre-train seven CLIP models on two large-scale image-text pair datasets, and two MoCo models on the ImageNet dataset, resulting in a total of 16 pre-trained models. With a data pruning rate of 30-35\% across all 16 models, our method exhibits only marginal performance degradation (less than \textbf{1\%} on average) compared to corresponding models trained on the full dataset counterparts across various downstream datasets, and also surpasses several baselines with a large performance margin. Additionally, the byproduct from our method, \ie coresets derived from the original datasets after pre-training, also demonstrates significant superiority in terms of downstream performance over other static coreset selection approaches.
- [91] arXiv:2411.09127 [pdf, html, other]
-
Title: Complexity-Aware Training of Deep Neural Networks for Optimal Structure DiscoveryComments: 28 pages, 4 figures, 5 tablesSubjects: Machine Learning (cs.LG)
We propose a novel algorithm for combined unit/filter and layer pruning of deep neural networks that functions during training and without requiring a pre-trained network to apply. Our algorithm optimally trades-off learning accuracy and pruning levels while balancing layer vs. unit/filter pruning and computational vs. parameter complexity using only three user-defined parameters, which are easy to interpret and tune. The optimal network structure is found as the solution of a stochastic optimization problem over the network weights and the parameters of variational Bernoulli distributions for 0/1 Random Variables scaling the units and layers of the network. Pruning occurs when a variational parameter converges to 0 rendering the corresponding structure permanently inactive, thus saving computations during training and prediction. A key contribution of our approach is to define a cost function that combines the objectives of prediction accuracy and network pruning in a computational/parameter complexity-aware manner and the automatic selection of the many regularization parameters. We show that the solutions of the optimization problem to which the algorithm converges are deterministic networks. We analyze the ODE system that underlies our stochastic optimization algorithm and establish domains of attraction around zero for the dynamics of the network parameters. These results provide theoretical support for safely pruning units/filters and/or layers during training and lead to practical pruning conditions. We evaluate our method on the CIFAR-10/100 and ImageNet datasets using ResNet architectures and demonstrate that our method improves upon layer only or unit only pruning and favorably competes with combined unit/filter and layer pruning algorithms requiring pre-trained networks with respect to pruning ratios and test accuracy.
- [92] arXiv:2411.09128 [pdf, html, other]
-
Title: Performance Analysis of uRLLC in scalable Cell-free RAN SystemSubjects: Information Theory (cs.IT); Applications (stat.AP)
As an essential part of mobile communication systems that beyond the fifth generation (B5G) and sixth generation (6G), ultra reliable low latency communication (uRLLC) places strict requirements on latency and reliability. In recent years, with the improvement of mobile communication network performance, centralized and distributed processing of cell-free mMIMO has been widely studied, and wireless access networks (RAN) have also become a widely studied topic in academia. This paper analyzes the performance of a novel scalable cell-free RAN (CF-RAN) architecture with multiple edge distributed units (EDUs) in the scenario of finite block length. The upper and lower bounds on its spectral efficiency (SE) performance are derived, and the complete set's formula and distributed processing can be used as their two exceptional cases, respectively. Secondly, the paper further considers the distribution of users and large-scale fading models and studies the position distribution of remote radio units (RRUs). It is found that a uniform distribution of RRUs is beneficial for improving the SE of finite block length under specific error rate performance, and RRUs need to be interwoven as much as possible under multiple EDUs. This is different from traditional multi-node clustering centralized collaborative processing. The paper compares the performance of Monte Carlo simulation and multi-RRU clustering group collaborative processing. At the same time, this article verifies the accuracy of the space-time exchange theory in the CF-RAN scenario. Through scalable EDU deployment, a trade-off between latency and reliability can be achieved in practical systems and exchanged with spatial degrees of freedom. This implementation can be seen as a distributed and scalable implementation of the space-time exchange theory.
- [93] arXiv:2411.09134 [pdf, html, other]
-
Title: ABCI 3.0: Evolution of the leading AI infrastructure in JapanComments: 4 pages, 2 figuresSubjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
ABCI 3.0 is the latest version of the ABCI, a large-scale open AI infrastructure that AIST has been operating since August 2018 and will be fully operational in January 2025. ABCI 3.0 consists of computing servers equipped with 6128 of the NVIDIA H200 GPUs and an all-flash storage system. Its peak performance is 6.22 exaflops in half precision and 3.0 exaflops in single precision, which is 7 to 13 times faster than the previous system, ABCI 2.0. It also more than doubles both storage capacity and theoretical read/write performance. ABCI 3.0 is expected to accelerate research and development, evaluation, and workforce development of cutting-edge AI technologies, with a particular focus on generative AI.
- [94] arXiv:2411.09140 [pdf, html, other]
-
Title: Adversarial Vessel-Unveiling Semi-Supervised Segmentation for Retinopathy of Prematurity DiagnosisComments: 10 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Accurate segmentation of retinal images plays a crucial role in aiding ophthalmologists in diagnosing retinopathy of prematurity (ROP) and assessing its severity. However, due to their underdeveloped, thinner vessels, manual annotation in infant fundus images is very complex, and this presents challenges for fully supervised learning. To address the scarcity of annotations, we propose a semi supervised segmentation framework designed to advance ROP studies without the need for extensive manual vessel annotation. Unlike previous methods that rely solely on limited labeled data, our approach leverages teacher student learning by integrating two powerful components: an uncertainty weighted vessel unveiling module and domain adversarial learning. The vessel unveiling module helps the model effectively reveal obscured and hard to detect vessel structures, while adversarial training aligns feature representations across different domains, ensuring robust and generalizable vessel segmentations. We validate our approach on public datasets (CHASEDB, STARE) and an in-house ROP dataset, demonstrating its superior performance across multiple evaluation metrics. Additionally, we extend the model's utility to a downstream task of ROP multi-stage classification, where vessel masks extracted by our segmentation model improve diagnostic accuracy. The promising results in classification underscore the model's potential for clinical application, particularly in early-stage ROP diagnosis and intervention. Overall, our work offers a scalable solution for leveraging unlabeled data in pediatric ophthalmology, opening new avenues for biomarker discovery and clinical research.
- [95] arXiv:2411.09142 [pdf, html, other]
-
Title: Laplace Transform Interpretation of Differential PrivacySubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
We introduce a set of useful expressions of Differential Privacy (DP) notions in terms of the Laplace transform of the privacy loss distribution. Its bare form expression appears in several related works on analyzing DP, either as an integral or an expectation. We show that recognizing the expression as a Laplace transform unlocks a new way to reason about DP properties by exploiting the duality between time and frequency domains. Leveraging our interpretation, we connect the $(q, \rho(q))$-Rényi DP curve and the $(\epsilon, \delta(\epsilon))$-DP curve as being the Laplace and inverse-Laplace transforms of one another. This connection shows that the Rényi divergence is well-defined for complex orders $q = \gamma + i \omega$. Using our Laplace transform-based analysis, we also prove an adaptive composition theorem for $(\epsilon, \delta)$-DP guarantees that is exactly tight (i.e., matches even in constants) for all values of $\epsilon$. Additionally, we resolve an issue regarding symmetry of $f$-DP on subsampling that prevented equivalence across all functional DP notions.
- [96] arXiv:2411.09145 [pdf, html, other]
-
Title: UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction VideosSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Egocentric Hand Object Interaction (HOI) videos provide valuable insights into human interactions with the physical world, attracting growing interest from the computer vision and robotics communities. A key task in fully understanding the geometry and dynamics of HOI scenes is dense pointclouds sequence reconstruction. However, the inherent motion of both hands and the camera makes this challenging. Current methods often rely on time-consuming test-time optimization, making them impractical for reconstructing internet-scale videos. To address this, we introduce UniHOI, a model that unifies the estimation of all variables necessary for dense 4D reconstruction, including camera intrinsic, camera poses, and video depth, for egocentric HOI scene in a fast feed-forward manner. We end-to-end optimize all these variables to improve their consistency in 3D space. Furthermore, our model could be trained solely on large-scale monocular video dataset, overcoming the limitation of scarce labeled HOI data. We evaluate UniHOI with both in-domain and zero-shot generalization setting, surpassing all baselines in pointclouds sequence reconstruction and long-term 3D scene flow recovery. UniHOI is the first approach to offer fast, dense, and generalizable monocular egocentric HOI scene reconstruction in the presence of motion. Code and trained model will be released in the future.
- [97] arXiv:2411.09146 [pdf, html, other]
-
Title: Secrecy Energy Efficiency Maximization in IRS-Assisted VLC MISO Networks with RSMA: A DS-PPO approachYangbo Guo, Jianhui Fan, Ruichen Zhang, Baofang Chang, Derrick Wing Kwan Ng, Dusit Niyato, Dong In KimComments: 13 pages, 10 figuresSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
This paper investigates intelligent reflecting surface (IRS)-assisted multiple-input single-output (MISO) visible light communication (VLC) networks utilizing the rate-splitting multiple access (RSMA) scheme. {In these networks,} an eavesdropper (Eve) attempts to eavesdrop on communications intended for legitimate users (LUs). To enhance information security and energy efficiency simultaneously, we formulate a secrecy energy efficiency (SEE) maximization problem. In the formulated problem, beamforming vectors, RSMA common rates, direct current (DC) bias, and IRS alignment matrices are jointly optimized subject to constraints on total power budget, quality of service (QoS) requirements, linear operating region of light emitting diodes (LEDs), and common information rate allocation. Due to the non-convex and NP-hard nature of the formulated problem, we propose a deep reinforcement learning (DRL)-based dual-sampling proximal policy optimization (DS-PPO) approach. {The approach leverages} dual sample strategies and generalized advantage estimation (GAE). In addition, to further simplify the design, we adopt the maximum ratio transmission (MRT) and zero-forcing (ZF) as beamforming vectors in the action space. Simulation results show that the proposed DS-PPO approach outperforms traditional baseline approaches in terms of achievable SEE and significantly improves convergence speed compared to the original PPO approach. Moreover, implementing the RSMA scheme and IRS contributes to overall system performance, {achieving approximately $19.67\%$ improvement over traditional multiple access schemes and $25.74\%$ improvement over networks without IRS deployment.
- [98] arXiv:2411.09148 [pdf, html, other]
-
Title: Toward Democratized Generative AI in Next-Generation Mobile Edge NetworksRuichen Zhang, Jiayi He, Xiaofeng Luo, Dusit Niyato, Jiawen Kang, Zehui Xiong, Yonghui Li, Biplab SikdarComments: 9 pages, 5 figuresSubjects: Networking and Internet Architecture (cs.NI)
The rapid development of generative AI technologies, including large language models (LLMs), has brought transformative changes to various fields. However, deploying such advanced models on mobile and edge devices remains challenging due to their high computational, memory, communication, and energy requirements. To address these challenges, we propose a model-centric framework for democratizing generative AI deployment on mobile and edge networks. First, we comprehensively review key compact model strategies, such as quantization, model pruning, and knowledge distillation, and present key performance metrics to optimize generative AI for mobile deployment. Next, we provide a focused review of mobile and edge networks, emphasizing the specific challenges and requirements of these environments. We further conduct a case study demonstrating the effectiveness of these strategies by deploying LLMs on real mobile edge devices. Experimental results highlight the practicality of democratized LLMs, with significant improvements in generalization accuracy, hallucination rate, accessibility, and resource consumption. Finally, we discuss potential research directions to further advance the deployment of generative AI in resource-constrained environments.
- [99] arXiv:2411.09151 [pdf, html, other]
-
Title: Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo MatchingComments: 8 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
The generalization and performance of stereo matching networks are limited due to the domain gap of the existing synthetic datasets and the sparseness of GT labels in the real datasets. In contrast, monocular depth estimation has achieved significant advancements, benefiting from large-scale depth datasets and self-supervised strategies. To bridge the performance gap between monocular depth estimation and stereo matching, we propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo. We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning. In the pre-training stage, we design a data generation pipeline that synthesizes stereo training data from monocular images. This pipeline utilizes monocular depth for warping and novel view synthesis and employs our proposed Edge-Aware (EA) inpainting module to fill in missing contents in the generated images. In the fine-tuning stage, we introduce a Sparse-to-Dense Knowledge Distillation (S2DKD) strategy encouraging the distributions of predictions to align with dense monocular depths. This strategy mitigates issues with edge blurring in sparse real-world labels and enhances overall consistency. Experimental results demonstrate that our pre-trained model exhibits strong zero-shot generalization capabilities. Furthermore, domain-specific fine-tuning using our pre-trained model and S2DKD strategy significantly increments in-domain performance. The code will be made available soon.
- [100] arXiv:2411.09152 [pdf, html, other]
-
Title: GRAINRec: Graph and Attention Integrated Approach for Real-Time Session-Based Item RecommendationsComments: Accepted to the 2024 IEEE International Conference on Big Data (IEEE BigData 2024)Subjects: Machine Learning (cs.LG)
Recent advancements in session-based recommendation models using deep learning techniques have demonstrated significant performance improvements. While they can enhance model sophistication and improve the relevance of recommendations, they also make it challenging to implement a scalable real-time solution. To addressing this challenge, we propose GRAINRec: a Graph and Attention Integrated session-based recommendation model that generates recommendations in real-time. Our scope of work is item recommendations in online retail where a session is defined as an ordered sequence of digital guest actions, such as page views or adds to cart. The proposed model generates recommendations by considering the importance of all items in the session together, letting us predict relevant recommendations dynamically as the session evolves. We also propose a heuristic approach to implement real-time inferencing that meets Target platform's service level agreement (SLA). The proposed architecture lets us predict relevant recommendations dynamically as the session evolves, rather than relying on pre-computed recommendations for each item. Evaluation results of the proposed model show an average improvement of 1.5% across all offline evaluation metrics. A/B tests done over a 2 week duration showed an increase of 10% in click through rate and 9% increase in attributable demand. Extensive ablation studies are also done to understand our model performance for different parameters.
- [101] arXiv:2411.09153 [pdf, html, other]
-
Title: VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot ManipulationComments: Accepted to NeurIPS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual trajectories in a video denoising diffusion manner, enabling the model to develop a long horizontal awareness of the environment's dynamics. In the second stage, a flexible yet effective layer-wise self-attention adapter is introduced to transform VidMan into an efficient inverse dynamics model that predicts action modulated by the implicit dynamics knowledge via parameter sharing. Our VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset. These results provide compelling evidence that world models can significantly enhance the precision of robot action prediction. Codes and models will be public.
- [102] arXiv:2411.09154 [pdf, html, other]
-
Title: STAR-RIS Enabled ISAC Systems: Joint Rate Splitting and Beamforming OptimizationComments: 13 pages, 9 figuresSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
This paper delves into an integrated sensing and communication (ISAC) system bolstered by a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS). Within this system, a base station (BS) is equipped with communication and radar capabilities, enabling it to communicate with ground terminals (GTs) and concurrently probe for echo signals from a target of interest. Moreover, to manage interference and improve communication quality, the rate splitting multiple access (RSMA) scheme is incorporated into the system. The signal-to-interference-plus-noise ratio (SINR) of the received sensing echo signals is a measure of sensing performance. We formulate a joint optimization problem of common rates, transmit beamforming at the BS, and passive beamforming vectors of the STAR-RIS. The objective is to maximize sensing SINR while guaranteeing the communication rate requirements for each GT. We present an iterative algorithm to address the non-convex problem by invoking Dinkelbach's transform, semidefinite relaxation (SDR), majorization-minimization, and sequential rank-one constraint relaxation (SROCR) theories. Simulation results manifest that the performance of the studied ISAC network enhanced by the STAR-RIS and RSMA surpasses other benchmarks considerably. The results evidently indicate the superior performance improvement of the ISAC system with the proposed RSMA-based transmission strategy design and the dynamic optimization of both transmission and reflection beamforming at STAR-RIS.
- [103] arXiv:2411.09156 [pdf, html, other]
-
Title: DyGASR: Dynamic Generalized Exponential Splatting with Surface Alignment for Accelerated 3D Mesh ReconstructionSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Recent advancements in 3D Gaussian Splatting (3DGS), which lead to high-quality novel view synthesis and accelerated rendering, have remarkably improved the quality of radiance field reconstruction. However, the extraction of mesh from a massive number of minute 3D Gaussian points remains great challenge due to the large volume of Gaussians and difficulty of representation of sharp signals caused by their inherent low-pass characteristics. To address this issue, we propose DyGASR, which utilizes generalized exponential function instead of traditional 3D Gaussian to decrease the number of particles and dynamically optimize the representation of the captured signal. In addition, it is observed that reconstructing mesh with Generalized Exponential Splatting(GES) without modifications frequently leads to failures since the generalized exponential distribution centroids may not precisely align with the scene surface. To overcome this, we adopt Sugar's approach and introduce Generalized Surface Regularization (GSR), which reduces the smallest scaling vector of each point cloud to zero and ensures normal alignment perpendicular to the surface, facilitating subsequent Poisson surface mesh reconstruction. Additionally, we propose a dynamic resolution adjustment strategy that utilizes a cosine schedule to gradually increase image resolution from low to high during the training stage, thus avoiding constant full resolution, which significantly boosts the reconstruction speed. Our approach surpasses existing 3DGS-based mesh reconstruction methods, as evidenced by extensive evaluations on various scene datasets, demonstrating a 25\% increase in speed, and a 30\% reduction in memory usage.
- [104] arXiv:2411.09158 [pdf, html, other]
-
Title: The \emph{Optimist}: Towards Fully Automated Graph Theory ResearchSubjects: Artificial Intelligence (cs.AI); Combinatorics (math.CO)
This paper introduces the \emph{Optimist}, an autonomous system developed to advance automated conjecture generation in graph theory. Leveraging mixed-integer programming (MIP) and heuristic methods, the \emph{Optimist} generates conjectures that both rediscover established theorems and propose novel inequalities. Through a combination of memory-based computation and agent-like adaptability, the \emph{Optimist} iteratively refines its conjectures by integrating new data, enabling a feedback process with minimal human (\emph{or machine}) intervention. Initial experiments reveal the \emph{Optimist}'s potential to uncover foundational results in graph theory, as well as to produce conjectures of interest for future exploration. This work also outlines the \emph{Optimist}'s evolving integration with a counterpart agent, the \emph{Pessimist} (a human \emph{or machine} agent), to establish a dueling system that will drive fully automated graph theory research.
- [105] arXiv:2411.09159 [pdf, html, other]
-
Title: PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory AcceleratorsSubjects: Hardware Architecture (cs.AR)
Various processing-in-memory (PIM) accelerators based on various devices, micro-architectures, and interfaces have been proposed to accelerate deep neural networks (DNNs). How to deploy DNNs onto PIM-based accelerators is the key to explore PIM's high performance and energy efficiency. The scale of DNN models, the diversity of PIM accelerators, and the complexity of deployment are far beyond the human deployment capability. Hence, an automatic deployment methodology is indispensable. In this work, we propose PIMCOMP, an end-to-end DNN compiler tailored for PIM accelerators, achieving efficient deployment of DNN models on PIM hardware. PIMCOMP can adapt to various PIM architectures by using an abstract configurable PIM accelerator template with a set of pseudo-instructions, which is a high-level abstraction of the hardware's fundamental functionalities. Through a generic multi-level optimization framework, PIMCOMP realizes an end-to-end conversion from a high-level DNN description to pseudo-instructions, which can be further converted to specific hardware intrinsics/primitives. The compilation addresses two critical issues in PIM-accelerated inference from a system perspective: resource utilization and dataflow scheduling. PIMCOMP adopts a flexible unfolding format to reshape and partition convolutional layers, adopts a weight-layout guided computation-storage-mapping approach to enhance resource utilization, and balances the system's computation, memory access, and communication characteristics. For dataflow scheduling, we design two scheduling algorithms with different inter-layer pipeline granularities to support varying application scenarios while ensuring high computational parallelism. Experiments demonstrate that PIMCOMP improves throughput, latency, and energy efficiency across various architectures. PIMCOMP is open-sourced at \url{this https URL}.
- [106] arXiv:2411.09160 [pdf, html, other]
-
Title: Rationality based Innate-Values-driven Reinforcement LearningComments: arXiv admin note: substantial text overlap with arXiv:2401.05572Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
Innate values describe agents' intrinsic motivations, which reflect their inherent interests and preferences to pursue goals and drive them to develop diverse skills satisfying their various needs. The essence of reinforcement learning (RL) is learning from interaction based on reward-driven behaviors, much like natural agents. It is an excellent model to describe the innate-values-driven (IV) behaviors of AI agents. Especially developing the awareness of the AI agent through balancing internal and external utilities based on its needs in different tasks is a crucial problem for individuals learning to support AI agents integrating human society with safety and harmony in the long term. This paper proposes a hierarchical compound intrinsic value reinforcement learning model -- innate-values-driven reinforcement learning termed IVRL to describe the complex behaviors of AI agents' interaction. We formulated the IVRL model and proposed two IVRL models: DQN and A2C. By comparing them with benchmark algorithms such as DQN, DDQN, A2C, and PPO in the Role-Playing Game (RPG) reinforcement learning test platform VIZDoom, we demonstrated that rationally organizing various individual needs can effectively achieve better performance.
- [107] arXiv:2411.09162 [pdf, html, other]
-
Title: An Asymptotic-Preserving Scheme for Isentropic Flow in Pipe NetworksSubjects: Numerical Analysis (math.NA)
We consider the simulation of isentropic flow in pipelines and pipe networks. Standard operating conditions in pipe networks suggest an emphasis to simulate low Mach and high friction regimes -- however, the system is stiff in these regimes and conventional explicit approximation techniques prove quite costly and often impractical. To combat these inefficiencies, we develop a novel asymptotic-preserving scheme that is uniformly consistent and stable for all Mach regimes. The proposed method for a single pipeline follows the flux splitting suggested in [Haack et al., Commun. Comput. Phys., 12 (2012), pp. 955--980], in which the flux is separated into stiff and non-stiff portions then discretized in time using an implicit-explicit approach. The non-stiff part is advanced in time by an explicit hyperbolic solver; we opt for the second-order central-upwind finite volume scheme. The stiff portion is advanced in time implicitly using an approach based on Rosenbrock-type Runge-Kutta methods, which ultimately reduces this implicit stage to a discretization of a linear elliptic equation.
To extend to full pipe networks, the scheme on a single pipeline is paired with coupling conditions defined at pipe-to-pipe intersections to ensure a mathematically well-posed problem. We show that the coupling conditions remain well-posed in the low Mach/high friction limit -- which, when used to define the ghost cells of each pipeline, results in a method that is accurate across these intersections in all regimes. The proposed method is tested on several numerical examples and produces accurate, non-oscillatory results with run times independent of the Mach number. - [108] arXiv:2411.09166 [pdf, html, other]
-
Title: Unstructured Text Enhanced Open-domain Dialogue System: A Systematic SurveyComments: 45 pages, 3 Figures, 11 TablesJournal-ref: ACM Transactions on Information Systems 40(1): 9:1-9:44 (2022)Subjects: Computation and Language (cs.CL)
Incorporating external knowledge into dialogue generation has been proven to benefit the performance of an open-domain Dialogue System (DS), such as generating informative or stylized responses, controlling conversation topics. In this article, we study the open-domain DS that uses unstructured text as external knowledge sources (\textbf{U}nstructured \textbf{T}ext \textbf{E}nhanced \textbf{D}ialogue \textbf{S}ystem, \textbf{UTEDS}). The existence of unstructured text entails distinctions between UTEDS and traditional data-driven DS and we aim to analyze these differences. We first give the definition of the UTEDS related concepts, then summarize the recently released datasets and models. We categorize UTEDS into Retrieval and Generative models and introduce them from the perspective of model components. The retrieval models consist of Fusion, Matching, and Ranking modules, while the generative models comprise Dialogue and Knowledge Encoding, Knowledge Selection, and Response Generation modules. We further summarize the evaluation methods utilized in UTEDS and analyze the current models' performance. At last, we discuss the future development trends of UTEDS, hoping to inspire new research in this field.
- [109] arXiv:2411.09167 [pdf, html, other]
-
Title: Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature AugmentationSubjects: Sound (cs.SD); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
AI-synthesized speech, also known as deepfake speech, has recently raised significant concerns due to the rapid advancement of speech synthesis and speech conversion techniques. Previous works often rely on distinguishing synthesizer artifacts to identify deepfake speech. However, excessive reliance on these specific synthesizer artifacts may result in unsatisfactory performance when addressing speech signals created by unseen synthesizers. In this paper, we propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features as complementary for detection. Specifically, we propose a dual-stream feature decomposition learning strategy that decomposes the learned speech representation using a synthesizer stream and a content stream. The synthesizer stream specializes in learning synthesizer features through supervised training with synthesizer labels. Meanwhile, the content stream focuses on learning synthesizer-independent content features, enabled by a pseudo-labeling-based supervised learning method. This method randomly transforms speech to generate speed and compression labels for training. Additionally, we employ an adversarial learning technique to reduce the synthesizer-related components in the content stream. The final classification is determined by concatenating the synthesizer and content features. To enhance the model's robustness to different synthesizer characteristics, we further propose a synthesizer feature augmentation strategy that randomly blends the characteristic styles within real and fake audio features and randomly shuffles the synthesizer features with the content features. This strategy effectively enhances the feature diversity and simulates more feature combinations.
- [110] arXiv:2411.09168 [pdf, html, other]
-
Title: Theory of Mind Enhances Collective IntelligenceComments: 20 pages, 2 figures, 1 tableSubjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Adaptation and Self-Organizing Systems (nlin.AO)
Collective Intelligence plays a central role in a large variety of fields, from economics and evolutionary theory to neural networks and eusocial insects, and it is also core to much of the work on emergence and self-organisation in complex systems theory. However, in human collective intelligence there is still much more to be understood in the relationship between specific psychological processes at the individual level and the emergence of self-organised structures at the social level. Previously psychological factors have played a relatively minor role in the study of collective intelligence as the principles are often quite general and applicable to humans just as readily as insects or other agents without sophisticated psychologies. In this article we emphasise, with examples from other complex adaptive systems, the broad applicability of collective intelligence principles while the mechanisms and time-scales differ significantly between examples. We contend that flexible collective intelligence in human social settings is improved by our use of a specific cognitive tool: our Theory of Mind. We identify several key characteristics of psychologically mediated collective intelligence and show that the development of a Theory of Mind is a crucial factor distinguishing social collective intelligence from general collective intelligence. We then place these capabilities in the context of the next steps in artificial intelligence embedded in a future that includes an effective human-AI hybrid social ecology.
- [111] arXiv:2411.09169 [pdf, html, other]
-
Title: Artificial Theory of Mind and Self-Guided Social OrganisationComments: 4 pagesSubjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Adaptation and Self-Organizing Systems (nlin.AO)
One of the challenges artificial intelligence (AI) faces is how a collection of agents coordinate their behaviour to achieve goals that are not reachable by any single agent. In a recent article by Ozmen et al this was framed as one of six grand challenges: That AI needs to respect human cognitive processes at the human-AI interaction frontier. We suggest that this extends to the AI-AI frontier and that it should also reflect human psychology, as it is the only successful framework we have from which to build out. In this extended abstract we first make the case for collective intelligence in a general setting, drawing on recent work from single neuron complexity in neural networks and ant network adaptability in ant colonies. From there we introduce how species relate to one another in an ecological network via niche selection, niche choice, and niche conformity with the aim of forming an analogy with human social network development as new agents join together and coordinate. From there we show how our social structures are influenced by our neuro-physiology, our psychology, and our language. This emphasises how individual people within a social network influence the structure and performance of that network in complex tasks, and that cognitive faculties such as Theory of Mind play a central role. We finish by discussing the current state of the art in AI and where there is potential for further development of a socially embodied collective artificial intelligence that is capable of guiding its own social structures.
- [112] arXiv:2411.09170 [pdf, html, other]
-
Title: Towards Scalable Handwriting Communication via EEG Decoding and Latent Embedding IntegrationComments: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer InterfaceSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
In recent years, brain-computer interfaces have made advances in decoding various motor-related tasks, including gesture recognition and movement classification, utilizing electroencephalogram (EEG) data. These developments are fundamental in exploring how neural signals can be interpreted to recognize specific physical actions. This study centers on a written alphabet classification task, where we aim to decode EEG signals associated with handwriting. To achieve this, we incorporate hand kinematics to guide the extraction of the consistent embeddings from high-dimensional neural recordings using auxiliary variables (CEBRA). These CEBRA embeddings, along with the EEG, are processed by a parallel convolutional neural network model that extracts features from both data sources simultaneously. The model classifies nine different handwritten characters, including symbols such as exclamation marks and commas, within the alphabet. We evaluate the model using a quantitative five-fold cross-validation approach and explore the structure of the embedding space through visualizations. Our approach achieves a classification accuracy of 91 % for the nine-class task, demonstrating the feasibility of fine-grained handwriting decoding from EEG.
- [113] arXiv:2411.09171 [pdf, html, other]
-
Title: Optimizing Metamorphic Testing: Prioritizing Relations Through Execution Profile DissimilaritySubjects: Software Engineering (cs.SE)
An oracle determines whether the output of a program for executed test cases is correct. For machine learning programs, such an oracle is often unavailable or impractical to apply. Metamorphic testing addresses this by using metamorphic relations (MRs), which are essential properties of the software under test, to verify program correctness. Prioritizing MRs enhances fault detection effectiveness and improves testing efficiency. However, fault-based MR prioritization is costly, and code coverage-based approaches often yield inaccurate results. To address these challenges, we propose a statement centrality-based prioritization approach that leverages diversity in the execution profiles of source and follow-up test cases. Our approach improves fault detection effectiveness by up to 31% compared to code coverage-based methods and reduces fault detection time by 29% compared to random MR execution. Overall, it increases the average fault detection rate, reduces fault detection time, and enhances testing efficiency.
- [114] arXiv:2411.09174 [pdf, html, other]
-
Title: Advancing Diffusion Models: Alias-Free Resampling and Enhanced Rotational EquivarianceComments: 13 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Recent advances in image generation, particularly via diffusion models, have led to impressive improvements in image synthesis quality. Despite this, diffusion models are still challenged by model-induced artifacts and limited stability in image fidelity. In this work, we hypothesize that the primary cause of this issue is the improper resampling operation that introduces aliasing in the diffusion model and a careful alias-free resampling dictated by image processing theory can improve the model's performance in image synthesis. We propose the integration of alias-free resampling layers into the UNet architecture of diffusion models without adding extra trainable parameters, thereby maintaining computational efficiency. We then assess whether these theory-driven modifications enhance image quality and rotational equivariance. Our experimental results on benchmark datasets, including CIFAR-10, MNIST, and MNIST-M, reveal consistent gains in image quality, particularly in terms of FID and KID scores. Furthermore, we propose a modified diffusion process that enables user-controlled rotation of generated images without requiring additional training. Our findings highlight the potential of theory-driven enhancements such as alias-free resampling in generative models to improve image quality while maintaining model efficiency and pioneer future research directions to incorporate them into video-generating diffusion models, enabling deeper exploration of the applications of alias-free resampling in generative modeling.
- [115] arXiv:2411.09176 [pdf, html, other]
-
Title: Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual ForagingSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Imagine searching a collection of coins for quarters ($0.25$), dimes ($0.10$), nickels ($0.05$), and pennies ($0.01$)-a hybrid foraging task where observers look for multiple instances of multiple target types. In such tasks, how do target values and their prevalence influence foraging and eye movement behaviors (e.g., should you prioritize rare quarters or common nickels)? To explore this, we conducted human psychophysics experiments, revealing that humans are proficient reward foragers. Their eye fixations are drawn to regions with higher average rewards, fixation durations are longer on more valuable targets, and their cumulative rewards exceed chance, approaching the upper bound of optimal foragers. To probe these decision-making processes of humans, we developed a transformer-based Visual Forager (VF) model trained via reinforcement learning. Our VF model takes a series of targets, their corresponding values, and the search image as inputs, processes the images using foveated vision, and produces a sequence of eye movements along with decisions on whether to collect each fixated item. Our model outperforms all baselines, achieves cumulative rewards comparable to those of humans, and approximates human foraging behavior in eye movements and foraging biases within time-limited environments. Furthermore, stress tests on out-of-distribution tasks with novel targets, unseen values, and varying set sizes demonstrate the VF model's effective generalization. Our work offers valuable insights into the relationship between eye movements and decision-making, with our model serving as a powerful tool for further exploration of this connection. All data, code, and models will be made publicly available.
- [116] arXiv:2411.09177 [pdf, html, other]
-
Title: Enhancing reinforcement learning for population setpoint tracking in co-culturesSebastián Espinel-Ríos, Joyce Qiaoxi Mo, Dongda Zhang, Ehecatl Antonio del Rio-Chanona, José L. AvalosSubjects: Systems and Control (eess.SY)
Efficient multiple setpoint tracking can enable advanced biotechnological applications, such as maintaining desired population levels in co-cultures for optimal metabolic division of labor. In this study, we employ reinforcement learning as a control method for population setpoint tracking in co-cultures, focusing on policy-gradient techniques where the control policy is parameterized by neural networks. However, achieving accurate tracking across multiple setpoints is a significant challenge in reinforcement learning, as the agent must effectively balance the contributions of various setpoints to maximize the expected system performance. Traditional return functions, such as those based on a quadratic cost, often yield suboptimal performance due to their inability to efficiently guide the agent toward the simultaneous satisfaction of all setpoints. To overcome this, we propose a novel return function that rewards the simultaneous satisfaction of multiple setpoints and diminishes overall reward gains otherwise, accounting for both stage and terminal system performance. This return function includes parameters to fine-tune the desired smoothness and steepness of the learning process. We demonstrate our approach considering an $\textit{Escherichia coli}$ co-culture in a chemostat with optogenetic control over amino acid synthesis pathways, leveraging auxotrophies to modulate growth.
- [117] arXiv:2411.09178 [pdf, html, other]
-
Title: SAFES: Sequential Privacy and Fairness Enhancing Data Synthesis for Responsible AISubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
As data-driven and AI-based decision making gains widespread adoption in most disciplines, it is crucial that both data privacy and decision fairness are appropriately addressed. While differential privacy (DP) provides a robust framework for guaranteeing privacy and several widely accepted methods have been proposed for improving fairness, the vast majority of existing literature treats the two concerns independently. For methods that do consider privacy and fairness simultaneously, they often only apply to a specific machine learning task, limiting their generalizability. In response, we introduce SAFES, a Sequential PrivAcy and Fairness Enhancing data Synthesis procedure that sequentially combines DP data synthesis with a fairness-aware data transformation. SAFES allows full control over the privacy-fairness-utility trade-off via tunable privacy and fairness parameters. We illustrate SAFES by combining AIM, a graphical model-based DP data synthesizer, with a popular fairness-aware data pre-processing transformation. Empirical evaluations on the Adult and COMPAS datasets demonstrate that for reasonable privacy loss, SAFES-generated synthetic data achieve significantly improved fairness metrics with relatively low utility loss.
- [118] arXiv:2411.09180 [pdf, html, other]
-
Title: LEAP:D -- A Novel Prompt-based Approach for Domain-Generalized Aerial Object DetectionComments: ICIP 2024 Workshop accepted paperSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Drone-captured images present significant challenges in object detection due to varying shooting conditions, which can alter object appearance and shape. Factors such as drone altitude, angle, and weather cause these variations, influencing the performance of object detection algorithms. To tackle these challenges, we introduce an innovative vision-language approach using learnable prompts. This shift from conventional manual prompts aims to reduce domain-specific knowledge interference, ultimately improving object detection capabilities. Furthermore, we streamline the training process with a one-step approach, updating the learnable prompt concurrently with model training, enhancing efficiency without compromising performance. Our study contributes to domain-generalized object detection by leveraging learnable prompts and optimizing training processes. This enhances model robustness and adaptability across diverse environments, leading to more effective aerial object detection.
- [119] arXiv:2411.09181 [pdf, html, other]
-
Title: DeBaTeR: Denoising Bipartite Temporal Graph for RecommendationSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Due to the difficulty of acquiring large-scale explicit user feedback, implicit feedback (e.g., clicks or other interactions) is widely applied as an alternative source of data, where user-item interactions can be modeled as a bipartite graph. Due to the noisy and biased nature of implicit real-world user-item interactions, identifying and rectifying noisy interactions are vital to enhance model performance and robustness. Previous works on purifying user-item interactions in collaborative filtering mainly focus on mining the correlation between user/item embeddings and noisy interactions, neglecting the benefit of temporal patterns in determining noisy interactions. Time information, while enhancing the model utility, also bears its natural advantage in helping to determine noisy edges, e.g., if someone usually watches horror movies at night and talk shows in the morning, a record of watching a horror movie in the morning is more likely to be noisy interaction. Armed with this observation, we introduce a simple yet effective mechanism for generating time-aware user/item embeddings and propose two strategies for denoising bipartite temporal graph in recommender systems (DeBaTeR): the first is through reweighting the adjacency matrix (DeBaTeR-A), where a reliability score is defined to reweight the edges through both soft assignment and hard assignment; the second is through reweighting the loss function (DeBaTeR-L), where weights are generated to reweight user-item samples in the losses. Extensive experiments have been conducted to demonstrate the efficacy of our methods and illustrate how time information indeed helps identifying noisy edges.
- [120] arXiv:2411.09184 [pdf, other]
-
Title: Dynamic technology impact analysis: A multi-task learning approach to patent citation predictionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Machine learning (ML) models are valuable tools for analyzing the impact of technology using patent citation information. However, existing ML-based methods often struggle to account for the dynamic nature of the technology impact over time and the interdependencies of these impacts across different periods. This study proposes a multi-task learning (MTL) approach to enhance the prediction of technology impact across various time frames by leveraging knowledge sharing and simultaneously monitoring the evolution of technology impact. First, we quantify the technology impacts and identify patterns through citation analysis over distinct time periods. Next, we develop MTL models to predict citation counts using multiple patent indicators over time. Finally, we examine the changes in key input indicators and their patterns over different periods using the SHapley Additive exPlanation method. We also offer guidelines for validating and interpreting the results by employing statistical methods and natural language processing techniques. A case study on battery technologies demonstrates that our approach not only deepens the understanding of technology impact, but also improves prediction accuracy, yielding valuable insights for both academia and industry.
- [121] arXiv:2411.09189 [pdf, html, other]
-
Title: Improvement and Implementation of a Speech Emotion Recognition Model Based on Dual-Layer LSTMSubjects: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper builds upon an existing speech emotion recognition model by adding an additional LSTM layer to improve the accuracy and processing efficiency of emotion recognition from audio data. By capturing the long-term dependencies within audio sequences through a dual-layer LSTM network, the model can recognize and classify complex emotional patterns more accurately. Experiments conducted on the RAVDESS dataset validated this approach, showing that the modified dual layer LSTM model improves accuracy by 2% compared to the single-layer LSTM while significantly reducing recognition latency, thereby enhancing real-time performance. These results indicate that the dual-layer LSTM architecture is highly suitable for handling emotional features with long-term dependencies, providing a viable optimization for speech emotion recognition systems. This research provides a reference for practical applications in fields like intelligent customer service, sentiment analysis and human-computer interaction.
- [122] arXiv:2411.09198 [pdf, html, other]
-
Title: Risk-aware MPPI for Stochastic Hybrid SystemsHardik Parwana, Mitchell Black, Bardh Hoxha, Hideki Okamoto, Georgios Fainekos, Danil Prokhorov, Dimitra PanagouSubjects: Robotics (cs.RO)
Path Planning for stochastic hybrid systems presents a unique challenge of predicting distributions of future states subject to a state-dependent dynamics switching function. In this work, we propose a variant of Model Predictive Path Integral Control (MPPI) to plan kinodynamic paths for such systems. Monte Carlo may be inaccurate when few samples are chosen to predict future states under state-dependent disturbances. We employ recently proposed Unscented Transform-based methods to capture stochasticity in the states as well as the state-dependent switching surfaces. This is in contrast to previous works that perform switching based only on the mean of predicted states. We focus our motion planning application on the navigation of a mobile robot in the presence of dynamically moving agents whose responses are based on sensor-constrained attention zones. We evaluate our framework on a simulated mobile robot and show faster convergence to a goal without collisions when the robot exploits the hybrid human dynamics versus when it does not.
- [123] arXiv:2411.09199 [pdf, other]
-
Title: Ghost-Connect Net: A Generalization-Enhanced Guidance For Sparse Deep Networks Under Distribution ShiftsComments: 21 pages, 4 figures, 3 subfigures, 42 tablesSubjects: Machine Learning (cs.LG)
Sparse deep neural networks (DNNs) excel in real-world applications like robotics and computer vision, by reducing computational demands that hinder usability. However, recent studies aim to boost DNN efficiency by trimming redundant neurons or filters based on task relevance, but neglect their adaptability to distribution shifts. We aim to enhance these existing techniques by introducing a companion network, Ghost Connect-Net (GC-Net), to monitor the connections in the original network with distribution generalization advantage. GC-Net's weights represent connectivity measurements between consecutive layers of the original network. After pruning GC-Net, the pruned locations are mapped back to the original network as pruned connections, allowing for the combination of magnitude and connectivity-based pruning methods. Experimental results using common DNN benchmarks, such as CIFAR-10, Fashion MNIST, and Tiny ImageNet show promising results for hybridizing the method, and using GC-Net guidance for later layers of a network and direct pruning on earlier layers. We provide theoretical foundations for GC-Net's approach to improving generalization under distribution shifts.
- [124] arXiv:2411.09200 [pdf, other]
-
Title: Advancing Software Security and Reliability in Cloud Platforms through AI-based Anomaly DetectionComments: 10 pagesSubjects: Software Engineering (cs.SE)
Continuous Integration/Continuous Deployment (CI/CD) is fundamental for advanced software development, supporting faster and more efficient delivery of code changes into cloud environments. However, security issues in the CI/CD pipeline remain challenging, and incidents (e.g., DDoS, Bot, Log4j, etc.) are happening over the cloud environments. While plenty of literature discusses static security testing and CI/CD practices, only a few deal with network traffic pattern analysis to detect different cyberattacks. This research aims to enhance CI/CD pipeline security by implementing anomaly detection through AI (Artificial Intelligence) support. The goal is to identify unusual behaviour or variations from network traffic patterns in pipeline and cloud platforms. The system shall integrate into the workflow to continuously monitor pipeline activities and cloud infrastructure. Additionally, it aims to explore adaptive response mechanisms to mitigate the detected anomalies or security threats. This research employed two popular network traffic datasets, CSE-CIC-IDS2018 and CSE-CIC-IDS2017. We implemented a combination of Convolution Neural Network(CNN) and Long Short-Term Memory (LSTM) to detect unusual traffic patterns. We achieved an accuracy of 98.69% and 98.30% and generated log files in different CI/CD pipeline stages that resemble the network anomalies affected to address security challenges in modern DevOps practices, contributing to advancing software security and reliability.
- [125] arXiv:2411.09201 [pdf, html, other]
-
Title: Noncontact Multi-Point Vital Sign Monitoring with mmWave MIMO RadarComments: 15 pagesSubjects: Hardware Architecture (cs.AR); Signal Processing (eess.SP)
Multi-point vital sign monitoring is essential for providing detailed insights into physiological changes. Traditional single-sensor approaches are inadequate for capturing multi-point vibrations. Existing contact-based solutions, while addressing this need, can cause discomfort and skin allergies, whereas noncontact optical and acoustic methods are highly susceptible to light interference and environmental noise. In this paper, we aim to develop a non-contact, multi-point vital sign monitoring technique using MIMO radar, focused on physically differentiating and precisely measuring chest-wall surface vibrations at multiple points induced by cardiopulmonary mechanical activity. The primary challenges in developing such a technique involve developing algorithms to extract and separate entangled signals, as well as establishing a reliable method for validating detection accuracy. To address these limitations, we introduce MultiVital, a wireless system that leverages mmWave Multiple-input Multiple-output (MIMO) radar for synchronous multi-point vital sign monitoring. It integrates two reference modalities: five-channel seismocardiography (SCG) sensors and a one-channel electrocardiogram (ECG) electrode, enabling comprehensive radar-based research and performance validation across multiple physiological metrics. Additionally, we have developed a multi-modal signal processing framework, consisting of a radar signal processing module, an SCG calibration module, and a spatial alignment scheme. To evaluate the radar signal processing module, we conducted mathematical derivation and simulation. The experimental results indicate that the noncontact MultiVital system achieves multi-point synchronous monitoring with high precision, highly consistent with the results from reference modalities.
- [126] arXiv:2411.09205 [pdf, html, other]
-
Title: FlexFlood: Efficiently Updatable Learned Multi-dimensional IndexSubjects: Data Structures and Algorithms (cs.DS)
A learned multi-dimensional index is a data structure that efficiently answers multi-dimensional orthogonal queries by understanding the data distribution using machine learning models. One of the existing problems is that the search performance significantly decreases when the distribution of data stored in the data structure becomes skewed due to update operations. To overcome this problem, we propose FlexFlood, a flexible variant of Flood. FlexFlood partially reconstructs the internal structure when the data distribution becomes skewed. Moreover, FlexFlood is the first learned multi-dimensional index that guarantees the time complexity of the update operation. Through experiments using both artificial and real-world data, we demonstrate that the search performance when the data distribution becomes skewed is up to 10 times faster than existing methods. We also found that partial reconstruction takes only about twice as much time as naive data updating.
- [127] arXiv:2411.09209 [pdf, html, other]
-
Title: JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Audio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the applications in portrait animation. The code will be available at: this https URL.
- [128] arXiv:2411.09211 [pdf, other]
-
Title: Dynamic Neural Communication: Convergence of Computer Vision and Brain-Computer InterfaceComments: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer InterfaceSubjects: Artificial Intelligence (cs.AI)
Interpreting human neural signals to decode static speech intentions such as text or images and dynamic speech intentions such as audio or video is showing great potential as an innovative communication tool. Human communication accompanies various features, such as articulatory movements, facial expressions, and internal speech, all of which are reflected in neural signals. However, most studies only generate short or fragmented outputs, while providing informative communication by leveraging various features from neural signals remains challenging. In this study, we introduce a dynamic neural communication method that leverages current computer vision and brain-computer interface technologies. Our approach captures the user's intentions from neural signals and decodes visemes in short time steps to produce dynamic visual outputs. The results demonstrate the potential to rapidly capture and reconstruct lip movements during natural speech attempts from human neural signals, enabling dynamic neural communication through the convergence of computer vision and brain--computer interface.
- [129] arXiv:2411.09213 [pdf, html, other]
-
Title: Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question AnsweringSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs' ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs' reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.
- [130] arXiv:2411.09214 [pdf, html, other]
-
Title: HateGPT: Unleashing GPT-3.5 Turbo to Combat Hate Speech on XComments: Accepted at FIRE 2024 (Track: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages (HASOC)). arXiv admin note: text overlap with arXiv:2411.05039, arXiv:2411.06946Subjects: Computation and Language (cs.CL)
The widespread use of social media platforms like Twitter and Facebook has enabled people of all ages to share their thoughts and experiences, leading to an immense accumulation of user-generated content. However, alongside the benefits, these platforms also face the challenge of managing hate speech and offensive content, which can undermine rational discourse and threaten democratic values. As a result, there is a growing need for automated methods to detect and mitigate such content, especially given the complexity of conversations that may require contextual analysis across multiple languages, including code-mixed languages like Hinglish, German-English, and Bangla. We participated in the English task where we have to classify English tweets into two categories namely Hate and Offensive and Non Hate-Offensive. In this work, we experiment with state-of-the-art large language models like GPT-3.5 Turbo via prompting to classify tweets into Hate and Offensive or Non Hate-Offensive. In this study, we evaluate the performance of a classification model using Macro-F1 scores across three distinct runs. The Macro-F1 score, which balances precision and recall across all classes, is used as the primary metric for model evaluation. The scores obtained are 0.756 for run 1, 0.751 for run 2, and 0.754 for run 3, indicating a high level of performance with minimal variance among the runs. The results suggest that the model consistently performs well in terms of precision and recall, with run 1 showing the highest performance. These findings highlight the robustness and reliability of the model across different runs.
- [131] arXiv:2411.09217 [pdf, html, other]
-
Title: SmartInv: Multimodal Learning for Smart Contract Invariant InferenceSubjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Programming Languages (cs.PL)
Smart contracts are software programs that enable diverse business activities on the blockchain. Recent research has identified new classes of "machine un-auditable" bugs that arise from both transactional contexts and source code. Existing detection methods require human understanding of underlying transaction logic and manual reasoning across different sources of context (i.e. modalities), such as code, dynamic transaction executions, and natural language specifying the expected transaction behavior.
To automate the detection of ``machine un-auditable'' bugs, we present SmartInv, an accurate and fast smart contract invariant inference framework. Our key insight is that the expected behavior of smart contracts, as specified by invariants, relies on understanding and reasoning across multimodal information, such as source code and natural language. We propose a new prompting strategy to foundation models, Tier of Thought (ToT), to reason across multiple modalities of smart contracts and ultimately to generate invariants. By checking the violation of these generated invariants, SmartInv can identify potential vulnerabilities.
We evaluate SmartInv on real-world contracts and re-discover bugs that resulted in multi-million dollar losses over the past 2.5 years (from January 1, 2021 to May 31, 2023). Our extensive evaluation shows that SmartInv generates (3.5X) more bug-critical invariants and detects (4$\times$) more critical bugs compared to the state-of-the-art tools in significantly (150X) less time. \sys uncovers 119 zero-day vulnerabilities from the 89,621 real-world contracts. Among them, five are critical zero-day bugs confirmed by developers as ``high severity.'' - [132] arXiv:2411.09219 [pdf, html, other]
-
Title: Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary SegmentationComments: 12 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to this http URL is available at this https URL.
- [133] arXiv:2411.09222 [pdf, html, other]
-
Title: Toward Democracy Levels for AIAviv Ovadya, Luke Thorburn, Kyle Redman, Flynn Devine, Smitha Milli, Manon Revel, Andrew Konya, Atoosa KasirzadehComments: 11 pages. Accepted to the Workshop on Pluralistic Alignment at NeurIPS 2024Subjects: Computers and Society (cs.CY)
There is increasing concern about the unilateral power of the organizations involved in the development, alignment, and governance of AI. Recent pilots - such as Meta's Community Forums and Anthropic's Collective Constitutional AI - have illustrated a promising direction, where democratic processes might be used to meaningfully improve public involvement and trust in critical decisions. However, there is no standard framework for evaluating such processes. In this paper, building on insights from the theory and practice of deliberative democracy, we provide a "Democracy Levels" framework for evaluating the degree to which decisions in a given domain are made democratically. The framework can be used (i) to define milestones in a roadmap for the democratic AI, pluralistic AI, and public AI ecosystems, (ii) to guide organizations that need to increase the legitimacy of their decisions on difficult AI governance questions, and (iii) as a rubric by those aiming to evaluate AI organizations and keep them accountable.
- [134] arXiv:2411.09224 [pdf, html, other]
-
Title: Programming with AI: Evaluating ChatGPT, Gemini, AlphaCode, and GitHub Copilot for ProgrammersComments: 8 pagesSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Our everyday lives now heavily rely on artificial intelligence (AI) powered large language models (LLMs). Like regular users, programmers are also benefiting from the newest large language models. In response to the critical role that AI models play in modern software development, this study presents a thorough evaluation of leading programming assistants, including ChatGPT, Gemini(Bard AI), AlphaCode, and GitHub Copilot. The evaluation is based on tasks like natural language processing and code generation accuracy in different programming languages like Java, Python and C++. Based on the results, it has emphasized their strengths and weaknesses and the importance of further modifications to increase the reliability and accuracy of the latest popular models. Although these AI assistants illustrate a high level of progress in language understanding and code generation, along with ethical considerations and responsible usage, they provoke a necessity for discussion. With time, developing more refined AI technology is essential for achieving advanced solutions in various fields, especially with the knowledge of the feature intricacies of these models and their implications. This study offers a comparison of different LLMs and provides essential feedback on the rapidly changing area of AI models. It also emphasizes the need for ethical developmental practices to actualize AI models' full potential.
- [135] arXiv:2411.09228 [pdf, html, other]
-
Title: Injection Attacks Against End-to-End Encrypted ApplicationsComments: Published in IEEE Security and Privacy 2024Subjects: Cryptography and Security (cs.CR)
We explore an emerging threat model for end-to-end (E2E) encrypted applications: an adversary sends chosen messages to a target client, thereby "injecting" adversarial content into the application state. Such state is subsequently encrypted and synchronized to an adversarially-visible storage. By observing the lengths of the resulting cloud-stored ciphertexts, the attacker backs out confidential information. We investigate this injection threat model in the context of state-of-the-art encrypted messaging applications that support E2E encrypted backups. We show proof-of-concept attacks that can recover information about E2E encrypted messages or attachments sent via WhatsApp, assuming the ability to compromise the target user's Google or Apple account (which gives access to encrypted backups). We also show weaknesses in Signal's encrypted backup design that would allow injection attacks to infer metadata including a target user's number of contacts and conversations, should the adversary somehow obtain access to the user's encrypted Signal backup. While we do not believe our results should be of immediate concern for users of these messaging applications, our results do suggest that more work is needed to build tools that enjoy strong E2E security guarantees.
- [136] arXiv:2411.09229 [pdf, html, other]
-
Title: Efficient and Secure Cross-Domain Data-Sharing for Resource-Constrained Internet of ThingsComments: 15 pages,10 figures, submitted to Transactions on Information Forensics & Security in 19-Sep-2024Subjects: Cryptography and Security (cs.CR)
The growing complexity of Internet of Things (IoT) environments, particularly in cross-domain data sharing, presents significant security challenges. Existing data-sharing schemes often rely on computationally expensive cryptographic operations and centralized key management, limiting their effectiveness for resource-constrained devices. To address these issues, we propose an efficient, secure blockchain-based data-sharing scheme. First, our scheme adopts a distributed key generation method, which avoids single point of failure. This method also allows independent pseudonym generation and key updates, enhancing authentication flexibility while reducing computational overhead. Additionally, the scheme provides a complete data-sharing process, covering data uploading, storage, and sharing, while ensuring data traceability, integrity, and privacy. Security analysis shows that the proposed scheme is theoretically secure and resistant to various attacks, while performance evaluations demonstrate lower computational and communication overhead compared to existing solutions, making it both secure and efficient for IoT applications.
- [137] arXiv:2411.09231 [pdf, html, other]
-
Title: AEAKA: An Adaptive and Efficient Authentication and Key Agreement Scheme for IoT in Cloud-Edge-Device Collaborative EnvironmentsComments: 17 pages,14 figures,submitted to Transactions on Dependable and Secure Computing in 30-May-2024Subjects: Cryptography and Security (cs.CR)
To meet the diverse needs of users, the rapid advancement of cloud-edge-device collaboration has become a standard practice. However, this complex environment, particularly in untrusted (non-collaborative) scenarios, presents numerous security challenges. Authentication acts as the first line of defense and is fundamental to addressing these issues. Although many authentication and key agreement schemes exist, they often face limitations, such as being tailored to overly specific scenarios where devices authenticate solely with either the edge or the cloud, or being unsuitable for resource-constrained devices. To address these challenges, we propose an adaptive and efficient authentication and key agreement scheme (AEAKA) for Cloud-Edge-Device IoT environments. This scheme is highly adaptive and scalable, capable of automatically and dynamically initiating different authentication methods based on device requirements. Additionally, it employs an edge-assisted authentication approach to reduce the load on third-party trust authorities. Furthermore, we introduce a hash-based algorithm for the authentication protocol, ensuring a lightweight method suitable for a wide range of resource-constrained devices while maintaining security. AEAKA ensures that entities use associated authentication credentials, enhancing the privacy of the authentication process. Security proofs and performance analyses demonstrate that AEAKA outperforms other methods in terms of security and authentication efficiency.
- [138] arXiv:2411.09237 [pdf, html, other]
-
Title: Unsupervised Physics-Informed Neural Network-based Nonlinear Observer design for autonomous systems using contraction analysisSubjects: Systems and Control (eess.SY)
Contraction analysis offers, through elegant mathematical developments, a unified way of designing observers for a general class of nonlinear systems, where the observer correction term is obtained by solving an infinite dimensional inequality that guarantees global exponential convergence. However, solving the matrix partial differential inequality involved in contraction analysis design is both analytically and numerically challenging and represents a long-lasting challenge that prevented its wide use. Therefore, the present paper proposes a novel approach that relies on an unsupervised Physics Informed Neural Network (PINN) to design the observer's correction term by enforcing the partial differential inequality in the loss function. The performance of the proposed PINN-based nonlinear observer is assessed in numerical simulation as well as its robustness to measurement noise and neural network approximation error.
- [139] arXiv:2411.09238 [pdf, html, other]
-
Title: Rethinking the "Heatmap + Monte Carlo Tree Search" Paradigm for Solving Large Scale TSPSubjects: Machine Learning (cs.LG)
The Travelling Salesman Problem (TSP) remains a fundamental challenge in combinatorial optimization, inspiring diverse algorithmic strategies. This paper revisits the "heatmap + Monte Carlo Tree Search (MCTS)" paradigm that has recently gained traction for learning-based TSP solutions. Within this framework, heatmaps encode the likelihood of edges forming part of the optimal tour, and MCTS refines this probabilistic guidance to discover optimal solutions. Contemporary approaches have predominantly emphasized the refinement of heatmap generation through sophisticated learning models, inadvertently sidelining the critical role of MCTS. Our extensive empirical analysis reveals two pivotal insights: 1) The configuration of MCTS strategies profoundly influences the solution quality, demanding meticulous tuning to leverage their full potential; 2) Our findings demonstrate that a rudimentary and parameter-free heatmap, derived from the intrinsic $k$-nearest nature of TSP, can rival or even surpass the performance of complicated heatmaps, with strong generalizability across various scales. Empirical evaluations across various TSP scales underscore the efficacy of our approach, achieving competitive results. These observations challenge the prevailing focus on heatmap sophistication, advocating a reevaluation of the paradigm to harness both components synergistically. Our code is available at: this https URL.
- [140] arXiv:2411.09240 [pdf, html, other]
-
Title: Cybersecurity Study Programs: What's in a Name?Comments: Published in ACM SIGCSE 2025 conference proceedings, see this https URLSubjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Improving cybersecurity education has become a priority for many countries and organizations worldwide. Computing societies and professional associations have recognized cybersecurity as a distinctive computing discipline and created specialized cybersecurity curricular guidelines. Higher education institutions are introducing new cybersecurity programs, attracting students to this expanding field. In this paper, we examined 101 study programs across 24 countries. Based on their analysis, we argue that top-ranked universities have not yet fully implemented the guidelines and offer programs that have "cyber" in their name but lack some essential elements of a cybersecurity program. In particular, most programs do not sufficiently cover non-technical components, such as law, policies, or risk management. Also, most programs teach knowledge and skills but do not expose students to experiential learning outside the traditional classroom (such as internships) to develop their competencies. As a result, graduates of these programs may not meet employer expectations and may require additional training. To help program directors and educators improve their programs and courses, this paper offers examples of effective practices from cybersecurity programs around the world and our teaching practice.
- [141] arXiv:2411.09241 [pdf, html, other]
-
Title: BlueME: Robust Underwater Robot-to-Robot Communication Using Compact Magnetoelectric AntennasSubjects: Robotics (cs.RO); Signal Processing (eess.SP)
We present the design, development, and experimental validation of BlueME, a compact magnetoelectric (ME) antenna array system for underwater robot-to-robot communication. BlueME employs ME antennas operating at their natural mechanical resonance frequency to efficiently transmit and receive very-low-frequency (VLF) electromagnetic signals underwater. To evaluate its performance, we deployed BlueME on an autonomous surface vehicle (ASV) and a remotely operated vehicle (ROV) in open-water field trials. Our tests demonstrate that BlueME maintains reliable signal transmission at distances beyond 200 meters while consuming only 1 watt of power. Field trials show that the system operates effectively in challenging underwater conditions such as turbidity, obstacles, and multipath interference -- that generally affect acoustics and optics. Our analysis also examines the impact of complete submersion on system performance and identifies key deployment considerations. This work represents the first practical underwater deployment of ME antennas outside the laboratory and implements the largest VLF ME array system to date. BlueME demonstrates significant potential for marine robotics and automation in multi-robot cooperative systems and remote sensor networks.
- [142] arXiv:2411.09242 [pdf, html, other]
-
Title: FluidML: Fast and Memory Efficient Inference OptimizationSubjects: Machine Learning (cs.LG)
Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K line of codes, built for general-purpose usage, and will be released as an open-source inference runtime optimization framework to the community.
- [143] arXiv:2411.09243 [pdf, html, other]
-
Title: Towards Unified Neural Decoding of Perceived, Spoken and Imagined Speech from EEG SignalsSubjects: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Brain signals accompany various information relevant to human actions and mental imagery, making them crucial to interpreting and understanding human intentions. Brain-computer interface technology leverages this brain activity to generate external commands for controlling the environment, offering critical advantages to individuals with paralysis or locked-in syndrome. Within the brain-computer interface domain, brain-to-speech research has gained attention, focusing on the direct synthesis of audible speech from brain signals. Most current studies decode speech from brain activity using invasive techniques and emphasize spoken speech data. However, humans express various speech states, and distinguishing these states through non-invasive approaches remains a significant yet challenging task. This research investigated the effectiveness of deep learning models for non-invasive-based neural signal decoding, with an emphasis on distinguishing between different speech paradigms, including perceived, overt, whispered, and imagined speech, across multiple frequency bands. The model utilizing the spatial conventional neural network module demonstrated superior performance compared to other models, especially in the gamma band. Additionally, imagined speech in the theta frequency band, where deep learning also showed strong effects, exhibited statistically significant differences compared to the other speech paradigms.
- [144] arXiv:2411.09244 [pdf, html, other]
-
Title: Parallel in time partially explicit splitting scheme for high contrast multiscale problemsSubjects: Numerical Analysis (math.NA)
Solving multiscale diffusion problems is often computationally expensive due to the spatial and temporal discretization challenges arising from high-contrast coefficients. To address this issue, a partially explicit temporal splitting scheme is proposed. By appropriately constructing multiscale spaces, the spatial multiscale property is effectively managed, and it has been demonstrated that the temporal step size is independent of the contrast. To enhance simulation speed, we propose a parallel algorithm for the multiscale flow problem that leverages the partially explicit temporal splitting scheme. The idea is first to evolve the partially explicit system using a coarse time step size, then correct the solution on each coarse time interval with a fine propagator, for which we consider both the sequential solver and all-at-once solver. This procedure is then performed iteratively till convergence. We analyze the stability and convergence of the proposed algorithm. The numerical experiments demonstrate that the proposed algorithm achieves high numerical accuracy for high-contrast problems and converges in a relatively small number of iterations. The number of iterations stays stable as the number of coarse intervals increases, thus significantly improving computational efficiency through parallel processing.
- [145] arXiv:2411.09249 [pdf, other]
-
Title: Enhancing Financial Domain Adaptation of Language Models via Model AugmentationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The domain adaptation of language models, including large language models (LLMs), has become increasingly important as the use of such models continues to expand. This study demonstrates the effectiveness of Composition to Augment Language Models (CALM) in adapting to the financial domain. CALM is a model to extend the capabilities of existing models by introducing cross-attention between two LLMs with different functions. In our experiments, we developed a CALM to enhance the financial performance of an LLM with strong response capabilities by leveraging a financial-specialized LLM. Notably, the CALM was trained using a financial dataset different from the one used to train the financial-specialized LLM, confirming CALM's ability to adapt to various datasets. The models were evaluated through quantitative Japanese financial benchmarks and qualitative response comparisons, demonstrating that CALM enables superior responses with higher scores than the original models and baselines. Additionally, comparative experiments on connection points revealed that connecting the middle layers of the models is most effective in facilitating adaptation to the financial domain. These findings confirm that CALM is a practical approach for adapting LLMs to the financial domain.
- [146] arXiv:2411.09250 [pdf, html, other]
-
Title: Embedding Space Allocation with Angle-Norm Joint Classifiers for Few-Shot Class-Incremental LearningSubjects: Computer Vision and Pattern Recognition (cs.CV)
Few-shot class-incremental learning (FSCIL) aims to continually learn new classes from only a few samples without forgetting previous ones, requiring intelligent agents to adapt to dynamic environments. FSCIL combines the characteristics and challenges of class-incremental learning and few-shot learning: (i) Current classes occupy the entire feature space, which is detrimental to learning new classes. (ii) The small number of samples in incremental rounds is insufficient for fully training. In existing mainstream virtual class methods, for addressing the challenge (i), they attempt to use virtual classes as placeholders. However, new classes may not necessarily align with the virtual classes. For the challenge (ii), they replace trainable fully connected layers with Nearest Class Mean (NCM) classifiers based on cosine similarity, but NCM classifiers do not account for sample imbalance issues. To address these issues in previous methods, we propose the class-center guided embedding Space Allocation with Angle-Norm joint classifiers (SAAN) learning framework, which provides balanced space for all classes and leverages norm differences caused by sample imbalance to enhance classification criteria. Specifically, for challenge (i), SAAN divides the feature space into multiple subspaces and allocates a dedicated subspace for each session by guiding samples with the pre-set category centers. For challenge (ii), SAAN establishes a norm distribution for each class and generates angle-norm joint logits. Experiments demonstrate that SAAN can achieve state-of-the-art performance and it can be directly embedded into other SOTA methods as a plug-in, further enhancing their performance.
- [147] arXiv:2411.09251 [pdf, html, other]
-
Title: Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow ForecastingSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Predicting spatio-temporal traffic flow presents significant challenges due to complex interactions between spatial and temporal factors. Existing approaches often address these dimensions in isolation, neglecting their critical interdependencies. In this paper, we introduce the Spatio-Temporal Unitized Model (STUM), a unified framework designed to capture both spatial and temporal dependencies while addressing spatio-temporal heterogeneity through techniques such as distribution alignment and feature fusion. It also ensures both predictive accuracy and computational efficiency. Central to STUM is the Adaptive Spatio-temporal Unitized Cell (ASTUC), which utilizes low-rank matrices to seamlessly store, update, and interact with space, time, as well as their correlations. Our framework is also modular, allowing it to integrate with various spatio-temporal graph neural networks through components such as backbone models, feature extractors, residual fusion blocks, and predictive modules to collectively enhance forecasting outcomes. Experimental results across multiple real-world datasets demonstrate that STUM consistently improves prediction performance with minimal computational cost. These findings are further supported by hyperparameter optimization, pre-training analysis, and result visualization. We provide our source code for reproducibility at this https URL.
- [148] arXiv:2411.09252 [pdf, html, other]
-
Title: Implementing an Optimized and Secured Multimedia Streaming Protocol in a Participatory Sensing ScenarioSubjects: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR)
Multimedia streaming protocols are becoming increasingly popular in Crowdsensing due to their ability to deliver high-quality video content over the internet in real-time. Streaming multimedia content, as in the context of live video streaming, requires high bandwidth and large storage capacity to ensure a sufficient throughput. Crowdsensing can distribute information about shared video contents among multiple users in network, reducing storage capacity and computational and bandwidth requirements. However, Crowdsensing introduces several security constraints that must be taken into account to ensure the confidentiality, integrity, and availability of the data. In the specific case of video streaming, commonly named as visual crowdsensing (VCS) within this context, data is transmitted over wireless networks, making it vulnerable to security breaches and susceptible to eavesdropping and interception by attackers. Multimedias often contains sensitive user data and may be subject to various privacy laws, including data protection laws and laws related to photography and video recording, based on local GDPR (General Data Protection Regulation). For this reason the realization of a secure protocol optimized for a distributed data streaming in real-time becomes increasingly important in crowdsensing and smart-enviroment context. In this article, we will discuss the use of a symmetric AES-CTR encryption based protocol for securing data streaming over a crowd-sensed network.
- [149] arXiv:2411.09254 [pdf, html, other]
-
Title: Are the flows of complex-valued Laplacians and their pseudoinverses related?Subjects: Systems and Control (eess.SY)
Laplacian flows model the rate of change of each node's state as being proportional to the difference between its value and that of its neighbors. Typically, these flows capture diffusion or synchronization dynamics and are well-studied. Expanding on these classical flows, we introduce a pseudoinverse Laplacian flow system, substituting the Laplacian with its pseudoinverse within complex-valued networks. Interestingly, for undirected graphs and unsigned weight-balanced digraphs, Laplacian and the pseudoinverse Laplacian flows exhibit an interdependence in terms of consensus. To show this relation, we first present the conditions for achieving consensus in the pseudoinverse Laplacian flow system using the property of real eventually exponentially positivity. Thereafter, we show that the pseudoinverse Laplacian flow system converges to consensus if and only if the Laplacian flow system achieves consensus in the above-mentioned networks. However, these are only the sufficient conditions for digraphs. Further, we illustrate the efficacy of the proposed approach through examples, focusing primarily on power networks.
- [150] arXiv:2411.09255 [pdf, html, other]
-
Title: DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in BiomedicineComments: EMNLP2024/FEVERSubjects: Computation and Language (cs.CL)
We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public.
- [151] arXiv:2411.09259 [pdf, html, other]
-
Title: Jailbreak Attacks and Defenses against Multimodal Generative Models: A SurveyXuannan Liu, Xing Cui, Peipei Li, Zekun Li, Huaibo Huang, Shuhan Xia, Miaoxuan Zhang, Yueying Zou, Ran HeComments: ongoing workSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
The rapid evolution of multimodal foundation models has led to significant advancements in cross-modal understanding and generation across diverse modalities, including text, images, audio, and video. However, these models remain susceptible to jailbreak attacks, which can bypass built-in safety mechanisms and induce the production of potentially harmful content. Consequently, understanding the methods of jailbreak attacks and existing defense mechanisms is essential to ensure the safe deployment of multimodal generative models in real-world scenarios, particularly in security-sensitive applications. To provide comprehensive insight into this topic, this survey reviews jailbreak and defense in multimodal generative models. First, given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output. Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models. Additionally, we cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems. Finally, we highlight current research challenges and propose potential directions for future this http URL open-source repository corresponding to this work can be found at this https URL.
- [152] arXiv:2411.09261 [pdf, html, other]
-
Title: Automating Autograding: Large Language Models as Test Suite Generators for Introductory ProgrammingComments: Submitted to Journal of Computer Assisted LearningSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Automatically graded programming assignments provide instant feedback to students and significantly reduce manual grading time for instructors. However, creating comprehensive suites of test cases for programming problems within automatic graders can be time-consuming and complex. The effort needed to define test suites may deter some instructors from creating additional problems or lead to inadequate test coverage, potentially resulting in misleading feedback on student solutions. Such limitations may reduce student access to the well-documented benefits of timely feedback when learning programming.
In this work, we evaluate the effectiveness of using Large Language Models (LLMs), as part of a larger workflow, to automatically generate test suites for CS1-level programming problems. Each problem's statement and reference solution are provided to GPT-4 to produce a test suite that can be used by an autograder. We evaluate our proposed approach using a sample of 26 problems, and more than 25,000 attempted solutions to those problems, submitted by students in an introductory programming course. We compare the performance of the LLM-generated test suites against the instructor-created test suites for each problem. Our findings reveal that LLM-generated test suites can correctly identify most valid solutions, and for most problems are at least as comprehensive as the instructor test suites. Additionally, the LLM-generated test suites exposed ambiguities in some problem statements, underscoring their potential to improve both autograding and instructional design. - [153] arXiv:2411.09263 [pdf, html, other]
-
Title: Rethinking Weight-Averaged Model-mergingSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Weight-averaged model-merging has emerged as a powerful approach in deep learning, capable of enhancing model performance without fine-tuning or retraining. However, the underlying mechanisms that explain its effectiveness remain largely unexplored. In this paper, we investigate this technique from three novel perspectives to provide deeper insights into how and why weight-averaged model-merging works: (1) we examine the intrinsic patterns captured by the learning of the model weights, through the visualizations of their patterns on several datasets, showing that these weights often encode structured and interpretable patterns; (2) we investigate model ensemble merging strategies based on averaging on weights versus averaging on features, providing detailed analyses across diverse architectures and datasets; and (3) we explore the impact on model-merging prediction stability in terms of changing the parameter magnitude, revealing insights into the way of weight averaging works as regularization by showing the robustness across different parameter scales. Our findings shed light on the "black box" of weight-averaged model-merging, offering valuable insights and practical recommendations that advance the model-merging process.
- [154] arXiv:2411.09265 [pdf, html, other]
-
Title: BEARD: Benchmarking the Adversarial Robustness for Dataset DistillationComments: 15 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Dataset Distillation (DD) is an emerging technique that compresses large-scale datasets into significantly smaller synthesized datasets while preserving high test performance and enabling the efficient training of large models. However, current research primarily focuses on enhancing evaluation accuracy under limited compression ratios, often overlooking critical security concerns such as adversarial robustness. A key challenge in evaluating this robustness lies in the complex interactions between distillation methods, model architectures, and adversarial attack strategies, which complicate standardized assessments. To address this, we introduce BEARD, an open and unified benchmark designed to systematically assess the adversarial robustness of DD methods, including DM, IDM, and BACON. BEARD encompasses a variety of adversarial attacks (e.g., FGSM, PGD, C&W) on distilled datasets like CIFAR-10/100 and TinyImageNet. Utilizing an adversarial game framework, it introduces three key metrics: Robustness Ratio (RR), Attack Efficiency Ratio (AE), and Comprehensive Robustness-Efficiency Index (CREI). Our analysis includes unified benchmarks, various Images Per Class (IPC) settings, and the effects of adversarial training. Results are available on the BEARD Leaderboard, along with a library providing model and dataset pools to support reproducible research. Access the code at BEARD.
- [155] arXiv:2411.09266 [pdf, html, other]
-
Title: How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human PerceptionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
Multimodal deepfakes involving audiovisual manipulations are a growing threat because they are difficult to detect with the naked eye or using unimodal deep learningbased forgery detection methods. Audiovisual forensic models, while more capable than unimodal models, require large training datasets and are computationally expensive for training and inference. Furthermore, these models lack interpretability and often do not generalize well to unseen manipulations. In this study, we examine the detection capabilities of a large language model (LLM) (i.e., ChatGPT) to identify and account for any possible visual and auditory artifacts and manipulations in audiovisual deepfake content. Extensive experiments are conducted on videos from a benchmark multimodal deepfake dataset to evaluate the detection performance of ChatGPT and compare it with the detection capabilities of state-of-the-art multimodal forensic models and humans. Experimental results demonstrate the importance of domain knowledge and prompt engineering for video forgery detection tasks using LLMs. Unlike approaches based on end-to-end learning, ChatGPT can account for spatial and spatiotemporal artifacts and inconsistencies that may exist within or across modalities. Additionally, we discuss the limitations of ChatGPT for multimedia forensic tasks.
- [156] arXiv:2411.09267 [pdf, html, other]
-
Title: Towards efficient compression and communication for prototype-based decentralized learningPablo Fernández-Piñeiro, Manuel Ferández-Veiga, Rebeca P. Díaz-Redondo, Ana Fernández-Vilas, Martín González-SotoComments: 15 pages, 2 tables, 7 figures, 6 algorithmsSubjects: Machine Learning (cs.LG)
In prototype-based federated learning, the exchange of model parameters between clients and the master server is replaced by transmission of prototypes or quantized versions of the data samples to the aggregation server. A fully decentralized deployment of prototype-based learning, without a central agregartor of prototypes, is more robust upon network failures and reacts faster to changes in the statistical distribution of the data, suggesting potential advantages and quick adaptation in dynamic learning tasks, e.g., when the data sources are IoT devices or when data is non-iid. In this paper, we consider the problem of designing a communication-efficient decentralized learning system based on prototypes. We address the challenge of prototype redundancy by leveraging on a twofold data compression technique, i.e., sending only update messages if the prototypes are informationtheoretically useful (via the Jensen-Shannon distance), and using clustering on the prototypes to compress the update messages used in the gossip protocol. We also use parallel instead of sequential gossiping, and present an analysis of its age-of-information (AoI). Our experimental results show that, with these improvements, the communications load can be substantially reduced without decreasing the convergence rate of the learning algorithm.
- [157] arXiv:2411.09268 [pdf, html, other]
-
Title: LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion SpaceSubjects: Computer Vision and Pattern Recognition (cs.CV)
While existing one-shot talking head generation models have achieved progress in coarse-grained emotion editing, there is still a lack of fine-grained emotion editing models with high interpretability. We argue that for an approach to be considered fine-grained, it needs to provide clear definitions and sufficiently detailed differentiation. We present LES-Talker, a novel one-shot talking head generation model with high interpretability, to achieve fine-grained emotion editing across emotion types, emotion levels, and facial units. We propose a Linear Emotion Space (LES) definition based on Facial Action Units to characterize emotion transformations as vector transformations. We design the Cross-Dimension Attention Net (CDAN) to deeply mine the correlation between LES representation and 3D model representation. Through mining multiple relationships across different feature and structure dimensions, we enable LES representation to guide the controllable deformation of 3D model. In order to adapt the multimodal data with deviations to the LES and enhance visual quality, we utilize specialized network design and training strategies. Experiments show that our method provides high visual quality along with multilevel and interpretable fine-grained emotion editing, outperforming mainstream methods.
- [158] arXiv:2411.09269 [pdf, html, other]
-
Title: Harnessing multiple LLMs for Information Retrieval: A case study on Deep Learning methodologies in Biodiversity publicationsSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Deep Learning (DL) techniques are increasingly applied in scientific studies across various domains to address complex research questions. However, the methodological details of these DL models are often hidden in the unstructured text. As a result, critical information about how these models are designed, trained, and evaluated is challenging to access and comprehend. To address this issue, in this work, we use five different open-source Large Language Models (LLMs): Llama-3 70B, Llama-3.1 70B, Mixtral-8x22B-Instruct-v0.1, Mixtral 8x7B, and Gemma 2 9B in combination with Retrieval-Augmented Generation (RAG) approach to extract and process DL methodological details from scientific publications automatically. We built a voting classifier from the outputs of five LLMs to accurately report DL methodological information. We tested our approach using biodiversity publications, building upon our previous research. To validate our pipeline, we employed two datasets of DL-related biodiversity publications: a curated set of 100 publications from our prior work and a set of 364 publications from the Ecological Informatics journal. Our results demonstrate that the multi-LLM, RAG-assisted pipeline enhances the retrieval of DL methodological information, achieving an accuracy of 69.5% (417 out of 600 comparisons) based solely on textual content from publications. This performance was assessed against human annotators who had access to code, figures, tables, and other supplementary information. Although demonstrated in biodiversity, our methodology is not limited to this field; it can be applied across other scientific domains where detailed methodological reporting is essential for advancing knowledge and ensuring reproducibility. This study presents a scalable and reliable approach for automating information extraction, facilitating better reproducibility and knowledge transfer across studies.
- [159] arXiv:2411.09273 [pdf, html, other]
-
Title: Cross-Modal Consistency in Multimodal Large Language ModelsXiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul-Mageed, Laks V.S. LakshmananSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Recent developments in multimodal methodologies have marked the beginning of an exciting era for models adept at processing diverse data types, encompassing text, audio, and visual content. Models like GPT-4V, which merge computer vision with advanced language processing, exhibit extraordinary proficiency in handling intricate tasks that require a simultaneous understanding of both textual and visual information. Prior research efforts have meticulously evaluated the efficacy of these Vision Large Language Models (VLLMs) in various domains, including object detection, image captioning, and other related fields. However, existing analyses have often suffered from limitations, primarily centering on the isolated evaluation of each modality's performance while neglecting to explore their intricate cross-modal interactions. Specifically, the question of whether these models achieve the same level of accuracy when confronted with identical task instances across different modalities remains unanswered. In this study, we take the initiative to delve into the interaction and comparison among these modalities of interest by introducing a novel concept termed cross-modal consistency. Furthermore, we propose a quantitative evaluation framework founded on this concept. Our experimental findings, drawn from a curated collection of parallel vision-language datasets developed by us, unveil a pronounced inconsistency between the vision and language modalities within GPT-4V, despite its portrayal as a unified multimodal model. Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.
- [160] arXiv:2411.09275 [pdf, html, other]
-
Title: Pkd-tree: Parallel $k$d-tree with Batch UpdatesSubjects: Data Structures and Algorithms (cs.DS); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
The $k$d-tree is one of the most widely used data structures to manage multi-dimensional data. Due to the ever-growing data volume, it is imperative to consider parallelism in $k$d-trees. However, we observed challenges in existing parallel kd-tree implementations, for both constructions and updates.
The goal of this paper is to develop efficient in-memory $k$d-trees by supporting high parallelism and cache-efficiency. We propose the Pkd-tree (Parallel $k$d-tree), a parallel $k$d-tree that is efficient both in theory and in practice. The Pkd-tree supports parallel tree construction, batch update (insertion and deletion), and various queries including k-nearest neighbor search, range query, and range count. We proved that our algorithms have strong theoretical bounds in work (sequential time complexity), span (parallelism), and cache complexity. Our key techniques include 1) an efficient construction algorithm that optimizes work, span, and cache complexity simultaneously, and 2) reconstruction-based update algorithms that guarantee the tree to be weight-balanced. With the new algorithmic insights and careful engineering effort, we achieved a highly optimized implementation of the Pkd-tree.
We tested Pkd-tree with various synthetic and real-world datasets, including both uniform and highly skewed data. We compare the Pkd-tree with state-of-the-art parallel $k$d-tree implementations. In all tests, with better or competitive query performance, Pkd-tree is much faster in construction and updates consistently than all baselines. We released our code. - [161] arXiv:2411.09279 [pdf, other]
-
Title: A Comparative Analysis of Electricity Consumption Flexibility in Different Industrial Plant ConfigurationsSebastián Rojas-Innocenti, Enrique Baeyens, Alejandro Martín-Crespo, Sergio Saludes-Rodil, Fernando FrechosoSubjects: Systems and Control (eess.SY)
The flexibility of industrial power consumption plays a key role in the transition to renewable energy systems, contributing to grid stability, cost reduction and decarbonization efforts. This paper presents a novel methodology to quantify and optimize the flexibility of electricity consumption in manufacturing plants. The proposed model is applied to actual cement and steel plant configurations. Comparative simulations performed with the model reveal significant differences in flexibility and cost-effectiveness, driven by factors such as production capacity, downstream process demand, storage capacity, and operational constraints. A comprehensive sensitivity analysis further clarifies the impact of various parameters on production optimization and flexibility savings. Specifically, as demand approaches production levels, flexibility decreases. Although increasing storage capacity typically reduces production costs, the benefits diminish above a certain threshold. The results provide valuable information for industrial operators wishing to improve operational efficiency, reduce costs and increase the flexibility of their operations.
- [162] arXiv:2411.09285 [pdf, other]
-
Title: Existence of solutions to numerical schemes using regularization: application to two-phase flow in porous media schemesThomas Crozon (LMJL, Nantes Univ - ECN)Subjects: Numerical Analysis (math.NA)
The present document corresponds to the 4 th chapter of my thesis, the problem setting is not definitive, what matters most here are the mathematical results and the methodology of the existence proofs. In this work, we propose a framework and some tools for establishing the existence of solutions to numerical schemes in the case of the two-phase flow model. These schemes are sharing some key a priori mathematical properties. It applies to a large variety of continuous models. We propose the definition of a regularized scheme and show that if solutions exist to this regularization, then the existence of the initial one is ensured. This perturbation of the scheme facilitates the regularized existence. The main aim is to handle degenerate systems such as the two-phase Darcy flows in porous media. We illustrate the strength of our framework on two practical schemes, a finite volume one using the DDFV framework, and the other based on a Control Volume Finite Element (CVFE) method.
- [163] arXiv:2411.09286 [pdf, html, other]
-
Title: A Centralized-Distributed Transfer Model for Cross-Domain Recommendation Based on Multi-Source Heterogeneous Transfer LearningComments: Published in: 2022 IEEE International Conference on Data Mining (ICDM) (The authors were affiliated Hangzhou NetEase Cloud Music Technology Co., Ltd.)Subjects: Machine Learning (cs.LG)
Cross-domain recommendation (CDR) methods are proposed to tackle the sparsity problem in click through rate (CTR) estimation. Existing CDR methods directly transfer knowledge from the source domains to the target domain and ignore the heterogeneities among domains, including feature dimensional heterogeneity and latent space heterogeneity, which may lead to negative transfer. Besides, most of the existing methods are based on single-source transfer, which cannot simultaneously utilize knowledge from multiple source domains to further improve the model performance in the target domain. In this paper, we propose a centralized-distributed transfer model (CDTM) for CDR based on multi-source heterogeneous transfer learning. To address the issue of feature dimension heterogeneity, we build a dual embedding structure: domain specific embedding (DSE) and global shared embedding (GSE) to model the feature representation in the single domain and the commonalities in the global space,separately. To solve the latent space heterogeneity, the transfer matrix and attention mechanism are used to map and combine DSE and GSE adaptively. Extensive offline and online experiments demonstrate the effectiveness of our model.
- [164] arXiv:2411.09287 [pdf, html, other]
-
Title: The Communication-Friendly Privacy-Preserving Machine Learning against Malicious AdversariesSubjects: Cryptography and Security (cs.CR)
With the increasing emphasis on privacy regulations, such as GDPR, protecting individual privacy and ensuring compliance have become critical concerns for both individuals and organizations. Privacy-preserving machine learning (PPML) is an innovative approach that allows for secure data analysis while safeguarding sensitive information. It enables organizations to extract valuable insights from data without compromising privacy. Secure multi-party computation (MPC) is a key tool in PPML, as it allows multiple parties to jointly compute functions without revealing their private inputs, making it essential in multi-server environments. We address the performance overhead of existing maliciously secure protocols, particularly in finite rings like $\mathbb{Z}_{2^\ell}$, by introducing an efficient protocol for secure linear function evaluation. We implement our maliciously secure MPC protocol on GPUs, significantly improving its efficiency and scalability. We extend the protocol to handle linear and non-linear layers, ensuring compatibility with a wide range of machine-learning models. Finally, we comprehensively evaluate machine learning models by integrating our protocol into the workflow, enabling secure and efficient inference across simple and complex models, such as convolutional neural networks (CNNs).
- [165] arXiv:2411.09289 [pdf, html, other]
-
Title: StreamAdapter: Efficient Test Time Adaptation from Contextual StreamsDilxat Muhtar, Yelong Shen, Yaming Yang, Xiaodong Liu, Yadong Lu, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Xueliang Zhang, Jianfeng Gao, Weizhu Chen, Qi ZhangComments: 22 Pages, 9 FiguresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks directly from the given demonstrations without requiring gradient updates. While recent advances have expanded context windows to accommodate more demonstrations, this approach increases inference costs without necessarily improving performance. To mitigate these issues, We propose StreamAdapter, a novel approach that directly updates model parameters from context at test time, eliminating the need for explicit in-context demonstrations. StreamAdapter employs context mapping and weight absorption mechanisms to dynamically transform ICL demonstrations into parameter updates with minimal additional parameters. By reducing reliance on numerous in-context examples, StreamAdapter significantly reduce inference costs and allows for efficient inference with constant time complexity, regardless of demonstration count. Extensive experiments across diverse tasks and model architectures demonstrate that StreamAdapter achieves comparable or superior adaptation capability to ICL while requiring significantly fewer demonstrations. The superior task adaptation and context encoding capabilities of StreamAdapter on both language understanding and generation tasks provides a new perspective for adapting LLMs at test time using context, allowing for more efficient adaptation across scenarios and more cost-effective inference
- [166] arXiv:2411.09293 [pdf, html, other]
-
Title: LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolutionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Existing face super-resolution (FSR) methods have made significant advancements, but they primarily super-resolve face with limited visual information, original pixel-wise space in particular, commonly overlooking the pluralistic clues, like the higher-order depth and semantics, as well as non-visual inputs (text caption and description). Consequently, these methods struggle to produce a unified and meaningful representation from the input face. We suppose that introducing the language-vision pluralistic representation into unexplored potential embedding space could enhance FSR by encoding and exploiting the complementarity across language-vision prior. This motivates us to propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of FSR. Specifically, besides directly absorbing knowledge from original input, we introduce the pre-trained vision-language model to generate pluralistic priors, involving the image caption, descriptions, face semantic mask and depths. These priors are then employed to guide the more critical feature representation, facilitating realistic and high-quality face super-resolution. Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.
- [167] arXiv:2411.09294 [pdf, html, other]
-
Title: Learning Hand State Estimation for a Light ExoskeletonSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
We propose a machine learning-based estimator of the hand state for rehabilitation purposes, using light exoskeletons. These devices are easy to use and useful for delivering domestic and frequent therapies. We build a supervised approach using information from the muscular activity of the forearm and the motion of the exoskeleton to reconstruct the hand's opening degree and compliance level. Such information can be used to evaluate the therapy progress and develop adaptive control behaviors. Our approach is validated with a real light exoskeleton. The experiments demonstrate good predictive performance of our approach when trained on data coming from a single user and tested on the same user, even across different sessions. This generalization capability makes our system promising for practical use in real rehabilitation.
- [168] arXiv:2411.09297 [pdf, html, other]
-
Title: DTELS: Towards Dynamic Granularity of Timeline SummarizationComments: Under reviewSubjects: Computation and Language (cs.CL)
The rapid proliferation of online news has posed significant challenges in tracking the continuous development of news topics. Traditional timeline summarization constructs a chronological summary of the events but often lacks the flexibility to meet the diverse granularity needs. To overcome this limitation, we introduce a new paradigm, Dynamic-granularity TimELine Summarization, (DTELS), which aims to construct adaptive timelines based on user instructions or requirements. This paper establishes a comprehensive benchmark for DTLES that includes: (1) an evaluation framework grounded in journalistic standards to assess the timeline quality across four dimensions: Informativeness, Granular Consistency, Factuality, and Coherence; (2) a large-scale, multi-source dataset with multiple granularity timeline annotations based on a consensus process to facilitate authority; (3) extensive experiments and analysis with two proposed solutions based on Large Language Models (LLMs) and existing state-of-the-art TLS methods. The experimental results demonstrate the effectiveness of LLM-based solutions. However, even the most advanced LLMs struggle to consistently generate timelines that are both informative and granularly consistent, highlighting the challenges of the DTELS task.
- [169] arXiv:2411.09299 [pdf, html, other]
-
Title: Hearing the Robot's Mind: Sonification for Explicit Feedback in Human-Robot InteractionSubjects: Robotics (cs.RO)
Social robots are required not only to understand human intentions but also to effectively communicate their intentions or own internal states to users. This study explores the use of sonification to provide explicit auditory feedback, enhancing mutual understanding in HRI. We introduce a novel sonification approach that conveys the robot's internal state, linked to its perception of nearby individuals and their interaction intentions. The approach is evaluated through a two-fold user study: an online video-based survey with $26$ participants and live experiments with $10$ participants. Results indicate that while sonification improves the robot's expressivity and communication effectiveness, the design of the auditory feedback needs refinement to enhance user experience. Participants found the auditory cues useful but described the sounds as uninteresting and unpleasant. These findings underscore the importance of carefully designed auditory feedback in developing more effective and engaging HRI systems.
- [170] arXiv:2411.09301 [pdf, html, other]
-
Title: LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language InterpretationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Automatically and rapidly understanding Earth's surface is fundamental to our grasp of the living environment and informed decision-making. This underscores the need for a unified system with comprehensive capabilities in analyzing Earth's surface to address a wide range of human needs. The emergence of multimodal large language models (MLLMs) has great potential in boosting the efficiency and convenience of intelligent Earth observation. These models can engage in human-like conversations, serve as unified platforms for understanding images, follow diverse instructions, and provide insightful feedbacks. In this study, we introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images, designed to expertly perform a wide range of RS understanding tasks aligned with human instructions. LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment. To further enhance RS-oriented vision-language alignment, we propose a large-scale RS image-caption dataset, generated through feature-guided image recaptioning. Additionally, we introduce an instruction dataset specifically designed to improve spatial recognition abilities. Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks. We also evaluate different MLLM performances in complex RS perception and instruction following using a complicated multi-choice question evaluation benchmark, providing a reliable guide for future model selection and improvement. Data, code, and models will be available at this https URL.
- [171] arXiv:2411.09302 [pdf, html, other]
-
Title: EEG-Based Speech Decoding: A Novel Approach Using Multi-Kernel Ensemble Diffusion ModelsSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
In this study, we propose an ensemble learning framework for electroencephalogram-based overt speech classification, leveraging denoising diffusion probabilistic models with varying convolutional kernel sizes. The ensemble comprises three models with kernel sizes of 51, 101, and 201, effectively capturing multi-scale temporal features inherent in signals. This approach improves the robustness and accuracy of speech decoding by accommodating the rich temporal complexity of neural signals. The ensemble models work in conjunction with conditional autoencoders that refine the reconstructed signals and maximize the useful information for downstream classification tasks. The results indicate that the proposed ensemble-based approach significantly outperforms individual models and existing state-of-the-art techniques. These findings demonstrate the potential of ensemble methods in advancing brain signal decoding, offering new possibilities for non-verbal communication applications, particularly in brain-computer interface systems aimed at aiding individuals with speech impairments.
- [172] arXiv:2411.09303 [pdf, other]
-
Title: Model-Guided Fieldwork: A Practical, Methodological and Philosophical Investigation in the use of Ethnomethodology for Engineering Software RequirementsComments: A thesis submitted for the degree of DPhilSubjects: Software Engineering (cs.SE)
Ethnomethodological fieldwork has long been acknowledged as a potentially valuable way of informing the design of technology. However, there is relatively little methodological support for this activity, particularly in relation to the systematic approaches to development advocated in mainstream software and requirements engineering. This thesis focuses on the use of ethnomethodological fieldwork for the engineering of software requirements. Firstly, it proposes an approach, dubbed "Model Guided Fieldwork," to support a fieldworker in making observations that may contribute to a technological development process. It does this by supplementing the normal debriefing sessions that a fieldworker and a technologist might have, with a lightweight iterative system modelling exercise, in such a way that the fieldwork and modelling can be mutually guiding. Secondly, the thesis presents an application of this approach in a high-profile e-Science project. This case study provides an opportunity to examine the relationship between ethnomethodological ethnography and requirements engineering empirically. Thirdly, the thesis addresses a number of theoretical and philosophical concerns relating to its project. This consists in a number of clarifications and counterarguments which aim to better situate ethnomethodological fieldwork as a method of requirements elicitation. In these three regards the thesis constitutes a practical methodological and philosophical investigation into the topic at hand.
- [173] arXiv:2411.09307 [pdf, html, other]
-
Title: Model-Based Event-Triggered Implementation of Hybrid Controllers Using Finite-Time Convergent ObserversSubjects: Systems and Control (eess.SY)
In this paper, we explore the conditions for asymptotic stability of the hybrid closed-loop system resulting from the interconnection of a nonlinear plant, an intelligent sensor that generates finite-time convergent estimates of the plant state, and a controller node that receives opportunistic samples from the sensor node when certain model-based event-triggering conditions are met. The proposed method is endowed with a degree of separation, in the sense that the controller design is independent of the sensor design. This is achieved under mild regularity conditions imposed on the hybrid closed-loop system and the existence of persistently flowing solutions. We demonstrate the versatility of the method by implementing it on: 1) a sampled-data controller for regulation of linear plants; 2) a synergistic controller for attitude stabilization of rigid bodies. The effectiveness of these novel controllers is demonstrated through numerical simulations.
- [174] arXiv:2411.09310 [pdf, html, other]
-
Title: Exploring Zero-Shot Anomaly Detection with CLIP in Medical Imaging: Are We There Yet?Comments: accepted at 3rd AIxIA Workshop on Artificial Intelligence for Healthcare and 5th Data4SmartHealthSubjects: Computer Vision and Pattern Recognition (cs.CV)
Zero-shot anomaly detection (ZSAD) offers potential for identifying anomalies in medical imaging without task-specific training. In this paper, we evaluate CLIP-based models, originally developed for industrial tasks, on brain tumor detection using the BraTS-MET dataset. Our analysis examines their ability to detect medical-specific anomalies with no or minimal supervision, addressing the challenges posed by limited data annotation. While these models show promise in transferring general knowledge to medical tasks, their performance falls short of the precision required for clinical use. Our findings highlight the need for further adaptation before CLIP-based models can be reliably applied to medical anomaly detection.
- [175] arXiv:2411.09311 [pdf, html, other]
-
Title: Compression Method for Solar Polarization Spectra Collected from Hinode SOT/SP ObservationsSubjects: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR)
The complex structure and extensive details of solar spectral data, combined with a recent surge in volume, present significant processing challenges. To address this, we propose a deep learning-based compression technique using deep autoencoder (DAE) and 1D-convolutional autoencoder (CAE) models developed with Hinode SOT/SP data. We focused on compressing Stokes I and V polarization spectra from the quiet Sun, as well as from active regions, providing a novel insight into comprehensive spectral analysis by incorporating spectra from extreme magnetic fields. The results indicate that the CAE model outperforms the DAE model in reconstructing Stokes profiles, demonstrating greater robustness and achieving reconstruction errors around the observational noise level. The proposed method has proven effective in compressing Stokes I and V spectra from both the quiet Sun and active regions, highlighting its potential for impactful applications in solar spectral analysis, such as detection of unusual spectral signals.
- [176] arXiv:2411.09312 [pdf, html, other]
-
Title: Approximate Probabilistic Inference forTime-Series Data A Robust Latent Gaussian Model With Temporal AwarenessSubjects: Machine Learning (cs.LG)
The development of robust generative models for highly varied non-stationary time series data is a complex yet important problem. Traditional models for time series data prediction, such as Long Short-Term Memory (LSTM), are inefficient and generalize poorly as they cannot capture complex temporal relationships. In this paper, we present a probabilistic generative model that can be trained to capture temporal information, and that is robust to data errors. We call it Time Deep Latent Gaussian Model (tDLGM). Its novel architecture is inspired by Deep Latent Gaussian Model (DLGM). Our model is trained to minimize a loss function based on the negative log loss. One contributing factor to Time Deep Latent Gaussian Model (tDLGM) robustness is our regularizer, which accounts for data trends. Experiments conducted show that tDLGM is able to reconstruct and generate complex time series data, and that it is robust against to noise and faulty data.
- [177] arXiv:2411.09313 [pdf, other]
-
Title: Socio-Economic Consequences of Generative AI: A Review of Methodological ApproachesSubjects: Computers and Society (cs.CY)
The widespread adoption of generative artificial intelligence (AI) has fundamentally transformed technological landscapes and societal structures in recent years. Our objective is to identify the primary methodologies that may be used to help predict the economic and social impacts of generative AI adoption. Through a comprehensive literature review, we uncover a range of methodologies poised to assess the multifaceted impacts of this technological revolution. We explore Agent-Based Simulation (ABS), Econometric Models, Input-Output Analysis, Reinforcement Learning (RL) for Decision-Making Agents, Surveys and Interviews, Scenario Analysis, Policy Analysis, and the Delphi Method. Our findings have allowed us to identify these approaches' main strengths and weaknesses and their adequacy in coping with uncertainty, robustness, and resource requirements.
- [178] arXiv:2411.09314 [pdf, other]
-
Title: Theory of the lattice Boltzmann method: discrete effects due to advectionSubjects: Numerical Analysis (math.NA)
Lattice Boltzmann models are briefly introduced together with references to methods used to predict their ability for simulations of systems described by partial differential equations that are first order in time and low order in space derivatives. Several previous works have been devoted to analyzing the accuracy of these models with special emphasis on deviations from pure Newtonian viscous behaviour, related to higher order space derivatives of even order. The presentcontribution concentrates on possible inaccuracies of the advection behaviour linked to space derivatives of odd order. Detailed properties of advection-diffusion and athermal fluids are presented for two-dimensional situations allowing to propose situations that are accurate to third order in space derivatives. Simulations of the advection of a gaussian dot or vortex are presented. Similar results are discussed in appendices for three-dimensional advection-diffusion.
- [179] arXiv:2411.09315 [pdf, html, other]
-
Title: Sustainable Hardware SpecializationSubjects: Hardware Architecture (cs.AR)
Hardware specialization is commonly viewed as a way to scale performance in the dark silicon era with modern-day SoCs featuring multiple tens of dedicated accelerators. By only powering on hardware circuitry when needed, accelerators fundamentally trade off chip area for power efficiency. Dark silicon however comes with a severe downside, namely its environmental footprint. While hardware specialization typically reduces the operational footprint through high energy efficiency, the embodied footprint incurred by integrating additional accelerators on chip leads to a net overall increase in environmental footprint, which has led prior work to conclude that dark silicon is not a sustainable design paradigm.
We explore sustainable hardware specialization through reconfigurable logic that has the potential to drastically reduce the environmental footprint compared to a sea of accelerators by amortizing its embodied footprint across multiple applications. We present an abstract analytical model that evaluates the sustainability implications of replacing dedicated accelerators with a reconfigurable accelerator. We derive hardware synthesis results on ASIC and CGRA (a representative reconfigurable fabric) for chip area and energy numbers for a wide variety of kernels. We input these results to the analytical model and conclude that reconfigurable fabric is more sustainable. We find that as few as a handful to a dozen accelerators can be replaced by a CGRA. Moreover, replacing a sea of accelerators with a CGRA leads to a drastically reduced environmental footprint (by a factor of $2.5 \times$ to $7.6 \times$). - [180] arXiv:2411.09317 [pdf, html, other]
-
Title: Pie: Pooling CPU Memory for LLM InferenceSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping often results in higher latency and lower throughput.
This paper introduces Pie, an LLM inference framework that addresses these challenges with performance-transparent swapping and adaptive expansion. By leveraging predictable memory access patterns and the high bandwidth of modern hardware like the NVIDIA GH200 Grace Hopper Superchip, Pie enables concurrent data swapping without affecting foreground computation, expanding effective memory without added latency. Adaptive expansion dynamically adjusts CPU memory allocation based on real-time information, optimizing memory usage and performance under varying conditions.
Pie maintains low computation latency, high throughput, and high elasticity. Our experimental evaluation demonstrates that Pie achieves optimal swapping policy during cache warmup and effectively balances increased memory capacity with negligible impact on computation. With its extended capacity, Pie outperforms vLLM by up to 1.9X in throughput and 2X in latency. Additionally, Pie can reduce GPU memory usage by up to 1.67X while maintaining the same performance. Compared to FlexGen, an offline profiling-based swapping solution, Pie achieves magnitudes lower latency and 9.4X higher throughput. - [181] arXiv:2411.09318 [pdf, html, other]
-
Title: DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language ArchivesMohammadRifqi Farhansyah, Muhammad Zuhdi Fikri Johari, Afinzaki Amiral, Ayu Purwarianti, Kumara Ari Yuana, Derry Tanti WijayaComments: 12 pages, 3 figures, 6 tablesSubjects: Computation and Language (cs.CL)
Indonesia is one of the most diverse countries linguistically. However, despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing (NLP) research and technologies. In the past two years, several efforts have been conducted to construct NLP resources for Indonesian languages. However, most of these efforts have been focused on creating manual resources thus difficult to scale to more languages. Although many Indonesian languages do not have a web presence, locally there are resources that document these languages well in printed forms such as books, magazines, and newspapers. Digitizing these existing resources will enable scaling of Indonesian language resource construction to many more languages. In this paper, we propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build digital language resources in Indonesia. DriveThru is a platform for extracting document content utilizing Optical Character Recognition (OCR) techniques in its system to provide language resource building with less manual effort and cost. This paper also studies the utility of current state-of-the-art LLM for post-OCR correction to show the capability of increasing the character accuracy rate (CAR) and word accuracy rate (WAR) compared to off-the-shelf OCR.
- [182] arXiv:2411.09329 [pdf, html, other]
-
Title: Improving hp-Variational Physics-Informed Neural Networks for Steady-State Convection-Dominated ProblemsComments: 25 pages, 11 figures, 8 tablesSubjects: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
This paper proposes and studies two extensions of applying hp-variational physics-informed neural networks, more precisely the FastVPINNs framework, to convection-dominated convection-diffusion-reaction problems. First, a term in the spirit of a SUPG stabilization is included in the loss functional and a network architecture is proposed that predicts spatially varying stabilization parameters. Having observed that the selection of the indicator function in hard-constrained Dirichlet boundary conditions has a big impact on the accuracy of the computed solutions, the second novelty is the proposal of a network architecture that learns good parameters for a class of indicator functions. Numerical studies show that both proposals lead to noticeably more accurate results than approaches that can be found in the literature.
- [183] arXiv:2411.09335 [pdf, html, other]
-
Title: Experimental Demonstration of Remote Synchronization in Coupled Nonlinear OscillatorSubjects: Systems and Control (eess.SY)
This study investigates remote synchronization in scale-free networks of coupled nonlinear oscillators inspired by synchronization observed in the brain's cortical regions and power grid. We employ the Master Stability Function (MSF) approach to analyze network stability across various oscillator models. Synchronization results are obtained for a star network using linearization techniques and extended to arbitrary networks with benchmark oscillators, verifying consistent behavior. Stable synchronous solutions emerge as the Floquet multiplier decreases and the MSF becomes negative. Additionally, we demonstrate remote synchronization in a star network, where peripheral oscillators communicate exclusively through a central hub, drawing parallels to neuronal synchronization in the brain. Experimental validation is achieved through an electronic circuit testbed, supported by nonlinear ODE modeling and LTspice simulation. Future work will extend the investigation to arbitrary network topologies, further elucidating synchronization dynamics in complex systems.
- [184] arXiv:2411.09339 [pdf, html, other]
-
Title: Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion RecognitionSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
With the increasing implementation of machine learning models on edge or Internet-of-Things (IoT) devices, deploying advanced models on resource-constrained IoT devices remains challenging. Transformer models, a currently dominant neural architecture, have achieved great success in broad domains but their complexity hinders its deployment on IoT devices with limited computation capability and storage size. Although many model compression approaches have been explored, they often suffer from notorious performance degradation. To address this issue, we introduce a new method, namely Transformer Re-parameterization, to boost the performance of lightweight Transformer models. It consists of two processes: the High-Rank Factorization (HRF) process in the training stage and the deHigh-Rank Factorization (deHRF) process in the inference stage. In the former process, we insert an additional linear layer before the Feed-Forward Network (FFN) of the lightweight Transformer. It is supposed that the inserted HRF layers can enhance the model learning capability. In the later process, the auxiliary HRF layer will be merged together with the following FFN layer into one linear layer and thus recover the original structure of the lightweight model. To examine the effectiveness of the proposed method, we evaluate it on three widely used Transformer variants, i.e., ConvTransformer, Conformer, and SpeechFormer networks, in the application of speech emotion recognition on the IEMOCAP, M3ED and DAIC-WOZ datasets. Experimental results show that our proposed method consistently improves the performance of lightweight Transformers, even making them comparable to large models. The proposed re-parameterization approach enables advanced Transformer models to be deployed on resource-constrained IoT devices.
- [185] arXiv:2411.09341 [pdf, html, other]
-
Title: Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model AlignmentSubjects: Machine Learning (cs.LG)
The alignment of large language models (LLMs) is crucial for generating helpful and harmless content. Existing approaches leverage preference-based human feedback data to learn the reward function and align the LLM with the feedback data. However, these approaches focus on modeling the reward difference between the chosen and rejected demonstrations, rather than directly modeling the true reward from each demonstration. Moreover, these approaches assume that the reward is only obtained at the end of the sentence, which overlooks the modeling of intermediate rewards. These issues lead to insufficient use of training signals in the feedback data, limiting the representation and generalization ability of the reward and potentially resulting in reward hacking. In this paper, we formulate LLM alignment as a Bayesian Inverse Reinforcement Learning (BIRL) problem and propose a novel training objective, Approximated Variational Alignment (AVA), to perform LLM alignment through Approximated Variational Reward Imitation Learning (AVRIL). The BIRL formulation facilitates intermediate reward modeling and direct reward modeling on each single demonstration, which enhances the utilization of training signals in the feedback data. Experiments show that AVA outperforms existing LLM alignment approaches in reward modeling, RL fine-tuning, and direct optimization.
- [186] arXiv:2411.09344 [pdf, html, other]
-
Title: Adaptively Augmented Consistency Learning: A Semi-supervised Segmentation Framework for Remote SensingJournal-ref: International Conference on Neural Information Processing 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Remote sensing (RS) involves the acquisition of data about objects or areas from a distance, primarily to monitor environmental changes, manage resources, and support planning and disaster response. A significant challenge in RS segmentation is the scarcity of high-quality labeled images due to the diversity and complexity of RS image, which makes pixel-level annotation difficult and hinders the development of effective supervised segmentation algorithms. To solve this problem, we propose Adaptively Augmented Consistency Learning (AACL), a semi-supervised segmentation framework designed to enhances RS segmentation accuracy under condictions of limited labeled data. AACL extracts additional information embedded in unlabeled images through the use of Uniform Strength Augmentation (USAug) and Adaptive Cut-Mix (AdaCM). Evaluations across various RS datasets demonstrate that AACL achieves competitive performance in semi-supervised segmentation, showing up to a 20% improvement in specific categories and 2% increase in overall performance compared to state-of-the-art frameworks.
- [187] arXiv:2411.09347 [pdf, other]
-
Title: The Denotational Semantics of SSAComments: 94 pages, 38 figures, mechanization available at this https URLSubjects: Programming Languages (cs.PL); Logic in Computer Science (cs.LO)
Static single assignment form, or SSA, has been the dominant compiler intermediate representation for decades. In this paper, we give a type theory for a variant of SSA, including its equational theory, which are strong enough to validate a variety of control and data flow transformations. We also give a categorical semantics for SSA, and show that the type theory is sound and complete with respect to the categorical axiomatization. We demonstrate the utility of our model by exhibiting a variety of concrete models satisfying our axioms, including in particular a model of TSO weak memory. The correctness of the syntactic metatheory, as well as the completeness proof has been mechanized in the Lean proof assistant.
- [188] arXiv:2411.09349 [pdf, html, other]
-
Title: ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation ModelsZixing Zhang, Weixiang Xu, Zhongren Dong, Kanglin Wang, Yimeng Wu, Jing Peng, Runming Wang, Dong-Yan HuangSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Computational paralinguistics (ComParal) aims to develop algorithms and models to automatically detect, analyze, and interpret non-verbal information from speech communication, e. g., emotion, health state, age, and gender. Despite its rapid progress, it heavily depends on sophisticatedly designed models given specific paralinguistic tasks. Thus, the heterogeneity and diversity of ComParal models largely prevent the realistic implementation of ComParal models. Recently, with the advent of acoustic foundation models because of self-supervised learning, developing more generic models that can efficiently perceive a plethora of paralinguistic information has become an active topic in speech processing. However, it lacks a unified evaluation framework for a fair and consistent performance comparison. To bridge this gap, we conduct a large-scale benchmark, namely ParaLBench, which concentrates on standardizing the evaluation process of diverse paralinguistic tasks, including critical aspects of affective computing such as emotion recognition and emotion dimensions prediction, over different acoustic foundation models. This benchmark contains ten datasets with thirteen distinct paralinguistic tasks, covering short-, medium- and long-term characteristics. Each task is carried out on 14 acoustic foundation models under a unified evaluation framework, which allows for an unbiased methodological comparison and offers a grounded reference for the ComParal community. Based on the insights gained from ParaLBench, we also point out potential research directions, i.e., the cross-corpus generalizability, to propel ComParal research in the future. The code associated with this study will be available to foster the transparency and replicability of this work for succeeding researchers.
- [189] arXiv:2411.09355 [pdf, html, other]
-
Title: Prices, Bids, Values: Everything, Everywhere, All at OnceSubjects: Computer Science and Game Theory (cs.GT)
We study the design of iterative combinatorial auctions (ICAs). The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning (ML)-based preference elicitation algorithms that aim to elicit only the most important information from bidders to maximize efficiency. The SOTA ML-based algorithms elicit bidders' preferences via value queries (i.e., "What is your value for the bundle $\{A,B\}$?"). However, the most popular iterative combinatorial auction in practice elicits information via more practical \emph{demand queries} (i.e., "At prices $p$, what is your most preferred bundle of items?"). In this paper, we examine the advantages of value and demand queries from both an auction design and an ML perspective. We propose a novel ML algorithm that provably integrates the full information from both query types. As suggested by our theoretical analysis, our experimental results verify that combining demand and value queries results in significantly better learning performance. Building on these insights, we present MLHCA, the most efficient ICA ever designed. MLHCA substantially outperforms the previous SOTA in realistic auction settings, delivering large efficiency gains. Compared to the previous SOTA, MLHCA reduces efficiency loss by up to a factor of 10, and in the most challenging and realistic domain, MLHCA outperforms the previous SOTA using 30% fewer queries. Thus, MLHCA achieves efficiency improvements that translate to welfare gains of hundreds of millions of USD, while also reducing the cognitive load on the bidders, establishing a new benchmark both for practicability and for economic impact.
- [190] arXiv:2411.09356 [pdf, html, other]
-
Title: Multi-scale Generative Modeling for Fast SamplingXiongye Xiao, Shixuan Li, Luzhe Huang, Gengshuo Liu, Trung-Kien Nguyen, Yi Huang, Di Chang, Mykel J. Kochenderfer, Paul BogdanSubjects: Artificial Intelligence (cs.AI)
While working within the spatial domain can pose problems associated with ill-conditioned scores caused by power-law decay, recent advances in diffusion-based generative models have shown that transitioning to the wavelet domain offers a promising alternative. However, within the wavelet domain, we encounter unique challenges, especially the sparse representation of high-frequency coefficients, which deviates significantly from the Gaussian assumptions in the diffusion process. To this end, we propose a multi-scale generative modeling in the wavelet domain that employs distinct strategies for handling low and high-frequency bands. In the wavelet domain, we apply score-based generative modeling with well-conditioned scores for low-frequency bands, while utilizing a multi-scale generative adversarial learning for high-frequency bands. As supported by the theoretical analysis and experimental results, our model significantly improve performance and reduce the number of trainable parameters, sampling steps, and time.
- [191] arXiv:2411.09359 [pdf, html, other]
-
Title: Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright ProtectionSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Embedding-as-a-Service (EaaS) has emerged as a successful business pattern but faces significant challenges related to various forms of copyright infringement, including API misuse and different attacks. Various studies have proposed backdoor-based watermarking schemes to protect the copyright of EaaS services. In this paper, we reveal that previous watermarking schemes possess semantic-independent characteristics and propose the Semantic Perturbation Attack (SPA). Our theoretical and experimental analyses demonstrate that this semantic-independent nature makes current watermarking schemes vulnerable to adaptive attacks that exploit semantic perturbations test to bypass watermark verification. To address this vulnerability, we propose the Semantic Aware Watermarking (SAW) scheme, a robust defense mechanism designed to resist SPA, by injecting a watermark that adapts to the text semantics. Extensive experimental results across multiple datasets demonstrate that the True Positive Rate (TPR) for detecting watermarked samples under SPA can reach up to more than 95%, rendering previous watermarks ineffective. Meanwhile, our watermarking scheme can resist such attack while ensuring the watermark verification capability. Our code is available at this https URL.
- [192] arXiv:2411.09360 [pdf, html, other]
-
Title: D4W: Dependable Data-Driven Dynamics for Wheeled RobotsComments: The Fifth International Conference on Distributed Artificial IntelligenceSubjects: Robotics (cs.RO)
Wheeled robots have gained significant attention due to their wide range of applications in manufacturing, logistics, and service industries. However, due to the difficulty of building a highly accurate dynamics model for wheeled robots, developing and testing control algorithms for them remains challenging and time-consuming, requiring extensive physical experimentation. To address this problem, we propose D4W, i.e., Dependable Data-Driven Dynamics for Wheeled Robots, a simulation framework incorporating data-driven methods to accelerate the development and evaluation of algorithms for wheeled robots. The key contribution of D4W is a solution that utilizes real-world sensor data to learn accurate models of robot dynamics. The learned dynamics can capture complex robot behaviors and interactions with the environment throughout simulations, surpassing the limitations of analytical methods, which only work in simplified scenarios. Experimental results show that D4W achieves the best simulation accuracy compared to traditional approaches, allowing for rapid iteration of wheel robot algorithms with less or no need for fine-tuning in reality. We further verify the usability and practicality of the proposed framework through integration with existing simulators and controllers.
- [193] arXiv:2411.09361 [pdf, html, other]
-
Title: Time-to-Event Pretraining for 3D Medical ImagingZepeng Huo, Jason Alan Fries, Alejandro Lozano, Jeya Maria Jose Valanarasu, Ethan Steinberg, Louis Blankemeier, Akshay S. Chaudhari, Curtis Langlotz, Nigam H. ShahComments: 34 pages, 19 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
With the rise of medical foundation models and the growing availability of imaging data, scalable pretraining techniques offer a promising way to identify imaging biomarkers predictive of future disease risk. While current self-supervised methods for 3D medical imaging models capture local structural features like organ morphology, they fail to link pixel biomarkers with long-term health outcomes due to a missing context problem. Current approaches lack the temporal context necessary to identify biomarkers correlated with disease progression, as they rely on supervision derived only from images and concurrent text descriptions. To address this, we introduce time-to-event pretraining, a pretraining framework for 3D medical imaging models that leverages large-scale temporal supervision from paired, longitudinal electronic health records (EHRs). Using a dataset of 18,945 CT scans (4.2 million 2D images) and time-to-event distributions across thousands of EHR-derived tasks, our method improves outcome prediction, achieving an average AUROC increase of 23.7% and a 29.4% gain in Harrell's C-index across 8 benchmark tasks. Importantly, these gains are achieved without sacrificing diagnostic classification performance. This study lays the foundation for integrating longitudinal EHR and 3D imaging data to advance clinical risk prediction.
- [194] arXiv:2411.09365 [pdf, other]
-
Title: Stability and Generalization for Distributed SGDASubjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Minimax optimization is gaining increasing attention in modern machine learning applications. Driven by large-scale models and massive volumes of data collected from edge devices, as well as the concern to preserve client privacy, communication-efficient distributed minimax optimization algorithms become popular, such as Local Stochastic Gradient Descent Ascent (Local-SGDA), and Local Decentralized SGDA (Local-DSGDA). While most existing research on distributed minimax algorithms focuses on convergence rates, computation complexity, and communication efficiency, the generalization performance remains underdeveloped, whereas generalization ability is a pivotal indicator for evaluating the holistic performance of a model when fed with unknown data. In this paper, we propose the stability-based generalization analytical framework for Distributed-SGDA, which unifies two popular distributed minimax algorithms including Local-SGDA and Local-DSGDA, and conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics under various settings, e.g., (S)C-(S)C, PL-SC, and NC-NC cases. Our theoretical results reveal the trade-off between the generalization gap and optimization error and suggest hyperparameters choice to obtain the optimal population risk. Numerical experiments for Local-SGDA and Local-DSGDA validate the theoretical results.
- [195] arXiv:2411.09366 [pdf, html, other]
-
Title: LTLf+ and PPLTL+: Extending LTLf and PPLTL to Infinite TracesSubjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
We introduce LTLf+ and PPLTL+, two logics to express properties of infinite traces, that are based on the linear-time temporal logics LTLf and PPLTL on finite traces. LTLf+/PPLTL+ use levels of Manna and Pnueli's LTL safety-progress hierarchy, and thus have the same expressive power as LTL. However, they also retain a crucial characteristic of the reactive synthesis problem for the base logics: the game arena for strategy extraction can be derived from deterministic finite automata (DFA). Consequently, these logics circumvent the notorious difficulties associated with determinizing infinite trace automata, typical of LTL reactive synthesis. We present DFA-based synthesis techniques for LTLf+/PPLTL+, and show that synthesis is 2EXPTIME-complete for LTLf+ (matching LTLf) and EXPTIME-complete for PPLTL+ (matching PPLTL). Notably, while PPLTL+ retains the full expressive power of LTL, reactive synthesis is EXPTIME-complete instead of 2EXPTIME-complete. The techniques are also adapted to optimally solve satisfiability, validity, and model-checking, to get EXPSPACE-complete for LTLf+ (extending a recent result for the guarantee level using LTLf), and PSPACE-complete for PPLTL+.
- [196] arXiv:2411.09371 [pdf, html, other]
-
Title: DSCformer: A Dual-Branch Network Integrating Enhanced Dynamic Snake Convolution and SegFormer for Crack SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
In construction quality monitoring, accurately detecting and segmenting cracks in concrete structures is paramount for safety and maintenance. Current convolutional neural networks (CNNs) have demonstrated strong performance in crack segmentation tasks, yet they often struggle with complex backgrounds and fail to capture fine-grained tubular structures fully. In contrast, Transformers excel at capturing global context but lack precision in detailed feature extraction. We introduce DSCformer, a novel hybrid model that integrates an enhanced Dynamic Snake Convolution (DSConv) with a Transformer architecture for crack segmentation to address these challenges. Our key contributions include the enhanced DSConv through a pyramid kernel for adaptive offset computation and a simultaneous bi-directional learnable offset iteration, significantly improving the model's performance to capture intricate crack patterns. Additionally, we propose a Weighted Convolutional Attention Module (WCAM), which refines channel attention, allowing for more precise and adaptive feature attention. We evaluate DSCformer on the Crack3238 and FIND datasets, achieving IoUs of 59.22\% and 87.24\%, respectively. The experimental results suggest that our DSCformer outperforms state-of-the-art methods across different datasets.
- [197] arXiv:2411.09380 [pdf, html, other]
-
Title: Connecting the Unconnected: A DT Case Study of Nomadic Nodes Deployment in NepalComments: Accepted for publication at IEEE CCNC 2025Subjects: Networking and Internet Architecture (cs.NI)
This paper addresses the challenge of robust cellular connectivity in dense, underdeveloped urban environments, specifically focusing on Kathmandu, Nepal. As cities grow, existing cellular infrastructure struggles to meet the demand for reliable, high-throughput, and low-latency communication services. The lack of investment in new technologies and the intricacies of the cities' landscape pose even more difficulties for robust connectivity. This work addresses the above challenges in a cost-effective and flexible way. We investigate the deployment of LTE Nomadic Nodes (NNs) at scale in order to enhance network capacity and coverage. Utilising a Digital Twin (DT), we simulate and optimise NN placement, considering Kathmandu's physical and environmental characteristics. Our approach leverages the DRIVE DT framework, which enables the systemic evaluation of various network configurations and user mobility scenarios. The results demonstrate that NNs significantly improve signal strength and expected user datarates, presenting a viable solution for enhancing urban cellular connectivity.
- [198] arXiv:2411.09387 [pdf, html, other]
-
Title: Instruction-Driven Fusion of Infrared-Visible Images: Tailoring for Diverse Downstream TasksComments: 10 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
The primary value of infrared and visible image fusion technology lies in applying the fusion results to downstream tasks. However, existing methods face challenges such as increased training complexity and significantly compromised performance of individual tasks when addressing multiple downstream tasks simultaneously. To tackle this, we propose Task-Oriented Adaptive Regulation (T-OAR), an adaptive mechanism specifically designed for multi-task environments. Additionally, we introduce the Task-related Dynamic Prompt Injection (T-DPI) module, which generates task-specific dynamic prompts from user-input text instructions and integrates them into target representations. This guides the feature extraction module to produce representations that are more closely aligned with the specific requirements of downstream tasks. By incorporating the T-DPI module into the T-OAR framework, our approach generates fusion images tailored to task-specific requirements without the need for separate training or task-specific weights. This not only reduces computational costs but also enhances adaptability and performance across multiple tasks. Experimental results show that our method excels in object detection, semantic segmentation, and salient object detection, demonstrating its strong adaptability, flexibility, and task specificity. This provides an efficient solution for image fusion in multi-task environments, highlighting the technology's potential across diverse applications.
- [199] arXiv:2411.09388 [pdf, html, other]
-
Title: A survey of probabilistic generative frameworks for molecular simulationsSubjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Soft Condensed Matter (cond-mat.soft); Statistical Mechanics (cond-mat.stat-mech)
Generative artificial intelligence is now a widely used tool in molecular science. Despite the popularity of probabilistic generative models, numerical experiments benchmarking their performance on molecular data are lacking. In this work, we introduce and explain several classes of generative models, broadly sorted into two categories: flow-based models and diffusion models. We select three representative models: Neural Spline Flows, Conditional Flow Matching, and Denoising Diffusion Probabilistic Models, and examine their accuracy, computational cost, and generation speed across datasets with tunable dimensionality, complexity, and modal asymmetry. Our findings are varied, with no one framework being the best for all purposes. In a nutshell, (i) Neural Spline Flows do best at capturing mode asymmetry present in low-dimensional data, (ii) Conditional Flow Matching outperforms other models for high-dimensional data with low complexity, and (iii) Denoising Diffusion Probabilistic Models appears the best for low-dimensional data with high complexity. Our datasets include a Gaussian mixture model and the dihedral torsion angle distribution of the Aib\textsubscript{9} peptide, generated via a molecular dynamics simulation. We hope our taxonomy of probabilistic generative frameworks and numerical results may guide model selection for a wide range of molecular tasks.
- [200] arXiv:2411.09389 [pdf, html, other]
-
Title: Less is More: Unseen Domain Fake News Detection via Causal Propagation SubstructuresComments: 9 pages, 2 figures, 5 tablesSubjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
The spread of fake news on social media poses significant threats to individuals and society. Text-based and graph-based models have been employed for fake news detection by analysing news content and propagation networks, showing promising results in specific scenarios. However, these data-driven models heavily rely on pre-existing in-distribution data for training, limiting their performance when confronted with fake news from emerging or previously unseen domains, known as out-of-distribution (OOD) data. Tackling OOD fake news is a challenging yet critical task. In this paper, we introduce the Causal Subgraph-oriented Domain Adaptive Fake News Detection (CSDA) model, designed to enhance zero-shot fake news detection by extracting causal substructures from propagation graphs using in-distribution data and generalising this approach to OOD data. The model employs a graph neural network based mask generation process to identify dominant nodes and edges within the propagation graph, using these substructures for fake news detection. Additionally, the performance of CSDA is further improved through contrastive learning in few-shot scenarios, where a limited amount of OOD data is available for training. Extensive experiments on public social media datasets demonstrate that CSDA effectively handles OOD fake news detection, achieving a 7 to 16 percents accuracy improvement over other state-of-the-art models.
- [201] arXiv:2411.09391 [pdf, other]
-
Title: An Optimizing Just-In-Time Compiler for RotorComments: First International Conference of Innovative Views of .NET Technologies, June 2005Journal-ref: TProceedings of the First International Conference of Innovative Views of .NET Technologies, 107-121. Porto, Portugal. ISBN 972-8688-31-8 (2005)Subjects: Programming Languages (cs.PL)
The Shared Source CLI (SSCLI), also known as Rotor, is an implementation of the CLI released by Microsoft in source code. Rotor includes a single pass just-in-time compiler that generates non-optimized code for Intel IA-32 and IBM PowerPC processors. We extend Rotor with an optimizing just-in-time compiler for IA-32. This compiler has three passes: control flow graph generation, data dependence graph generation and final code generation. Dominance relations in the control flow graph are used to detect natural loops. A number of optimizations are performed during the generation of the data dependence graph. During native code generation, the rich address modes of IA-32 are used for instruction folding, reducing code size and usage of register names. Despite the overhead of three passes and optimizations, this compiler is only 1.4 to 1.9 times slower than the original SSCLI compiler and generates code that runs 6.4 to 10 times faster.
- [202] arXiv:2411.09393 [pdf, html, other]
-
Title: Inherently Interpretable and Uncertainty-Aware Models for Online Learning in Cyber-Security ProblemsSubjects: Machine Learning (cs.LG)
In this paper, we address the critical need for interpretable and uncertainty-aware machine learning models in the context of online learning for high-risk industries, particularly cyber-security. While deep learning and other complex models have demonstrated impressive predictive capabilities, their opacity and lack of uncertainty quantification present significant questions about their trustworthiness. We propose a novel pipeline for online supervised learning problems in cyber-security, that harnesses the inherent interpretability and uncertainty awareness of Additive Gaussian Processes (AGPs) models. Our approach aims to balance predictive performance with transparency while improving the scalability of AGPs, which represents their main drawback, potentially enabling security analysts to better validate threat detection, troubleshoot and reduce false positives, and generally make trustworthy, informed decisions. This work contributes to the growing field of interpretable AI by proposing a class of models that can be significantly beneficial for high-stake decision problems such as the ones typical of the cyber-security domain. The source code is available.
- [203] arXiv:2411.09400 [pdf, other]
-
Title: Imagined Speech and Visual Imagery as Intuitive Paradigms for Brain-Computer InterfacesComments: 4 pagesSubjects: Artificial Intelligence (cs.AI)
Recent advancements in brain-computer interface (BCI) technology have emphasized the promise of imagined speech and visual imagery as effective paradigms for intuitive communication. This study investigates the classification performance and brain connectivity patterns associated with these paradigms, focusing on decoding accuracy across selected word classes. Sixteen participants engaged in tasks involving thirteen imagined speech and visual imagery classes, revealing above-chance classification accuracy for both paradigms. Variability in classification accuracy across individual classes highlights the influence of sensory and motor associations in imagined speech and vivid visual associations in visual imagery. Connectivity analysis further demonstrated increased functional connectivity in language-related and sensory regions for imagined speech, whereas visual imagery activated spatial and visual processing networks. These findings suggest the potential of imagined speech and visual imagery as an intuitive and scalable paradigm for BCI communication when selecting optimal word classes. Further exploration of the decoding outcomes for these two paradigms could provide insights for practical BCI communication.
- [204] arXiv:2411.09410 [pdf, html, other]
-
Title: LLM-assisted Explicit and Implicit Multi-interest Learning Framework for Sequential RecommendationComments: 10 pagesSubjects: Information Retrieval (cs.IR)
Multi-interest modeling in current recommender systems (RS) is mainly based on user behavioral data, capturing user interest preferences from multiple dimensions. However, since behavioral data is implicit and often highly sparse, it is challenging to understand users' complex and diverse interests. Recent studies have shown that the rich semantic information in the text can effectively supplement the deficiencies of behavioral data. Despite this, it is still difficult for small models to directly extract semantic features associated with users' deep interests. That is, how to effectively align semantics with behavioral information to form a more comprehensive and accurate understanding of user interests has become a critical research this http URL address this, we propose an LLM-assisted explicit and implicit multi-interest learning framework (named EIMF) to model user interests on two levels: behavior and semantics. The framework consists of two parts: Implicit Behavioral Interest Module (IBIM) and Explicit Semantic Interest Module (ESIM). The traditional multi-interest RS model in IBIM can learn users' implicit behavioral interests from interactions with items. In ESIM, we first adopt a clustering algorithm to select typical samples and design a prompting strategy on LLM to obtain explicit semantic interests. Furthermore, in the training phase, the semantic interests of typical samples can enhance the representation learning of behavioral interests based on the multi-task learning on semantic prediction and modality alignment. Therefore, in the inference stage, accurate recommendations can be achieved with only the user's behavioral data. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed EIMF framework, which effectively and efficiently combines small models with LLM to improve the accuracy of multi-interest modeling.
- [205] arXiv:2411.09411 [pdf, html, other]
-
Title: Building Height Estimation Using Shadow Length in Satellite ImageryComments: 6 pages, 5 figures, 2 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Estimating building height from satellite imagery poses significant challenges, especially when monocular images are employed, resulting in a loss of essential 3D information during imaging. This loss of spatial depth further complicates the height estimation process. We addressed this issue by using shadow length as an additional cue to compensate for the loss of building height estimation using single-view imagery. We proposed a novel method that first localized a building and its shadow in the given satellite image. After localization, the shadow length is estimated using a regression model. To estimate the final height of each building, we utilize the principles of photogrammetry, specifically considering the relationship between the solar elevation angle, the vertical edge length of the building, and the length of the building's shadow. For the localization of buildings in our model, we utilized a modified YOLOv7 detector, and to regress the shadow length for each building we utilized the ResNet18 as backbone architecture. Finally, we estimated the associated building height using solar elevation with shadow length through analytical formulation. We evaluated our method on 42 different cities and the results showed that the proposed framework surpasses the state-of-the-art methods with a suitable margin.
- [206] arXiv:2411.09413 [pdf, html, other]
-
Title: Script-centric behavior understanding for assisted autism spectrum disorder diagnosisComments: 5 pages, 4 figures, submitted to ICASSP 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Observing and analyzing children's social behaviors is crucial for the early diagnosis of Autism Spectrum Disorders (ASD). This work focuses on automatically detecting ASD using computer vision techniques and large language models (LLMs). Existing methods typically rely on supervised learning. However, the scarcity of ASD diagnostic datasets and the lack of interpretability in diagnostic results significantly limits its clinical application. To address these challenges, we introduce a novel unsupervised approach based on script-centric behavior understanding. Our pipeline converts video content into scripts that describe the behavior of characters, leveraging the generalizability of large language models to detect ASD in a zero-shot or few-shot manner. Specifically, we propose a scripts transcription module for multimodal behavior data textualization and a domain prompts module to bridge LLMs. Our method achieves an accuracy of 92.00\% in diagnosing ASD in children with an average age of 24 months, surpassing the performance of supervised learning methods by 3.58\% absolutely. Extensive experiments confirm the effectiveness of our approach and suggest its potential for advancing ASD research through LLMs.
- [207] arXiv:2411.09420 [pdf, html, other]
-
Title: SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision TransformersComments: 10 pages, 4 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Image classification is a computer vision task where a model analyzes an image to categorize it into a specific label. Vision Transformers (ViT) improve this task by leveraging self-attention to capture complex patterns and long range relationships between image patches. However, a key challenge for ViTs is efficiently incorporating multiscale feature representations, which is inherent in CNNs through their hierarchical structure. In this paper, we introduce the Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework that addresses this challenge by integrating multi-scale features. Using EfficientNet as a backbone, the model extracts multi-scale feature maps, which are divided into patches to preserve semantic information. These patches are organized into a graph based on spatial and feature similarities, with a Graph Attention Network (GAT) refining the node embeddings. Finally, a Transformer encoder captures long-range dependencies and complex interactions. The SAG-ViT is evaluated on benchmark datasets, demonstrating its effectiveness in enhancing image classification performance.
- [208] arXiv:2411.09422 [pdf, other]
-
Title: An Adaptive Open-Source Dataset Generation Framework for Machine Learning Tasks in Logic SynthesisLiwei Ni, Rui Wang, Miao Liu, Xingyu Meng, Xiaoze Lin, Junfeng Liu, Guojie Luo, Zhufei Chu, Weikang Qian, Xiaoyan Yang, Biwei Xie, Xingquan Li, Huawei LiComments: 14 pagesSubjects: Artificial Intelligence (cs.AI)
This paper introduces an adaptive logic synthesis dataset generation framework designed to enhance machine learning applications within the logic synthesis process. Unlike previous dataset generation flows that were tailored for specific tasks or lacked integrated machine learning capabilities, the proposed framework supports a comprehensive range of machine learning tasks by encapsulating the three fundamental steps of logic synthesis: Boolean representation, logic optimization, and technology mapping. It preserves the original information in the intermediate files that can be stored in both Verilog and Graphmal format. Verilog files enable semi-customizability, allowing researchers to add steps and incrementally refine the generated dataset. The framework also includes an adaptive circuit engine to facilitate the loading of GraphML files for final dataset packaging and sub-dataset extraction. The generated OpenLS-D dataset comprises 46 combinational designs from established benchmarks, totaling over 966,000 Boolean circuits, with each design containing 21,000 circuits generated from 1000 synthesis recipes, including 7000 Boolean networks, 7000 ASIC netlists, and 7000 FPGA netlists. Furthermore, OpenLS-D supports integrating newly desired data features, making it more versatile for new challenges. The utility of OpenLS-D is demonstrated through four distinct downstream tasks: circuit classification, circuit ranking, quality of results (QoR) prediction, and probability prediction. Each task highlights different internal steps of logic synthesis, with the datasets extracted and relabeled from the OpenLS-D dataset using the circuit engine. The experimental results confirm the dataset's diversity and extensive applicability. The source code and datasets are available at this https URL.
- [209] arXiv:2411.09425 [pdf, html, other]
-
Title: MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable ComplexityXiao Lv, Jiangxia Cao, Shijie Guan, Xiaoyou Zhou, Zhiguang Qi, Yaqiang Zang, Ming Li, Ben Wang, Kun Gai, Guorui ZhouComments: Work in progressSubjects: Information Retrieval (cs.IR)
Scaling-law has guided the language model designing for past years, however, it is worth noting that the scaling laws of NLP cannot be directly applied to RecSys due to the following reasons: (1) The amount of training samples and model parameters is typically not the bottleneck for the model. Our recommendation system can generate over 50 billion user samples daily, and such a massive amount of training data can easily allow our model parameters to exceed 200 billion, surpassing many LLMs (about 100B). (2) To ensure the stability and robustness of the recommendation system, it is essential to control computational complexity FLOPs carefully. Considering the above differences with LLM, we can draw a conclusion that: for a RecSys model, compared to model parameters, the computational complexity FLOPs is a more expensive factor that requires careful control. In this paper, we propose our milestone work, MARM (Memory Augmented Recommendation Model), which explores a new cache scaling-laws successfully.
- [210] arXiv:2411.09431 [pdf, html, other]
-
Title: Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech DataComments: Accepted at ECML PKDD 2024, 4th Workshop on Bias and Fairness in AI (BIAS)Subjects: Computation and Language (cs.CL)
Recent research has shown that state-of-the-art (SotA) Automatic Speech Recognition (ASR) systems, such as Whisper, often exhibit predictive biases that disproportionately affect various demographic groups. This study focuses on identifying the performance disparities of Whisper models on Dutch speech data from the Common Voice dataset and the Dutch National Public Broadcasting organisation. We analyzed the word error rate, character error rate and a BERT-based semantic similarity across gender groups. We used the moral framework of Weerts et al. (2022) to assess quality of service harms and fairness, and to provide a nuanced discussion on the implications of these biases, particularly for automatic subtitling. Our findings reveal substantial disparities in word error rate (WER) among gender groups across all model sizes, with bias identified through statistical testing.
- [211] arXiv:2411.09433 [pdf, html, other]
-
Title: Field-based Security Testing of SDN configuration UpdatesComments: 9 Figures, 6 Tables, 1 AlgorithmSubjects: Software Engineering (cs.SE); Networking and Internet Architecture (cs.NI)
Software-defined systems revolutionized the management of hardware devices but introduced quality assurance challenges that remain to be tackled. For example, software defined networks (SDNs) became a key technology for the prompt reconfigurations of network services in many sectors including telecommunications, data centers, financial services, cloud providers, and manufacturing industry. Unfortunately, reconfigurations may lead to mistakes that compromise the dependability of the provided services. In this paper, we focus on the reconfigurations of network services in the satellite communication sector, and target security requirements, which are often hard to verify; for example, although connectivity may function properly, confidentiality may be broken by packets forwarded to a wrong destination. We propose an approach for FIeld-based Security Testing of SDN Configurations Updates (FISTS). First, it probes the network before and after configuration updates. Then, using the collected data, it relies on unsupervised machine learning algorithms to prioritize the inspection of suspicious node responses, after identifying the network nodes that likely match across the two configurations. Our empirical evaluation has been conducted with network data from simulated and real SDN configuration updates for our industry partner, a world-leading satellite operator. Our results show that, when combined with K-Nearest Neighbour, FISTS leads to best results (up to 0.95 precision and 1.00 recall). Further, we demonstrated its scalability.
- [212] arXiv:2411.09434 [pdf, html, other]
-
Title: Mediffusion: Joint Diffusion for Self-Explainable Semi-Supervised Classification and Medical Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
We introduce Mediffusion -- a new method for semi-supervised learning with explainable classification based on a joint diffusion model. The medical imaging domain faces unique challenges due to scarce data labelling -- insufficient for standard training, and critical nature of the applications that require high performance, confidence, and explainability of the models. In this work, we propose to tackle those challenges with a single model that combines standard classification with a diffusion-based generative task in a single shared parametrisation. By sharing representations, our model effectively learns from both labeled and unlabeled data while at the same time providing accurate explanations through counterfactual examples. In our experiments, we show that our Mediffusion achieves results comparable to recent semi-supervised methods while providing more reliable and precise explanations.
- [213] arXiv:2411.09435 [pdf, html, other]
-
Title: ReMP: Reusable Motion Prior for Multi-domain 3D Human Pose Estimation and Motion InbetweeningComments: 8 main pages, WACV 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
We present Reusable Motion prior (ReMP), an effective motion prior that can accurately track the temporal evolution of motion in various downstream tasks. Inspired by the success of foundation models, we argue that a robust spatio-temporal motion prior can encapsulate underlying 3D dynamics applicable to various sensor modalities. We learn the rich motion prior from a sequence of complete parametric models of posed human body shape. Our prior can easily estimate poses in missing frames or noisy measurements despite significant occlusion by employing a temporal attention mechanism. More interestingly, our prior can guide the system with incomplete and challenging input measurements to quickly extract critical information to estimate the sequence of poses, significantly improving the training efficiency for mesh sequence recovery. ReMP consistently outperforms the baseline method on diverse and practical 3D motion data, including depth point clouds, LiDAR scans, and IMU sensor data. Project page is available in this https URL.
- [214] arXiv:2411.09436 [pdf, html, other]
-
Title: Robot Tasks with Fuzzy Time Requirements from Natural Language InstructionsComments: 9 pages, 8 figures, to be published in 2024 IEEE International Conference on Robotic Computing (IRC)Subjects: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Natural language allows robot programming to be accessible to everyone. However, the inherent fuzziness in natural language poses challenges for inflexible, traditional robot systems. We focus on instructions with fuzzy time requirements (e.g., "start in a few minutes"). Building on previous robotics research, we introduce fuzzy skills. These define an execution by the robot with so-called satisfaction functions representing vague execution time requirements. Such functions express a user's satisfaction over potential starting times for skill execution. When the robot handles multiple fuzzy skills, the satisfaction function provides a temporal tolerance window for execution, thus, enabling optimal scheduling based on satisfaction. We generalized such functions based on individual user expectations with a user study. The participants rated their satisfaction with an instruction's execution at various times. Our investigations reveal that trapezoidal functions best approximate the users' satisfaction. Additionally, the results suggest that users are more lenient if the execution is specified further into the future.
- [215] arXiv:2411.09439 [pdf, html, other]
-
Title: Spider: Any-to-Many Multimodal LLMSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities 'Text + X' within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities 'Text + Xs', such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents, and an Any-to-Many Instruction Template designed for producing Xs signal prompts. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates the learning of the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG task in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field.
- [216] arXiv:2411.09440 [pdf, html, other]
-
Title: Evaluation of RIS-Enabled B5G/6G Indoor Positioning and Mapping using Ray Tracing ModelsDimitris Kompostiotis, Dimitris Vordonis, Vassilis Paliouras, George C. Alexandropoulos, Florin GrecComments: 6 pages, 4 figures, to be presented in an ESA workshopSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
A Reconfigurable Intelligent Surface (RIS) can significantly enhance network positioning and mapping, acting as an additional anchor point in the reference system and improving signal strength and measurement diversity through the generation of favorable scattering conditions and virtual line-of-sight paths. In this paper, we present a comprehensive framework aimed at user localization and scatterer position estimation in an indoor environment with multipath effects. Our approach leverages beam sweeping through codebook-based beamforming at an 1-bit RIS to scan the environment, applies signal component extraction mechanisms, and utilizes a super-resolution algorithm for angle-based positioning of both connected users and scatterers. To validate the system's effectiveness, accurate 3D ray tracing models are employed, ensuring the robustness and effectiveness of the proposed approach in practical scenarios.
- [217] arXiv:2411.09441 [pdf, html, other]
-
Title: A ROS~2-based Navigation and Simulation Stack for the RobotinoComments: Published at RoboCup 2024: Robot World Cup XXVII, Springer-Verlag, 2024Subjects: Robotics (cs.RO)
The Robotino, developed by Festo Didactic, serves as a versatile platform in education and research for mobile robotics tasks. However, there currently is no ROS2 integration for the Robotino available. In this paper, we describe our work on a Webots simulation environment for a Robotino platform extended by LIDAR sensors. A ROS2 integration and a pre-configured setup for localization and navigation using existing ROS packages from the Nav2 suite are provided. We validate our setup by comparing simulations with real-world experiments conducted by three Robotinos in a logistics environment in our lab. Additionally, we tested the setup using a ROS 2 hardware driver for the Robotino developed by team GRIPS of the RoboCup Logistics League. The results demonstrate the feasibility of using ROS2 and Nav2 for navigation tasks on the Robotino platform showing great consistency between simulation and real-world performance.
- [218] arXiv:2411.09444 [pdf, html, other]
-
Title: Learning efficient and provably convergent splitting methodsSubjects: Numerical Analysis (math.NA); Machine Learning (cs.LG)
Splitting methods are widely used for solving initial value problems (IVPs) due to their ability to simplify complicated evolutions into more manageable subproblems which can be solved efficiently and accurately. Traditionally, these methods are derived using analytic and algebraic techniques from numerical analysis, including truncated Taylor series and their Lie algebraic analogue, the Baker--Campbell--Hausdorff formula. These tools enable the development of high-order numerical methods that provide exceptional accuracy for small timesteps. Moreover, these methods often (nearly) conserve important physical invariants, such as mass, unitarity, and energy. However, in many practical applications the computational resources are limited. Thus, it is crucial to identify methods that achieve the best accuracy within a fixed computational budget, which might require taking relatively large timesteps. In this regime, high-order methods derived with traditional methods often exhibit large errors since they are only designed to be asymptotically optimal. Machine Learning techniques offer a potential solution since they can be trained to efficiently solve a given IVP with less computational resources. However, they are often purely data-driven, come with limited convergence guarantees in the small-timestep regime and do not necessarily conserve physical invariants. In this work, we propose a framework for finding machine learned splitting methods that are computationally efficient for large timesteps and have provable convergence and conservation guarantees in the small-timestep limit. We demonstrate numerically that the learned methods, which by construction converge quadratically in the timestep size, can be significantly more efficient than established methods for the Schrödinger equation if the computational budget is limited.
- [219] arXiv:2411.09449 [pdf, html, other]
-
Title: Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.
- [220] arXiv:2411.09451 [pdf, html, other]
-
Title: DiffRoad: Realistic and Diverse Road Scenario Generation for Autonomous Vehicle TestingComments: 14 pages, 9 figuresSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Generating realistic and diverse road scenarios is essential for autonomous vehicle testing and validation. Nevertheless, owing to the complexity and variability of real-world road environments, creating authentic and varied scenarios for intelligent driving testing is challenging. In this paper, we propose DiffRoad, a novel diffusion model designed to produce controllable and high-fidelity 3D road scenarios. DiffRoad leverages the generative capabilities of diffusion models to synthesize road layouts from white noise through an inverse denoising process, preserving real-world spatial features. To enhance the quality of generated scenarios, we design the Road-UNet architecture, optimizing the balance between backbone and skip connections for high-realism scenario generation. Furthermore, we introduce a road scenario evaluation module that screens adequate and reasonable scenarios for intelligent driving testing using two critical metrics: road continuity and road reasonableness. Experimental results on multiple real-world datasets demonstrate DiffRoad's ability to generate realistic and smooth road structures while maintaining the original distribution. Additionally, the generated scenarios can be fully automated into the OpenDRIVE format, facilitating generalized autonomous vehicle simulation testing. DiffRoad provides a rich and diverse scenario library for large-scale autonomous vehicle testing and offers valuable insights for future infrastructure designs that are better suited for autonomous vehicles.
- [221] arXiv:2411.09453 [pdf, html, other]
-
Title: Long-Tailed Object Detection Pre-training: Dynamic Rebalancing Contrastive Learning with Dual ReconstructionComments: Accepted by NeurIPS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Pre-training plays a vital role in various vision tasks, such as object recognition and detection. Commonly used pre-training methods, which typically rely on randomized approaches like uniform or Gaussian distributions to initialize model parameters, often fall short when confronted with long-tailed distributions, especially in detection tasks. This is largely due to extreme data imbalance and the issue of simplicity bias. In this paper, we introduce a novel pre-training framework for object detection, called Dynamic Rebalancing Contrastive Learning with Dual Reconstruction (2DRCL). Our method builds on a Holistic-Local Contrastive Learning mechanism, which aligns pre-training with object detection by capturing both global contextual semantics and detailed local patterns. To tackle the imbalance inherent in long-tailed data, we design a dynamic rebalancing strategy that adjusts the sampling of underrepresented instances throughout the pre-training process, ensuring better representation of tail classes. Moreover, Dual Reconstruction addresses simplicity bias by enforcing a reconstruction task aligned with the self-consistency principle, specifically benefiting underrepresented tail classes. Experiments on COCO and LVIS v1.0 datasets demonstrate the effectiveness of our method, particularly in improving the mAP/AP scores for tail classes.
- [222] arXiv:2411.09459 [pdf, html, other]
-
Title: Caravan MultiMet: Extending Caravan with Multiple Weather Nowcasts and ForecastsSubjects: Machine Learning (cs.LG)
The Caravan large-sample hydrology dataset (Kratzert et al., 2023) was created to standardize and harmonize streamflow data from various regional datasets, combined with globally available meteorological forcing and catchment attributes. This community-driven project also allows researchers to conveniently extend the dataset for additional basins, as done 6 times to date (see this https URL). We present a novel extension to Caravan, focusing on enriching the meteorological forcing data. Our extension adds three precipitation nowcast products (CPC, IMERG v07 Early, and CHIRPS) and three weather forecast products (ECMWF IFS HRES, GraphCast, and CHIRPS-GEFS) to the existing ERA5-Land reanalysis data. The inclusion of diverse data sources, particularly weather forecasts, enables more robust evaluation and benchmarking of hydrological models, especially for real-time forecasting scenarios. To the best of our knowledge, this extension makes Caravan the first large-sample hydrology dataset to incorporate weather forecast data, significantly enhancing its capabilities and fostering advancements in hydrological research, benchmarking, and real-time hydrologic forecasting. The data is publicly available under a CC-BY-4.0 license on Zenodo in two parts (this https URL, this https URL) and on Google Cloud Platform (GCP) - see more under the Data Availability chapter.
- [223] arXiv:2411.09460 [pdf, html, other]
-
Title: Analysis Methodology for Age of Information under Sequence Based SchedulingSubjects: Information Theory (cs.IT)
We focus on the Age of Information (AoI) performance in a system where each user generates packets periodically to send to a common access point (AP) for status updating. To avoid heavy overhead, we assume that channel sensing, feedback information from the AP, and time synchronization are not available in the system. We adopt a multi-access scheme called the sequence scheme, where each user is assigned a periodic binary sequence to schedule their transmissions. In our previous work [18], we have thoroughly studied the AoI performance under sequence scheme when the period of schedule sequences, $L$, is equal to the status generating period, $T$. The results can be extended to the case where $T>L$. However, the case of $T<L$ is not covered by [18]. Therefore, in this paper, we concentrate on analyzing the AoI performance in the case of $T<L$, which is more challenging and requires different approaches. We conduct in-depth analysis on this case and develop a mathematical tool based on integer partitions to facilitate the analysis. We derive low-complexity closed-form expressions for two scenarios under $T<L$. Based on the obtained analytical results, we propose an algorithm to optimize the construction parameters of the sequence scheme. Finally, we compare our proposed sequence scheme with two commonly used baselines, and show that our proposed scheme outperforms the baselines in terms of AoI performance while consuming less energy.
- [224] arXiv:2411.09462 [pdf, html, other]
-
Title: SINETRA: a Versatile Framework for Evaluating Single Neuron Tracking in Behaving AnimalsComments: 5 pages, 3 figures, submitted at 2025 IEEE International Symposium on Biomedical Imaging (ISBI)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Accurately tracking neuronal activity in behaving animals presents significant challenges due to complex motions and background noise. The lack of annotated datasets limits the evaluation and improvement of such tracking algorithms. To address this, we developed SINETRA, a versatile simulator that generates synthetic tracking data for particles on a deformable background, closely mimicking live animal recordings. This simulator produces annotated 2D and 3D videos that reflect the intricate movements seen in behaving animals like Hydra Vulgaris. We evaluated four state-of-the-art tracking algorithms highlighting the current limitations of these methods in challenging scenarios and paving the way for improved cell tracking techniques in dynamic biological systems.
- [225] arXiv:2411.09463 [pdf, html, other]
-
Title: Teaching Program Decomposition in CS1: A Conceptual Framework for Improved Code QualityComments: SIGCSE 2025Subjects: Software Engineering (cs.SE)
Program decomposition is essential for developing maintainable and efficient software, yet it remains a challenging skill to teach and learn in introductory programming courses. What does program decomposition for procedural CS1 programs entail? How can CS1 students improve the decomposition of their programs? What scaffolded exercises can instructors use to teach program decomposition skills? We aim to answer all these questions by presenting a conceptual framework that (1) is grounded in the established code style principles, (2) provides a systematic approach that can be taught to students as an actionable strategy to improve the program decomposition of their programs, and (3) includes scaffolded exercises to be used in classroom activities. In addition, this systematic approach is automatable and can further be used to implement visualizers, automated feedback generators and digital tutors.
- [226] arXiv:2411.09467 [pdf, html, other]
-
Title: The Perceptions of Software Engineers Concerning the Utilization of Bots in the OSS Development Process: An Exploratory SurveyComments: 6 pages, 6 figures, 3 tablesSubjects: Software Engineering (cs.SE)
Software bots, extensively adopted by Open Source Software (OSS) projects, support developers across several activities, from automating predefined tasks to generating code that aids software engineers. However, with the growing prominence of bots, questions have emerged regarding the extension to which they truly assist or hinder software engineers in their routine tasks. To address this, an exploratory survey was conducted with 37 software engineers to gather insights into their views on the use of bots within the software development process. The findings suggest that bots are present across multiple phases of the software development lifecycle, providing daily support to professionals by enhancing productivity and facilitating task automation. Respondents stated that current bots are not sufficiently intelligent and raised new challenges and enhancements to aid bot designers in developing additional functionalities and integrations.
- [227] arXiv:2411.09468 [pdf, html, other]
-
Title: Harnessing Machine Learning for Single-Shot Measurement of Free Electron Laser Pulse PowerTill Korten (1), Vladimir Rybnikov (2), Mathias Vogt (2), Juliane Roensch-Schulenburg (2), Peter Steinbach (1), Najmeh Mirian (1) ((1) Helmholtz-Zentrum Dresden-Rossendorf HZDR, (2) Deutsches Elektronen-Synchrotron DESY)Comments: 10 pages, 4 figures, Machine Learning and the Physical Sciences Workshop, NeurIPS 2024 this https URLSubjects: Machine Learning (cs.LG); Accelerator Physics (physics.acc-ph)
Electron beam accelerators are essential in many scientific and technological fields. Their operation relies heavily on the stability and precision of the electron beam. Traditional diagnostic techniques encounter difficulties in addressing the complex and dynamic nature of electron beams. Particularly in the context of free-electron lasers (FELs), it is fundamentally impossible to measure the lasing-on and lasingoff electron power profiles for a single electron bunch. This is a crucial hurdle in the exact reconstruction of the photon pulse profile. To overcome this hurdle, we developed a machine learning model that predicts the temporal power profile of the electron bunch in the lasing-off regime using machine parameters that can be obtained when lasing is on. The model was statistically validated and showed superior predictions compared to the state-of-the-art batch calibrations. The work we present here is a critical element for a virtual pulse reconstruction diagnostic (VPRD) tool designed to reconstruct the power profile of individual photon pulses without requiring repeated measurements in the lasing-off regime. This promises to significantly enhance the diagnostic capabilities in FELs at large.
- [228] arXiv:2411.09471 [pdf, html, other]
-
Title: Renal Cell Carcinoma subtyping: learning from multi-resolution localizationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Renal Cell Carcinoma is typically asymptomatic at the early stages for many patients. This leads to a late diagnosis of the tumor, where the curability likelihood is lower, and makes the mortality rate of Renal Cell Carcinoma high, with respect to its incidence rate. To increase the survival chance, a fast and correct categorization of the tumor subtype is paramount. Nowadays, computerized methods, based on artificial intelligence, represent an interesting opportunity to improve the productivity and the objectivity of the microscopy-based Renal Cell Carcinoma diagnosis. Nonetheless, much of their exploitation is hampered by the paucity of annotated dataset, essential for a proficient training of supervised machine learning technologies. This study sets out to investigate a novel self supervised training strategy for machine learning diagnostic tools, based on the multi-resolution nature of the histological samples. We aim at reducing the need of annotated dataset, without significantly reducing the accuracy of the tool. We demonstrate the classification capability of our tool on a whole slide imaging dataset for Renal Cancer subtyping, and we compare our solution with several state-of-the-art classification counterparts.
- [229] arXiv:2411.09472 [pdf, html, other]
-
Title: An Algorithm for the Longest Common Subsequence and Substring Problem for Multiple StringsSubjects: Data Structures and Algorithms (cs.DS)
Let $X_1, X_2, ..., X_s$ and $Y_1, Y_2, ..., Y_t$ be strings over an alphabet $\Sigma$, where $s$ and $t$ are positive integers. The longest common subsequence and substring problem for multiple strings $X_1, X_2, ..., X_s$ and $Y_1, Y_2, ..., Y_t$ is to find the longest string which is a subsequence of $X_1, X_2, ..., X_s$ and a substring of $Y_1, Y_2, ..., Y_t$. In this paper, we propose an algorithm to solve the problem.
- [230] arXiv:2411.09473 [pdf, html, other]
-
Title: Enhancing Scalability and Performance in Influence Maximization with Optimized Parallel ProcessingHanjiang Wu, Huan Xu, Joongun Park, Jesmin Jahan Tithi, Fabio Checconi, Jordi Wolfson-Pou, Fabrizio Petrini, Tushar KrishnaSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
Influence Maximization (IM) is vital in viral marketing and biological network analysis for identifying key influencers. Given its NP-hard nature, approximate solutions are employed. This paper addresses scalability challenges in scale-out shared memory system by focusing on the state-of-the-art Influence Maximization via Martingales (IMM) benchmark. To enhance the work efficiency of the current IMM implementation, we propose EFFICIENTIMM with key strategies, including new parallelization scheme, NUMA-aware memory usage, dynamic load balancing and fine-grained adaptive data structures. Benchmarking on a 128-core CPU system with 8 NUMA nodes, EFFICIENTIMM demonstrated significant performance improvements, achieving an average 5.9x speedup over Ripples across 8 diverse SNAP datasets, when compared to the best execution times of the original Ripples framework. Additionally, on the Youtube graph, EFFICIENTIMM demonstrates a better memory access pattern with 357.4x reduction in L1+L2 cache misses as compared to Ripples.
- [231] arXiv:2411.09475 [pdf, html, other]
-
Title: ResidualDroppath: Enhancing Feature Reuse over Residual ConnectionsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Residual connections are one of the most important components in neural network architectures for mitigating the vanishing gradient problem and facilitating the training of much deeper networks. One possible explanation for how residual connections aid deeper network training is by promoting feature reuse. However, we identify and analyze the limitations of feature reuse with vanilla residual connections. To address these limitations, we propose modifications in training methods. Specifically, we provide an additional opportunity for the model to learn feature reuse with residual connections through two types of iterations during training. The first type of iteration involves using droppath, which enforces feature reuse by randomly dropping a subset of layers. The second type of iteration focuses on training the dropped parts of the model while freezing the undropped parts. As a result, the dropped parts learn in a way that encourages feature reuse, as the model relies on the undropped parts with feature reuse in mind. Overall, we demonstrated performance improvements in models with residual connections for image classification in certain cases.
- [232] arXiv:2411.09481 [pdf, other]
-
Title: What makes a good BIM design: quantitative linking between design behavior and qualitySubjects: Machine Learning (cs.LG)
In the Architecture Engineering & Construction (AEC) industry, how design behaviors impact design quality remains unclear. This study proposes a novel approach, which, for the first time, identifies and quantitatively describes the relationship between design behaviors and quality of design based on Building Information Modeling (BIM). Real-time collection and log mining are integrated to collect raw data of design behaviors. Feature engineering and various machine learning models are then utilized for quantitative modeling and interpretation. Results confirm an existing quantifiable relationship which can be learned by various models. The best-performing model using Extremely Random Trees achieved an R2 value of 0.88 on the test set. Behavioral features related to designer's skill level and changes of design intentions are identified to have significant impacts on design quality. These findings deepen our understanding of the design process and help forming BIM designs with better quality.
- [233] arXiv:2411.09484 [pdf, html, other]
-
Title: Image Matching Filtering and Refinement by Planes and BeyondComments: project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper introduces a modular, non-deep learning method for filtering and refining sparse correspondences in image matching. Assuming that motion flow within the scene can be approximated by local homography transformations, matches are aggregated into overlapping clusters corresponding to virtual planes using an iterative RANSAC-based approach, with non-conforming correspondences discarded. Moreover, the underlying planar structural design provides an explicit map between local patches associated with the matches, enabling optional refinement of keypoint positions through cross-correlation template matching after patch reprojection. Finally, to enhance robustness and fault-tolerance against violations of the piece-wise planar approximation assumption, a further strategy is designed for minimizing relative patch distortion in the plane reprojection by introducing an intermediate homography that projects both patches into a common plane. The proposed method is extensively evaluated on standard datasets and image matching pipelines, and compared with state-of-the-art approaches. Unlike other current comparisons, the proposed benchmark also takes into account the more general, real, and practical cases where camera intrinsics are unavailable. Experimental results demonstrate that our proposed non-deep learning, geometry-based approach achieves performances that are either superior to or on par with recent state-of-the-art deep learning methods. Finally, this study suggests that there are still development potential in actual image matching solutions in the considered research direction, which could be in the future incorporated in novel deep image matching architectures.
- [234] arXiv:2411.09485 [pdf, other]
-
Title: Exact Integration for singular Zienkiewicz and Guzman-Neilan Finite Elements with ImplementationSubjects: Numerical Analysis (math.NA)
We develop a recursive integration formula for a class of rational polynomials in 2D. Based on this, we present implementations of finite elements that have rational basis functions. Specifically, we provide simple Matlab implementations of the singular Zienkiewicz and the lowest-order Guzman-Neilan finite element in 2D.
- [235] arXiv:2411.09486 [pdf, other]
-
Title: An Approach to Twinning and Mining Collaborative Network of Construction ProjectsJournal-ref: Automation in Construction, 2021Subjects: Social and Information Networks (cs.SI)
Understanding complex collaboration processes is essential for the success of construction projects. However, there is still a lack of efficient methods for timely collection and analysis of collaborative networks. Therefore, an integrated framework consisting three parts, namely, system updating for data collection, data preprocessing, and social network analysis, is proposed for the twinning and mining collaborative network of a construction project. First, a system updating strategy for automatic data collection is introduced. Centrality measures are then utilized to identify key players, including hubs and brokers. Meanwhile, information sharing frequency (ISF) and association rule mining are introduced to discover collaborative patterns, that is, frequently collaborating users (FCUs) and associations between information flows and task levels. Finally, the proposed framework is validated and demonstrated in a large-scale project. The results show that key players, FCUs, and associations between information flows and task levels were successfully discovered, providing a deep understanding of collaboration and communication for decision-making processes. This research contributes to the body of knowledge by: 1) introducing ISF and Apriori-based association mining algorithm to identify FCUs and information flow patterns in collaboration; 2) establishing a new data-driven framework to map and analyze fine-grained collaborative networks automatically. It is also shown that people tend to form small groups to handle certain levels or types of tasks more efficiently. Other researchers and industrial practitioners may use this work as a foundation to further improve the efficiency of collaboration and communication.
- [236] arXiv:2411.09489 [pdf, other]
-
Title: Positive Focusing is Directly UsefulComments: Paper for the proceedings of MFPS 2024Subjects: Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
Recently, Miller and Wu introduced the positive $\lambda$-calculus, a call-by-value $\lambda$-calculus with sharing obtained by assigning proof terms to the positively polarized focused proofs for minimal intuitionistic logic. The positive $\lambda$-calculus stands out among $\lambda$-calculi with sharing for a compactness property related to the sharing of variables. We show that -- thanks to compactness -- the positive calculus neatly captures the core of useful sharing, a technique for the study of reasonable time cost models.
- [237] arXiv:2411.09492 [pdf, html, other]
-
Title: MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) excel in high-resource languages but face notable challenges in low-resource languages like Mongolian. This paper addresses these challenges by categorizing capabilities into language abilities (syntax and semantics) and cognitive abilities (knowledge and reasoning). To systematically evaluate these areas, we developed MM-Eval, a specialized dataset based on Modern Mongolian Language Textbook I and enriched with WebQSP and MGSM datasets.
Preliminary experiments on models including Qwen2-7B-Instruct, GLM4-9b-chat, Llama3.1-8B-Instruct, GPT-4, and DeepseekV2.5 revealed that: 1) all models performed better on syntactic tasks than semantic tasks, highlighting a gap in deeper language understanding; and 2) knowledge tasks showed a moderate decline, suggesting that models can transfer general knowledge from high-resource to low-resource contexts.
The release of MM-Eval, comprising 569 syntax, 677 semantics, 344 knowledge, and 250 reasoning tasks, offers valuable insights for advancing NLP and LLMs in low-resource languages like Mongolian. The dataset is available at this https URL. - [238] arXiv:2411.09493 [pdf, html, other]
-
Title: Strategic Sacrifice: Self-Organized Robot Swarm Localization for Inspection ProductivityComments: 14 pages, 10 figures, 17th International Symposium on Distributed Autonomous Robotic Systems (DARS'24)Subjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Robot swarms offer significant potential for inspecting diverse infrastructure, ranging from bridges to space stations. However, effective inspection requires accurate robot localization, which demands substantial computational resources and limits productivity. Inspired by biological systems, we introduce a novel cooperative localization mechanism that minimizes collective computation expenditure through self-organized sacrifice. Here, a few agents bear the computational burden of localization; through local interactions, they improve the inspection productivity of the swarm. Our approach adaptively maximizes inspection productivity for unconstrained trajectories in dynamic interaction and environmental settings. We demonstrate the optimality and robustness using mean-field analytical models, multi-agent simulations, and hardware experiments with metal climbing robots inspecting a 3D cylinder.
- [239] arXiv:2411.09495 [pdf, html, other]
-
Title: ISAC Super-Resolution Receiver via Lifted Atomic Norm MinimizationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
This paper introduces an off-the-grid estimator for integrated sensing and communication (ISAC) systems, utilizing lifted atomic norm minimization (LANM). The key challenge in this scenario is that neither the transmit signals nor the radar-and-communication channels are known. We prove that LANM can simultaneously achieve localization of radar targets and decoding of communication symbols, when the number of observations is proportional to the degrees of freedom in the ISAC systems. Despite the inherent ill-posed nature of the problem, we employ the lifting technique to initially encode the transmit signals. Then, we leverage the atomic norm to promote the structured low-rankness for the ISAC channel. We utilize a dual technique to transform the LANM into an infinite-dimensional search over the signal domain. Subsequently, we use semidefinite relaxation (SDR) to implement the dual problem.
We extend our approach to practical scenarios where received signals are contaminated by additive white Gaussian noise (AWGN) and jamming signals. Furthermore, we derive the computational complexity of the proposed estimator and demonstrate that it is equivalent to the conventional pilot-aided ANM for estimating the channel parameters. Our simulation experiments demonstrate the ability of the proposed LANM approach to estimate both communication data and target parameters with a performance comparable to traditional radar-only super-resolution techniques. - [240] arXiv:2411.09497 [pdf, html, other]
-
Title: The Use of Readability Metrics in Legal Text: A Systematic Literature ReviewSubjects: Computation and Language (cs.CL)
Understanding the text in legal documents can be challenging due to their complex structure and the inclusion of domain-specific jargon. Laws and regulations are often crafted in such a manner that engagement with them requires formal training, potentially leading to vastly different interpretations of the same texts. Linguistic complexity is an important contributor to the difficulties experienced by readers. Simplifying texts could enhance comprehension across a broader audience, not just among trained professionals. Various metrics have been developed to measure document readability. Therefore, we adopted a systematic review approach to examine the linguistic and readability metrics currently employed for legal and regulatory texts. A total of 3566 initial papers were screened, with 34 relevant studies found and further assessed. Our primary objective was to identify which current metrics were applied for evaluating readability within the legal field. Sixteen different metrics were identified, with the Flesch-Kincaid Grade Level being the most frequently used method. The majority of studies (73.5%) were found in the domain of "informed consent forms". From the analysis, it is clear that not all legal domains are well represented in terms of readability metrics and that there is a further need to develop more consensus on which metrics should be applied for legal documents.
- [241] arXiv:2411.09498 [pdf, html, other]
-
Title: Analysis and discretization of the Ohta-Kawasaki equation with forcing and degenerate mobilitySubjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)
The Ohta-Kawasaki equation models the mesoscopic phase separation of immiscible polymer chains that form diblock copolymers, with applications in directed self-assembly for lithography. We perform a mathematical analysis of this model under degenerate mobility and an external force, proving the existence of weak solutions via an approximation scheme for the mobility function. Additionally, we propose a fully discrete scheme for the system and demonstrate the existence and uniqueness of its discrete solution, showing that it inherits essential structural-preserving properties. Finally, we conduct numerical experiments to compare the Ohta-Kawasaki system with the classical Cahn-Hilliard model, highlighting the impact of the repulsion parameter on the phase separation dynamics.
- [242] arXiv:2411.09499 [pdf, html, other]
-
Title: Developement of Reinforcement Learning based Optimisation Method for Side-Sill DesignSubjects: Machine Learning (cs.LG)
Optimisation for crashworthiness is a critical part of the vehicle development process. Due to stringent regulations and increasing market demands, multiple factors must be considered within a limited timeframe. However, for optimal crashworthiness design, multiobjective optimisation is necessary, and for complex parts, multiple design parameters must be evaluated. This crashworthiness analysis requires computationally intensive finite element simulations. This challenge leads to the need for inverse multi-parameter multi-objective optimisation. This challenge leads to the need for multi-parameter, multi-objective inverse optimisation. This article investigates a machine learning-based method for this type of optimisation, focusing on the design optimisation of a multi-cell side sill to improve crashworthiness results. Furthermore, the optimiser is coupled with an FE solver to achieve improved results.
- [243] arXiv:2411.09502 [pdf, html, other]
-
Title: Golden Noise for Diffusion Models: A Learning FrameworkSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are ``golden noises'' that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those golden noises. To learn golden noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the \textit{noise prompt}, which aims at turning a random Gaussian noise into a golden noise by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the \textit{noise prompt learning} framework that systematically learns ``prompted'' golden noise associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale \textit{noise prompt dataset}~(NPD) that contains 100k pairs of random noises and golden noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small \textit{noise prompt network}~(NPNet) that can directly learn to transform a random noise into a golden noise. The learned golden noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Third, our extensive experiments demonstrate the impressive effectiveness and generalization of NPNet on improving the quality of synthesized images across various diffusion models, including SDXL, DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and efficient controller that acts as a plug-and-play module with very limited additional inference and computational costs, as it just provides a golden noise instead of a random noise without accessing the original pipeline.
- [244] arXiv:2411.09504 [pdf, html, other]
-
Title: Asymptotic Analysis of IMEX-RK Methods for ES-BGK Model at Navier-Stokes levelSubjects: Numerical Analysis (math.NA)
Implicit-explicit Runge-Kutta (IMEX-RK) time discretization methods are very popular when solving stiff kinetic equations. In [21], an asymptotic analysis shows that a specific class of high-order IMEX-RK schemes can accurately capture the Navier-Stokes limit without needing to resolve the small scales dictated by the Knudsen number. In this work, we extend the asymptotic analysis to general IMEX-RK schemes, known in literature as Type I and Type II. We further suggest some IMEX-RK methods developed in the literature to attain uniform accuracy in the wide range of Knudsen numbers. Several numerical examples are presented to verify the validity of the obtained theoretical results and the effectiveness of the methods.
- [245] arXiv:2411.09507 [pdf, other]
-
Title: Toward a Cohesive AI and Simulation Software Ecosystem for Scientific InnovationComments: 5 pagesSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
In this paper, we discuss the need for an integrated software stack that unites artificial intelligence (AI) and modeling and simulation (ModSim) tools to advance scientific discovery. The authors advocate for a unified AI/ModSim software ecosystem that ensures compatibility across a wide range of software on diverse high-performance computing systems, promoting ease of deployment, version management, and binary distribution. Key challenges highlighted include balancing the distinct needs of AI and ModSim, especially in terms of software build practices, dependency management, and compatibility. The document underscores the importance of continuous integration, community-driven stewardship, and collaboration with the Department of Energy (DOE) to develop a portable and cohesive scientific software ecosystem. Recommendations focus on supporting standardized environments through initiatives like the Extreme-scale Scientific Software Stack (E4S) and Spack to foster interdisciplinary innovation and facilitate new scientific advancements.
- [246] arXiv:2411.09509 [pdf, html, other]
-
Title: Enhanced HLLEM and HLL-CPS schemes for all Mach number flows based using anti-diffusion coefficientsSubjects: Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
This paper compares the HLLEM and HLL-CPS schemes for Euler equations and proposes improvements for all Mach number flows. Enhancements to the HLLEM scheme involve adding anti-diffusion terms in the face normal direction and modifying anti-diffusion coefficients for linearly degenerate waves near shocks. The HLL-CPS scheme is improved by adjusting anti-diffusion coefficients for the face normal direction and linearly degenerate waves. Matrix stability, linear perturbation, and asymptotic analyses demonstrate the robustness of the proposed schemes and their ability to capture low Mach flow features. Numerical tests confirm that the schemes are free from shock instabilities at high speeds and accurately resolve low Mach number flow features.
- [247] arXiv:2411.09510 [pdf, html, other]
-
Title: Communication Compression for Tensor Parallel LLM InferenceSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.
- [248] arXiv:2411.09517 [pdf, html, other]
-
Title: Randomized Truthful Auctions with Learning AgentsSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
We study a setting where agents use no-regret learning algorithms to participate in repeated auctions. \citet{kolumbus2022auctions} showed, rather surprisingly, that when bidders participate in second-price auctions using no-regret bidding algorithms, no matter how large the number of interactions $T$ is, the runner-up bidder may not converge to bidding truthfully. Our first result shows that this holds for \emph{general deterministic} truthful auctions. We also show that the ratio of the learning rates of the bidders can \emph{qualitatively} affect the convergence of the bidders. Next, we consider the problem of revenue maximization in this environment. In the setting with fully rational bidders, \citet{myerson1981optimal} showed that revenue can be maximized by using a second-price auction with this http URL show that, in stark contrast, in our setting with learning bidders, \emph{randomized} auctions can have strictly better revenue guarantees than second-price auctions with reserves, when $T$ is large enough. Finally, we study revenue maximization in the non-asymptotic regime. We define a notion of {\em auctioneer regret} comparing the revenue generated to the revenue of a second price auction with truthful bids. When the auctioneer has to use the same auction throughout the interaction, we show an (almost) tight regret bound of $\smash{\widetilde \Theta(T^{3/4})}.$ If the auctioneer can change auctions during the interaction, but in a way that is oblivious to the bids, we show an (almost) tight bound of $\smash{\widetilde \Theta(\sqrt{T})}.$
- [249] arXiv:2411.09523 [pdf, html, other]
-
Title: Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based AgentsYuyou Gan, Yong Yang, Zhe Ma, Ping He, Rui Zeng, Yiming Wang, Qingming Li, Chunyi Zhou, Songze Li, Ting Wang, Yunjun Gao, Yingcai Wu, Shouling JiSubjects: Artificial Intelligence (cs.AI)
With the continuous development of large language models (LLMs), transformer-based models have made groundbreaking advances in numerous natural language processing (NLP) tasks, leading to the emergence of a series of agents that use LLMs as their control hub. While LLMs have achieved success in various tasks, they face numerous security and privacy threats, which become even more severe in the agent scenarios. To enhance the reliability of LLM-based applications, a range of research has emerged to assess and mitigate these risks from different perspectives.
To help researchers gain a comprehensive understanding of various risks, this survey collects and analyzes the different threats faced by these agents. To address the challenges posed by previous taxonomies in handling cross-module and cross-stage threats, we propose a novel taxonomy framework based on the sources and impacts. Additionally, we identify six key features of LLM-based agents, based on which we summarize the current research progress and analyze their limitations. Subsequently, we select four representative agents as case studies to analyze the risks they may face in practical use. Finally, based on the aforementioned analyses, we propose future research directions from the perspectives of data, methodology, and policy, respectively. - [250] arXiv:2411.09524 [pdf, html, other]
-
Title: FlowNav: Learning Efficient Navigation Policies via Conditional Flow MatchingComments: Accepted at CoRL 2024 workshop on Learning Effective Abstractions for Planning (LEAP) and workshop on Differentiable Optimization Everywhere: Simulation, Estimation, Learning, and Control. 7 pages + 2 pages of references, 7 figuresSubjects: Robotics (cs.RO)
Effective robot navigation in dynamic environments is a challenging task that depends on generating precise control actions at high frequencies. Recent advancements have framed navigation as a goal-conditioned control problem. Current state-of-the-art methods for goal-based navigation, such as diffusion policies, either generate sub-goal images or robot control actions to guide robots. However, despite their high accuracy, these methods incur substantial computational costs, which limits their practicality for real-time applications. Recently, Conditional Flow Matching(CFM) has emerged as a more efficient and robust generalization of diffusion. In this work we explore the use of CFM to learn action policies that help the robot navigate its environment. Our results demonstrate that CFM is able to generate highly accurate robot actions. CFM not only matches the accuracy of diffusion policies but also significantly improves runtime performance. This makes it particularly advantageous for real-time robot navigation, where swift, reliable action generation is vital for collision avoidance and smooth operation. By leveraging CFM, we provide a pathway to more scalable, responsive robot navigation systems capable of handling the demands of dynamic and unpredictable environments.
- [251] arXiv:2411.09525 [pdf, html, other]
-
Title: Data-driven parameterization refinement for the structural optimization of cruise ship hullsSubjects: Numerical Analysis (math.NA)
In this work, we focus on the early design phase of cruise ship hulls, where the designers are tasked with ensuring the structural resilience of the ship against extreme waves while reducing steel usage and respecting safety and manufacturing constraints. The ship's geometry is already finalized and the designer can choose the thickness of the primary structural elements, such as decks, bulkheads, and the shell. Reduced order modeling and black-box optimization techniques reduce the use of expensive finite element analysis to only validate the most promising configurations, thanks to the efficient exploration of the domain of decision variables. However, the quality of the results heavily relies on the problem formulation, and on how the structural elements are assigned to the decision variables. A parameterization that does not capture well the stress configuration of the model prevents the optimization procedure from achieving the most efficient allocation of the steel. To address this issue, we extended an existing pipeline for the structural optimization of cruise ships developed in collaboration with Fincantieri S.p.A. with a novel data-driven reparameterization procedure, based on the optimization of a series of sub-problems. Moreover, we implemented a multi-objective optimization module to provide the designers with insights into the efficient trade-offs between competing quantities of interest and enhanced the single-objective Bayesian optimization module. The new pipeline is tested on a simplified midship section and a full ship hull, comparing the automated reparameterization to a baseline model provided by the designers. The tests show that the iterative refinement outperforms the baseline on the more complex hull, proving that the pipeline streamlines the initial design phase, and helps the designers tackle more innovative projects.
- [252] arXiv:2411.09531 [pdf, other]
-
Title: Efficient top-down updates in AVL treesSubjects: Data Structures and Algorithms (cs.DS)
Since AVL trees were invented in 1962, two major open questions about rebalancing operations, which found positive answers in other balanced binary search trees, were left open: can these operations be performed top-down (with a fixed look-ahead), and can they use an amortised constant number of write operations per update? We propose an algorithm that answers both questions positively.
- [253] arXiv:2411.09538 [pdf, html, other]
-
Title: Marker-free Human Gait Analysis using a Smart Edge Sensor SystemComments: accepted for SII 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
The human gait is a complex interplay between the neuronal and the muscular systems, reflecting an individual's neurological and physiological condition. This makes gait analysis a valuable tool for biomechanics and medical experts. Traditional observational gait analysis is cost-effective but lacks reliability and accuracy, while instrumented gait analysis, particularly using marker-based optical systems, provides accurate data but is expensive and time-consuming. In this paper, we introduce a novel markerless approach for gait analysis using a multi-camera setup with smart edge sensors to estimate 3D body poses without fiducial markers. We propose a Siamese embedding network with triplet loss calculation to identify individuals by their gait pattern. This network effectively maps gait sequences to an embedding space that enables clustering sequences from the same individual or activity closely together while separating those of different ones. Our results demonstrate the potential of the proposed system for efficient automated gait analysis in diverse real-world environments, facilitating a wide range of applications.
- [254] arXiv:2411.09539 [pdf, html, other]
-
Title: A Practical Guide to Fine-tuning Language Models with Limited DataSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Employing pre-trained Large Language Models (LLMs) has become the de facto standard in Natural Language Processing (NLP) despite their extensive data requirements. Motivated by the recent surge in research focused on training LLMs with limited data, particularly in low-resource domains and languages, this paper surveys recent transfer learning approaches to optimize model performance in downstream tasks where data is scarce. We first address initial and continued pre-training strategies to better leverage prior knowledge in unseen domains and languages. We then examine how to maximize the utility of limited data during fine-tuning and few-shot learning. The final section takes a task-specific perspective, reviewing models and methods suited for different levels of data scarcity. Our goal is to provide practitioners with practical guidelines for overcoming the challenges posed by constrained data while also highlighting promising directions for future research.
- [255] arXiv:2411.09540 [pdf, html, other]
-
Title: Prompting the Unseen: Detecting Hidden Backdoors in Black-Box ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Visual prompting (VP) is a new technique that adapts well-trained frozen models for source domain tasks to target domain tasks. This study examines VP's benefits for black-box model-level backdoor detection. The visual prompt in VP maps class subspaces between source and target domains. We identify a misalignment, termed class subspace inconsistency, between clean and poisoned datasets. Based on this, we introduce \textsc{BProm}, a black-box model-level detection method to identify backdoors in suspicious models, if any. \textsc{BProm} leverages the low classification accuracy of prompted models when backdoors are present. Extensive experiments confirm \textsc{BProm}'s effectiveness.
- [256] arXiv:2411.09543 [pdf, other]
-
Title: OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory CouplingXiaoling Yi, Ryan Antonio, Joren Dumoulin, Jiacong Sun, Josse Van Delm, Guilherme Paim, Marian VerhelstSubjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety ofGeMM workloads, while achieving 4.68 TOPS/W system efficiency.
- [257] arXiv:2411.09546 [pdf, html, other]
-
Title: Architectural Exploration of Application-Specific Resonant SRAM Compute-in-Memory (rCiM)Subjects: Hardware Architecture (cs.AR); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
While general-purpose computing follows Von Neumann's architecture, the data movement between memory and processor elements dictates the processor's performance. The evolving compute-in-memory (CiM) paradigm tackles this issue by facilitating simultaneous processing and storage within static random-access memory (SRAM) elements. Numerous design decisions taken at different levels of hierarchy affect the figure of merits (FoMs) of SRAM, such as power, performance, area, and yield. The absence of a rapid assessment mechanism for the impact of changes at different hierarchy levels on global FoMs poses a challenge to accurately evaluating innovative SRAM designs. This paper presents an automation tool designed to optimize the energy and latency of SRAM designs incorporating diverse implementation strategies for executing logic operations within the SRAM. The tool structure allows easy comparison across different array topologies and various design strategies to result in energy-efficient implementations. Our study involves a comprehensive comparison of over 6900+ distinct design implementation strategies for EPFL combinational benchmark circuits on the energy-recycling resonant compute-in-memory (rCiM) architecture designed using TSMC 28 nm technology. When provided with a combinational circuit, the tool aims to generate an energy-efficient implementation strategy tailored to the specified input memory and latency constraints. The tool reduces 80.9% of energy consumption on average across all benchmarks while using the six-topology implementation compared to baseline implementation of single-macro topology by considering the parallel processing capability of rCiM cache size ranging from 4KB to 192KB.
- [258] arXiv:2411.09547 [pdf, html, other]
-
Title: Piecing It All Together: Verifying Multi-Hop Multimodal ClaimsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal evidence supports or refutes a given claim. To study this task, we construct MMCV, a large-scale dataset comprising 16k multi-hop claims paired with multimodal evidence, generated and refined using large language models, with additional input from human feedback. We show that MMCV is challenging even for the latest state-of-the-art multimodal large language models, especially as the number of reasoning hops increases. Additionally, we establish a human performance benchmark on a subset of MMCV. We hope this dataset and its evaluation task will encourage future research in multimodal multi-hop claim verification.
- [259] arXiv:2411.09550 [pdf, html, other]
-
Title: A small-gain criterion for 2-contraction of large scale interconnected systemsSubjects: Systems and Control (eess.SY)
Despite modular conditions to guarantee stability for large-scale systems have been widely studied, few methods are available to tackle the case of networks with multiple equilibria. This paper introduces small-gain like sufficient conditions for 2-contraction of large-scale interconnected systems on the basis of a family of upper-bounds to the $L_2$ gains that arise from the gains computed on individual channels of the second additive variational equation. Such a condition guarantee the 2-additive compound of the system's Jacobian to be exponentially contractive, thus implying convergence towards equilibria of the system's solutions. The gains are obtained by solving suitable Linear Matrix Inequalities. Three interconnected Thomas' systems are considered in order to illustrate the application of the theory and the degree of conservatism.
- [260] arXiv:2411.09551 [pdf, html, other]
-
Title: MFTIQ: Multi-Flow Tracker with Independent Matching Quality EstimationComments: accepted to WACV 2025Subjects: Computer Vision and Pattern Recognition (cs.CV)
In this work, we present MFTIQ, a novel dense long-term tracking model that advances the Multi-Flow Tracker (MFT) framework to address challenges in point-level visual tracking in video sequences. MFTIQ builds upon the flow-chaining concepts of MFT, integrating an Independent Quality (IQ) module that separates correspondence quality estimation from optical flow computations. This decoupling significantly enhances the accuracy and flexibility of the tracking process, allowing MFTIQ to maintain reliable trajectory predictions even in scenarios of prolonged occlusions and complex dynamics. Designed to be "plug-and-play", MFTIQ can be employed with any off-the-shelf optical flow method without the need for fine-tuning or architectural modifications. Experimental validations on the TAP-Vid Davis dataset show that MFTIQ with RoMa optical flow not only surpasses MFT but also performs comparably to state-of-the-art trackers while having substantially faster processing speed. Code and models available at this https URL .
- [261] arXiv:2411.09552 [pdf, html, other]
-
Title: Faster Differentially Private Top-$k$ Selection: A Joint Exponential Mechanism with PruningComments: NeurIPS 2024Subjects: Cryptography and Security (cs.CR)
We study the differentially private top-$k$ selection problem, aiming to identify a sequence of $k$ items with approximately the highest scores from $d$ items. Recent work by Gillenwater et al. (ICML '22) employs a direct sampling approach from the vast collection of $d^{\,\Theta(k)}$ possible length-$k$ sequences, showing superior empirical accuracy compared to previous pure or approximate differentially private methods. Their algorithm has a time and space complexity of $\tilde{O}(dk)$.
In this paper, we present an improved algorithm with time and space complexity $O(d + k^2 / \epsilon \cdot \ln d)$, where $\epsilon$ denotes the privacy parameter. Experimental results show that our algorithm runs orders of magnitude faster than their approach, while achieving similar empirical accuracy. - [262] arXiv:2411.09553 [pdf, html, other]
-
Title: OOD-SEG: Out-Of-Distribution detection for image SEGmentation with sparse multi-class positive-only annotationsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Despite significant advancements, segmentation based on deep neural networks in medical and surgical imaging faces several challenges, two of which we aim to address in this work. First, acquiring complete pixel-level segmentation labels for medical images is time-consuming and requires domain expertise. Second, typical segmentation pipelines cannot detect out-of-distribution (OOD) pixels, leaving them prone to spurious outputs during deployment. In this work, we propose a novel segmentation approach exploiting OOD detection that learns only from sparsely annotated pixels from multiple positive-only classes. %but \emph{no background class} annotation. These multi-class positive annotations naturally fall within the in-distribution (ID) set. Unlabelled pixels may contain positive classes but also negative ones, including what is typically referred to as \emph{background} in standard segmentation formulations. Here, we forgo the need for background annotation and consider these together with any other unseen classes as part of the OOD set. Our framework can integrate, at a pixel-level, any OOD detection approaches designed for classification tasks. To address the lack of existing OOD datasets and established evaluation metric for medical image segmentation, we propose a cross-validation strategy that treats held-out labelled classes as OOD. Extensive experiments on both multi-class hyperspectral and RGB surgical imaging datasets demonstrate the robustness and generalisation capability of our proposed framework.
- [263] arXiv:2411.09555 [pdf, html, other]
-
Title: Image Processing for Motion MagnificationNadaniela Egidi, Josephin Giacomini, Paolo Leonesi, Pierluigi Maponi, Federico Mearelli, Edin TrebovicSubjects: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
Motion Magnification (MM) is a collection of relative recent techniques within the realm of Image Processing. The main motivation of introducing these techniques in to support the human visual system to capture relevant displacements of an object of interest; these motions can be in object color and in object location. In fact, the goal is to opportunely process a video sequence to obtain as output a new video in which motions are magnified and visible to the viewer. We propose a numerical technique using the Phase-Based Motion Magnification which analyses the video sequence in the Fourier Domain and rely on the Fourier Shifting Property. We describe the mathematical foundation of this method and the corresponding implementation in a numerical algorithm. We present preliminary experiments, focusing on some basic test made up using synthetic images.
- [264] arXiv:2411.09558 [pdf, html, other]
-
Title: Adaptive Deviation Learning for Visual Anomaly Detection with Data ContaminationComments: Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Visual anomaly detection targets to detect images that notably differ from normal pattern, and it has found extensive application in identifying defective parts within the manufacturing industry. These anomaly detection paradigms predominantly focus on training detection models using only clean, unlabeled normal samples, assuming an absence of contamination; a condition often unmet in real-world scenarios. The performance of these methods significantly depends on the quality of the data and usually decreases when exposed to noise. We introduce a systematic adaptive method that employs deviation learning to compute anomaly scores end-to-end while addressing data contamination by assigning relative importance to the weights of individual instances. In this approach, the anomaly scores for normal instances are designed to approximate scalar scores obtained from the known prior distribution. Meanwhile, anomaly scores for anomaly examples are adjusted to exhibit statistically significant deviations from these reference scores. Our approach incorporates a constrained optimization problem within the deviation learning framework to update instance weights, resolving this problem for each mini-batch. Comprehensive experiments on the MVTec and VisA benchmark datasets indicate that our proposed method surpasses competing techniques and exhibits both stability and robustness in the presence of data contamination.
- [265] arXiv:2411.09565 [pdf, html, other]
-
Title: Vlimb: A Wire-Driven Wearable Robot for Bodily Extension, Balancing Powerfulness and ReachabilityShogo Sawaguchi, Temma Suzuki, Akihiro Miki, Kento Kawaharazuka, Sota Yuzaki, Shunnosuke Yoshimura, Yoshimoto Ribayashi, Kei Okada, Masayuki InabaSubjects: Robotics (cs.RO)
Numerous wearable robots have been developed to meet the demands of physical assistance and entertainment. These wearable robots range from body-enhancing types that assist human arms and legs to body-extending types that have extra arms. This study focuses specifically on wearable robots of the latter category, aimed at bodily extension. However, they have not yet achieved the level of powerfulness and reachability equivalent to that of human limbs, limiting their application to entertainment and manipulation tasks involving lightweight objects. Therefore, in this study, we develop an body-extending wearable robot, Vlimb, which has enough powerfulness to lift a human and can perform manipulation. Leveraging the advantages of tendon-driven mechanisms, Vlimb incorporates a wire routing mechanism capable of accommodating both delicate manipulations and robust lifting tasks. Moreover, by introducing a passive ring structure to overcome the limited reachability inherent in tendon-driven mechanisms, Vlimb achieves both the powerfulness and reachability comparable to that of humans. This paper outlines the design methodology of Vlimb, conducts preliminary manipulation and lifting tasks, and verifies its effectiveness.
- [266] arXiv:2411.09567 [pdf, html, other]
-
Title: VPBSD:Vessel-Pattern-Based Semi-Supervised Distillation for Efficient 3D Microscopic Cerebrovascular SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D microscopic cerebrovascular images are characterized by their high resolution, presenting significant annotation challenges, large data volumes, and intricate variations in detail. Together, these factors make achieving high-quality, efficient whole-brain segmentation particularly demanding. In this paper, we propose a novel Vessel-Pattern-Based Semi-Supervised Distillation pipeline (VpbSD) to address the challenges of 3D microscopic cerebrovascular segmentation. This pipeline initially constructs a vessel-pattern codebook that captures diverse vascular structures from unlabeled data during the teacher model's pretraining phase. In the knowledge distillation stage, the codebook facilitates the transfer of rich knowledge from a heterogeneous teacher model to a student model, while the semi-supervised approach further enhances the student model's exposure to diverse learning samples. Experimental results on real-world data, including comparisons with state-of-the-art methods and ablation studies, demonstrate that our pipeline and its individual components effectively address the challenges inherent in microscopic cerebrovascular segmentation.
- [267] arXiv:2411.09572 [pdf, html, other]
-
Title: Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact RepresentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present ViTaM-D, a novel visual-tactile framework for dynamic hand-object interaction reconstruction, integrating distributed tactile sensing for more accurate contact modeling. While existing methods focus primarily on visual inputs, they struggle with capturing detailed contact interactions such as object deformation. Our approach leverages distributed tactile sensors to address this limitation by introducing DF-Field. This distributed force-aware contact representation models both kinetic and potential energy in hand-object interaction. ViTaM-D first reconstructs hand-object interactions using a visual-only network, VDT-Net, and then refines contact details through a force-aware optimization (FO) process, enhancing object deformation modeling. To benchmark our approach, we introduce the HOT dataset, which features 600 sequences of hand-object interactions, including deformable objects, built in a high-precision simulation environment. Extensive experiments on both the DexYCB and HOT datasets demonstrate significant improvements in accuracy over previous state-of-the-art methods such as gSDF and HOTrack. Our results highlight the superior performance of ViTaM-D in both rigid and deformable object reconstruction, as well as the effectiveness of DF-Field in refining hand poses. This work offers a comprehensive solution to dynamic hand-object interaction reconstruction by seamlessly integrating visual and tactile data. Codes, models, and datasets will be available.
- [268] arXiv:2411.09576 [pdf, html, other]
-
Title: Automating Reformulation of Essence Specifications via Graph RewritingComments: Presented at the PTHG 2024 workshopSubjects: Artificial Intelligence (cs.AI)
Formulating an effective constraint model of a parameterised problem class is crucial to the efficiency with which instances of the class can subsequently be solved. It is difficult to know beforehand which of a set of candidate models will perform best in practice. This paper presents a system that employs graph rewriting to reformulate an input model for improved performance automatically. By situating our work in the Essence abstract constraint specification language, we can use the structure in its high level variable types to trigger rewrites directly. We implement our system via rewrite rules expressed in the Graph Programs 2 language, applied to the abstract syntax tree of an input specification. We show how to automatically translate the solution of the reformulated problem into a solution of the original problem for verification and presentation. We demonstrate the efficacy of our system with a detailed case study.
- [269] arXiv:2411.09577 [pdf, html, other]
-
Title: SimTube: Generating Simulated Video Comments through Multimodal AI and User PersonasYu-Kai Hung (1), Yun-Chien Huang (1), Ting-Yu Su (1), Yen-Ting Lin (1), Lung-Pan Cheng (1), Bryan Wang (2), Shao-Hua Sun (1) ((1) National Taiwan University, (2) Adobe Research)Subjects: Human-Computer Interaction (cs.HC)
Audience feedback is crucial for refining video content, yet it typically comes after publication, limiting creators' ability to make timely adjustments. To bridge this gap, we introduce SimTube, a generative AI system designed to simulate audience feedback in the form of video comments before a video's release. SimTube features a computational pipeline that integrates multimodal data from the video-such as visuals, audio, and metadata-with user personas derived from a broad and diverse corpus of audience demographics, generating varied and contextually relevant feedback. Furthermore, the system's UI allows creators to explore and customize the simulated comments. Through a comprehensive evaluation-comprising quantitative analysis, crowd-sourced assessments, and qualitative user studies-we show that SimTube's generated comments are not only relevant, believable, and diverse but often more detailed and informative than actual audience comments, highlighting its potential to help creators refine their content before release.
- [270] arXiv:2411.09580 [pdf, html, other]
-
Title: Software Performance Engineering for Foundation Model-Powered Software (FMware)Haoxiang Zhang, Shi Chang, Arthur Leung, Kishanthan Thangarajah, Boyuan Chen, Hanan Lutfiyya, Ahmed E. HassanSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The rise of Foundation Models (FMs) like Large Language Models (LLMs) is revolutionizing software development. Despite the impressive prototypes, transforming FMware into production-ready products demands complex engineering across various domains. A critical but overlooked aspect is performance engineering, which aims at ensuring FMware meets performance goals such as throughput and latency to avoid user dissatisfaction and financial loss. Often, performance considerations are an afterthought, leading to costly optimization efforts post-deployment. FMware's high computational resource demands highlight the need for efficient hardware use. Continuous performance engineering is essential to prevent degradation. This paper highlights the significance of Software Performance Engineering (SPE) in FMware, identifying four key challenges: cognitive architecture design, communication protocols, tuning and optimization, and deployment. These challenges are based on literature surveys and experiences from developing an in-house FMware system. We discuss problems, current practices, and innovative paths for the software engineering community.
- [271] arXiv:2411.09582 [pdf, html, other]
-
Title: Safety Filter for Robust Disturbance Rejection via Online OptimizationComments: Submitted to the 2025 European Control Conference. This paper builds on the work done in arXiv:2405.07037Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
Disturbance rejection in high-precision control applications can be significantly improved upon via online convex optimization (OCO). This includes classical techniques such as recursive least squares (RLS) and more recent, regret-based formulations. However, these methods can cause instabilities in the presence of model uncertainty. This paper introduces a safety filter for systems with OCO in the form of adaptive finite impulse response (FIR) filtering to ensure robust disturbance rejection. The safety filter enforces a robust stability constraint on the FIR coefficients while minimally altering the OCO command in the $\infty$-norm cost. Additionally, we show that the induced $\ell_\infty$-norm allows for easy online implementation of the safety filter by directly limiting the OCO command. The constraint can be tuned to trade off robustness and performance. We provide a simple example to demonstrate the safety filter.
- [272] arXiv:2411.09583 [pdf, other]
-
Title: A Nonuniform Fast Hankel TransformSubjects: Numerical Analysis (math.NA)
We describe a fast algorithm for computing discrete Hankel transforms of moderate orders from $n$ nonuniform points to $m$ nonuniform frequencies in $O((m+n)\log\min(n,m))$ operations. Our approach combines local and asymptotic Bessel function expansions with nonuniform fast Fourier transforms. The order of each expansion is adjusted automatically according to error analysis to obtain any desired precision $\varepsilon$. Several numerical examples are provided which demonstrate the speed and accuracy of the algorithm in multiple regimes and applications.
- [273] arXiv:2411.09584 [pdf, html, other]
-
Title: A Sylvester equation approach for the computation of zero-group-velocity points in waveguidesComments: 18 pagesSubjects: Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
Eigenvalues of parameter-dependent quadratic eigenvalue problems form eigencurves. The critical points on these curves, where the derivative vanishes, are of practical interest.
A particular example is found in the dispersion curves of elastic waveguides, where such points are called zero-group-velocity (ZGV) points. Recently, it was revealed that the problem of computing ZGV points can be modeled as a multiparameter eigenvalue problem (MEP), and several numerical methods were devised. Due to their complexity, these methods are feasible only for problems involving small matrices. In this paper, we improve the efficiency of these methods by exploiting the link to the Sylvester equation. This approach enables the computation of ZGV points for problems with much larger matrices, such as multi-layered plates and three-dimensional structures of complex cross-sections. - [274] arXiv:2411.09585 [pdf, html, other]
-
Title: Backdoor Mitigation by Distance-Driven DetoxificationComments: Preprint versionSubjects: Cryptography and Security (cs.CR)
Backdoor attacks undermine the integrity of machine learning models by allowing attackers to manipulate predictions using poisoned training data. Such attacks lead to targeted misclassification when specific triggers are present, while the model behaves normally under other conditions. This paper considers a post-training backdoor defense task, aiming to detoxify the backdoors in pre-trained models. We begin by analyzing the underlying issues of vanilla fine-tuning and observe that it is often trapped in regions with low loss for both clean and poisoned samples. Motivated by such observations, we propose Distance-Driven Detoxification (D3), an innovative approach that reformulates backdoor defense as a constrained optimization problem. Specifically, D3 promotes the model's departure from the vicinity of its initial weights, effectively reducing the influence of backdoors. Extensive experiments on state-of-the-art (SOTA) backdoor attacks across various model architectures and datasets demonstrate that D3 not only matches but often surpasses the performance of existing SOTA post-training defense techniques.
- [275] arXiv:2411.09587 [pdf, html, other]
-
Title: BabyLM Challenge: Exploring the Effect of Variation Sets on Language Model Training EfficiencyComments: This paper accepted BabyLM challenge 2024 at CONLL 2024Subjects: Computation and Language (cs.CL)
While current large language models have achieved a remarkable success, their data efficiency remains a challenge to overcome. Recently it has been suggested that child-directed speech (CDS) can improve training data efficiency of modern language models based on Transformer neural networks. However, it is not yet understood which specific properties of CDS are effective for training these models. In the context of the BabyLM Challenge, we focus on Variation Sets (VSs), sets of consecutive utterances expressing a similar intent with slightly different words and structures, which are ubiquitous in CDS. To assess the impact of VSs on training data efficiency, we augment CDS data with different proportions of artificial VSs and use these datasets to train an auto-regressive model, GPT-2. We find that the best proportion of VSs depends on the evaluation benchmark: BLiMP and GLUE scores benefit from the presence of VSs, but EWOK scores do not. Additionally, the results vary depending on multiple factors such as the number of epochs and the order of utterance presentation. Taken together, these findings suggest that VSs can have a beneficial influence on language models, while leaving room for further investigation.
- [276] arXiv:2411.09590 [pdf, other]
-
Title: Adopting RAG for LLM-Aided Future Vehicle DesignComments: Conference paper accepted in IEEE FLLM 2024Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
In this paper, we explore the integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to enhance automated design and software development in the automotive industry. We present two case studies: a standardization compliance chatbot and a design copilot, both utilizing RAG to provide accurate, context-aware responses. We evaluate four LLMs-GPT-4o, LLAMA3, Mistral, and Mixtral -- comparing their answering accuracy and execution time. Our results demonstrate that while GPT-4 offers superior performance, LLAMA3 and Mistral also show promising capabilities for local deployment, addressing data privacy concerns in automotive applications. This study highlights the potential of RAG-augmented LLMs in improving design workflows and compliance in automotive engineering.
- [277] arXiv:2411.09591 [pdf, html, other]
-
Title: Expert Study on Interpretable Machine Learning Models with Missing DataComments: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 13 pagesSubjects: Machine Learning (cs.LG)
Inherently interpretable machine learning (IML) models provide valuable insights for clinical decision-making but face challenges when features have missing values. Classical solutions like imputation or excluding incomplete records are often unsuitable in applications where values are missing at test time. In this work, we conducted a survey with 71 clinicians from 29 trauma centers across France, including 20 complete responses to study the interaction between medical professionals and IML applied to data with missing values. This provided valuable insights into how missing data is interpreted in clinical machine learning. We used the prediction of hemorrhagic shock as a concrete example to gauge the willingness and readiness of the participants to adopt IML models from three classes of methods. Our findings show that, while clinicians value interpretability and are familiar with common IML methods, classical imputation techniques often misalign with their intuition, and that models that natively handle missing values are preferred. These results emphasize the need to integrate clinical intuition into future IML models for better human-computer interaction.
- [278] arXiv:2411.09595 [pdf, html, other]
-
Title: LLaMA-Mesh: Unifying 3D Mesh Generation with Language ModelsComments: See the project website at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.
- [279] arXiv:2411.09597 [pdf, html, other]
-
Title: Rare-Case Hard Functions Against Various AdversariesComments: 25 pagesSubjects: Computational Complexity (cs.CC)
We say that a function is rare-case hard against a given class of algorithms (the adversary) if all algorithms in the class can compute the function only on an $o(1)$-fraction of instances of size $n$ for large enough $n$. Starting from any NP-complete language, for each $k > 0$, we construct a function that cannot be computed correctly on even a $1/n^k$-fraction of instances for polynomial-sized circuit families if NP $\not \subset$ P/POLY and by polynomial-time algorithms if NP $\not \subset$ BPP - functions that are rare-case hard against polynomial-time algorithms and polynomial-sized circuits. The constructed function is a number-theoretic polynomial evaluated over specific finite fields. For NP-complete languages that admit parsimonious reductions from all of NP (for example, SAT), the constructed functions are hard to compute on even a $1/n^k$-fraction of instances by polynomial-time algorithms and polynomial-sized circuit families simply if $P^{\#P} \not \subset$ BPP and $P^{\#P} \not \subset$ P/POLY, respectively. We also show that if the Randomized Exponential Time Hypothesis (RETH) is true, none of these constructed functions can be computed on even a $1/n^k$-fraction of instances in subexponential time. These functions are very hard, almost always.
While one may not be able to efficiently compute the values of these constructed functions themselves, in polynomial time, one can verify that the evaluation of a function, $s = f(x)$, is correct simply by asking a prover to compute $f(y)$ on targeted queries. - [280] arXiv:2411.09600 [pdf, html, other]
-
Title: Latency Optimization in LEO Satellite Communications with Hybrid Beam Pattern and Interference ControlSubjects: Information Theory (cs.IT); Machine Learning (cs.LG)
The rapid advancement of low Earth orbit (LEO) satellite communication systems has significantly enhanced global connectivity, offering high-capacity, low-latency services crucial for next-generation applications. However, the dense configuration of LEO constellations poses challenges in resource allocation optimization and interference management, complicating coexistence with other communication systems. To address these limitations, this paper proposes a novel framework for optimizing the beam scheduling and resource allocation in multi-beam LEO systems. To satisfy the uneven terrestrial traffic demand, a hybrid beam pattern is employed to enhance the downlink quality of service and minimize the transmission latency from LEO satellites to ground user terminals. Additionally, a dynamic co-channel interference (CCI) control mechanism is developed to mitigate inter-beam interference within the LEO constellation and limit cross-system interference affecting protected users from other networks. The problem of user-beam-frequency allocation with power optimization is formulated as a mixed-integer dynamic programming model and solved using a low-complexity neural network-based graph generation algorithm. Simulation results show that the proposed approach outperforms the baseline methods of full frequency reuse and single-channel transmission, and highlights the potential for further performance improvement with multi-user transmissions.
- [281] arXiv:2411.09601 [pdf, html, other]
-
Title: Accelerating Knowledge Graph and Ontology Engineering with Large Language ModelsSubjects: Artificial Intelligence (cs.AI)
Large Language Models bear the promise of significant acceleration of key Knowledge Graph and Ontology Engineering tasks, including ontology modeling, extension, modification, population, alignment, as well as entity disambiguation. We lay out LLM-based Knowledge Graph and Ontology Engineering as a new and coming area of research, and argue that modular approaches to ontologies will be of central importance.
- [282] arXiv:2411.09603 [pdf, html, other]
-
Title: Smart Automation in Luxury Leather Shoe Polishing: A Human Centric Robotic ApproachJournal-ref: International Journal of Computer Integrated Manufacturing, 2024, 1-15Subjects: Robotics (cs.RO)
The polishing of luxury leather shoes is a delicate, labor intensive process traditionally performed by skilled craftsmen. Footwear companies aim to automate parts of this process to enhance quality, productivity, and operator well-being, but the unique nature of luxury shoe production presents challenges. This paper introduces a solution involving a collaborative robotic cell to assist in shoe polishing. A collaborative robotic manipulator, equipped with a specialized tool and governed by force control, executes the polishing tasks. Key factors such as trajectory design, applied force, polishing speed, and polish amount were analyzed. Polishing trajectories are designed using CAM software and transferred to the robot control system. Human operators design the process, supervise the robot, and perform final finishing, ensuring their expertise is integral to achieving quality. Extensive testing on various shoe models showed significant improvements in quality and reliability, leading to successful implementation on an industrial production line.
- [283] arXiv:2411.09604 [pdf, html, other]
-
Title: Local-Global Attention: An Adaptive Mechanism for Multi-Scale Feature IntegrationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
In recent years, attention mechanisms have significantly enhanced the performance of object detection by focusing on key feature information. However, prevalent methods still encounter difficulties in effectively balancing local and global features. This imbalance hampers their ability to capture both fine-grained details and broader contextual information-two critical elements for achieving accurate object this http URL address these challenges, we propose a novel attention mechanism, termed Local-Global Attention, which is designed to better integrate both local and global contextual features. Specifically, our approach combines multi-scale convolutions with positional encoding, enabling the model to focus on local details while concurrently considering the broader global context. Additionally, we introduce a learnable parameters, which allow the model to dynamically adjust the relative importance of local and global attention, depending on the specific requirements of the task, thereby optimizing feature representations across multiple this http URL have thoroughly evaluated the Local-Global Attention mechanism on several widely used object detection and classification datasets. Our experimental results demonstrate that this approach significantly enhances the detection of objects at various scales, with particularly strong performance on multi-class and small object detection tasks. In comparison to existing attention mechanisms, Local-Global Attention consistently outperforms them across several key metrics, all while maintaining computational efficiency.
- [284] arXiv:2411.09605 [pdf, html, other]
-
Title: An explicit, energy-conserving particle-in-cell schemeSubjects: Numerical Analysis (math.NA); Plasma Physics (physics.plasm-ph)
We present an explicit temporal discretization of particle-in-cell schemes for the Vlasov equation that results in exact energy conservation when combined with an appropriate spatial discretization. The scheme is inspired by a simple, second-order explicit scheme that conserves energy exactly in the Eulerian context. We show that direct translation to particle-in-cell does not result in strict conservation, but derive a simple correction based on an analytically solvable optimization problem that recovers conservation. While this optimization problem is not guaranteed to have a real solution for every particle, we provide a correction that makes imaginary values extremely rare and still admits $\mathcal{O}(10^{-12})$ fractional errors in energy for practical simulation parameters. We present the scheme in both electrostatic -- where we use the Ampère formulation -- and electromagnetic contexts. With an electromagnetic field solve, the field update is most naturally linearly implicit, but the more computationally intensive particle update remains fully explicit. We also show how the scheme can be extended to use the fully explicit leapfrog and pseudospectral analytic time-domain (PSATD) field solvers. The scheme is tested on standard kinetic plasma problems, confirming its conservation properties.
- [285] arXiv:2411.09607 [pdf, html, other]
-
Title: Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer FrameworkSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
This report provides an initial look at partial results from the TREC 2024 Retrieval-Augmented Generation (RAG) Track. We have identified RAG evaluation as a barrier to continued progress in information access (and more broadly, natural language processing and artificial intelligence), and it is our hope that we can contribute to tackling the many challenges in this space. The central hypothesis we explore in this work is that the nugget evaluation methodology, originally developed for the TREC Question Answering Track in 2003, provides a solid foundation for evaluating RAG systems. As such, our efforts have focused on "refactoring" this methodology, specifically applying large language models to both automatically create nuggets and to automatically assign nuggets to system answers. We call this the AutoNuggetizer framework. Within the TREC setup, we are able to calibrate our fully automatic process against a manual process whereby nuggets are created by human assessors semi-manually and then assigned manually to system answers. Based on initial results across 21 topics from 45 runs, we observe a strong correlation between scores derived from a fully automatic nugget evaluation and a (mostly) manual nugget evaluation by human assessors. This suggests that our fully automatic evaluation process can be used to guide future iterations of RAG systems.
- [286] arXiv:2411.09612 [pdf, other]
-
Title: The Moral Foundations Weibo CorpusSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Moral sentiments expressed in natural language significantly influence both online and offline environments, shaping behavioral styles and interaction patterns, including social media selfpresentation, cyberbullying, adherence to social norms, and ethical decision-making. To effectively measure moral sentiments in natural language processing texts, it is crucial to utilize large, annotated datasets that provide nuanced understanding for accurate analysis and modeltraining. However, existing corpora, while valuable, often face linguistic limitations. To address this gap in the Chinese language domain,we introduce the Moral Foundation Weibo Corpus. This corpus consists of 25,671 Chinese comments on Weibo, encompassing six diverse topic areas. Each comment is manually annotated by at least three systematically trained annotators based on ten moral categories derived from a grounded theory of morality. To assess annotator reliability, we present the kappa testresults, a gold standard for measuring consistency. Additionally, we apply several the latest large language models to supplement the manual annotations, conducting analytical experiments to compare their performance and report baseline results for moral sentiment classification.
- [287] arXiv:2411.09613 [pdf, html, other]
-
Title: PTR: Precision-Driven Tool Recommendation for Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
By augmenting Large Language Models (LLMs) with external tools, their capacity to solve complex problems has been significantly enhanced. However, despite ongoing advancements in the parsing capabilities of LLMs, incorporating all available tools simultaneously in the prompt remains impractical due to the vast number of external tools. Consequently, it is essential to provide LLMs with a precise set of tools tailored to the specific task, considering both quantity and quality. Current tool retrieval methods primarily focus on refining the ranking list of tools and directly packaging a fixed number of top-ranked tools as the tool set. However, these approaches often fail to equip LLMs with the optimal set of tools prior to execution, since the optimal number of tools for different tasks could be different, resulting in inefficiencies such as redundant or unsuitable tools, which impede immediate access to the most relevant tools. This paper addresses the challenge of recommending precise toolsets for LLMs. We introduce the problem of tool recommendation, define its scope, and propose a novel Precision-driven Tool Recommendation (PTR) approach. PTR captures an initial, concise set of tools by leveraging historical tool bundle usage and dynamically adjusts the tool set by performing tool matching, culminating in a multi-view-based tool addition. Additionally, we present a new dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness of tool recommendation for LLMs. We further validate our design choices through comprehensive experiments, demonstrating promising accuracy across two open benchmarks and our RecTools dataset.
- [288] arXiv:2411.09617 [pdf, other]
-
Title: Riemannian optimisation methods for ground states of multicomponent Bose-Einstein condensatesSubjects: Numerical Analysis (math.NA)
This paper addresses the computation of ground states of multicomponent Bose-Einstein condensates, defined as the global minimiser of an energy functional on an infinite-dimensional generalised oblique manifold. We establish the existence of the ground state, prove its uniqueness up to scaling, and characterise it as the solution to a coupled nonlinear eigenvector problem. By equipping the manifold with several Riemannian metrics, we introduce a suite of Riemannian gradient descent and Riemannian Newton methods. Metrics that incorporate first- or second-order information about the energy are particularly advantageous, effectively preconditioning the resulting methods. For a Riemannian gradient descent method with an energy-adaptive metric, we provide a qualitative global and quantitative local convergence analysis, confirming its reliability and robustness with respect to the choice of the spatial discretisation. Numerical experiments highlight the computational efficiency of both the Riemannian gradient descent and Newton methods.
- [289] arXiv:2411.09619 [pdf, html, other]
-
Title: Hardness Amplification via Group TheoryComments: 72 pagesSubjects: Computational Complexity (cs.CC)
We employ techniques from group theory to show that, in many cases, counting problems on graphs are almost as hard to solve in a small number of instances as they are in all instances. Specifically, we show the following results.
1. Goldreich (2020) asks if, for every constant $\delta < 1 / 2$, there is an $\tilde{O} \left( n^2 \right)$-time randomized reduction from computing the number of $k$-cliques modulo $2$ with a success probability of greater than $2 / 3$ to computing the number of $k$-cliques modulo $2$ with an error probability of at most $\delta$.
In this work, we show that for almost all choices of the $\delta 2^{n \choose 2}$ corrupt answers within the average-case solver, we have a reduction taking $\tilde{O} \left( n^2 \right)$-time and tolerating an error probability of $\delta$ in the average-case solver for any constant $\delta < 1 / 2$. By "almost all", we mean that if we choose, with equal probability, any subset $S \subset \{0,1\}^{n \choose 2}$ with $|S| = \delta2^{n \choose 2}$, then with a probability of $1-2^{-\Omega \left( n^2 \right)}$, we can use an average-case solver corrupt on $S$ to obtain a probabilistic algorithm.
2. Inspired by the work of Goldreich and Rothblum in FOCS 2018 to take the weighted versions of the graph counting problems, we prove that if the RETH is true, then for a prime $p = \Theta \left( 2^n \right)$, the problem of counting the number of unique Hamiltonian cycles modulo $p$ on $n$-vertex directed multigraphs and the problem of counting the number of unique half-cliques modulo $p$ on $n$-vertex undirected multigraphs, both require exponential time to compute correctly on even a $1 / 2^{n/\log n}$-fraction of instances. Meanwhile, simply printing $0$ on all inputs is correct on at least a $\Omega \left( 1 / 2^n \right)$-fraction of instances. - [290] arXiv:2411.09623 [pdf, html, other]
-
Title: Vision-based Manipulation of Transparent Plastic Bags in Industrial SetupsF. Adetunji (1 and 2), A. Karukayil (1 and 2), P. Samant (1 and 2), S. Shabana (1 and 2), F. Varghese (1 and 2), U. Upadhyay (1 and 2), R. A. Yadav (1 and 2), A. Partridge (2), E. Pendleton (2), R. Plant (2), Y. Petillot (1 and 2), M. Koskinopoulou (1 and 2) ((1) Heriot-Watt University, (2) The National Robotarium)Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
This paper addresses the challenges of vision-based manipulation for autonomous cutting and unpacking of transparent plastic bags in industrial setups, aligning with the Industry 4.0 paradigm. Industry 4.0, driven by data, connectivity, analytics, and robotics, promises enhanced accessibility and sustainability throughout the value chain. The integration of autonomous systems, including collaborative robots (cobots), into industrial processes is pivotal for efficiency and safety. The proposed solution employs advanced Machine Learning algorithms, particularly Convolutional Neural Networks (CNNs), to identify transparent plastic bags under varying lighting and background conditions. Tracking algorithms and depth sensing technologies are utilized for 3D spatial awareness during pick and placement. The system addresses challenges in grasping and manipulation, considering optimal points, compliance control with vacuum gripping technology, and real-time automation for safe interaction in dynamic environments. The system's successful testing and validation in the lab with the FRANKA robot arm, showcases its potential for widespread industrial applications, while demonstrating effectiveness in automating the unpacking and cutting of transparent plastic bags for an 8-stack bulk-loader based on specific requirements and rigorous testing.
- [291] arXiv:2411.09625 [pdf, html, other]
-
Title: Local deployment of large-scale music AI models on commodity hardwareComments: 2 pagesSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
We present the MIDInfinite, a web application capable of generating symbolic music using a large-scale generative AI model locally on commodity hardware. Creating this demo involved porting the Anticipatory Music Transformer, a large language model (LLM) pre-trained on the Lakh MIDI dataset, to the Machine Learning Compilation (MLC) framework. Once the model is ported, MLC facilitates inference on a variety of runtimes including C++, mobile, and the browser. We envision that MLC has the potential to bridge the gap between the landscape of increasingly capable music AI models and technology more familiar to music software developers. As a proof of concept, we build a web application that allows users to generate endless streams of multi-instrumental MIDI in the browser, either from scratch or conditioned on a prompt. On commodity hardware (an M3 Macbook Pro), our demo can generate 51 notes per second, which is faster than real-time playback for 72.9% of generations, and increases to 86.3% with 2 seconds of upfront buffering.
- [292] arXiv:2411.09627 [pdf, html, other]
-
Title: One-Shot Manipulation Strategy Learning by Making Contact AnalogiesComments: CoRL LEAP Workshop, 2024Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
We present a novel approach, MAGIC (manipulation analogies for generalizable intelligent contacts), for one-shot learning of manipulation strategies with fast and extensive generalization to novel objects. By leveraging a reference action trajectory, MAGIC effectively identifies similar contact points and sequences of actions on novel objects to replicate a demonstrated strategy, such as using different hooks to retrieve distant objects of different shapes and sizes. Our method is based on a two-stage contact-point matching process that combines global shape matching using pretrained neural features with local curvature analysis to ensure precise and physically plausible contact points. We experiment with three tasks including scooping, hanging, and hooking objects. MAGIC demonstrates superior performance over existing methods, achieving significant improvements in runtime speed and generalization to different object categories. Website: this https URL .
- [293] arXiv:2411.09639 [pdf, html, other]
-
Title: MCCE: Missingness-aware Causal Concept ExplainerSubjects: Machine Learning (cs.LG)
Causal concept effect estimation is gaining increasing interest in the field of interpretable machine learning. This general approach explains the behaviors of machine learning models by estimating the causal effect of human-understandable concepts, which represent high-level knowledge more comprehensibly than raw inputs like tokens. However, existing causal concept effect explanation methods assume complete observation of all concepts involved within the dataset, which can fail in practice due to incomplete annotations or missing concept data. We theoretically demonstrate that unobserved concepts can bias the estimation of the causal effects of observed concepts. To address this limitation, we introduce the Missingness-aware Causal Concept Explainer (MCCE), a novel framework specifically designed to estimate causal concept effects when not all concepts are observable. Our framework learns to account for residual bias resulting from missing concepts and utilizes a linear predictor to model the relationships between these concepts and the outputs of black-box machine learning models. It can offer explanations on both local and global levels. We conduct validations using a real-world dataset, demonstrating that MCCE achieves promising performance compared to state-of-the-art explanation methods in causal concept effect estimation.
- [294] arXiv:2411.09642 [pdf, html, other]
-
Title: On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode CollapseComments: Abstract shortened to fit arXiv limitSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
Specifying all desirable properties of a language model is challenging, but certain requirements seem essential. Given samples from an unknown language, the trained model should produce valid strings not seen in training and be expressive enough to capture the language's full richness. Otherwise, outputting invalid strings constitutes "hallucination," and failing to capture the full range leads to "mode collapse." We ask if a language model can meet both requirements.
We investigate this within a statistical language generation setting building on Gold and Angluin. Here, the model receives random samples from a distribution over an unknown language K, which belongs to a possibly infinite collection of languages. The goal is to generate unseen strings from K. We say the model generates from K with consistency and breadth if, as training size increases, its output converges to all unseen strings in K.
Kleinberg and Mullainathan [KM24] asked if consistency and breadth in language generation are possible. We answer this negatively: for a large class of language models, including next-token prediction models, this is impossible for most collections of candidate languages. This contrasts with [KM24]'s result, showing consistent generation without breadth is possible for any countable collection of languages. Our finding highlights that generation with breadth fundamentally differs from generation without breadth.
As a byproduct, we establish near-tight bounds on the number of samples needed for generation with or without breadth.
Finally, our results offer hope: consistent generation with breadth is achievable for any countable collection of languages when negative examples (strings outside K) are available alongside positive ones. This suggests that post-training feedback, which encodes negative examples, can be crucial in reducing hallucinations while limiting mode collapse. - [295] arXiv:2411.09643 [pdf, html, other]
-
Title: Modular Fault Diagnosis Framework for Complex Autonomous Driving SystemsStefan Orf, Sven Ochs, Jens Doll, Albert Schotschneider, Marc Heinrich, Marc René Zofka, J. Marius ZöllnerComments: Accepted at 2024 IEEE 20th International Conference on Intelligent Computer Communication and Processing (ICCP 2024)Subjects: Robotics (cs.RO)
Fault diagnosis is crucial for complex autonomous mobile systems, especially for modern-day autonomous driving (AD). Different actors, numerous use cases, and complex heterogeneous components motivate a fault diagnosis of the system and overall system integrity. AD systems are composed of many heterogeneous components, each with different functionality and possibly using a different algorithm (e.g., rule-based vs. AI components). In addition, these components are subject to the vehicle's driving state and are highly dependent. This paper, therefore, faces this problem by presenting the concept of a modular fault diagnosis framework for AD systems. The concept suggests modular state monitoring and diagnosis elements, together with a state- and dependency-aware aggregation method. Our proposed classification scheme allows for the categorization of the fault diagnosis modules. The concept is implemented on AD shuttle buses and evaluated to demonstrate its capabilities.
- [296] arXiv:2411.09645 [pdf, other]
-
Title: How do Machine Learning Models Change?Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
The proliferation of Machine Learning (ML) models and their open-source implementations has transformed Artificial Intelligence research and applications. Platforms like Hugging Face (HF) enable the development, sharing, and deployment of these models, fostering an evolving ecosystem. While previous studies have examined aspects of models hosted on platforms like HF, a comprehensive longitudinal study of how these models change remains underexplored. This study addresses this gap by utilizing both repository mining and longitudinal analysis methods to examine over 200,000 commits and 1,200 releases from over 50,000 models on HF. We replicate and extend an ML change taxonomy for classifying commits and utilize Bayesian networks to uncover patterns in commit and release activities over time. Our findings indicate that commit activities align with established data science methodologies, such as CRISP-DM, emphasizing iterative refinement and continuous improvement. Additionally, release patterns tend to consolidate significant updates, particularly in documentation, distinguishing between granular changes and milestone-based releases. Furthermore, projects with higher popularity prioritize infrastructure enhancements early in their lifecycle, and those with intensive collaboration practices exhibit improved documentation standards. These and other insights enhance the understanding of model changes on community platforms and provide valuable guidance for best practices in model maintenance.
- [297] arXiv:2411.09648 [pdf, html, other]
-
Title: Med-Bot: An AI-Powered Assistant to Provide Accurate and Reliable Medical InformationComments: 3 figures, 5 pages Keywords-LLM, AI-powered healthcare, Medical chatbot, Context-based interaction, Llama-assisted data processing, AutoGPT-Q, PyTorch, TensorFlow, Reliable medical information, Machine learning in healthcare, Conversational AISubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
This paper introduces Med-Bot, an AI-powered chatbot designed to provide users with accurate and reliable medical information. Utilizing advanced libraries and frameworks such as PyTorch, Chromadb, Langchain and Autogptq, Med-Bot is built to handle the complexities of natural language understanding in a healthcare context. The integration of llamaassisted data processing and AutoGPT-Q provides enhanced performance in processing and responding to queries based on PDFs of medical literature, ensuring that users receive precise and trustworthy information. This research details the methodologies employed in developing Med-Bot and evaluates its effectiveness in disseminating healthcare information.
- [298] arXiv:2411.09653 [pdf, html, other]
-
Title: How to implement the Bayes' formula in the age of ML?Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
This chapter contains a self-contained introduction to the significance of Bayes' formula in the context of nonlinear filtering problems. Both discrete-time and continuous-time settings of the problem are considered in a unified manner. In control theory, the focus on optimization-based solution approaches is stressed together with a discussion of historical developments in this area (from 1960s onwards). The heart of this chapter contains a presentation of a novel optimal transportation formulation for the Bayes formula (developed recently by the first author) and its relationship to some of the prior joint work (feedback particle filter) from the authors. The presentation highlights how optimal transportation theory is leveraged to overcome some of the numerical challenges of implementing Bayes' law by enabling the use of machine learning (ML) tools.
- [299] arXiv:2411.09658 [pdf, html, other]
-
Title: Motion Before Action: Diffusing Object Motion as Manipulation ConditionSubjects: Robotics (cs.RO)
Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations. We propose MBA (Motion Before Action), a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks.
- [300] arXiv:2411.09660 [pdf, html, other]
-
Title: Capacity and Power Consumption of Multi-Layer 6G Networks Using the Upper Mid-BandSubjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
This paper presents a new system model to evaluate the capacity and power consumption of multi-layer 6G networks utilising the upper mid-band (FR3). The model captures heterogeneous 4G, 5G, and 6G deployments, analyzing their performance under different deployment strategies. Our results show that strategic 6G deployments, non-co-located with existing 5G sites, significantly enhance throughput, with median and peak user rates of 300 Mbps and exceeding 1 Gbps, respectively. We also emphasize the importance of priority-based cell reselection and beam configuration to fully leverage 6G capabilities. While 6G implementation increases power consumption by 33%, non-colocated deployments strike a balance between performance and power consumption.
- [301] arXiv:2411.09661 [pdf, html, other]
-
Title: Adaptive Decoding via Latent Preference OptimizationShehzaad Dhuliawala, Ilia Kulikov, Ping Yu, Asli Celikyilmaz, Jason Weston, Sainbayar Sukhbaatar, Jack LanchantinSubjects: Computation and Language (cs.CL)
During language model decoding, it is known that using higher temperature sampling gives more creative responses, while lower temperatures are more factually accurate. However, such models are commonly applied to general instruction following, which involves both creative and fact seeking tasks, using a single fixed temperature across all examples and tokens. In this work, we introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time, at either the token or example level, in order to optimize performance. To learn its parameters we introduce Latent Preference Optimization (LPO) a general approach to train discrete latent variables such as choices of temperature. Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures, including UltraFeedback, Creative Story Writing, and GSM8K.
- [302] arXiv:2411.09666 [pdf, other]
-
Title: Evaluating 5G Networks for U-Space Applications: Insights from Dense Urban Measurement CampaignBarrios-Munoz Ricardo, Bernabe Matteo, Lopez-Perez David, Gomez-Barquero David, Quintanilla-Garcia IsraelSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
This paper examines the communication performance of unmanned aerial vehicles (UAVs) in dense urban environments, specifically in Benidorm, Spain. Through a comprehensive measurement campaign, we assessed key performance indicators (KPIs) relating to received signal strength and quality as well as rate across various locations, altitudes, operators, technologies, and frequencies, using different measurement equipment. The results highlight significant challenges, primarily due to the lack of planning for aerial coverage and interference, revealing that current cellular networks may fall short in supporting U-space communication needs. The paper calls for network upgrades to ensure reliable UAV operations in urban airspace, contributing to the integration of UAVS in urban logistics and mobility.
- [303] arXiv:2411.09675 [pdf, html, other]
-
Title: Citation Sentiment Reflects Multiscale Sociocultural NormsComments: 16 pages, 8 figures in main; 13 pages, 3 figures in supplementSubjects: Social and Information Networks (cs.SI)
Modern science is formally structured around scholarly publication, where scientific knowledge is canonized through citation. Precisely how citations are given and accrued can provide information about the value of discovery, the history of scientific ideas, the structure of fields, and the space or scope of inquiry. Yet parsing this information has been challenging because citations are not simply present or absent; rather, they differ in purpose, function, and sentiment. In this paper, we investigate how critical and favorable sentiments are distributed across citations, and demonstrate that citation sentiment tracks sociocultural norms across scales of collaboration, discipline, and country. At the smallest scale of individuals, we find that researchers cite scholars they have collaborated with more favorably (and less critically) than scholars they have not collaborated with. Outside collaborative relationships, higher h-index scholars cite lower h-index scholars more critically. At the mesoscale of disciplines, we find that wetlab disciplines tend to be less critical than drylab disciplines, and disciplines that engage in more synthesis through publishing more review articles tend to be less critical. At the largest scale of countries, we find that greater individualism (and lesser acceptance of the unequal distribution of power) is associated with more critical sentiment. Collectively, our results demonstrate how sociocultural factors can explain variations in sentiment in scientific communication. As such, our study contributes to the broader understanding of how human factors influence the practice of science, and underscore the importance of considering the larger sociocultural contexts in which science progresses.
- [304] arXiv:2411.09678 [pdf, other]
-
Title: NeuralDEM -- Real-time Simulation of Industrial Particulate FlowsBenedikt Alkin, Tobias Kronlachner, Samuele Papa, Stefan Pirker, Thomas Lichtenegger, Johannes BrandstetterComments: Project page: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Advancements in computing power have made it possible to numerically simulate large-scale fluid-mechanical and/or particulate systems, many of which are integral to core industrial processes. Among the different numerical methods available, the discrete element method (DEM) provides one of the most accurate representations of a wide range of physical systems involving granular and discontinuous materials. Consequently, DEM has become a widely accepted approach for tackling engineering problems connected to granular flows and powder mechanics. Additionally, DEM can be integrated with grid-based computational fluid dynamics (CFD) methods, enabling the simulation of chemical processes taking place, e.g., in fluidized beds. However, DEM is computationally intensive because of the intrinsic multiscale nature of particulate systems, restricting simulation duration or number of particles. Towards this end, NeuralDEM presents an end-to-end approach to replace slow numerical DEM routines with fast, adaptable deep learning surrogates. NeuralDEM is capable of picturing long-term transport processes across different regimes using macroscopic observables without any reference to microscopic model parameters. First, NeuralDEM treats the Lagrangian discretization of DEM as an underlying continuous field, while simultaneously modeling macroscopic behavior directly as additional auxiliary fields. Second, NeuralDEM introduces multi-branch neural operators scalable to real-time modeling of industrially-sized scenarios - from slow and pseudo-steady to fast and transient. Such scenarios have previously posed insurmountable challenges for deep learning models. Notably, NeuralDEM faithfully models coupled CFD-DEM fluidized bed reactors of 160k CFD cells and 500k DEM particles for trajectories of 28s. NeuralDEM will open many new doors to advanced engineering and much faster process cycles.
- [305] arXiv:2411.09683 [pdf, html, other]
-
Title: Towards a Classification of Open-Source ML Models and Datasets for Software EngineeringComments: 5 pages, 8 figuresSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Background: Open-Source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. Aims: We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time. Method: We conducted a repository mining study. We started with a systematically gathered database of PTMs and datasets from the HF API. Our selection was refined by analyzing model and dataset cards and metadata, such as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are replicable, with a publicly accessible replication package. Results: The most common SE task among PTMs and datasets is code generation, with a primary focus on software development and limited attention to software management. Popular PTMs and datasets mainly target software development. Among ML tasks, text generation is the most common in SE PTMs and datasets. There has been a marked increase in PTMs for SE since 2023 Q2. Conclusions: This study underscores the need for broader task coverage to enhance the integration of ML within SE practices.
- [306] arXiv:2411.09688 [pdf, other]
-
Title: Squeezed Attention: Accelerating Long Context Length LLM InferenceColeman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir GholamiSubjects: Computation and Language (cs.CL)
Emerging Large Language Model (LLM) applications require long input prompts to perform complex downstream tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations to process user inputs quickly, as they are received. In this work, we propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. During inference, we compare query tokens from the user input with the centroids to predict which of the keys from the fixed context are semantically relevant and need to be loaded during inference. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs. We also extend our method to use a hierarchical centroid lookup to identify important keys, which can reduce the complexity of attention from linear to logarithmic with respect to the context length. We implement optimized Triton kernels for centroid comparison and sparse FlashAttention with important keys, achieving more than 4x speedups during both the prefill and generation phases for long-context inference. Furthermore, we have extensively evaluated our method on various long-context benchmarks including LongBench, where it achieves a 3x reduction in KV cache budget without accuracy loss and up to an 8x reduction with <0.5 point accuracy gap for various models.
- [307] arXiv:2411.09689 [pdf, html, other]
-
Title: LLM Hallucination Reasoning with Zero-shot Knowledge TestComments: 12 pages, 2 figuresSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
LLM hallucination, where LLMs occasionally generate unfaithful text, poses significant challenges for their practical applications. Most existing detection methods rely on external knowledge, LLM fine-tuning, or hallucination-labeled datasets, and they do not distinguish between different types of hallucinations, which are crucial for improving detection performance. We introduce a new task, Hallucination Reasoning, which classifies LLM-generated text into one of three categories: aligned, misaligned, and fabricated. Our novel zero-shot method assesses whether LLM has enough knowledge about a given prompt and text. Our experiments conducted on new datasets demonstrate the effectiveness of our method in hallucination reasoning and underscore its importance for enhancing detection performance.
- [308] arXiv:2411.09691 [pdf, html, other]
-
Title: Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal ModelsWei Wang, Zhaowei Li, Qi Xu, Linfeng Li, YiQing Cai, Botian Jiang, Hang Song, Xingcan Hu, Pengyu Wang, Li XiaoSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multi-modal large language models (MLLMs) have achieved remarkable success in fine-grained visual understanding across a range of tasks. However, they often encounter significant challenges due to inadequate alignment for fine-grained knowledge, which restricts their ability to accurately capture local details and attain a comprehensive global perception. While recent advancements have focused on aligning object expressions with grounding information, they typically lack explicit integration of object images, which contain affluent information beyond mere texts or coordinates. To bridge this gap, we introduce a novel fine-grained visual knowledge alignment method that effectively aligns and integrates multi-scale knowledge of objects, including texts, coordinates, and images. This innovative method is underpinned by our multi-scale fine-grained enhancement data synthesis pipeline, which provides over 300K essential training data to enhance alignment and improve overall performance. Furthermore, we present TinyGroundingGPT, a series of compact models optimized for high-level alignments. With a scale of approximately 3B parameters, TinyGroundingGPT achieves outstanding results in grounding tasks while delivering performance comparable to larger MLLMs in complex visual scenarios.
- [309] arXiv:2411.09693 [pdf, html, other]
-
Title: CropCraft: Inverse Procedural Modeling for 3D Reconstruction of Crop PlantsAlbert J. Zhai, Xinlei Wang, Kaiyuan Li, Zhao Jiang, Junxiong Zhou, Sheng Wang, Zhenong Jin, Kaiyu Guan, Shenlong WangComments: PreprintSubjects: Computer Vision and Pattern Recognition (cs.CV)
The ability to automatically build 3D digital twins of plants from images has countless applications in agriculture, environmental science, robotics, and other fields. However, current 3D reconstruction methods fail to recover complete shapes of plants due to heavy occlusion and complex geometries. In this work, we present a novel method for 3D reconstruction of agricultural crops based on optimizing a parametric model of plant morphology via inverse procedural modeling. Our method first estimates depth maps by fitting a neural radiance field and then employs Bayesian optimization to estimate plant morphological parameters that result in consistent depth renderings. The resulting 3D model is complete and biologically plausible. We validate our method on a dataset of real images of agricultural fields, and demonstrate that the reconstructions can be used for a variety of monitoring and simulation applications.
- [310] arXiv:2411.09694 [pdf, html, other]
-
Title: A Bayesian Optimization Approach to Machine Translation RerankingComments: v1: Preprint versionSubjects: Computation and Language (cs.CL)
Reranking a list of candidates from a machine translation system with an external scoring model and returning the highest-scoring candidate remains a simple and effective method for improving the overall output quality. Translation scoring models continue to grow in size, with the best models being comparable to generation models. Thus, reranking can add substantial computational cost to the translation pipeline. In this work, we pose reranking as a Bayesian optimization (BayesOpt) problem. By strategically selecting candidates to score based on a balance of exploration and exploitation, we show that it is possible to find top-scoring candidates when scoring only a fraction of the candidate list. For instance, our method achieves the same CometKiwi score using only 70 scoring evaluations compared a baseline system using 180. We present a multi-fidelity setting for BayesOpt, where the candidates are first scored with a cheaper but noisier proxy scoring model, which further improves the cost-performance tradeoff when using smaller but well-trained distilled proxy scorers.
- [311] arXiv:2411.09702 [pdf, html, other]
-
Title: On the Surprising Effectiveness of Attention Transfer for Vision TransformersComments: NeurIPS 2024. Code: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning
- [312] arXiv:2411.09703 [pdf, html, other]
-
Title: MagicQuill: An Intelligent Interactive Image Editing SystemZichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, Yujun ShenComments: Code and demo available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing operations (e.g., inserting elements, erasing objects, altering color) with minimal input. These interactions are monitored by a multimodal large language model (MLLM) to anticipate editing intentions in real time, bypassing the need for explicit prompt entry. Finally, we apply a powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process editing requests with precise control. Experimental results demonstrate the effectiveness of MagicQuill in achieving high-quality image edits. Please visit this https URL to try out our system.
New submissions (showing 312 of 312 entries)
- [313] arXiv:2411.08538 (cross-list from physics.app-ph) [pdf, other]
-
Title: Intelligent Adaptive Metasurface in Complex Wireless EnvironmentsHan Qing Yang, Jun Yan Dai, Hui Dong Li, Lijie Wu, Meng Zhen Zhang, Zi Hang Shen, Si Ran Wang, Zheng Xing Wang, Wankai Tang, Shi Jin, Jun Wei Wu, Qiang Cheng, Tie Jun CuiSubjects: Applied Physics (physics.app-ph); Systems and Control (eess.SY)
The programmable metasurface is regarded as one of the most promising transformative technologies for next-generation wireless system applications. Due to the lack of effective perception ability of the external electromagnetic environment, there are numerous challenges in the intelligent regulation of wireless channels, and it still relies on external sensors to reshape electromagnetic environment as desired. To address that problem, we propose an adaptive metasurface (AMS) which integrates the capabilities of acquiring wireless environment information and manipulating reflected electromagnetic (EM) waves in a programmable manner. The proposed design endows the metasurfaces with excellent capabilities to sense the complex electromagnetic field distributions around them and then dynamically manipulate the waves and signals in real time under the guidance of the sensed information, eliminating the need for prior knowledge or external inputs about the wireless environment. For verification, a prototype of the proposed AMS is constructed, and its dual capabilities of sensing and manipulation are experimentally validated. Additionally, different integrated sensing and communication (ISAC) scenarios with and without the aid of the AMS are established. The effectiveness of the AMS in enhancing communication quality is well demonstrated in complex electromagnetic environments, highlighting its beneficial application potential in future wireless systems.
- [314] arXiv:2411.08886 (cross-list from eess.SP) [pdf, html, other]
-
Title: Network scaling and scale-driven loss balancing for intelligent poroelastographySubjects: Signal Processing (eess.SP); Machine Learning (cs.LG)
A deep learning framework is developed for multiscale characterization of poroelastic media from full waveform data which is known as poroelastography. Special attention is paid to heterogeneous environments whose multiphase properties may drastically change across several scales. Described in space-frequency, the data takes the form of focal solid displacement and pore pressure fields in various neighborhoods furnished either by reconstruction from remote data or direct measurements depending on the application. The objective is to simultaneously recover the six hydromechanical properties germane to Biot equations and their spatial distribution in a robust and efficient manner. Two major challenges impede direct application of existing state-of-the-art techniques for this purpose: (i) the sought-for properties belong to vastly different and potentially uncertain scales, and~(ii) the loss function is multi-objective and multi-scale (both in terms of its individual components and the total loss). To help bridge the gap, we propose the idea of \emph{network scaling} where the neural property maps are constructed by unit shape functions composed into a scaling layer. In this model, the unknown network parameters (weights and biases) remain of O(1) during training. This forms the basis for explicit scaling of the loss components and their derivatives with respect to the network parameters. Thereby, we propose the physics-based \emph{dynamic scaling} approach for adaptive loss balancing. The idea is first presented in a generic form for multi-physics and multi-scale PDE systems, and then applied through a set of numerical experiments to poroelastography. The results are presented along with reconstructions by way of gradient normalization (GradNorm) and Softmax adaptive weights (SoftAdapt) for loss balancing. A comparative analysis of the methods and corresponding results is provided.
- [315] arXiv:2411.08887 (cross-list from eess.SP) [pdf, html, other]
-
Title: Deep Learning-Based CKM Construction with Image Super-ResolutionSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI)
Channel knowledge map (CKM) is a novel technique for achieving environment awareness, and thereby improving the communication and sensing performance for wireless systems. A fundamental problem associated with CKM is how to construct a complete CKM that provides channel knowledge for a large number of locations based solely on sparse data measurements. This problem bears similarities to the super-resolution (SR) problem in image processing. In this letter, we propose an effective deep learning-based CKM construction method that leverages the image SR network known as SRResNet. Unlike most existing studies, our approach does not require any additional input beyond the sparsely measured data. In addition to the conventional path loss map construction, our approach can also be applied to construct channel angle maps (CAMs), thanks to the use of a new dataset called CKMImageNet. The numerical results demonstrate that our method outperforms interpolation-based methods such as nearest neighbour and bicubic interpolation, as well as the SRGAN method in CKM construction. Furthermore, only 1/16 of the locations need to be measured in order to achieve a root mean square error (RMSE) of 1.1 dB in path loss.
- [316] arXiv:2411.08896 (cross-list from eess.SP) [pdf, html, other]
-
Title: Demand-Aware Beam Hopping and Power Allocation for Load Balancing in Digital Twin empowered LEO Satellite NetworksSubjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Low-Earth orbit (LEO) satellites utilizing beam hopping (BH) technology offer extensive coverage, low latency, high bandwidth, and significant flexibility. However, the uneven geographical distribution and temporal variability of ground traffic demands, combined with the high mobility of LEO satellites, present significant challenges for efficient beam resource utilization. Traditional BH methods based on GEO satellites fail to address issues such as satellite interference, overlapping coverage, and mobility. This paper explores a Digital Twin (DT)-based collaborative resource allocation network for multiple LEO satellites with overlapping coverage areas. A two-tier optimization problem, focusing on load balancing and cell service fairness, is proposed to maximize throughput and minimize inter-cell service delay. The DT layer optimizes the allocation of overlapping coverage cells by designing BH patterns for each satellite, while the LEO layer optimizes power allocation for each selected service cell. At the DT layer, an Actor-Critic network is deployed on each agent, with a global critic network in the cloud center. The A3C algorithm is employed to optimize the DT layer. Concurrently, the LEO layer optimization is performed using a Multi-Agent Reinforcement Learning algorithm, where each beam functions as an independent agent. The simulation results show that this method reduces satellite load disparity by about 72.5% and decreases the average delay to 12ms. Additionally, our approach outperforms other benchmarks in terms of throughput, ensuring a better alignment between offered and requested data.
- [317] arXiv:2411.08899 (cross-list from q-fin.TR) [pdf, html, other]
-
Title: FinVision: A Multi-Agent Framework for Stock Market PredictionComments: Accepted at ICAIF 2024Subjects: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI)
Financial trading has been a challenging task, as it requires the integration of vast amounts of data from various modalities. Traditional deep learning and reinforcement learning methods require large training data and often involve encoding various data types into numerical formats for model input, which limits the explainability of model behavior. Recently, LLM-based agents have demonstrated remarkable advancements in handling multi-modal data, enabling them to execute complex, multi-step decision-making tasks while providing insights into their thought processes. This research introduces a multi-modal multi-agent system designed specifically for financial trading tasks. Our framework employs a team of specialized LLM-based agents, each adept at processing and interpreting various forms of financial data, such as textual news reports, candlestick charts, and trading signal charts. A key feature of our approach is the integration of a reflection module, which conducts analyses of historical trading signals and their outcomes. This reflective process is instrumental in enhancing the decision-making capabilities of the system for future trading scenarios. Furthermore, the ablation studies indicate that the visual reflection module plays a crucial role in enhancing the decision-making capabilities of our framework.
- [318] arXiv:2411.08900 (cross-list from q-bio.GN) [pdf, html, other]
-
Title: RNA-GPT: Multimodal Generative System for RNA Sequence UnderstandingComments: Machine Learning for Structural Biology Workshop, NeurIPS 2024Subjects: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
RNAs are essential molecules that carry genetic information vital for life, with profound implications for drug development and biotechnology. Despite this importance, RNA research is often hindered by the vast literature available on the topic. To streamline this process, we introduce RNA-GPT, a multi-modal RNA chat model designed to simplify RNA discovery by leveraging extensive RNA literature. RNA-GPT integrates RNA sequence encoders with linear projection layers and state-of-the-art large language models (LLMs) for precise representation alignment, enabling it to process user-uploaded RNA sequences and deliver concise, accurate responses. Built on a scalable training pipeline, RNA-GPT utilizes RNA-QA, an automated system that gathers RNA annotations from RNACentral using a divide-and-conquer approach with GPT-4o and latent Dirichlet allocation (LDA) to efficiently handle large datasets and generate instruction-tuning samples. Our experiments indicate that RNA-GPT effectively addresses complex RNA queries, thereby facilitating RNA research. Additionally, we present RNA-QA, a dataset of 407,616 RNA samples for modality alignment and instruction tuning, further advancing the potential of RNA research tools.
- [319] arXiv:2411.08902 (cross-list from eess.SP) [pdf, html, other]
-
Title: A Range-Free Node Localization Method for Anisotropic Wireless Sensor Networks with Sparse AnchorsSubjects: Signal Processing (eess.SP); Networking and Internet Architecture (cs.NI)
In sensor networks characterized by irregular layouts and poor connectivity, anisotropic properties can significantly reduce the accuracy of distance estimation between nodes, consequently impairing the localization precision of unidentified nodes. Since distance estimation is contingent upon the multi-hop paths between anchor node pairs, assigning differential weights based on the reliability of these paths could enhance localization accuracy. To address this, we introduce an adaptive weighted method, termed AW-MinMax, for range-free node localization. This method involves constructing a weighted mean nodes localization model, where each multi-hop path weight is inversely proportional to the number of hops. Despite the model's inherent non-convexity and non-differentiability, it can be reformulated into an optimization model with convex objective functions and non-convex constraints through matrix transformations. To resolve these constraints, we employ a Sequential Convex Approximation (SCA) algorithm that utilizes first-order Taylor expansion for iterative refinement. Simulation results validate that our proposed algorithm substantially improves stability and accuracy in estimating range-free node locations.
- [320] arXiv:2411.08903 (cross-list from physics.geo-ph) [pdf, html, other]
-
Title: Turkey's Earthquakes: Damage Prediction and Feature Significance Using A Multivariate AnalysisSubjects: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
Accurate damage prediction is crucial for disaster preparedness and response strategies, particularly given the frequent earthquakes in Turkey. Utilizing datasets on earthquake data, infrastructural quality metrics, and contemporary socioeconomic factors, we tested various machine-learning architectures to forecast death tolls and fatalities per affected population. Our findings indicate that the Random Forest model provides the most reliable predictions. The model highlights earthquake magnitude and building stability as the primary determinants of damage. This research contributes to the reduction of fatalities in future seismic events in Turkey.
- [321] arXiv:2411.08909 (cross-list from q-bio.BM) [pdf, html, other]
-
Title: Long-context Protein Language ModelYingheng Wang, Zichen Wang, Gil Sadeh, Luca Zancato, Alessandro Achille, George Karypis, Huzefa RangwalaComments: 32 pages, 17 figures, 11 tablesSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design. Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths. Such protein LMs cannot extrapolate to longer proteins and protein complexes well. They also fail to account for the underlying biological mechanisms carried out by biomolecular interactions and dynamics i.e., proteins often interact with other proteins, molecules, and pathways in complex biological systems. In this work, we propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models, to learn high-quality universal protein representations at the amino acid token level using masked language modeling. We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction (PPI) graphs for a second stage of training. LC-PLM demonstrates favorable neural scaling laws, better length extrapolation capability, and a 7% to 34% improvement on protein downstream tasks than Transformer-based ESM-2. LC-PLM-G further trained within the context of PPI graphs shows promising results on protein structure and function prediction tasks. Our study demonstrates the benefit of increasing the context size with computationally efficient LM architecture (e.g. structured state space models) in learning universal protein representations and incorporating molecular interaction context contained in biological graphs.
- [322] arXiv:2411.08911 (cross-list from physics.comp-ph) [pdf, html, other]
-
Title: A Message Passing Neural Network Surrogate Model for Bond-Associated Peridynamic Material Correspondence FormulationComments: arXiv admin note: substantial text overlap with arXiv:2410.00934Subjects: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Machine Learning (stat.ML)
Peridynamics is a non-local continuum mechanics theory that offers unique advantages for modeling problems involving discontinuities and complex deformations. Within the peridynamic framework, various formulations exist, among which the material correspondence formulation stands out for its ability to directly incorporate traditional continuum material models, making it highly applicable to a range of engineering challenges. A notable advancement in this area is the bond-associated correspondence model, which not only resolves issues of material instability but also achieves high computational accuracy. However, the bond-associated model typically requires higher computational costs than FEA, which can limit its practical application. To address this computational challenge, we propose a novel surrogate model based on a message-passing neural network (MPNN) specifically designed for the bond-associated peridynamic material correspondence formulation. Leveraging the similarities between graph structure and the neighborhood connectivity inherent to peridynamics, we construct an MPNN that can transfers domain knowledge from peridynamics into a computational graph and shorten the computation time via GPU acceleration. Unlike conventional graph neural networks that focus on node features, our model emphasizes edge-based features, capturing the essential material point interactions in the formulation. A key advantage of this neural network approach is its flexibility: it does not require fixed neighborhood connectivity, making it adaptable across diverse configurations and scalable for complex systems. Furthermore, the model inherently possesses translational and rotational invariance, enabling it to maintain physical objectivity: a critical requirement for accurate mechanical modeling.
- [323] arXiv:2411.08919 (cross-list from eess.SP) [pdf, html, other]
-
Title: A Machine Learning based Hybrid Receiver for 5G NR PRACHComments: 6 pages, 9 figuresSubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
Random Access is a critical procedure using which a User Equipment (UE) identifies itself to a Base Station (BS). Random Access starts with the UE transmitting a random preamble on the Physical Random Access Channel (PRACH). In a conventional BS receiver, the UE's specific preamble is identified by correlation with all the possible preambles. The PRACH signal is also used to estimate the timing advance which is induced by propagation delay. Correlation-based receivers suffer from false peaks and missed detection in scenarios dominated by high fading and low signal-to-noise ratio. This paper describes the design of a hybrid receiver that consists of an AI/ML model for preamble detection followed by conventional peak detection for the Timing Advance estimation. The proposed receiver combines the Power Delay Profiles of correlation windows across multiple antennas and uses the combination as input to a Neural Network model. The model predicts the presence or absence of a user in a particular preamble window, after which the timing advance is estimated by peak detection. Results show superior performance of the hybrid receiver compared to conventional receivers both for simulated and real hardware-captured datasets.
- [324] arXiv:2411.08926 (cross-list from eess.IV) [pdf, html, other]
-
Title: DG-PPU: Dynamical Graphs based Post-processing of Point Clouds extracted from Knee UltrasoundsComments: This paper was submitted to the IEEE International Symposium on Biomedical Imaging (ISBI). This is a preprint version and may be subject to copyrightSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Patients undergoing total knee arthroplasty (TKA) often experience non-specific anterior knee pain, arising from abnormal patellofemoral joint (PFJ) instability. Tracking PFJ motion is challenging since static imaging modalities like CT and MRI are limited by field of view and metal artefact interference. Ultrasounds offer an alternative modality for dynamic musculoskeletal imaging. We aim to achieve accurate visualisation of patellar tracking and PFJ motion, using 3D registration of point clouds extracted from ultrasound scans across different angles of joint flexion. Ultrasound images containing soft tissue are often mislabeled as bone during segmentation, resulting in noisy 3D point clouds that hinder accurate registration of the bony joint anatomy. Machine learning the intrinsic geometry of the knee bone may help us eliminate these false positives. As the intrinsic geometry of the knee does not change during PFJ motion, one may expect this to be robust across multiple angles of joint flexion. Our dynamical graphs-based post-processing algorithm (DG-PPU) is able to achieve this, creating smoother point clouds that accurately represent bony knee anatomy across different angles. After inverting these point clouds back to their original ultrasound images, we evaluated that DG-PPU outperformed manual data cleaning done by our lab technician, deleting false positives and noise with 98.2% precision across three different angles of joint flexion. DG-PPU is the first algorithm to automate the post-processing of 3D point clouds extracted from ultrasound scans. With DG-PPU, we contribute towards the development of a novel patellar mal-tracking assessment system with ultrasound, which currently does not exist.
- [325] arXiv:2411.08936 (cross-list from eess.IV) [pdf, html, other]
-
Title: Clustered Patch Embeddings for Permutation-Invariant Classification of Whole Slide ImagesComments: arXiv admin note: text overlap with arXiv:2411.08530Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Whole Slide Imaging (WSI) is a cornerstone of digital pathology, offering detailed insights critical for diagnosis and research. Yet, the gigapixel size of WSIs imposes significant computational challenges, limiting their practical utility. Our novel approach addresses these challenges by leveraging various encoders for intelligent data reduction and employing a different classification model to ensure robust, permutation-invariant representations of WSIs. A key innovation of our method is the ability to distill the complex information of an entire WSI into a single vector, effectively capturing the essential features needed for accurate analysis. This approach significantly enhances the computational efficiency of WSI analysis, enabling more accurate pathological assessments without the need for extensive computational resources. This breakthrough equips us with the capability to effectively address the challenges posed by large image resolutions in whole-slide imaging, paving the way for more scalable and effective utilization of WSIs in medical diagnostics and research, marking a significant advancement in the field.
- [326] arXiv:2411.08975 (cross-list from eess.IV) [pdf, html, other]
-
Title: Fluoroformer: Scaling multiple instance learning to multiplexed images via attention-based channel fusionComments: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 14 pagesSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Though multiple instance learning (MIL) has been a foundational strategy in computational pathology for processing whole slide images (WSIs), current approaches are designed for traditional hematoxylin and eosin (H&E) slides rather than emerging multiplexed technologies. Here, we present an MIL strategy, the Fluoroformer module, that is specifically tailored to multiplexed WSIs by leveraging scaled dot-product attention (SDPA) to interpretably fuse information across disparate channels. On a cohort of 434 non-small cell lung cancer (NSCLC) samples, we show that the Fluoroformer both obtains strong prognostic performance and recapitulates immuno-oncological hallmarks of NSCLC. Our technique thereby provides a path for adapting state-of-the-art AI techniques to emerging spatial biology assays.
- [327] arXiv:2411.08987 (cross-list from math.OC) [pdf, html, other]
-
Title: Non-Euclidean High-Order Smooth Convex OptimizationSubjects: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
We develop algorithms for the optimization of convex objectives that have Hölder continuous $q$-th derivatives with respect to a $p$-norm by using a $q$-th order oracle, for $p, q \geq 1$. We can also optimize other structured functions. We do this by developing a non-Euclidean inexact accelerated proximal point method that makes use of an inexact uniformly convex regularizer. We also provide nearly matching lower bounds for any deterministic algorithm that interacts with the function via a local oracle.
- [328] arXiv:2411.08992 (cross-list from eess.IV) [pdf, html, other]
-
Title: IDCIA: Immunocytochemistry Dataset for Cellular Image AnalysisAbdurahman Ali Mohammed, Catherine Fonder, Donald S. Sakaguchi, Wallapak Tavanapong, Surya K. Mallapragada, Azeez IdrisSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
We present a new annotated microscopic cellular image dataset to improve the effectiveness of machine learning methods for cellular image analysis. Cell counting is an important step in cell analysis. Typically, domain experts manually count cells in a microscopic image. Automated cell counting can potentially eliminate this tedious, time-consuming process. However, a good, labeled dataset is required for training an accurate machine learning model. Our dataset includes microscopic images of cells, and for each image, the cell count and the location of individual cells. The data were collected as part of an ongoing study investigating the potential of electrical stimulation to modulate stem cell differentiation and possible applications for neural repair. Compared to existing publicly available datasets, our dataset has more images of cells stained with more variety of antibodies (protein components of immune responses against invaders) typically used for cell analysis. The experimental results on this dataset indicate that none of the five existing models under this study are able to achieve sufficiently accurate count to replace the manual methods. The dataset is available at this https URL.
- [329] arXiv:2411.08993 (cross-list from stat.ML) [pdf, html, other]
-
Title: Parameter Inference via Differentiable Diffusion Bridge Importance SamplingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We introduce a methodology for performing parameter inference in high-dimensional, non-linear diffusion processes. We illustrate its applicability for obtaining insights into the evolution of and relationships between species, including ancestral state reconstruction. Estimation is performed by utilising score matching to approximate diffusion bridges, which are subsequently used in an importance sampler to estimate log-likelihoods. The entire setup is differentiable, allowing gradient ascent on approximated log-likelihoods. This allows both parameter inference and diffusion mean estimation. This novel, numerically stable, score matching-based parameter inference framework is presented and demonstrated on biological two- and three-dimensional morphometry data.
- [330] arXiv:2411.08998 (cross-list from stat.ML) [pdf, html, other]
-
Title: Microfoundation Inference for Strategic PredictionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Often in prediction tasks, the predictive model itself can influence the distribution of the target variable, a phenomenon termed performative prediction. Generally, this influence stems from strategic actions taken by stakeholders with a vested interest in predictive models. A key challenge that hinders the widespread adaptation of performative prediction in machine learning is that practitioners are generally unaware of the social impacts of their predictions. To address this gap, we propose a methodology for learning the distribution map that encapsulates the long-term impacts of predictive models on the population. Specifically, we model agents' responses as a cost-adjusted utility maximization problem and propose estimates for said cost. Our approach leverages optimal transport to align pre-model exposure (ex ante) and post-model exposure (ex post) distributions. We provide a rate of convergence for this proposed estimate and assess its quality through empirical demonstrations on a credit-scoring dataset.
- [331] arXiv:2411.09038 (cross-list from quant-ph) [pdf, html, other]
-
Title: An Implementation of the Finite Element Method in Hybrid Classical/Quantum ComputersSubjects: Quantum Physics (quant-ph); Numerical Analysis (math.NA)
This manuscript presents the Quantum Finite Element Method (Q-FEM) developed for use in noisy intermediate-scale quantum (NISQ) computers, and employs the variational quantum linear solver (VQLS) algorithm. The proposed method leverages the classical FEM procedure to perform the unitary decomposition of the stiffness matrix and employs generator functions to design explicit quantum circuits corresponding to the unitaries. Q-FEM keeps the structure of the finite element discretization intact allowing for the use of variable element lengths and material coefficients in FEM discretization. The proposed method is tested on a steady-state heat equation discretized using linear and quadratic shape functions. Numerical verification studies demonstrate that Q-FEM is effective in converging to the correct solution for a variety of problems and model discretizations, including with different element lengths, variable coefficients, and different boundary conditions. The formalism developed herein is general and can be extended to problems with higher dimensions. However, numerical examples also demonstrate that the number of parameters for the variational ansatz scale exponentially with the number of qubits to increase the odds of convergence, and deterioration of system conditioning with problem size results in barren plateaus, and hence convergence difficulties.
- [332] arXiv:2411.09064 (cross-list from stat.ML) [pdf, html, other]
-
Title: Minimax Optimal Two-Sample Testing under Local Differential PrivacyComments: 59 pages, 5 figuresSubjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
We explore the trade-off between privacy and statistical utility in private two-sample testing under local differential privacy (LDP) for both multinomial and continuous data. We begin by addressing the multinomial case, where we introduce private permutation tests using practical privacy mechanisms such as Laplace, discrete Laplace, and Google's RAPPOR. We then extend our multinomial approach to continuous data via binning and study its uniform separation rates under LDP over Hölder and Besov smoothness classes. The proposed tests for both discrete and continuous cases rigorously control the type I error for any finite sample size, strictly adhere to LDP constraints, and achieve minimax separation rates under LDP. The attained minimax rates reveal inherent privacy-utility trade-offs that are unavoidable in private testing. To address scenarios with unknown smoothness parameters in density testing, we propose an adaptive test based on a Bonferroni-type approach that ensures robust performance without prior knowledge of the smoothness parameters. We validate our theoretical findings with extensive numerical experiments and demonstrate the practical relevance and effectiveness of our proposed methods.
- [333] arXiv:2411.09075 (cross-list from math.PR) [pdf, other]
-
Title: Weak Poincar\'e Inequalities, Simulated Annealing, and Sampling from Spherical Spin GlassesComments: 101 pagesSubjects: Probability (math.PR); Disordered Systems and Neural Networks (cond-mat.dis-nn); Data Structures and Algorithms (cs.DS); Mathematical Physics (math-ph)
There has been a recent surge of powerful tools to show rapid mixing of Markov chains, via functional inequalities such as Poincaré inequalities. In many situations, Markov chains fail to mix rapidly from a worst-case initialization, yet are expected to approximately sample from a random initialization. For example, this occurs if the target distribution has metastable states, small clusters accounting for a vanishing fraction of the mass that are essentially disconnected from the bulk of the measure. Under such conditions, a Poincaré inequality cannot hold, necessitating new tools to prove sampling guarantees. We develop a framework to analyze simulated annealing, based on establishing so-called weak Poincaré inequalities. These inequalities imply mixing from a suitably warm start, and simulated annealing provides a way to chain such warm starts together into a sampling algorithm. We further identify a local-to-global principle to prove weak Poincaré inequalities, mirroring the spectral independence and localization schemes frameworks for analyzing mixing times of Markov chains.
As our main application, we prove that simulated annealing samples from the Gibbs measure of a spherical spin glass for inverse temperatures up to a natural threshold, matching recent algorithms based on algorithmic stochastic localization. This provides the first Markov chain sampling guarantee that holds beyond the uniqueness threshold for spherical spin glasses, where mixing from a worst-case initialization is provably slow. As an ingredient in our proof, we prove bounds on the operator norm of the covariance matrix of spherical spin glasses in the full replica-symmetric regime.
Additionally, we resolve questions related to the mixing of Glauber dynamics in the ferromagnetic Potts model from a uniform monochromatic coloring, and sampling using data-based initializations. - [334] arXiv:2411.09118 (cross-list from math.OC) [pdf, html, other]
-
Title: FxTS-Net: Fixed-Time Stable Learning Framework for Neural ODEsSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
Neural Ordinary Differential Equations (Neural ODEs), as a novel category of modeling big data methods, cleverly link traditional neural networks and dynamical systems. However, it is challenging to ensure the dynamics system reaches a correctly predicted state within a user-defined fixed time. To address this problem, we propose a new method for training Neural ODEs using fixed-time stability (FxTS) Lyapunov conditions. Our framework, called FxTS-Net, is based on the novel FxTS loss (FxTS-Loss) designed on Lyapunov functions, which aims to encourage convergence to accurate predictions in a user-defined fixed time. We also provide an innovative approach for constructing Lyapunov functions to meet various tasks and network architecture requirements, achieved by leveraging supervised information during training. By developing a more precise time upper bound estimation for bounded non-vanishingly perturbed systems, we demonstrate that minimizing FxTS-Loss not only guarantees FxTS behavior of the dynamics but also input perturbation robustness. For optimising FxTS-Loss, we also propose a learning algorithm, in which the simulated perturbation sampling method can capture sample points in critical regions to approximate FxTS-Loss. Experimentally, we find that FxTS-Net provides better prediction performance and better robustness under input perturbation.
- [335] arXiv:2411.09133 (cross-list from physics.optics) [pdf, html, other]
-
Title: Computational metaoptics for imagingSubjects: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
Metasurfaces -- ultrathin structures composed of subwavelength optical elements -- have revolutionized light manipulation by enabling precise control over electromagnetic waves' amplitude, phase, polarization, and spectral properties. Concurrently, computational imaging leverages algorithms to reconstruct images from optically processed signals, overcoming limitations of traditional imaging systems. This review explores the synergistic integration of metaoptics and computational imaging, "computational metaoptics," which combines the physical wavefront shaping ability of metasurfaces with advanced computational algorithms to enhance imaging performance beyond conventional limits. We discuss how computational metaoptics addresses the inherent limitations of single-layer metasurfaces in achieving multifunctionality without compromising efficiency. By treating metasurfaces as physical preconditioners and co-designing them with reconstruction algorithms through end-to-end (inverse) design, it is possible to jointly optimize the optical hardware and computational software. This holistic approach allows for the automatic discovery of optimal metasurface designs and reconstruction methods that significantly improve imaging capabilities. Advanced applications enabled by computational metaoptics are highlighted, including phase imaging and quantum state measurement, which benefit from the metasurfaces' ability to manipulate complex light fields and the computational algorithms' capacity to reconstruct high-dimensional information. We also examine performance evaluation challenges, emphasizing the need for new metrics that account for the combined optical and computational nature of these systems. Finally, we identify new frontiers in computational metaoptics which point toward a future where computational metaoptics may play a central role in advancing imaging science and technology.
- [336] arXiv:2411.09137 (cross-list from eess.IV) [pdf, html, other]
-
Title: Fast probabilistic snake algorithmJournal-ref: International Conference on Image Processing (ICIP), Vol.2, 405-408, Barcelona, Spain, Sept 2003Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Few people use the probability theory in order to achieve image segmentation with snake models. In this article, we are presenting an active contour algorithm based on a probability approach inspired by A. Blake work and P. R{é}fr{é}gier's team research in France. Our algorithm, both very fast and highly accurate as far as contour description is concerned, is easily adaptable to any specific application.
- [337] arXiv:2411.09173 (cross-list from quant-ph) [pdf, html, other]
-
Title: Correction of circuit faults in a stacked quantum memory using rank-metric codesComments: 6 pages, 1 figureSubjects: Quantum Physics (quant-ph); Information Theory (cs.IT)
We introduce a model for a stacked quantum memory made with multi-qubit cells, inspired by multi-level flash cells in classical solid-state drive, and we design quantum error correction codes for this model by generalizing rank-metric codes to the quantum setting. Rank-metric codes are used to correct faulty links in classical communication networks. We propose a quantum generalization of Gabidulin codes, which is one of the most popular family of rank-metric codes, and we design a protocol to correct faults in Clifford circuits applied to a stacked quantum memory based on these codes. We envision potential applications to the optimization of stabilizer states and magic states factories, and to variational quantum algorithms. Further work is needed to make this protocol practical. It requires a hardware platform capable of hosting multi-qubit cells with low crosstalk between cells, a fault-tolerant syndrome extraction circuit for rank-metric codes and an associated efficient decoder.
- [338] arXiv:2411.09175 (cross-list from stat.ML) [pdf, html, other]
-
Title: Hybrid deep additive neural networksComments: 29 pages, 13 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Traditional neural networks (multi-layer perceptrons) have become an important tool in data science due to their success across a wide range of tasks. However, their performance is sometimes unsatisfactory, and they often require a large number of parameters, primarily due to their reliance on the linear combination structure. Meanwhile, additive regression has been a popular alternative to linear regression in statistics. In this work, we introduce novel deep neural networks that incorporate the idea of additive regression. Our neural networks share architectural similarities with Kolmogorov-Arnold networks but are based on simpler yet flexible activation and basis functions. Additionally, we introduce several hybrid neural networks that combine this architecture with that of traditional neural networks. We derive their universal approximation properties and demonstrate their effectiveness through simulation studies and a real-data application. The numerical results indicate that our neural networks generally achieve better performance than traditional neural networks while using fewer parameters.
- [339] arXiv:2411.09191 (cross-list from econ.TH) [pdf, html, other]
-
Title: Informational PutsSubjects: Theoretical Economics (econ.TH); Multiagent Systems (cs.MA)
We fully characterize how dynamic information should be provided to uniquely implement the largest equilibrium in dynamic binary-action supermodular games. The designer offers an informational put: she stays silent in good times, but injects asymmetric and inconclusive public information if players lose faith. There is (i) no multiplicity gap: the largest (partially) implementable equilibrium can be implemented uniquely; and (ii) no intertemporal commitment gap: the policy is sequentially optimal. Our results have sharp implications for the design of policy in coordination environments.
- [340] arXiv:2411.09204 (cross-list from eess.IV) [pdf, html, other]
-
Title: RibCageImp: A Deep Learning Framework for 3D Ribcage Implant GenerationSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
The recovery of damaged or resected ribcage structures requires precise, custom-designed implants to restore the integrity and functionality of the thoracic cavity. Traditional implant design methods rely mainly on manual processes, making them time-consuming and susceptible to variability. In this work, we explore the feasibility of automated ribcage implant generation using deep learning. We present a framework based on 3D U-Net architecture that processes CT scans to generate patient-specific implant designs. To the best of our knowledge, this is the first investigation into automated thoracic implant generation using deep learning approaches. Our preliminary results, while moderate, highlight both the potential and the significant challenges in this complex domain. These findings establish a foundation for future research in automated ribcage reconstruction and identify key technical challenges that need to be addressed for practical implementation.
- [341] arXiv:2411.09210 (cross-list from quant-ph) [pdf, html, other]
-
Title: Classical Verification of Quantum Learning Advantages with NoisesComments: 13 pages 1 figureSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)
Classical verification of quantum learning allows classical clients to reliably leverage quantum computing advantages by interacting with untrusted quantum servers. Yet, current quantum devices available in practice suffers from a variety of noises and whether existed classical verification protocols carry over to noisy scenarios remains unclear. Here, we propose an efficient classical error rectification algorithm to reconstruct the noise-free results given by the quantum Fourier sampling circuit with practical constant-level noises. In particular, we prove that the error rectification algorithm can restore the heavy Fourier coefficients by using a small number of noisy samples that scales logarithmically with the problem size. We apply this algorithm to the agnostic parity learning task with uniform input marginal and prove that this task can be accomplished in an efficient way on noisy quantum devices with our algorithm. In addition, we prove that a classical client with access to the random example oracle can verify the agnostic parity learning results from the noisy quantum prover in an efficient way, under the condition that the Fourier coefficients are sparse. Our results demonstrate the feasibility of classical verification of quantum learning advantages with noises, which provide a valuable guide for both theoretical studies and practical applications with current noisy intermediate scale quantum devices.
- [342] arXiv:2411.09220 (cross-list from eess.AS) [pdf, html, other]
-
Title: Transferable Adversarial Attacks against ASRComments: IEEE SPLSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Given the extensive research and real-world applications of automatic speech recognition (ASR), ensuring the robustness of ASR models against minor input perturbations becomes a crucial consideration for maintaining their effectiveness in real-time scenarios. Previous explorations into ASR model robustness have predominantly revolved around evaluating accuracy on white-box settings with full access to ASR models. Nevertheless, full ASR model details are often not available in real-world applications. Therefore, evaluating the robustness of black-box ASR models is essential for a comprehensive understanding of ASR model resilience. In this regard, we thoroughly study the vulnerability of practical black-box attacks in cutting-edge ASR models and propose to employ two advanced time-domain-based transferable attacks alongside our differentiable feature extractor. We also propose a speech-aware gradient optimization approach (SAGO) for ASR, which forces mistranscription with minimal impact on human imperceptibility through voice activity detection rule and a speech-aware gradient-oriented optimizer. Our comprehensive experimental results reveal performance enhancements compared to baseline approaches across five models on two databases.
- [343] arXiv:2411.09283 (cross-list from eess.IV) [pdf, html, other]
-
Title: Leveraging Auxiliary Classification for Rib Fracture SegmentationComments: Accepted at ICVGIP'24Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Thoracic trauma often results in rib fractures, which demand swift and accurate diagnosis for effective treatment. However, detecting these fractures on rib CT scans poses considerable challenges, involving the analysis of many image slices in sequence. Despite notable advancements in algorithms for automated fracture segmentation, the persisting challenges stem from the diverse shapes and sizes of these fractures. To address these issues, this study introduces a sophisticated deep-learning model with an auxiliary classification task designed to enhance the accuracy of rib fracture segmentation. The auxiliary classification task is crucial in distinguishing between fractured ribs and negative regions, encompassing non-fractured ribs and surrounding tissues, from the patches obtained from CT scans. By leveraging this auxiliary task, the model aims to improve feature representation at the bottleneck layer by highlighting the regions of interest. Experimental results on the RibFrac dataset demonstrate significant improvement in segmentation performance.
- [344] arXiv:2411.09296 (cross-list from hep-ph) [pdf, html, other]
-
Title: Enhancing generalization in high energy physics using white-box adversarial attacksComments: 10 pages, 4 figures, 8 tables, 3 algorithms, to be published in Physical Review D (PRD), presented at the ML4Jets 2024 conferenceSubjects: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
Machine learning is becoming increasingly popular in the context of particle physics. Supervised learning, which uses labeled Monte Carlo (MC) simulations, remains one of the most widely used methods for discriminating signals beyond the Standard Model. However, this paper suggests that supervised models may depend excessively on artifacts and approximations from Monte Carlo simulations, potentially limiting their ability to generalize well to real data. This study aims to enhance the generalization properties of supervised models by reducing the sharpness of local minima. It reviews the application of four distinct white-box adversarial attacks in the context of classifying Higgs boson decay signals. The attacks are divided into weight space attacks, and feature space attacks. To study and quantify the sharpness of different local minima this paper presents two analysis methods: gradient ascent and reduced Hessian eigenvalue analysis. The results show that white-box adversarial attacks significantly improve generalization performance, albeit with increased computational complexity.
- [345] arXiv:2411.09308 (cross-list from eess.IV) [pdf, html, other]
-
Title: DT-JRD: Deep Transformer based Just Recognizable Difference Prediction Model for Video Coding for MachinesComments: Submitted to IEEE Transactions on MultimediaSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Just Recognizable Difference (JRD) represents the minimum visual difference that is detectable by machine vision, which can be exploited to promote machine vision oriented visual signal processing. In this paper, we propose a Deep Transformer based JRD (DT-JRD) prediction model for Video Coding for Machines (VCM), where the accurately predicted JRD can be used reduce the coding bit rate while maintaining the accuracy of machine tasks. Firstly, we model the JRD prediction as a multi-class classification and propose a DT-JRD prediction model that integrates an improved embedding, a content and distortion feature extraction, a multi-class classification and a novel learning strategy. Secondly, inspired by the perception property that machine vision exhibits a similar response to distortions near JRD, we propose an asymptotic JRD loss by using Gaussian Distribution-based Soft Labels (GDSL), which significantly extends the number of training labels and relaxes classification boundaries. Finally, we propose a DT-JRD based VCM to reduce the coding bits while maintaining the accuracy of object detection. Extensive experimental results demonstrate that the mean absolute error of the predicted JRD by the DT-JRD is 5.574, outperforming the state-of-the-art JRD prediction model by 13.1%. Coding experiments shows that comparing with the VVC, the DT-JRD based VCM achieves an average of 29.58% bit rate reduction while maintaining the object detection accuracy.
- [346] arXiv:2411.09320 (cross-list from quant-ph) [pdf, html, other]
-
Title: AMARETTO: Enabling Efficient Quantum Algorithm Emulation on Low-Tier FPGAsComments: paper accepted at the IEEE International Conference on Electronics Circuits and Systems 2024 conference, 4 pages, 6 figuresSubjects: Quantum Physics (quant-ph); Systems and Control (eess.SY)
Researchers and industries are increasingly drawn to quantum computing for its computational potential. However, validating new quantum algorithms is challenging due to the limitations of current quantum devices. Software simulators are time and memory-consuming, making hardware emulators an attractive alternative. This article introduces AMARETTO (quAntuM ARchitecture EmulaTion TechnOlogy), designed for quantum computing emulation on low-tier Field-Programmable gate arrays (FPGAs), supporting Clifford+T and rotational gate sets. It simplifies and accelerates the verification of quantum algorithms using a Reduced-Instruction-Set-Computer (RISC)-like structure and efficient handling of sparse quantum gates. A dedicated compiler translates OpenQASM 2.0 into RISC-like instructions. AMARETTO is validated against the Qiskit simulators. Our results show successful emulation of sixteen qubits on a AMD Kria KV260 SoM. This approach rivals other works in emulated qubit capacity on a smaller, more affordable FPGA
- [347] arXiv:2411.09373 (cross-list from eess.IV) [pdf, html, other]
-
Title: Are nuclear masks all you need for improved out-of-domain generalisation? A closer look at cancer classification in histopathologyComments: Poster at NeurIPS 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Domain generalisation in computational histopathology is challenging because the images are substantially affected by differences among hospitals due to factors like fixation and staining of tissue and imaging equipment. We hypothesise that focusing on nuclei can improve the out-of-domain (OOD) generalisation in cancer detection. We propose a simple approach to improve OOD generalisation for cancer detection by focusing on nuclear morphology and organisation, as these are domain-invariant features critical in cancer detection. Our approach integrates original images with nuclear segmentation masks during training, encouraging the model to prioritise nuclei and their spatial arrangement. Going beyond mere data augmentation, we introduce a regularisation technique that aligns the representations of masks and original images. We show, using multiple datasets, that our method improves OOD generalisation and also leads to increased robustness to image corruptions and adversarial attacks. The source code is available at this https URL
- [348] arXiv:2411.09402 (cross-list from eess.IV) [pdf, html, other]
-
Title: Automated Segmentation of Ischemic Stroke Lesions in Non-Contrast Computed Tomography Images for Enhanced Treatment and PrognosisComments: 7 pages, 3 figures, MICCAI Meets Africa WorkshopSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Stroke is the second leading cause of death worldwide, and is increasingly prevalent in low- and middle-income countries (LMICs). Timely interventions can significantly influence stroke survivability and the quality of life after treatment. However, the standard and most widely available imaging method for confirming strokes and their sub-types, the NCCT, is more challenging and time-consuming to employ in cases of ischemic stroke. For this reason, we developed an automated method for ischemic stroke lesion segmentation in NCCTs using the nnU-Net frame work, aimed at enhancing early treatment and improving the prognosis of ischemic stroke patients. We achieved Dice scores of 0.596 and Intersection over Union (IoU) scores of 0.501 on the sampled dataset. After adjusting for outliers, these scores improved to 0.752 for the Dice score and 0.643 for the IoU. Proper delineation of the region of infarction can help clinicians better assess the potential impact of the infarction, and guide treatment procedures.
- [349] arXiv:2411.09403 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Machine Learning: An Interplay Between Quantum Computing and Machine LearningComments: In submissionSubjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
Quantum machine learning (QML) is a rapidly growing field that combines quantum computing principles with traditional machine learning. It seeks to revolutionize machine learning by harnessing the unique capabilities of quantum mechanics and employs machine learning techniques to advance quantum computing research. This paper introduces quantum computing for the machine learning paradigm, where variational quantum circuits (VQC) are used to develop QML architectures on noisy intermediate-scale quantum (NISQ) devices. We discuss machine learning for the quantum computing paradigm, showcasing our recent theoretical and empirical findings. In particular, we delve into future directions for studying QML, exploring the potential industrial impacts of QML research.
- [350] arXiv:2411.09429 (cross-list from cond-mat.mtrl-sci) [pdf, html, other]
-
Title: AI-driven inverse design of materials: Past, present and futureXiao-Qi Han, Xin-De Wang, Meng-Yuan Xu, Zhen Feng, Bo-Wen Yao, Peng-Jie Guo, Ze-Feng Gao, Zhong-Yi LuComments: 43 pages, 5 figures, 2 tablesSubjects: Materials Science (cond-mat.mtrl-sci); Superconductivity (cond-mat.supr-con); Artificial Intelligence (cs.AI)
The discovery of advanced materials is the cornerstone of human technological development and progress. The structures of materials and their corresponding properties are essentially the result of a complex interplay of multiple degrees of freedom such as lattice, charge, spin, symmetry, and topology. This poses significant challenges for the inverse design methods of materials. Humans have long explored new materials through a large number of experiments and proposed corresponding theoretical systems to predict new material properties and structures. With the improvement of computational power, researchers have gradually developed various electronic structure calculation methods, particularly such as the one based density functional theory, as well as high-throughput computational methods. Recently, the rapid development of artificial intelligence technology in the field of computer science has enabled the effective characterization of the implicit association between material properties and structures, thus opening up an efficient paradigm for the inverse design of functional materials. A significant progress has been made in inverse design of materials based on generative and discriminative models, attracting widespread attention from researchers. Considering this rapid technological progress, in this survey, we look back on the latest advancements in AI-driven inverse design of materials by introducing the background, key findings, and mainstream technological development routes. In addition, we summarize the remaining issues for future directions. This survey provides the latest overview of AI-driven inverse design of materials, which can serve as a useful resource for researchers.
- [351] arXiv:2411.09469 (cross-list from eess.IV) [pdf, html, other]
-
Title: An Explainable Attention Model for Cervical Precancer Risk Classification using Colposcopic ImagesComments: 19 pages, 9 figure, and 7 tablesSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
Cervical cancer remains a major worldwide health issue, with early identification and risk assessment playing critical roles in effective preventive interventions. This paper presents the Cervix-AID-Net model for cervical precancer risk classification. The study designs and evaluates the proposed Cervix-AID-Net model based on patients colposcopy images. The model comprises a Convolutional Block Attention Module (CBAM) and convolutional layers that extract interpretable and representative features of colposcopic images to distinguish high-risk and low-risk cervical precancer. In addition, the proposed Cervix-AID-Net model integrates four explainable techniques, namely gradient class activation maps, Local Interpretable Model-agnostic Explanations, CartoonX, and pixel rate distortion explanation based on output feature maps and input features. The evaluation using holdout and ten-fold cross-validation techniques yielded a classification accuracy of 99.33\% and 99.81\%. The analysis revealed that CartoonX provides meticulous explanations for the decision of the Cervix-AID-Net model due to its ability to provide the relevant piece-wise smooth part of the image. The effect of Gaussian noise and blur on the input shows that the performance remains unchanged up to Gaussian noise of 3\% and blur of 10\%, while the performance reduces thereafter. A comparison study of the proposed model's performance compared to other deep learning approaches highlights the Cervix-AID-Net model's potential as a supplemental tool for increasing the effectiveness of cervical precancer risk assessment. The proposed method, which incorporates the CBAM and explainable artificial integration, has the potential to influence cervical cancer prevention and early detection, improving patient outcomes and lowering the worldwide burden of this preventable disease.
- [352] arXiv:2411.09476 (cross-list from physics.flu-dyn) [pdf, html, other]
-
Title: Graph Neural Networks and Differential Equations: A hybrid approach for data assimilation of fluid flowsSubjects: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
This study presents a novel hybrid approach that combines Graph Neural Networks (GNNs) with Reynolds-Averaged Navier Stokes (RANS) equations to enhance the accuracy of mean flow reconstruction across a range of fluid dynamics applications. Traditional purely data-driven Neural Networks (NNs) models, often struggle maintaining physical consistency. Moreover, they typically require large datasets to achieve reliable performances. The GNN framework, which naturally handles unstructured data such as complex geometries in Computational Fluid Dynamics (CFD), is here integrated with RANS equations as a physical baseline model. The methodology leverages the adjoint method, enabling the use of RANS-derived gradients as optimization terms in the GNN training process. This ensures that the learned model adheres to the governing physics, maintaining physical consistency while improving the prediction accuracy. We test our approach on multiple CFD scenarios, including cases involving generalization with respect to the Reynolds number, sparse measurements, denoising and inpainting of missing portions of the mean flow. The results demonstrate significant improvements in the accuracy of the reconstructed mean flow compared to purely data-driven models, using limited amounts of data in the training dataset. The key strengths of this study are the integration of physical laws into the training process of the GNN, and the ability to achieve high-accuracy predictions with a limited amount of data, making this approach particularly valuable for applications in fluid dynamics where data is often scarce.
- [353] arXiv:2411.09483 (cross-list from stat.ML) [pdf, html, other]
-
Title: Sparse Bayesian Generative Modeling for Compressive SensingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
This work addresses the fundamental linear inverse problem in compressive sensing (CS) by introducing a new type of regularizing generative prior. Our proposed method utilizes ideas from classical dictionary-based CS and, in particular, sparse Bayesian learning (SBL), to integrate a strong regularization towards sparse solutions. At the same time, by leveraging the notion of conditional Gaussianity, it also incorporates the adaptability from generative models to training data. However, unlike most state-of-the-art generative models, it is able to learn from a few compressed and noisy data samples and requires no optimization algorithm for solving the inverse problem. Additionally, similar to Dirichlet prior networks, our model parameterizes a conjugate prior enabling its application for uncertainty quantification. We support our approach theoretically through the concept of variational inference and validate it empirically using different types of compressible signals.
- [354] arXiv:2411.09511 (cross-list from math.AP) [pdf, html, other]
-
Title: Structure-informed operator learning for parabolic Partial Differential EquationsComments: 19 pagesSubjects: Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Probability (math.PR)
In this paper, we present a framework for learning the solution map of a backward parabolic Cauchy problem. The solution depends continuously but nonlinearly on the final data, source, and force terms, all residing in Banach spaces of functions. We utilize Fréchet space neural networks (Benth et al. (2023)) to address this operator learning problem. Our approach provides an alternative to Deep Operator Networks (DeepONets), using basis functions to span the relevant function spaces rather than relying on finite-dimensional approximations through censoring. With this method, structural information encoded in the basis coefficients is leveraged in the learning process. This results in a neural network designed to learn the mapping between infinite-dimensional function spaces. Our numerical proof-of-concept demonstrates the effectiveness of our method, highlighting some advantages over DeepONets.
- [355] arXiv:2411.09512 (cross-list from eess.IV) [pdf, other]
-
Title: GAN-Based Architecture for Low-dose Computed Tomography Imaging DenoisingSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Generative Adversarial Networks (GANs) have surfaced as a revolutionary element within the domain of low-dose computed tomography (LDCT) imaging, providing an advanced resolution to the enduring issue of reconciling radiation exposure with image quality. This comprehensive review synthesizes the rapid advancements in GAN-based LDCT denoising techniques, examining the evolution from foundational architectures to state-of-the-art models incorporating advanced features such as anatomical priors, perceptual loss functions, and innovative regularization strategies. We critically analyze various GAN architectures, including conditional GANs (cGANs), CycleGANs, and Super-Resolution GANs (SRGANs), elucidating their unique strengths and limitations in the context of LDCT denoising. The evaluation provides both qualitative and quantitative results related to the improvements in performance in benchmark and clinical datasets with metrics such as PSNR, SSIM, and LPIPS. After highlighting the positive results, we discuss some of the challenges preventing a wider clinical use, including the interpretability of the images generated by GANs, synthetic artifacts, and the need for clinically relevant metrics. The review concludes by highlighting the essential significance of GAN-based methodologies in the progression of precision medicine via tailored LDCT denoising models, underlining the transformative possibilities presented by artificial intelligence within contemporary radiological practice.
- [356] arXiv:2411.09530 (cross-list from math-ph) [pdf, html, other]
-
Title: Discrete Dirac structures and discrete Lagrange--Dirac dynamical systems in mechanicsSubjects: Mathematical Physics (math-ph); Differential Geometry (math.DG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
In this paper, we propose the concept of $(\pm)$-discrete Dirac structures over a manifold, where we define $(\pm)$-discrete two-forms on the manifold and incorporate discrete constraints using $(\pm)$-finite difference maps. Specifically, we develop $(\pm)$-discrete induced Dirac structures as discrete analogues of the induced Dirac structure on the cotangent bundle over a configuration manifold, as described by Yoshimura and Marsden (2006). We demonstrate that $(\pm)$-discrete Lagrange--Dirac systems can be naturally formulated in conjunction with the $(\pm)$-induced Dirac structure on the cotangent bundle. Furthermore, we show that the resulting equations of motion are equivalent to the $(\pm)$-discrete Lagrange--d'Alembert equations proposed in Cortés and Martínez (2001) and McLachlan and Perlmutter (2006). We also clarify the variational structures of the discrete Lagrange--Dirac dynamical systems within the framework of the $(\pm)$-discrete Lagrange--d'Alembert--Pontryagin principle. Finally, we validate the proposed discrete Lagrange--Dirac systems with some illustrative examples of nonholonomic systems through numerical tests.
- [357] arXiv:2411.09545 (cross-list from physics.flu-dyn) [pdf, html, other]
-
Title: Equation-informed data-driven identification of flow budgets and dynamicsNataliya Sevryugina, Serena Costanzo, Steve de Bruyn Kops, Colm-cille Caulfield, Iraj Mortazavi, Taraneh SayadiSubjects: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
Computational Fluid Dynamics (CFD) is an indispensable method of fluid modelling in engineering applications, reducing the need for physical prototypes and testing for tasks such as design optimisation and performance analysis. Depending on the complexity of the system under consideration, models ranging from low to high fidelity can be used for prediction, allowing significant speed-up. However, the choice of model requires information about the actual dynamics of the flow regime. Correctly identifying the regions/clusters of flow that share the same dynamics has been a challenging research topic to date. In this study, we propose a novel hybrid approach to flow clustering. It consists of characterising each sample point of the system with equation-based features, i.e. features are budgets that represent the contribution of each term from the original governing equation to the local dynamics at each sample point. This was achieved by applying the Sparse Identification of Nonlinear Dynamical systems (SINDy) method pointwise to time evolution data. The method proceeds with equation-based clustering using the Girvan-Newman algorithm. This allows the detection of communities that share the same physical dynamics. The algorithm is implemented in both Eulerian and Lagrangian frameworks. In the Lagrangian, i.e. dynamic approach, the clustering is performed on the trajectory of each point, allowing the change of clusters to be represented also in time. The performance of the algorithm is first tested on a flow around a cylinder. The construction of the dynamic clusters in this test case clearly shows the evolution of the wake from the steady state solution through the transient to the oscillatory solution. Dynamic clustering was then successfully tested on turbulent flow data. Two distinct and well-defined clusters were identified and their temporal evolution was reconstructed.
- [358] arXiv:2411.09549 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum computing inspired paintings: reinterpreting classical masterpiecesComments: 10 pages, 8 figuresSubjects: Quantum Physics (quant-ph); Computers and Society (cs.CY)
We aim to apply a quantum computing technique to compose artworks. The main idea is to revisit three paintings of different styles and historical periods: ''Narciso'', painted circa 1597-1599 by Michelangelo Merisi (Caravaggio), ''Les fils de l'homme'', painted in 1964 by Rene Magritte and ''192 Farben'', painted in 1966 by Gerard Richter. We utilize the output of a quantum computation to change the composition in the paintings, leading to a paintings series titled ''Quantum Transformation I, II, III''. In particular, the figures are discretized into square lattices and the order of the pieces is changed according to the result of the quantum simulation. We consider an Ising Hamiltonian as the observable in the quantum computation and its time evolution as the final outcome. From a classical subject to abstract forms, we seek to combine classical and quantum aesthetics through these three art pieces. Besides experimenting with hardware runs and circuit noise, our goal is to reproduce these works as physical oil paintings on wooden panels. With this process, we complete a full circle between classical and quantum techniques and contribute to rethinking Art practice in the era of quantum computing technologies.
- [359] arXiv:2411.09593 (cross-list from eess.IV) [pdf, html, other]
-
Title: SMILE-UHURA Challenge -- Small Vessel Segmentation at Mesoscopic Scale from Ultra-High Resolution 7T Magnetic Resonance AngiogramsSoumick Chatterjee, Hendrik Mattern, Marc Dörner, Alessandro Sciarra, Florian Dubost, Hannes Schnurre, Rupali Khatun, Chun-Chih Yu, Tsung-Lin Hsieh, Yi-Shan Tsai, Yi-Zeng Fang, Yung-Ching Yang, Juinn-Dar Huang, Marshall Xu, Siyu Liu, Fernanda L. Ribeiro, Saskia Bollmann, Karthikesh Varma Chintalapati, Chethan Mysuru Radhakrishna, Sri Chandana Hudukula Ram Kumara, Raviteja Sutrave, Abdul Qayyum, Moona Mazher, Imran Razzak, Cristobal Rodero, Steven Niederren, Fengming Lin, Yan Xia, Jiacheng Wang, Riyu Qiu, Liansheng Wang, Arya Yazdan Panah, Rosana El Jurdi, Guanghui Fu, Janan Arslan, Ghislain Vaillant, Romain Valabregue, Didier Dormont, Bruno Stankoff, Olivier Colliot, Luisa Vargas, Isai Daniel Chacón, Ioannis Pitsiorlas, Pablo Arbeláez, Maria A. Zuluaga, Stefanie Schreiber, Oliver Speck, Andreas NürnbergerSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
The human brain receives nutrients and oxygen through an intricate network of blood vessels. Pathology affecting small vessels, at the mesoscopic scale, represents a critical vulnerability within the cerebral blood supply and can lead to severe conditions, such as Cerebral Small Vessel Diseases. The advent of 7 Tesla MRI systems has enabled the acquisition of higher spatial resolution images, making it possible to visualise such vessels in the brain. However, the lack of publicly available annotated datasets has impeded the development of robust, machine learning-driven segmentation algorithms. To address this, the SMILE-UHURA challenge was organised. This challenge, held in conjunction with the ISBI 2023, in Cartagena de Indias, Colombia, aimed to provide a platform for researchers working on related topics. The SMILE-UHURA challenge addresses the gap in publicly available annotated datasets by providing an annotated dataset of Time-of-Flight angiography acquired with 7T MRI. This dataset was created through a combination of automated pre-segmentation and extensive manual refinement. In this manuscript, sixteen submitted methods and two baseline methods are compared both quantitatively and qualitatively on two different datasets: held-out test MRAs from the same dataset as the training data (with labels kept secret) and a separate 7T ToF MRA dataset where both input volumes and labels are kept secret. The results demonstrate that most of the submitted deep learning methods, trained on the provided training dataset, achieved reliable segmentation performance. Dice scores reached up to 0.838 $\pm$ 0.066 and 0.716 $\pm$ 0.125 on the respective datasets, with an average performance of up to 0.804 $\pm$ 0.15.
- [360] arXiv:2411.09598 (cross-list from eess.IV) [pdf, html, other]
-
Title: Assessing the Performance of the DINOv2 Self-supervised Learning Vision Transformer Model for the Segmentation of the Left Atrium from MRI ImagesComments: 6 pages, 3 figures, SPIE Medical Imaging, 2025Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Accurate left atrium (LA) segmentation from pre-operative scans is crucial for diagnosing atrial fibrillation, treatment planning, and supporting surgical interventions. While deep learning models are key in medical image segmentation, they often require extensive manually annotated data. Foundation models trained on larger datasets have reduced this dependency, enhancing generalizability and robustness through transfer learning. We explore DINOv2, a self-supervised learning vision transformer trained on natural images, for LA segmentation using MRI. The challenges for LA's complex anatomy, thin boundaries, and limited annotated data make accurate segmentation difficult before & during the image-guided intervention. We demonstrate DINOv2's ability to provide accurate & consistent segmentation, achieving a mean Dice score of .871 & a Jaccard Index of .792 for end-to-end fine-tuning. Through few-shot learning across various data sizes & patient counts, DINOv2 consistently outperforms baseline models. These results suggest that DINOv2 effectively adapts to MRI with limited data, highlighting its potential as a competitive tool for segmentation & encouraging broader use in medical imaging.
- [361] arXiv:2411.09618 (cross-list from physics.med-ph) [pdf, html, other]
-
Title: MICCAI-CDMRI 2023 QuantConn Challenge Findings on Achieving Robust Quantitative Connectivity through Harmonized Preprocessing of Diffusion MRINancy R. Newlin, Kurt Schilling, Serge Koudoro, Bramsh Qamar Chandio, Praitayini Kanakaraj, Daniel Moyer, Claire E. Kelly, Sila Genc, Jian Chen, Joseph Yuan-Mou Yang, Ye Wu, Yifei He, Jiawei Zhang, Qingrun Zeng, Fan Zhang, Nagesh Adluru, Vishwesh Nath, Sudhir Pathak, Walter Schneider, Anurag Gade, Yogesh Rathi, Tom Hendriks, Anna Vilanova, Maxime Chamberland, Tomasz Pieciak, Dominika Ciupek, Antonio Tristán Vega, Santiago Aja-Fernández, Maciej Malawski, Gani Ouedraogo, Julia Machnio, Christian Ewert, Paul M. Thompson, Neda Jahanshad, Eleftherios Garyfallidis, Bennett A. LandmanComments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URLJournal-ref: Machine.Learning.for.Biomedical.Imaging. 2 (2024)Subjects: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
White matter alterations are increasingly implicated in neurological diseases and their progression. International-scale studies use diffusion-weighted magnetic resonance imaging (DW-MRI) to qualitatively identify changes in white matter microstructure and connectivity. Yet, quantitative analysis of DW-MRI data is hindered by inconsistencies stemming from varying acquisition protocols. There is a pressing need to harmonize the preprocessing of DW-MRI datasets to ensure the derivation of robust quantitative diffusion metrics across acquisitions. In the MICCAI-CDMRI 2023 QuantConn challenge, participants were provided raw data from the same individuals collected on the same scanner but with two different acquisitions and tasked with preprocessing the DW-MRI to minimize acquisition differences while retaining biological variation. Submissions are evaluated on the reproducibility and comparability of cross-acquisition bundle-wise microstructure measures, bundle shape features, and connectomics. The key innovations of the QuantConn challenge are that (1) we assess bundles and tractography in the context of harmonization for the first time, (2) we assess connectomics in the context of harmonization for the first time, and (3) we have 10x additional subjects over prior harmonization challenge, MUSHAC and 100x over SuperMUDI. We find that bundle surface area, fractional anisotropy, connectome assortativity, betweenness centrality, edge count, modularity, nodal strength, and participation coefficient measures are most biased by acquisition and that machine learning voxel-wise correction, RISH mapping, and NeSH methods effectively reduce these biases. In addition, microstructure measures AD, MD, RD, bundle length, connectome density, efficiency, and path length are least biased by these acquisition differences.
- [362] arXiv:2411.09635 (cross-list from stat.ML) [pdf, html, other]
-
Title: Counterfactual Uncertainty Quantification of Factual Estimand of Efficacy from Before-and-After Treatment Repeated Measures Randomized Controlled TrialsComments: 39 pages, 7 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The ideal estimand for comparing a new treatment $Rx$ with a control $C$ is the $\textit{counterfactual}$ efficacy $Rx:C$, the expected differential outcome between $Rx$ and $C$ if each patient were given $\textit{both}$. While counterfactual $\textit{point estimation}$ from $\textit{factual}$ Randomized Controlled Trials (RCTs) has been available, this article shows $\textit{counterfactual}$ uncertainty quantification (CUQ), quantifying uncertainty for factual point estimates but in a counterfactual setting, is surprisingly achievable. We achieve CUQ whose variability is typically smaller than factual UQ, by creating a new statistical modeling principle called ETZ which is applicable to RCTs with $\textit{Before-and-After}$ treatment Repeated Measures, common in many therapeutic areas.
We urge caution when estimate of the unobservable true condition of a patient before treatment has measurement error, because that violation of standard regression assumption can cause attenuation in estimating treatment effects. Fortunately, we prove that, for traditional medicine in general, and for targeted therapy with efficacy defined as averaged over the population, counterfactual point estimation is unbiased. However, for targeted therapy, both Real Human and Digital Twins approaches should respect this limitation, lest predicted treatment effect in $\textit{subgroups}$ will have bias. - [363] arXiv:2411.09636 (cross-list from math.OC) [pdf, html, other]
-
Title: Nash equilibrium seeking for a class of quadratic-bilinear Wasserstein distributionally robust gamesComments: 14 pages, 5 figuresSubjects: Optimization and Control (math.OC); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
We consider a class of Wasserstein distributionally robust Nash equilibrium problems, where agents construct heterogeneous data-driven Wasserstein ambiguity sets using private samples and radii, in line with their individual risk-averse behaviour. By leveraging relevant properties of this class of games, we show that equilibria of the original seemingly infinite-dimensional problem can be obtained as a solution to a finite-dimensional Nash equilibrium problem. We then reformulate the problem as a finite-dimensional variational inequality and establish the connection between the corresponding solution sets. Our reformulation has scalable behaviour with respect to the data size and maintains a fixed number of constraints, independently of the number of samples. To compute a solution, we leverage two algorithms, based on the golden ratio algorithm. The efficiency of both algorithmic schemes is corroborated through extensive simulation studies on an illustrative example and a stochastic portfolio allocation game, where behavioural coupling among investors is modeled.
- [364] arXiv:2411.09644 (cross-list from math.OC) [pdf, html, other]
-
Title: Neural Operators Can Play Dynamic Stackelberg GamesSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Computational Finance (q-fin.CP)
Dynamic Stackelberg games are a broad class of two-player games in which the leader acts first, and the follower chooses a response strategy to the leader's strategy. Unfortunately, only stylized Stackelberg games are explicitly solvable since the follower's best-response operator (as a function of the control of the leader) is typically analytically intractable. This paper addresses this issue by showing that the \textit{follower's best-response operator} can be approximately implemented by an \textit{attention-based neural operator}, uniformly on compact subsets of adapted open-loop controls for the leader. We further show that the value of the Stackelberg game where the follower uses the approximate best-response operator approximates the value of the original Stackelberg game. Our main result is obtained using our universal approximation theorem for attention-based neural operators between spaces of square-integrable adapted stochastic processes, as well as stability results for a general class of Stackelberg games.
- [365] arXiv:2411.09646 (cross-list from math.OC) [pdf, html, other]
-
Title: Reducing Stochastic Games to Semidefinite ProgrammingComments: 15 pages, 1 figureSubjects: Optimization and Control (math.OC); Computational Complexity (cs.CC); Computer Science and Game Theory (cs.GT)
We present a polynomial-time reduction from max-average constraints to the feasibility problem for semidefinite programs. This shows that Condon's simple stochastic games, stochastic mean payoff games, and in particular mean payoff games and parity games can all be reduced to semidefinite programming.
- [366] arXiv:2411.09686 (cross-list from stat.ML) [pdf, html, other]
-
Title: Conditional regression for the Nonlinear Single-Variable ModelComments: 55 pages, 10 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Several statistical models for regression of a function $F$ on $\mathbb{R}^d$ without the statistical and computational curse of dimensionality exist, for example by imposing and exploiting geometric assumptions on the distribution of the data (e.g. that its support is low-dimensional), or strong smoothness assumptions on $F$, or a special structure $F$. Among the latter, compositional models assume $F=f\circ g$ with $g$ mapping to $\mathbb{R}^r$ with $r\ll d$, have been studied, and include classical single- and multi-index models and recent works on neural networks. While the case where $g$ is linear is rather well-understood, much less is known when $g$ is nonlinear, and in particular for which $g$'s the curse of dimensionality in estimating $F$, or both $f$ and $g$, may be circumvented. In this paper, we consider a model $F(X):=f(\Pi_\gamma X) $ where $\Pi_\gamma:\mathbb{R}^d\to[0,\rm{len}_\gamma]$ is the closest-point projection onto the parameter of a regular curve $\gamma: [0,\rm{len}_\gamma]\to\mathbb{R}^d$ and $f:[0,\rm{len}_\gamma]\to\mathbb{R}^1$. The input data $X$ is not low-dimensional, far from $\gamma$, conditioned on $\Pi_\gamma(X)$ being well-defined. The distribution of the data, $\gamma$ and $f$ are unknown. This model is a natural nonlinear generalization of the single-index model, which corresponds to $\gamma$ being a line. We propose a nonparametric estimator, based on conditional regression, and show that under suitable assumptions, the strongest of which being that $f$ is coarsely monotone, it can achieve the $one$-$dimensional$ optimal min-max rate for non-parametric regression, up to the level of noise in the observations, and be constructed in time $\mathcal{O}(d^2n\log n)$. All the constants in the learning bounds, in the minimal number of samples required for our bounds to hold, and in the computational complexity are at most low-order polynomials in $d$.
Cross submissions (showing 54 of 54 entries)
- [367] arXiv:1803.04190 (replaced) [pdf, other]
-
Title: Counting of Shortest Paths in Cubic GridComments: 18 pages, 8 figuresJournal-ref: Acta Polytechnica Hungarica, Vol 21, No 6, pp 169 - 186, 2024Subjects: Discrete Mathematics (cs.DM); Computational Geometry (cs.CG)
The enumeration of shortest paths in cubic grid is presented herein, which could have importance in image processing and also in the network sciences. The cubic grid considers three neighborhoods - namely, 6-, 18- and 26-neighborhood related to face connectivity, edge connectivity and vertex connectivity, respectively. The formulation for distance metrics is given. L1, D18, and L_$\infty$ are the three metrics for 6-neighborhood, 18-neighborhood and 26-neighborhood. The task is to count the number of minimal paths, based on given neighborhood relations, from any given point to any other, in the three-dimensional cubic grid. Based on the coordinate triplets describing the grid, the formulations for the three neighborhoods are presented in this work. The problem both of theoretical importance and has several practical aspects.
- [368] arXiv:2012.08476 (replaced) [pdf, html, other]
-
Title: Simpler and Unified Recognition Algorithm for Path Graphs and Directed Path GraphsComments: 17 pages, 3 figuresSubjects: Data Structures and Algorithms (cs.DS)
A path graph is the intersection graph of paths in a tree. A directed path graph is the intersection graph of paths in a directed tree. Even if path graphs and directed path graphs are characterized very similarly, their recognition algorithms differ widely. We further unify these two graph classes by presenting the first recognition algorithm for both path graphs and directed path graphs. We deeply use a recent characterization of path graphs, and we extend it to directed path graphs. Our algorithm does not require complex data structures and has an easy and intuitive implementation, simplifying recognition algorithms for both graph classes.
- [369] arXiv:2107.09176 (replaced) [pdf, html, other]
-
Title: Relying on recent and temporally dispersed science predicts breakthrough inventionsSubjects: Digital Libraries (cs.DL); Computers and Society (cs.CY)
The development of inventions is theorized as a process of searching and recombining existing knowledge components. Previous studies under this theory have examined myriad characteristics of recombined knowledge and their performance implications. One such feature that has received much attention is technological knowledge age. Yet, little is known about how the age of scientific knowledge influences the impact of inventions, despite the widely known catalyzing role of science in the creation of new technologies. Here we use a large corpus of patents and derive features characterizing how patents temporally search in the scientific space. We find that patents that cite scientific papers have more citations and substantially more likely to become breakthroughs. Conditional on searching in the scientific space, referencing more recent papers increases the impact of patents and the likelihood of being breakthroughs. However, this positive effect can be offset if patents cite papers whose ages exhibit a low variance. These effects are consistent across technological fields.
- [370] arXiv:2112.14457 (replaced) [pdf, other]
-
Title: Online Stochastic Matching: A Polytope PerspectiveCéline Comte (TU/e, CNRS, LAAS-SARA), Fabien Mathieu (LINCS), Sushil Mahavir Varma, Ana Bušić (DI-ENS, LINCS, DYOGENE, ARGO)Subjects: Networking and Internet Architecture (cs.NI)
Stochastic dynamic matching problems have recently gained attention in the stochastic-modeling community due to their diverse applications, such as supply-chain management and kidney exchange programs. In this paper, we study a matching problem where items of different classes arrive according to independent Poisson processes. Unmatched items are stored in a queue, and compatibility between items is represented by a simple graph, where items can be matched if their classes are this http URL analyze matching policies in terms of stability, delay, and long-term matching rate optimization. Our approach relies on the conservation equation, which ensures a balance between arrivals and departures in any stable system. Our main contributions are as this http URL establish a link between the existence of stable policies, the dimensionality of the solution set of the conservation equation, and the compatibility graph's this http URL describe the convex polytope formed by non-negative solutions to the conservation equation, and we design policies that can achieve or closely approximate the vertices of this this http URL, we discuss potential extensions of our results beyond the main assumptions of this paper.
- [371] arXiv:2203.02458 (replaced) [pdf, html, other]
-
Title: Continuous Rating as Reliable Human Evaluation of Simultaneous Speech TranslationComments: Published at WMT 2022: this https URLSubjects: Computation and Language (cs.CL)
Simultaneous speech translation (SST) can be evaluated on simulated online events where human evaluators watch subtitled videos and continuously express their satisfaction by pressing buttons (so called Continuous Rating). Continuous Rating is easy to collect, but little is known about its reliability, or relation to comprehension of foreign language document by SST users. In this paper, we contrast Continuous Rating with factual questionnaires on judges with different levels of source language knowledge. Our results show that Continuous Rating is easy and reliable SST quality assessment if the judges have at least limited knowledge of the source language. Our study indicates users' preferences on subtitle layout and presentation style and, most importantly, provides a significant evidence that users with advanced source language knowledge prefer low latency over fewer re-translations.
- [372] arXiv:2204.01313 (replaced) [pdf, html, other]
-
Title: Towards Immersive Humanitarian VisualizationsPierre Dragicevic (Inria)Comments: 2024 updateSubjects: Human-Computer Interaction (cs.HC)
This paper introduces immersive humanitarian visualization as a promising research area in information visualization. Humanitarian visualizations are data visualizations designed to promote human welfare. This paper explains why immersive display technologies taken broadly (e.g, virtual reality, augmented reality, ambient displays and physical representations) open up a range of opportunities for humanitarian visualization. In particular, immersive displays offer ways to make remote and hidden human suffering more salient. They also offer ways to communicate quantitative facts together with qualitative information and visceral experiences, in order to provide a holistic understanding of humanitarian issues that could support more informed humanitarian decisions. But despite some promising preliminary work, immersive humanitarian visualization has not taken off as a research topic yet. The goal of this paper is to encourage, motivate, and inspire future research in this area.
- [373] arXiv:2205.08674 (replaced) [pdf, html, other]
-
Title: Budget Pacing in Repeated Auctions: Regret and Efficiency without ConvergenceComments: Preliminary version has appeared in ITCS 2023: the 14th Conference on Innovations in Theoretical Computer Science. Compared to the initial version, the current version features revised presentation and expanded discussions, adds numerical experiments (since Aug'24), and fixes an inaccuracy in the definition of the MBB property (since Sep'22)Subjects: Computer Science and Game Theory (cs.GT)
We study the aggregate welfare and individual regret guarantees of dynamic \emph{pacing algorithms} in the context of repeated auctions with budgets. Such algorithms are commonly used as bidding agents in Internet advertising platforms, adaptively learning to shade bids by a tunable linear multiplier in order to match a specified budget. We show that when agents simultaneously apply a natural form of gradient-based pacing, the liquid welfare obtained over the course of the learning dynamics is at least half the optimal expected liquid welfare obtainable by any allocation rule. Crucially, this result holds \emph{without requiring convergence of the dynamics}, allowing us to circumvent known complexity-theoretic obstacles of finding equilibria. This result is also robust to the correlation structure between agent valuations and holds for any \emph{core auction}, a broad class of auctions that includes first-price, second-price, and generalized second-price auctions as special cases. For individual guarantees, we further show such pacing algorithms enjoy \emph{dynamic regret} bounds for individual value maximization, with respect to the sequence of budget-pacing bids, for any auction satisfying a monotone bang-for-buck property. To complement our theoretical findings, we provide semi-synthetic numerical simulations based on auction data from the Bing Advertising platform.
- [374] arXiv:2206.04438 (replaced) [pdf, other]
-
Title: A taxonomy of explanations to support Explainability-by-DesignSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
As automated decision-making solutions are increasingly applied to all aspects of everyday life, capabilities to generate meaningful explanations for a variety of stakeholders (i.e., decision-makers, recipients of decisions, auditors, regulators...) become crucial. In this paper, we present a taxonomy of explanations that was developed as part of a holistic 'Explainability-by-Design' approach for the purposes of the project PLEAD. The taxonomy was built with a view to produce explanations for a wide range of requirements stemming from a variety of regulatory frameworks or policies set at the organizational level either to translate high-level compliance requirements or to meet business needs. The taxonomy comprises nine dimensions. It is used as a stand-alone classifier of explanations conceived as detective controls, in order to aid supportive automated compliance strategies. A machinereadable format of the taxonomy is provided in the form of a light ontology and the benefits of starting the Explainability-by-Design journey with such a taxonomy are demonstrated through a series of examples.
- [375] arXiv:2209.01418 (replaced) [pdf, html, other]
-
Title: Outsourcing Control requires Control ComplexitySubjects: Information Theory (cs.IT)
An embodied agent constantly influences its environment and is influenced by it. We use the sensorimotor loop to model these interactions and thereby we can quantify different information flows in the system by various information theoretic measures. This includes a measure for the interaction among the agent's body and its environment, called Morphological Computation. Additionally, we examine the controller complexity by two measures, one of which can be seen in the context of the Integrated Information Theory of consciousness. Applying this framework to an experimental setting with simulated agents allows us to analyze the interaction between an agent and its environment, as well as the complexity of its controller, the brain of the agent. Previous research reveals an antagonistic relationship between the controller complexity and Morphological Computation. A morphology adapted well to a task can reduce the necessary complexity of the controller significantly. This creates the problem that embodied intelligence is correlated with a reduced necessity of a controller, a brain. However, in order to interact well with their surroundings, the agents first have to understand the relevant dynamics of the environment. By analyzing learning agents we observe that an increased controller complexity can facilitate a better interaction between an agent's body and its environment. Hence, learning requires an increased controller complexity and the controller complexity and Morphological Computation influence each other.
- [376] arXiv:2212.09593 (replaced) [pdf, html, other]
-
Title: Unsupervised Summarization Re-rankingComments: 9 pages, 1 figure, 10 tables, 23 appendix pages, ACL Findings 2023Subjects: Computation and Language (cs.CL)
With the rise of task-specific pre-training objectives, abstractive summarization models like PEGASUS offer appealing zero-shot performance on downstream summarization tasks. However, the performance of such unsupervised models still lags significantly behind their supervised counterparts. Similarly to the supervised setup, we notice a very high variance in quality among summary candidates from these models while only one candidate is kept as the summary output. In this paper, we propose to re-rank summary candidates in an unsupervised manner, aiming to close the performance gap between unsupervised and supervised models. Our approach improves the unsupervised PEGASUS by up to 7.27% and ChatGPT by up to 6.86% relative mean ROUGE across four widely-adopted summarization benchmarks ; and achieves relative gains of 7.51% (up to 23.73% from XSum to WikiHow) averaged over 30 zero-shot transfer setups (finetuning on a dataset, evaluating on another).
- [377] arXiv:2301.13306 (replaced) [pdf, other]
-
Title: Autobidders with Budget and ROI Constraints: Efficiency, Regret, and Pacing DynamicsComments: Appeared at COLT 2024. Numerical experiments added since Jun'24 versionSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
We study a game between autobidding algorithms that compete in an online advertising platform. Each autobidder is tasked with maximizing its advertiser's total value over multiple rounds of a repeated auction, subject to budget and return-on-investment constraints. We propose a gradient-based learning algorithm that is guaranteed to satisfy all constraints and achieves vanishing individual regret. Our algorithm uses only bandit feedback and can be used with the first- or second-price auction, as well as with any "intermediate" auction format. Our main result is that when these autobidders play against each other, the resulting expected liquid welfare over all rounds is at least half of the expected optimal liquid welfare achieved by any allocation. This holds whether or not the bidding dynamics converges to an equilibrium.
- [378] arXiv:2305.08514 (replaced) [pdf, html, other]
-
Title: Generative Adversarial Networks for Spatio-Spectral Compression of Hyperspectral ImagesComments: Accepted at 14th IEEE GRSS Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
The development of deep learning-based models for the compression of hyperspectral images (HSIs) has recently attracted great attention in remote sensing due to the sharp growing of hyperspectral data archives. Most of the existing models achieve either spectral or spatial compression, and do not jointly consider the spatio-spectral redundancies present in HSIs. To address this problem, in this paper we focus our attention on the High Fidelity Compression (HiFiC) model (which is proven to be highly effective for spatial compression problems) and adapt it to perform spatio-spectral compression of HSIs. In detail, we introduce two new models: i) HiFiC using Squeeze and Excitation (SE) blocks (denoted as HiFiC$_{SE}$); and ii) HiFiC with 3D convolutions (denoted as HiFiC$_{3D}$) in the framework of compression of HSIs. We analyze the effectiveness of HiFiC$_{SE}$ and HiFiC$_{3D}$ in compressing the spatio-spectral redundancies with channel attention and inter-dependency analysis. Experimental results show the efficacy of the proposed models in performing spatio-spectral compression, while reconstructing images at reduced bitrates with higher reconstruction quality. The code of the proposed models is publicly available at this https URL .
- [379] arXiv:2306.05612 (replaced) [pdf, html, other]
-
Title: Spatial Re-parameterization for N:M SparsityComments: 11 pages, 4 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper presents a Spatial Re-parameterization (SpRe) method for the N:M sparsity in CNNs. SpRe is stemmed from an observation regarding the restricted variety in spatial sparsity present in N:M sparsity compared with unstructured sparsity. Particularly, N:M sparsity exhibits a fixed sparsity rate within the spatial domains due to its distinctive pattern that mandates N non-zero components among M successive weights in the input channel dimension of convolution filters. On the contrary, we observe that unstructured sparsity displays a substantial divergence in sparsity across the spatial domains, which we experimentally verified to be very crucial for its robust performance retention compared with N:M sparsity. Therefore, SpRe employs the spatial-sparsity distribution of unstructured sparsity to assign an extra branch in conjunction with the original N:M branch at training time, which allows the N:M sparse network to sustain a similar distribution of spatial sparsity with unstructured sparsity. During inference, the extra branch can be further re-parameterized into the main N:M branch, without exerting any distortion on the sparse pattern or additional computation costs. SpRe has achieved a commendable feat by matching the performance of N:M sparsity methods with state-of-the-art unstructured sparsity methods across various benchmarks. Code and models are anonymously available at \url{this https URL}.
- [380] arXiv:2306.06383 (replaced) [pdf, html, other]
-
Title: Using orthogonally structured positive bases for constructing positive $k$-spanning sets with cosine measure guaranteesComments: V4 fixes a typo in the arXiv abstractSubjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
Positive spanning sets span a given vector space by nonnegative linear combinations of their elements. These have attracted significant attention in recent years, owing to their extensive use in derivative-free optimization. In this setting, the quality of a positive spanning set is assessed through its cosine measure, a geometric quantity that expresses how well such a set covers the space of interest. In this paper, we investigate the construction of positive $k$-spanning sets with geometrical guarantees. Our results build on recently identified positive spanning sets, called orthogonally structured positive bases. We first describe how to identify such sets and compute their cosine measures efficiently. We then focus our study on positive $k$-spanning sets, for which we provide a complete description, as well as a new notion of cosine measure that accounts for the resilient nature of such sets. By combining our results, we are able to use orthogonally structured positive bases to create positive $k$-spanning sets with guarantees on the value of their cosine measures.
- [381] arXiv:2308.03648 (replaced) [pdf, other]
-
Title: Generative ForestsComments: NeurIPS'24Subjects: Machine Learning (cs.LG)
We focus on generative AI for a type of data that still represent one of the most prevalent form of data: tabular data. Our paper introduces two key contributions: a new powerful class of forest-based models fit for such tasks and a simple training algorithm with strong convergence guarantees in a boosting model that parallels that of the original weak / strong supervised learning setting. This algorithm can be implemented by a few tweaks to the most popular induction scheme for decision tree induction (i.e. supervised learning) with two classes. Experiments on the quality of generated data display substantial improvements compared to the state of the art. The losses our algorithm minimize and the structure of our models make them practical for related tasks that require fast estimation of a density given a generative model and an observation (even partially specified): such tasks include missing data imputation and density estimation. Additional experiments on these tasks reveal that our models can be notably good contenders to diverse state of the art methods, relying on models as diverse as (or mixing elements of) trees, neural nets, kernels or graphical models.
- [382] arXiv:2308.06064 (replaced) [pdf, html, other]
-
Title: Joint Beamforming Optimization for Active STAR-RIS-Assisted ISAC SystemsJournal-ref: Volume: 23, Issue: 11, Page(s): 15888 - 15902, November 2024Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In this paper, we investigate an active simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) assisted integrated sensing and communications (ISAC) system, where the dual-function base station (DFBS) operates in full-duplex (FD) mode to provide communication services and performs targets sensing simultaneously. Meanwhile, we consider multiple targets and multiple users scenario as well as the self-interference at the FD DFBS. Through jointly optimizing the DFBS and active STAR-RIS beamforming under different work modes, our purpose is to achieve the maximum communication sum-rate, while satisfying the minimum radar signal-to-interference-plus-noise ratio (SINR) constraint, the active STAR-RIS hardware constraints and the total power constraint of DFBS and active STAR-RIS. To tackle the complex non-convex optimization problem formulated, an efficient alternating optimization algorithm is proposed. Specifically, the fractional programming method is first leveraged to turn the original problem into a more tractable one, and subsequently the transformed problem is decomposed into several sub-problems. Next, we develop a derivation method to obtain the closed-form expression of the radar receiving beamforming, and then the DFBS transmit beamforming is optimized under the radar SINR requirement and total power constraints. After that, the active STAR-RIS reflection and transmission beamforming are optimized by majorization minimization, complex circle manifold and convex optimization techniques. Finally, the proposed schemes are conducted through numerical simulations to show their benefits and efficiency.
- [383] arXiv:2308.09500 (replaced) [pdf, other]
-
Title: Infinite chains in the tree of numerical semigroupsJournal-ref: Semigroup Forum (2024)Subjects: Discrete Mathematics (cs.DM); Commutative Algebra (math.AC)
One major problem in the study of numerical semigroups is determining the growth of the semigroup tree. In the present work, infinite chains of numerical semigroups in the semigroup tree, firstly introduced by Bras-Amorós and Bulygin (Semigroup Forum, 79:561--574, 2009), are studied. Computational results show that these chains are rare, but without them the tree would not be infinite. It is proved that for each genus $g\geq 5$ there are more semigroups of that genus not belonging to infinite chains than semigroups belonging. Bras-Amorós and Bulygin (Semigroup Forum, 79:561--574, 2009) presented a characterization of the semigroups that belong to infinite chains in terms of the coprimality of the left elements of the semigroup as well as a result on the cardinality of the set of infinite chains to which a numerical semigroup belongs in terms of the primality of the greatest common divisor of these left elements. We revisit these results and fix an imprecision on the cardinality of the set of infinite chains to which a semigroup belongs in the case when the greatest common divisor of the left elements is a prime number. We then look at infinite chains in subtrees with fixed multiplicity. When the multiplicity is a prime number there is only one infinite chain in the tree of semigroups with such multiplicity. When the multiplicity is $4$ or $6$ we prove a self-replication behavior in the subtree and prove a formula for the number of semigroups in infinite chains of a given genus and multiplicity $4$ and $6$, respectively.
- [384] arXiv:2308.11738 (replaced) [pdf, html, other]
-
Title: Lifted Inference beyond First-Order LogicComments: Under Review at the Artificial Intelligence Journal. Added two new lemmas for counting by splitting in the Main approach section. Added experiments with Markov this http URL admin note: text overlap with arXiv:2302.09830Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Combinatorics (math.CO)
Weighted First Order Model Counting (WFOMC) is fundamental to probabilistic inference in statistical relational learning models. As WFOMC is known to be intractable in general ($\#$P-complete), logical fragments that admit polynomial time WFOMC are of significant interest. Such fragments are called domain liftable. Recent works have shown that the two-variable fragment of first order logic extended with counting quantifiers ($\mathrm{C^2}$) is domain-liftable. However, many properties of real-world data, like acyclicity in citation networks and connectivity in social networks, cannot be modeled in $\mathrm{C^2}$, or first order logic in general. In this work, we expand the domain liftability of $\mathrm{C^2}$ with multiple such properties. We show that any $\mathrm{C^2}$ sentence remains domain liftable when one of its relations is restricted to represent a directed acyclic graph, a connected graph, a tree (resp. a directed tree) or a forest (resp. a directed forest). All our results rely on a novel and general methodology of "counting by splitting". Besides their application to probabilistic inference, our results provide a general framework for counting combinatorial structures. We expand a vast array of previous results in discrete mathematics literature on directed acyclic graphs, phylogenetic networks, etc.
- [385] arXiv:2308.12614 (replaced) [pdf, html, other]
-
Title: Obstruction characterization of co-TT graphsComments: arXiv admin note: substantial text overlap with arXiv:2206.05917Subjects: Discrete Mathematics (cs.DM); Combinatorics (math.CO)
Threshold tolerance graphs and their complement graphs, known as co-TT graphs, were introduced by Monma, Reed, and Trotter[24]. Building on this, Hell et al.[19] introduced the concept of negative interval. Then they proceeded to define signedinterval digraphs/ bigraphs, demonstrating their equivalence to several seemingly distinct classes of digraphs/ bigraphs. They also showed that co-TT graphs are equivalent to symmetric signed-interval digraphs, where some vertices of the digraphs have loops and others do not. We have showed that this actually solve the representation characterization problem of co-TT graphs posed by Monma, Reed and Trotter [24]. In this paper, we characterize signed-interval bigraphs and signed-interval graphs in terms of their biadjacency matrices and adjacency matrices, respectively. Moreover we emphasize on the geometric representation of signed-interval graphs, i.e. co-TT graphs. Finally, by utilizing the geometric representation of signed-interval graphs, we resolve the open problem of characterizing co-TT graphs in terms of minimal forbidden induced subgraphs, a problem initially posed by Monma, Reed, and Trotter in the same paper.
- [386] arXiv:2308.13303 (replaced) [pdf, html, other]
-
Title: Age of Information Diffusion on Social NetworksComments: Accepted by IEEE/ACM Transactions On Networking. A long abstract of this work appeared at MobiHoc 2023Subjects: Social and Information Networks (cs.SI); Discrete Mathematics (cs.DM); Information Theory (cs.IT)
To promote viral marketing, major social platforms (e.g., Facebook Marketplace and Pinduoduo) repeatedly select and invite different users (as seeds) in online social networks to share fresh information about a product or service with their friends. Thereby, we are motivated to optimize a multi-stage seeding process of viral marketing in social networks, and adopt the recent notions of the peak and the average age of information (AoI) to measure the timeliness of promotion information received by network users. Our problem is different from the literature on information diffusion in social networks, which limits to one-time seeding and overlooks AoI dynamics or information replacement over time. As a critical step, we manage to develop closed-form expressions that characterize and trace AoI dynamics over any social network. For the peak AoI problem, we first prove the NP-hardness of our multi-stage seeding problem by a highly non-straightforward reduction from the dominating set problem, and then present a new polynomial-time algorithm that achieves good approximation guarantees (e.g., less than 2 for linear network topology). To minimize the average AoI, we also prove that our problem is NP-hard by properly reducing it from the set cover problem. Benefiting from our two-sided bound analysis on the average AoI objective, we build up a new framework for approximation analysis and link our problem to a much simplified sum-distance minimization problem. This intriguing connection inspires us to develop another polynomial-time algorithm that achieves a good approximation guarantee. Additionally, our theoretical results are well corroborated by experiments on a real social network.
- [387] arXiv:2308.16426 (replaced) [pdf, html, other]
-
Title: Enumerating minimal vertex covers and dominating sets with capacity and/or connectivity constraintsComments: 17 pages. some results are improvedSubjects: Data Structures and Algorithms (cs.DS)
In this paper, we consider the problems of enumerating minimal vertex covers and minimal dominating sets with capacity and/or connectivity constraints. We develop polynomial-delay enumeration algorithms for these problems on bounded-degree graphs. For the case of minimal connected vertex covers, our algorithms run in polynomial delay even on the class of $d$-claw free graphs, extending the result on bounded-degree graphs, and in output quasi-polynomial time on general graphs. To complement these algorithmic results, we show that the problems of enumerating minimal connected vertex covers, minimal connected dominating sets, and minimal capacitated vertex covers in $2$-degenerated bipartite graphs are at least as hard as enumerating minimal transversals in hypergraphs.
- [388] arXiv:2309.00921 (replaced) [pdf, html, other]
-
Title: An iterative scheme for finite horizon model reduction of continuous-time linear time-varying systemsSubjects: Systems and Control (eess.SY)
In this paper, we obtain the functional derivatives of a finite horizon error norm between a full-order and a reduced-order continuous-time linear time-varying (LTV) system. Based on the functional derivatives, first-order necessary conditions for optimality of the error norm are derived, and a projection-based iterative scheme for model reduction is proposed. The iterative scheme upon convergence produces reduced-order models satisfying the optimality conditions. Finally, through a numerical example, we demonstrate the better performance of the proposed model reduction scheme in comparison to the finite horizon balanced truncation algorithm for continuous-time LTV systems.
- [389] arXiv:2309.11477 (replaced) [pdf, html, other]
-
Title: Multi-Agent Control Synthesis from Global Temporal Logic Tasks with Synchronous Satisfaction RequirementsComments: 10 pages, 4 figuresSubjects: Systems and Control (eess.SY)
This paper addresses the multi-agent control problem under global temporal logic tasks, considering agents with heterogeneous capabilities. These global tasks involve not only absolute and relative temporal and spatial constraints, but also group behaviors, including task completion times, agent capabilities, and task interdependencies such as the need for synchronous execution. The global tasks are formally formulated into global signal temporal logic (STL) formulae, and a synchronous robustness metric is designed to evaluate the synchronization quality with real values. A mixed-integer linear programming (MILP) encoding method is further proposed to realize task-satisfied motion planning with high synchronicity and minimum control efforts. The encoding method uses a logarithmic number of binary variables to fully capture synchronous robustness, leading to only linear computational complexity. Simulations are conducted to demonstrate the efficiency of the proposed control strategy.
- [390] arXiv:2309.17196 (replaced) [pdf, html, other]
-
Title: ResBit: Residual Bit Vector for Categorical ValuesComments: 25 pages, 29 tables, and 10 figuresSubjects: Machine Learning (cs.LG)
One-hot vectors, a common method for representing discrete/categorical data, in machine learning are widely used because of their simplicity and intuitiveness. However, one-hot vectors suffer from a linear increase in dimensionality, posing computational and memory challenges, especially when dealing with datasets containing numerous categories. In this paper, we focus on tabular data generation, and reveal the multinomial diffusion faces the mode collapse phenomenon when the cardinality is high. Moreover, due to the limitations of one-hot vectors, the training phase takes time longer in such a situation. To address these issues, we propose Residual Bit Vectors (ResBit), a technique for densely representing categorical data. ResBit is an extension of analog bits and overcomes limitations of analog bits when applied to tabular data generation. Our experiments demonstrate that ResBit not only accelerates training but also maintains performance when compared with the situations before applying ResBit. Furthermore, our results indicate that many existing methods struggle with high-cardinality data, underscoring the need for lower-dimensional representations, such as ResBit and latent vectors.
- [391] arXiv:2310.00976 (replaced) [pdf, html, other]
-
Title: Fair Division with Subjective DivisibilityComments: Appears in the 19th Conference on Web and Internet Economics (WINE), 2023Subjects: Computer Science and Game Theory (cs.GT)
The classic fair division problems assume the resources to be allocated are either divisible or indivisible, or contain a mixture of both, but the agents always have a predetermined and uncontroversial agreement on the (in)divisibility of the resources. In this paper, we propose and study a new model for fair division in which agents have their own subjective divisibility over the goods to be allocated. That is, some agents may find a good to be indivisible and get utilities only if they receive the whole good, while others may consider the same good to be divisible and thus can extract utilities according to the fraction of the good they receive. We investigate fairness properties that can be achieved when agents have subjective divisibility. First, we consider the maximin share (MMS) guarantee and show that the worst-case MMS approximation guarantee is at most $2/3$ for $n \geq 2$ agents and this ratio is tight in the two- and three-agent cases. This is in contrast to the classic fair division settings involving two or three agents. We also give an algorithm that produces a $1/2$-MMS allocation for an arbitrary number of agents. Second, we study a hierarchy of envy-freeness relaxations, including EF1M, EFM and EFXM, ordered by increasing strength. While EF1M is compatible with non-wastefulness (an economic efficiency notion), this is not the case for EFM, even for two agents. Nevertheless, an EFXM and non-wasteful allocation always exists for two agents if at most one good is discarded.
- [392] arXiv:2310.03525 (replaced) [pdf, html, other]
-
Title: V2X Cooperative Perception for Autonomous Driving: Recent Advances and ChallengesTao Huang, Jianan Liu, Xi Zhou, Dinh C. Nguyen, Mostafa Rahimi Azghadi, Yuxuan Xia, Qing-Long Han, Sumei SunSubjects: Computer Vision and Pattern Recognition (cs.CV)
Achieving fully autonomous driving with heightened safety and efficiency depends on vehicle-to-everything (V2X) cooperative perception (CP), which allows vehicles to share perception data, thereby enhancing situational awareness and overcoming the limitations of the sensing ability of individual vehicles. V2X CP is crucial for extending perception range, improving accuracy, and strengthening the decision-making and control capabilities of autonomous vehicles in complex environments. This paper provides a comprehensive survey of recent advances in V2X CP, introducing mathematical models of CP processes across various collaboration strategies. We examine essential techniques for reliable perception sharing, including agent selection, data alignment, and fusion methods. Key issues are analyzed, such as agent and model heterogeneity, perception uncertainty, and the impact of V2X communication constraints like delays and data loss on CP effectiveness. To inspire further advancements in V2X CP, we outline promising avenues, including privacy-preserving artificial intelligence (AI), collaborative AI, and integrated sensing frameworks, as pathways to enhance CP capabilities.
- [393] arXiv:2311.00534 (replaced) [pdf, html, other]
-
Title: Error analysis for a finite element approximation of the steady $p(\cdot)$-Navier-Stokes equationsComments: 37 pages, 4 figures, 4 tablesSubjects: Numerical Analysis (math.NA)
In this paper, we examine a finite element approximation of the steady $p(\cdot)$-Navier-Stokes equations ($p(\cdot)$ is variable dependent) and prove orders of convergence by assuming natural fractional regularity assumptions on the velocity vector field and the kinematic pressure. Compared to previous results, we treat the convective term and employ a more practicable discretization of the power-law index $p(\cdot)$. Numerical experiments confirm the quasi-optimality of the $\textit{a priori}$ error estimates (for the velocity) with respect to fractional regularity assumptions on the velocity vector field and the kinematic pressure.
- [394] arXiv:2311.10126 (replaced) [pdf, html, other]
-
Title: I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs QuantizationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Albeit the scalable performance of vision transformers (ViTs), the dense computational costs (training & inference) undermine their position in industrial applications. Post-training quantization (PTQ), tuning ViTs with a tiny dataset and running in a low-bit format, well addresses the cost issue but unluckily bears more performance drops in lower-bit cases. In this paper, we introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion. I&S-ViT first identifies two issues in the PTQ of ViTs: (1) Quantization inefficiency in the prevalent log2 quantizer for post-Softmax activations; (2) Rugged and magnified loss landscape in coarse-grained quantization granularity for post-LayerNorm activations. Then, I&S-ViT addresses these issues by introducing: (1) A novel shift-uniform-log2 quantizer (SULQ) that incorporates a shift mechanism followed by uniform quantization to achieve both an inclusive domain representation and accurate distribution approximation; (2) A three-stage smooth optimization strategy (SOS) that amalgamates the strengths of channel-wise and layer-wise quantization to enable stable learning. Comprehensive evaluations across diverse vision tasks validate I&S-ViT' superiority over existing PTQ of ViTs methods, particularly in low-bit scenarios. For instance, I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.
- [395] arXiv:2312.01970 (replaced) [pdf, html, other]
-
Title: CaRL: Cascade Reinforcement Learning with State Space Splitting for O-RAN based Traffic SteeringComments: 9 pages, 8 figuresSubjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
The Open Radio Access Network (O-RAN) architecture empowers intelligent and automated optimization of the RAN through applications deployed on the RAN Intelligent Controller (RIC) platform, enabling capabilities beyond what is achievable with traditional RAN solutions. Within this paradigm, Traffic Steering (TS) emerges as a pivotal RIC application that focuses on optimizing cell-level mobility settings in near-real-time, aiming to significantly improve network spectral efficiency. In this paper, we design a novel TS algorithm based on a Cascade Reinforcement Learning (CaRL) framework. We propose state space factorization and policy decomposition to reduce the need for large models and well-labeled datasets. For each sub-state space, an RL sub-policy will be trained to learn an optimized mapping onto the action space. To apply CaRL on new network regions, we propose a knowledge transfer approach to initialize a new sub-policy based on knowledge learned by the trained policies. To evaluate CaRL, we build a data-driven and scalable RIC digital twin (DT) that is modeled using important real-world data, including network configuration, user geo-distribution, and traffic demand, among others, from a tier-1 mobile operator in the US. We evaluate CaRL on two DT scenarios representing two network clusters in two different cities and compare its performance with the business-as-usual (BAU) policy and other competing optimization approaches using heuristic and Q-table algorithms. Benchmarking results show that CaRL performs the best and improves the average cluster-aggregated downlink throughput over the BAU policy by 24% and 18% in these two scenarios, respectively.
- [396] arXiv:2312.02664 (replaced) [pdf, other]
-
Title: Domain-Specific Tensor LanguagesComments: 42 pages, 11 figures. Under review for Journal of Functional ProgrammingSubjects: Programming Languages (cs.PL)
The tensor notation used in several areas of mathematics is a useful one, but it is not widely available to the functional programming community. In a practical sense, the (embedded) domain-specific languages (DSLs) that are currently in use for tensor algebra are either 1. array-oriented languages that do not enforce or take advantage of tensor properties and algebraic structure or 2. follow the categorical structure of tensors but require the programmer to manipulate tensors in an unwieldy point-free notation. A deeper issue is that for tensor calculus, the dominant pedagogical paradigm assumes an audience which is either comfortable with notational liberties which programmers cannot afford, or focus on the applied mathematics of tensors, largely leaving their linguistic aspects (behaviour of variable binding, syntax and semantics, etc.) for the reader to figure out by themselves. This state of affairs is hardly surprising, because, as we highlight, several properties of standard tensor notation are somewhat exotic from the perspective of lambda calculi. We bridge the gap by defining a DSL, embedded in Haskell, whose syntax closely captures the index notation for tensors in wide use in the literature. The semantics of this EDSL is defined in terms of the algebraic structures which define tensors in their full generality. This way, we believe that our EDSL can be used both as a tool for scientific computing, but also as a vehicle to express and present the theory and applications of tensors.
- [397] arXiv:2312.06231 (replaced) [pdf, other]
-
Title: Uncovering communities of pipelines in the task-fMRI analytical spaceComments: Accepted at the 2024 IEEE International Conference on Image ProcessingJournal-ref: IEEE International Conference on Image Processing, Oct 2024, Abu Dhabi, United Arab Emirates. \&\#x27E8;10.1109/ICIP51287.2024.10647701\&\#x27E9Subjects: Artificial Intelligence (cs.AI)
Analytical workflows in functional magnetic resonance imaging are highly flexible with limited best practices as to how to choose a pipeline. While it has been shown that the use of different pipelines might lead to different results, there is still a lack of understanding of the factors that drive these differences and of the stability of these differences across contexts. We use community detection algorithms to explore the pipeline space and assess the stability of pipeline relationships across different contexts. We show that there are subsets of pipelines that give similar results, especially those sharing specific parameters (e.g. number of motion regressors, software packages, etc.). Those pipeline-to-pipeline patterns are stable across groups of participants but not across different tasks. By visualizing the differences between communities, we show that the pipeline space is mainly driven by the size of the activation area in the brain and the scale of statistic values in statistic maps.
- [398] arXiv:2312.10947 (replaced) [pdf, html, other]
-
Title: LabelCraft: Empowering Short Video Recommendations with Automated Label CraftingComments: Accepted by WSDM'24Subjects: Information Retrieval (cs.IR)
Short video recommendations often face limitations due to the quality of user feedback, which may not accurately depict user interests. To tackle this challenge, a new task has emerged: generating more dependable labels from original feedback. Existing label generation methods rely on manual rules, demanding substantial human effort and potentially misaligning with the desired objectives of the platform. To transcend these constraints, we introduce LabelCraft, a novel automated label generation method explicitly optimizing pivotal operational metrics for platform success. By formulating label generation as a higher-level optimization problem above recommender model optimization, LabelCraft introduces a trainable labeling model for automatic label mechanism modeling. Through meta-learning techniques, LabelCraft effectively addresses the bi-level optimization hurdle posed by the recommender and labeling models, enabling the automatic acquisition of intricate label generation mechanisms. Extensive experiments on real-world datasets corroborate LabelCraft's excellence across varied operational metrics, encompassing usage time, user engagement, and retention. Codes are available at this https URL.
- [399] arXiv:2312.11166 (replaced) [pdf, html, other]
-
Title: Volume-Preserving Transformers for Learning Time Series Data with StructureComments: Will be published as part of "Cemracs Proceedings 2023" (status: accepted)Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG)
Two of the many trends in neural network research of the past few years have been (i) the learning of dynamical systems, especially with recurrent neural networks such as long short-term memory networks (LSTMs) and (ii) the introduction of transformer neural networks for natural language processing (NLP) tasks.
While some work has been performed on the intersection of these two trends, those efforts were largely limited to using the vanilla transformer directly without adjusting its architecture for the setting of a physical system.
In this work we develop a transformer-inspired neural network and use it to learn a dynamical system. We (for the first time) change the activation function of the attention layer to imbue the transformer with structure-preserving properties to improve long-term stability. This is shown to be of great advantage when applying the neural network to learning the trajectory of a rigid body. - [400] arXiv:2312.13175 (replaced) [pdf, html, other]
-
Title: Nonlinear moving horizon estimation for robust state and parameter estimation -- extended versionComments: Replaced by revised versionSubjects: Systems and Control (eess.SY)
We propose a moving horizon estimation scheme to estimate the states and the unknown constant parameters of general nonlinear uncertain discrete-time systems. The proposed framework and analysis explicitly do not involve the a priori verification of a particular excitation condition for the parameters. Instead, we use online information about the actual excitation of the parameters at any time during operation and ensure that the regularization term in the cost function is always automatically selected appropriately. This ensures that the state and parameter estimation error is bounded for all times, even if the parameters are never (or only rarely) excited during operation. Robust exponential stability of the state and parameter estimation error emerges under an additional uniform condition on the maximum duration of insufficient excitation. The theoretical results are illustrated by a numerical example.
- [401] arXiv:2312.17612 (replaced) [pdf, html, other]
-
Title: Bespoke Approximation of Multiplication-Accumulation and Activation Targeting Printed Multilayer PerceptronsFlorentia Afentaki, Gurol Saglam, Argyris Kokkinis, Kostas Siozios, Georgios Zervakis, Mehdi B TahooriComments: Accepted for publication at the 42th IEEE/ACM International Conference on Computer Aided Design (ICCAD) 2023, San Francisco, USASubjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Printed Electronics (PE) feature distinct and remarkable characteristics that make them a prominent technology for achieving true ubiquitous computing. This is particularly relevant in application domains that require conformal and ultra-low cost solutions, which have experienced limited penetration of computing until now. Unlike silicon-based technologies, PE offer unparalleled features such as non-recurring engineering costs, ultra-low manufacturing cost, and on-demand fabrication of conformal, flexible, non-toxic, and stretchable hardware. However, PE face certain limitations due to their large feature sizes, that impede the realization of complex circuits, such as machine learning classifiers. In this work, we address these limitations by leveraging the principles of Approximate Computing and Bespoke (fully-customized) design. We propose an automated framework for designing ultra-low power Multilayer Perceptron (MLP) classifiers which employs, for the first time, a holistic approach to approximate all functions of the MLP's neurons: multiplication, accumulation, and activation. Through comprehensive evaluation across various MLPs of varying size, our framework demonstrates the ability to enable battery-powered operation of even the most intricate MLP architecture examined, significantly surpassing the current state of the art.
- [402] arXiv:2401.01005 (replaced) [pdf, html, other]
-
Title: Edge AI Empowered Physical Layer Security for 6G NTN: Potential Threats and Future OpportunitiesHong-fu Chou, Sourabh Solanki, Vu Nguyen Ha, Lin Chen, Sean Longyu Ma, Hayder Al-Hraishawi, Geoffrey Eappen, Symeon ChatzinotasComments: 7 pages, 6 figures, magazineSubjects: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR)
Due to the enormous potential for economic profit offered by artificial intelligence (AI) servers, the field of cybersecurity has the potential to emerge as a prominent arena for competition among corporations and governments on a global scale. One of the prospective applications that stands to gain from the utilization of AI technology is the advancement in the field of cybersecurity. To this end, this paper provides an overview of the possible risks that the physical layer may encounter in the context of 6G Non-Terrestrial Networks (NTN). With the objective of showcasing the effectiveness of cutting-edge AI technologies in bolstering physical layer security, this study reviews the most foreseeable design strategies associated with the integration of edge AI in the realm of 6G NTN. The findings of this paper provide some insights and serve as a foundation for future investigations aimed at enhancing the physical layer security of edge servers/devices in the next generation of trustworthy 6G telecommunication networks.
- [403] arXiv:2401.02501 (replaced) [pdf, html, other]
-
Title: A metric embedding kernel for live cell microscopy signaling patternsLayton Aho, Mark Winter, Marc DeCarlo, Agne Frismantiene, Yannick Blum, Paolo Armando Gagliardi, Olivier Pertz, Andrew R. CohenSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Live cell microscopy captures 5-D $(x,y,z,channel,time)$ movies that display patterns of cellular motion and signaling dynamics. We present here a metric kernel function for spatiotemporal patterns of cell signaling dynamics in 5-D live cell microscopy movies unique in requiring no a priori knowledge of expected pattern dynamics, and no training data. The approach uses Kolmogorov complexity theory to compute a metric distance between movies and to measure the meaningful information among subsets of movies. Cell signaling kymographs store at each spatiotemporal cell centroid the cell signaling state, or a functional output such as velocity. Patterns of similarity are identified via the metric normalized compression distance (NCD). The NCD is a reproducing kernel for a Hilbert space that represents the input cell signaling kymographs as points in a low dimensional embedding that optimally captures the pattern similarity identified by the NCD throughout the space. The only parameter is the expected cell radii ($\mu m$). A new formulation of the cluster structure function optimally estimates the meaningful information captured by the embedding. Also presented is the cell signaling structure function (SSF), a Kolmogorov structure function that optimally measures cell signaling state as nuclear intensity w.r.t. surrounding cytoplasm, a significant improvement compared to the current state-of-the-art cytonuclear ratio. Results are presented quantifying the impact of ERK and AKT signaling between different oncogenic mutations, and by the relation between ERK signaling and cellular velocity patterns for movies of 2-D monolayers of human breast epithelial (MCF10A) cells, 3-D MCF10A spheroids under optogenetic manipulation of ERK, and human induced pluripotent stem cells.
- [404] arXiv:2401.03735 (replaced) [pdf, html, other]
-
Title: Language Models Encode the Value of Numbers LinearlyComments: The code and data are available at this https URLSubjects: Computation and Language (cs.CL)
Large language models (LLMs) have exhibited impressive competence in various tasks, but their internal mechanisms on mathematical problems are still under-explored. In this paper, we study a fundamental question: how language models encode the value of numbers, a basic element in math. To study the question, we construct a synthetic dataset comprising addition problems and utilize linear probes to read out input numbers from the hidden states. Experimental results support the existence of encoded number values in LLMs on different layers, and these values can be extracted via linear probes. Further experiments show that LLMs store their calculation results in a similar manner, and we can intervene the output via simple vector additions, proving the causal connection between encoded numbers and language model outputs. Our research provides evidence that LLMs encode the value of numbers linearly, offering insights for better exploring, designing, and utilizing numeric information in LLMs.
- [405] arXiv:2401.09417 (replaced) [pdf, html, other]
-
Title: Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space ModelComments: Vision Mamba (Vim) is accepted by ICML 2024. Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at this https URL.
- [406] arXiv:2401.13442 (replaced) [pdf, html, other]
-
Title: Finite-Precision Arithmetic Transceiver for Massive MIMO SystemsComments: 17 pages, 13 figures. Accepted by IEEE JSACSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Efficient implementation of massive multiple-input-multiple-output (MIMO) transceivers is essential for the next-generation wireless networks. To reduce the high computational complexity of the massive MIMO transceiver, in this paper, we propose a new massive MIMO architecture using finite-precision arithmetic. First, we conduct the rounding error analysis and derive the lower bound of the achievable rate for single-input-multiple-output (SIMO) using maximal ratio combining (MRC) and multiple-input-single-output (MISO) systems using maximal ratio transmission (MRT) with finite-precision arithmetic. Then, considering the multi-user scenario, the rounding error analysis of zero-forcing (ZF) detection and precoding is derived by using the normal equations (NE) method. The corresponding lower bounds of the achievable sum rate are also derived and asymptotic analyses are presented. Built upon insights from these analyses and lower bounds, we propose a mixed-precision architecture for massive MIMO systems to offset performance gaps due to finite-precision arithmetic. The corresponding analysis of rounding errors and computational costs is obtained. Simulation results validate the derived bounds and underscore the superiority of the proposed mixed-precision architecture to the conventional structure.
- [407] arXiv:2401.16109 (replaced) [pdf, html, other]
-
Title: Minimalistic System Modelling: Behaviours, Interfaces, and Local ReasoningComments: 24 pagesSubjects: Logic in Computer Science (cs.LO)
The infrastructure upon which the functioning of society depends is composed of complex ecosystems of systems. Consequently, we must reason about the properties of such ecosystems, which requires that we construct models of them. There are very many approaches to systems modelling, typically building on complex structural and dynamic frameworks. Our purpose here is to explore a modelling framework based on minimal assumptions, starting from a primitive notion of behaviour, and to show that such an approach allows the recovery of the key ideas, including a generalized CAP theorem, required for effective modelling of and reasoning about ecosystems of systems. We establish a logic of behaviours and use it to express local reasoning principles for the compositional structure of systems.
- [408] arXiv:2402.02681 (replaced) [pdf, other]
-
Title: Equivariant Symmetry Breaking SetsComments: 50 pages, 19 figures Published in Transactions on Machine Learning Research, October 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Equivariant neural networks (ENNs) have been shown to be extremely effective in applications involving underlying symmetries. By construction ENNs cannot produce lower symmetry outputs given a higher symmetry input. However, symmetry breaking occurs in many physical systems and we may obtain a less symmetric stable state from an initial highly symmetric one. Hence, it is imperative that we understand how to systematically break symmetry in ENNs. In this work, we propose a novel symmetry breaking framework that is fully equivariant and is the first which fully addresses spontaneous symmetry breaking. We emphasize that our approach is general and applicable to equivariance under any group. To achieve this, we introduce the idea of symmetry breaking sets (SBS). Rather than redesign existing networks, we design sets of symmetry breaking objects which we feed into our network based on the symmetry of our inputs and outputs. We show there is a natural way to define equivariance on these sets, which gives an additional constraint. Minimizing the size of these sets equates to data efficiency. We prove that minimizing these sets translates to a well studied group theory problem, and tabulate solutions to this problem for the point groups. Finally, we provide some examples of symmetry breaking to demonstrate how our approach works in practice. The code for these examples is available at \url{this https URL}.
- [409] arXiv:2402.02930 (replaced) [pdf, html, other]
-
Title: Embedding Hardware Approximations in Discrete Genetic-based Training for Printed MLPsComments: Accepted for publication at the 27th Design, Automation and Test in Europe Conference (DATE'24), Mar 25-27 2024, Valencia, SpainSubjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Printed Electronics (PE) stands out as a promisingtechnology for widespread computing due to its distinct attributes, such as low costs and flexible manufacturing. Unlike traditional silicon-based technologies, PE enables stretchable, conformal,and non-toxic hardware. However, PE are constrained by larger feature sizes, making it challenging to implement complex circuits such as machine learning (ML) classifiers. Approximate computing has been proven to reduce the hardware cost of ML circuits such as Multilayer Perceptrons (MLPs). In this paper, we maximize the benefits of approximate computing by integrating hardware approximation into the MLP training process. Due to the discrete nature of hardware approximation, we propose and implement a genetic-based, approximate, hardware-aware training approach specifically designed for printed MLPs. For a 5% accuracy loss, our MLPs achieve over 5x area and power reduction compared to the baseline while outperforming state of-the-art approximate and stochastic printed MLPs.
- [410] arXiv:2402.03017 (replaced) [pdf, html, other]
-
Title: Toward Green and Human-Like Artificial Intelligence: A Complete Survey on Contemporary Few-Shot Learning ApproachesComments: 35 pages, 9 figures. Submitted to ACM Computing SurveysSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Despite deep learning's widespread success, its data-hungry and computationally expensive nature makes it impractical for many data-constrained real-world applications. Few-Shot Learning (FSL) aims to address these limitations by enabling rapid adaptation to novel learning tasks, seeing significant growth in recent years. This survey provides a comprehensive overview of the field's latest advancements. Initially, FSL is formally defined, and its relationship with different learning fields is presented. A novel taxonomy is introduced, extending previously proposed ones, and real-world applications in classic and novel fields are described. Finally, recent trends shaping the field, outstanding challenges, and promising future research directions are discussed.
- [411] arXiv:2402.03227 (replaced) [pdf, html, other]
-
Title: IGUANe: a 3D generalizable CycleGAN for multicenter harmonization of brain MR imagesComments: 29 pages, 14 figuresJournal-ref: Medical Image Analysis 99 (2025)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In MRI studies, the aggregation of imaging data from multiple acquisition sites enhances sample size but may introduce site-related variabilities that hinder consistency in subsequent analyses. Deep learning methods for image translation have emerged as a solution for harmonizing MR images across sites. In this study, we introduce IGUANe (Image Generation with Unified Adversarial Networks), an original 3D model that leverages the strengths of domain translation and straightforward application of style transfer methods for multicenter brain MR image harmonization. IGUANe extends CycleGAN by integrating an arbitrary number of domains for training through a many-to-one architecture. The framework based on domain pairs enables the implementation of sampling strategies that prevent confusion between site-related and biological variabilities. During inference, the model can be applied to any image, even from an unknown acquisition site, making it a universal generator for harmonization. Trained on a dataset comprising T1-weighted images from 11 different scanners, IGUANe was evaluated on data from unseen sites. The assessments included the transformation of MR images with traveling subjects, the preservation of pairwise distances between MR images within domains, the evolution of volumetric patterns related to age and Alzheimer$'$s disease (AD), and the performance in age regression and patient classification tasks. Comparisons with other harmonization and normalization methods suggest that IGUANe better preserves individual information in MR images and is more suitable for maintaining and reinforcing variabilities related to age and AD. Future studies may further assess IGUANe in other multicenter contexts, either using the same model or retraining it for applications to different image modalities. IGUANe is available at this https URL.
- [412] arXiv:2402.06900 (replaced) [pdf, html, other]
-
Title: Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity MetricComments: 8 page longJournal-ref: EMNLP2024 findingsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
In the pursuit of developing Large Language Models (LLMs) that adhere to societal standards, it is imperative to detect the toxicity in the generated text. The majority of existing toxicity metrics rely on encoder models trained on specific toxicity datasets, which are susceptible to out-of-distribution (OOD) problems and depend on the dataset's definition of toxicity. In this paper, we introduce a robust metric grounded on LLMs to flexibly measure toxicity according to the given definition. We first analyze the toxicity factors, followed by an examination of the intrinsic toxic attributes of LLMs to ascertain their suitability as evaluators. Finally, we evaluate the performance of our metric with detailed analysis. Our empirical results demonstrate outstanding performance in measuring toxicity within verified factors, improving on conventional metrics by 12 points in the F1 score. Our findings also indicate that upstream toxicity significantly influences downstream metrics, suggesting that LLMs are unsuitable for toxicity evaluations within unverified factors.
- [413] arXiv:2403.04783 (replaced) [pdf, html, other]
-
Title: AutoDefense: Multi-Agent LLM Defense against Jailbreak AttacksSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Despite extensive pre-training in moral alignment to prevent generating harmful information, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense framework that filters harmful responses from LLMs. With the response-filtering mechanism, our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models. AutoDefense assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. With AutoDefense, small open-source LMs can serve as agents and defend larger models against jailbreak attacks. Our experiments show that AutoDefense can effectively defense against different jailbreak attacks, while maintaining the performance at normal user request. For example, we reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b with a 3-agent system. Our code and data are publicly available at this https URL.
- [414] arXiv:2403.07032 (replaced) [pdf, other]
-
Title: STARFlow: Spatial Temporal Feature Re-embedding with Attentive Learning for Real-world Scene FlowComments: This paper was renamed to:"SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow" [arXiv:2408.07825] and was accepted in 3DV 2025Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Scene flow prediction is a crucial underlying task in understanding dynamic scenes as it offers fundamental motion information. However, contemporary scene flow methods encounter three major challenges. Firstly, flow estimation solely based on local receptive fields lacks long-dependency matching of point pairs. To address this issue, we propose global attentive flow embedding to match all-to-all point pairs in both feature space and Euclidean space, providing global initialization before local refinement. Secondly, there are deformations existing in non-rigid objects after warping, which leads to variations in the spatiotemporal relation between the consecutive frames. For a more precise estimation of residual flow, a spatial temporal feature re-embedding module is devised to acquire the sequence features after deformation. Furthermore, previous methods perform poor generalization due to the significant domain gap between the synthesized and LiDAR-scanned datasets. We leverage novel domain adaptive losses to effectively bridge the gap of motion inference from synthetic to real-world. Experiments demonstrate that our approach achieves state-of-the-art performance across various datasets, with particularly outstanding results on real-world LiDAR-scanned datasets. Our code is available at this https URL.
- [415] arXiv:2403.07331 (replaced) [pdf, html, other]
-
Title: LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword QueriesComments: Accepted by VLDB JournalSubjects: Information Retrieval (cs.IR); Databases (cs.DB)
With the proliferation of spatio-textual data, Top-k KNN spatial keyword queries (TkQs), which return a list of objects based on a ranking function that considers both spatial and textual relevance, have found many real-life applications. To efficiently handle TkQs, many indexes have been developed, but the effectiveness of TkQ is limited. To improve effectiveness, several deep learning models have recently been proposed, but they suffer severe efficiency issues and there are no efficient indexes specifically designed to accelerate the top-k search process for these deep learning models. To tackle these issues, we consider embedding based spatial keyword queries, which capture the semantic meaning of query keywords and object descriptions in two separate embeddings to evaluate textual relevance. Although various models can be used to generate these embeddings, no indexes have been specifically designed for such queries. To fill this gap, we propose LIST, a novel machine learning based Approximate Nearest Neighbor Search index that Learns to Index the Spatio-Textual data. LIST utilizes a new learning-to-cluster technique to group relevant queries and objects together while separating irrelevant queries and objects. There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced clustering results. We develop a novel pseudo-label generation technique to address the two challenges. Additionally, we introduce a learning based spatial relevance model that can integrates with various text relevance models to form a lightweight yet effective relevance for reranking objects retrieved by LIST.
- [416] arXiv:2403.10873 (replaced) [pdf, html, other]
-
Title: CSI Transfer From Sub-6G to mmWave: Reduced-Overhead Multi-User Hybrid BeamformingComments: Accepted by IEEE JSAC NGATSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Hybrid beamforming is vital in modern wireless systems, especially for massive MIMO and millimeter-wave (mmWave) deployments, offering efficient directional transmission with reduced hardware complexity. However, effective beamforming in multi-user scenarios relies heavily on accurate channel state information, the acquisition of which often requires significant pilot overhead, degrading system performance. To address this and inspired by the spatial congruence between sub-6GHz (sub-6G) and mmWave channels, we propose a Sub-6G information Aided Multi-User Hybrid Beamforming (SA-MUHBF) framework, avoiding excessive use of pilots at mmWave. SA-MUHBF employs a convolutional neural network to predict mmWave beamspace from sub-6G channel estimate, followed by a novel multi-layer graph neural network for analog beam selection and a linear minimum mean-square error algorithm for digital beamforming. Numerical results demonstrate that SA-MUHBF efficiently predicts the mmWave beamspace representation and achieves superior spectrum efficiency over state-of-the-art benchmarks. Moreover, SA-MUHBF demonstrates robust performance across varied sub-6G system configurations and exhibits strong generalization to unseen scenarios.
- [417] arXiv:2403.12778 (replaced) [pdf, html, other]
-
Title: ViTGaze: Gaze Following with Interaction Features in Vision TransformersComments: 15 pages; Accepted by Visual IntelligenceJournal-ref: Visual Intelligence, 2024, Vol. 2, article no. 31Subjects: Computer Vision and Pattern Recognition (cs.CV)
Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the preceding modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Leveraging this presumption, we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps. Furthermore, our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information. Many experiments have been conducted to demonstrate the performance of the proposed method. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement in the area under curve (AUC) score, 5.1% improvement in the average precision (AP)) and very comparable performance against multi-modality methods with 59% number of parameters less.
- [418] arXiv:2403.13777 (replaced) [pdf, html, other]
-
Title: Embedding Pose Graph, Enabling 3D Foundation Model Capabilities with a Compact RepresentationSubjects: Robotics (cs.RO)
This paper presents the Embedding Pose Graph (EPG), an innovative method that combines the strengths of foundation models with a simple 3D representation suitable for robotics applications. Addressing the need for efficient spatial understanding in robotics, EPG provides a compact yet powerful approach by attaching foundation model features to the nodes of a pose graph. Unlike traditional methods that rely on bulky data formats like voxel grids or point clouds, EPG is lightweight and scalable. It facilitates a range of robotic tasks, including open-vocabulary querying, disambiguation, image-based querying, language-directed navigation, and re-localization in 3D environments. We showcase the effectiveness of EPG in handling these tasks, demonstrating its capacity to improve how robots interact with and navigate through complex spaces. Through both qualitative and quantitative assessments, we illustrate EPG's strong performance and its ability to outperform existing methods in re-localization. Our work introduces a crucial step forward in enabling robots to efficiently understand and operate within large-scale 3D spaces.
- [419] arXiv:2404.01901 (replaced) [pdf, html, other]
-
Title: Learning-based model augmentation with LFRsComments: Submitted for ECC 2025Subjects: Systems and Control (eess.SY)
Nonlinear system identification (NL-SI) has proven to be effective in obtaining accurate models for highly complex systems. Especially, recent encoder-based methods for artificial neural networks state-space (ANN-SS) models have achieved state-of-the-art performance on various benchmarks, while offering consistency and computational efficiency. The inclusion of prior knowledge of the system can be exploited to increase (i) estimation speed, (ii) accuracy, and (iii) interpretability of the resulting models. This paper proposes an encoder based model augmentation method incorporating prior knowledge from first-principles (FP) models. We introduce a novel linear-fractional-representation (LFR) model structure that allows for the unified representation of various augmentation structures including the ones that are commonly used in the literature, and an identification algorithm for estimating the proposed structure together with appropriate initialization methods. The performance and generalization capabilities of the proposed method are demonstrated on a hardening mass-spring-damper simulation.
- [420] arXiv:2404.03862 (replaced) [pdf, other]
-
Title: Verifiable by Design: Aligning Language Models to Quote from Pre-Training DataSubjects: Computation and Language (cs.CL)
To trust the fluent generations of large language models (LLMs), humans must be able to verify their correctness against trusted, external sources. Recent efforts, such as providing citations via retrieved documents or post-hoc provenance, enhance verifiability but provide no guarantees on their correctness. To address these limitations, we tackle the verifiability goal with a different philosophy: trivializing the verification process by developing models that quote verbatim statements from trusted sources in their pre-training data. We propose Quote-Tuning, which demonstrates the feasibility of aligning models to quote. The core of Quote-Tuning is a fast membership inference function that efficiently verifies text against trusted corpora. We leverage this tool to design a reward function to quantify quotes in model responses, and curate datasets for preference learning. Experiments show that Quote-Tuning significantly increases verbatim quotes from high-quality documents by up to 130% relative to base models while maintaining response quality. Quote-Tuning is applicable in different tasks, generalizes to out-of-domain data and diverse model families, and provides additional benefits to truthfulness. Our method not only serves as a hassle-free method to increase quoting but also opens up avenues for improving LLM trustworthiness through better verifiability.
- [421] arXiv:2404.05953 (replaced) [pdf, html, other]
-
Title: 3D Branch Point Cloud Completion for Robotic Pruning in Apple OrchardsComments: Accepted by IROS 2024Subjects: Robotics (cs.RO)
Robotic branch pruning is a significantly growing research area to cope with the shortage of labor force in the context of agriculture. One fundamental requirement in robotic pruning is the perception of detailed geometry and topology of branches. However, the point clouds obtained in agricultural settings often exhibit incompleteness due to several constraints, thereby restricting the accuracy of downstream robotic pruning. In this work, we addressed the issue of point cloud quality through a simulation-based deep neural network, leveraging a Real-to-Simulation (Real2Sim) data generation pipeline that not only eliminates the need for manual parameterization but also guarantees the realism of simulated data. The simulation-based neural network was applied to jointly perform point cloud completion and skeletonization on real-world partial branches, without additional real-world training. The Sim2Real qualitative completion and skeletonization results showed the model's remarkable capability for geometry reconstruction and topology prediction. Additionally, we quantitatively evaluated the Sim2Real performance by comparing branch-level trait characterization errors using raw incomplete data and complete data. The Mean Absolute Error (MAE) reduced by 75% and 8% for branch diameter and branch angle estimation, respectively, using the best complete data, which indicates the effectiveness of the Real2Sim data in a zero-shot generalization setting. The characterization improvements contributed to the precision and efficacy of robotic branch pruning.
- [422] arXiv:2404.06815 (replaced) [pdf, html, other]
-
Title: Security Assessment of the LG CryptosystemComments: Accepted for publication to the journal AAECCSubjects: Cryptography and Security (cs.CR)
The LG cryptosystem is a public-key encryption scheme in the rank metric using the recent family of $\lambdav-$Gabidulin codes and introduced in 2019 by Lau and Tan. In this paper, we present a cryptanalysis showing that the security of several parameters of the scheme have been overestimated. We also show the existence of some weak keys allowing an attacker to find in polynomial time an alternative private key.
- [423] arXiv:2404.07878 (replaced) [pdf, html, other]
-
Title: LeapFrog: The Rowhammer Instruction Skip AttackComments: Accepted at this http URL 2024Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Since its inception, Rowhammer exploits have rapidly evolved into increasingly sophisticated threats compromising data integrity and the control flow integrity of victim processes. Nevertheless, it remains a challenge for an attacker to identify vulnerable targets (i.e., Rowhammer gadgets), understand the outcome of the attempted fault, and formulate an attack that yields useful results.
In this paper, we present a new type of Rowhammer gadget, called a LeapFrog gadget, which, when present in the victim code, allows an adversary to subvert code execution to bypass a critical piece of code (e.g., authentication check logic, encryption rounds, padding in security protocols). The LeapFrog gadget manifests when the victim code stores the Program Counter (PC) value in the user or kernel stack (e.g., a return address during a function call) which, when tampered with, repositions the return address to a location that bypasses a security-critical code pattern.
This research also presents a systematic process to identify LeapFrog gadgets. This methodology enables the automated detection of susceptible targets and the determination of optimal attack parameters. We first show the attack on a decision tree algorithm to show the potential implications. Secondly, we employ the attack on OpenSSL to bypass the encryption and reveal the plaintext. We then use our tools to scan the Open Quantum Safe library and report on the number of LeapFrog gadgets in the code. Lastly, we demonstrate this new attack vector through a practical demonstration in a client/server TLS handshake scenario, successfully inducing an instruction skip in a client application. Our findings extend the impact of Rowhammer attacks on control flow and contribute to developing more robust defenses against these increasingly sophisticated threats. - [424] arXiv:2404.07940 (replaced) [pdf, other]
-
Title: InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language ModelsLinyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie, Hongxia YangComments: 31 pages. Appear at NeurIPS 2024 Datasets and Benchmarks track. Project website: this https URLSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source at this https URL and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.
- [425] arXiv:2404.08434 (replaced) [pdf, html, other]
-
Title: An improved tabular data generator with VAE-GMM integrationComments: 7 pages, 3 figuresJournal-ref: P. A. Apellaniz, J. Parras and S. Zazo, "An Improved Tabular Data Generator with VAE-GMM Integration," 2024 32nd European Signal Processing Conference (EUSIPCO), Lyon, France, 2024, pp. 1886-1890Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The rising use of machine learning in various fields requires robust methods to create synthetic tabular data. Data should preserve key characteristics while addressing data scarcity challenges. Current approaches based on Generative Adversarial Networks, such as the state-of-the-art CTGAN model, struggle with the complex structures inherent in tabular data. These data often contain both continuous and discrete features with non-Gaussian distributions. Therefore, we propose a novel Variational Autoencoder (VAE)-based model that addresses these limitations. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This avoids the limitations imposed by assuming a strictly Gaussian latent space, allowing for a more accurate representation of the underlying data distribution during data generation. Furthermore, our model offers enhanced flexibility by allowing the use of various differentiable distributions for individual features, making it possible to handle both continuous and discrete data types. We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones, based on their resemblance and utility. This evaluation demonstrates significant outperformance against CTGAN and TVAE, establishing its potential as a valuable tool for generating synthetic tabular data in various domains, particularly in healthcare.
- [426] arXiv:2404.14955 (replaced) [pdf, html, other]
-
Title: A Comprehensive Survey for Hyperspectral Image Classification: The Evolution from Conventional to Transformers and Mamba ModelsMuhammad Ahmad, Salvatore Distifano, Adil Mehmood Khan, Manuel Mazzara, Chenyu Li, Hao Li, Jagannath Aryal, Yao Ding, Gemine Vivone, Danfeng HongSubjects: Computer Vision and Pattern Recognition (cs.CV)
Hyperspectral Image Classification (HSC) presents significant challenges owing to the high dimensionality and intricate nature of Hyperspectral (HS) data. While traditional Machine Learning (TML) approaches have demonstrated effectiveness, they often encounter substantial obstacles in real-world applications, including the variability of optimal feature sets, subjectivity in human-driven design, inherent biases, and methodological limitations. Specifically, TML suffers from the curse of dimensionality, difficulties in feature selection and extraction, insufficient consideration of spatial information, limited robustness against noise, scalability issues, and inadequate adaptability to complex data distributions. In recent years, Deep Learning (DL) techniques have emerged as robust solutions to address these challenges. This survey offers a comprehensive overview of current trends and future prospects in HSC, emphasizing advancements from DL models to the increasing adoption of Transformer and Mamba Model architectures. We systematically review key concepts, methodologies, and state-of-the-art approaches in DL for HSC. Furthermore, we investigate the potential of Transformer-based models and the Mamba Model in HSC, detailing their advantages and challenges. Emerging trends in HSC are explored, including in-depth discussions on Explainable AI and Interoperability concepts, alongside Diffusion Models for image denoising, feature extraction, and image fusion. Comprehensive experimental results were conducted on three HS datasets to substantiate the efficacy of various conventional DL models and Transformers. Additionally, we identify several open challenges and pertinent research questions in the field of HSC. Finally, we outline future research directions and potential applications aimed at enhancing the accuracy and efficiency of HSC.
- [427] arXiv:2404.16824 (replaced) [pdf, html, other]
-
Title: V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright ProtectionComments: Accepted by ACM MM 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
- [428] arXiv:2404.18069 (replaced) [pdf, other]
-
Title: Local Discontinuous Galerkin method for fractional Korteweg-de Vries equationComments: There are some incomplete estimates and require to work on it. It may lead to incorrect justificationSubjects: Numerical Analysis (math.NA)
We propose a local discontinuous Galerkin (LDG) method for fractional Korteweg-de Vries equation involving the fractional Laplacian with exponent $\alpha\in (1,2)$ in one and two space dimensions. By decomposing the fractional Laplacian into a first order derivative and a fractional integral, we prove $L^2$-stability of the semi-discrete LDG scheme incorporating suitable interface and boundary fluxes. We analyze the error estimate by considering linear convection term and utilizing the estimate, we derive the error estimate for general nonlinear flux and demonstrate an order of convergence $\mathcal{O}(h^{k+1/2})$. Moreover, the stability and error analysis have been extended to multiple space dimensional case. Numerical illustrations are shown to demonstrate the efficiency of the scheme by obtaining an optimal order of convergence.
- [429] arXiv:2404.18373 (replaced) [pdf, html, other]
-
Title: 6G comprehensive intelligence: network operations and optimization based on Large Language ModelsComments: 8 pages, 5 figures, 15 preferencesSubjects: Networking and Internet Architecture (cs.NI)
The sixth generation mobile communication standard (6G) can promote the development of Industrial Internet and Internet of Things (IoT). To achieve comprehensive intelligent development of the network and provide customers with higher quality personalized services. This paper proposes a network performance optimization and intelligent operation network architecture based on Large Language Model (LLM), aiming to build a comprehensive intelligent 6G network system. The Large Language Model, with more parameters and stronger learning ability, can more accurately capture patterns and features in data, which can achieve more accurate content output and high intelligence and provide strong support for related research such as network data security, privacy protection, and health assessment. This paper also presents the design framework of a network health assessment system based on LLM and focuses on its potential application value, through the case of network health management system, it is fully demonstrated that the 6G intelligent network system based on LLM has important practical significance for the comprehensive realization of intelligence.
- [430] arXiv:2404.19513 (replaced) [pdf, html, other]
-
Title: A Smartphone-Based Method for Assessing Tomato Nutrient Status through Trichome Density MeasurementSubjects: Computer Vision and Pattern Recognition (cs.CV)
Early detection of fertilizer-induced stress in tomato plants is crucial for optimizing crop yield through timely management interventions. While conventional optical methods struggle to detect fertilizer stress in young leaves, these leaves contain valuable diagnostic information through their microscopic hair-like structures, particularly trichomes, which existing approaches have overlooked. This study introduces a smartphone-based noninvasive technique that leverages mobile computing and digital imaging capabilities to quantify trichome density on young leaves with superior detection latency. Our method uniquely combines augmented reality technology with image processing algorithms to analyze trichomes transferred onto specialized measurement paper. A robust automated pipeline processes these images through region extraction, perspective transformation, and illumination correction to precisely quantify trichome density. Validation experiments on hydroponically grown tomatoes under varying fertilizer conditions demonstrated the method's effectiveness. Leave-one-out cross-validation revealed strong predictive performance with the area under the precision-recall curve (PR-AUC: 0.82) and area under the receiver operating characteristic curve (ROC-AUC: 0.64), while the predicted and observed trichome densities exhibited high correlation ($r = 0.79$). This innovative approach transforms smartphones into precise diagnostic tools for plant nutrition assessment, offering a practical, cost-effective solution for precision agriculture.
- [431] arXiv:2405.07587 (replaced) [pdf, html, other]
-
Title: Structure-Preserving Model Order Reduction for Nonlinear DAE Models of Power NetworksSubjects: Systems and Control (eess.SY)
This paper deals with the joint reduction of the number of dynamic and algebraic states of a nonlinear differential-algebraic equation (NDAE) model of a power network. The dynamic states depict the internal states of generators, loads, renewables, whereas the algebraic ones define network states such as voltages and phase angles. In the current literature of power system model order reduction (MOR), the algebraic constraints are usually neglected and the power network is commonly modeled via a set of ordinary differential equations (ODEs) instead of NDAEs. Thus, reduction is usually carried out for the dynamic states only and the algebraic variables are kept intact. This leaves a significant part of the system's size and complexity unreduced. This paper addresses this aforementioned limitation by jointly reducing both dynamic and algebraic variables. As compared to the literature the proposed MOR techniques are endowed with the following features: (i) no system linearization is required, (ii) require no transformation to an equivalent or approximate ODE representation, (iii) guarantee that the reduced order model to be NDAE-structured and thus preserves the differential-algebraic structure of original power system model, and (iv) can seamlessly reduce both dynamic and algebraic variables while maintaining high accuracy. Case studies performed on a 2000-bus power system reveal that the proposed MOR techniques are able to reduce system order while maintaining accuracy.
- [432] arXiv:2405.08318 (replaced) [pdf, html, other]
-
Title: No-Regret Learning of Nash Equilibrium for Black-Box Games via Gaussian ProcessesComments: 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024)Subjects: Machine Learning (cs.LG)
This paper investigates the challenge of learning in black-box games, where the underlying utility function is unknown to any of the agents. While there is an extensive body of literature on the theoretical analysis of algorithms for computing the Nash equilibrium with complete information about the game, studies on Nash equilibrium in black-box games are less common. In this paper, we focus on learning the Nash equilibrium when the only available information about an agent's payoff comes in the form of empirical queries. We provide a no-regret learning algorithm that utilizes Gaussian processes to identify the equilibrium in such games. Our approach not only ensures a theoretical convergence rate but also demonstrates effectiveness across a variety collection of games through experimental validation.
- [433] arXiv:2405.09276 (replaced) [pdf, html, other]
-
Title: Dual-Segment Clustering Strategy for Hierarchical Federated Learning in Heterogeneous Wireless EnvironmentsPengcheng Sun, Erwu Liu, Wei Ni, Kanglei Yu, Xinyu Qu, Rui Wang, Yanlong Bi, Chuanchun Zhang, Abbas JamalipourSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Non-independent and identically distributed (Non- IID) data adversely affects federated learning (FL) while heterogeneity in communication quality can undermine the reliability of model parameter transmission, potentially degrading wireless FL convergence. This paper proposes a novel dual-segment clustering (DSC) strategy that jointly addresses communication and data heterogeneity in FL. This is achieved by defining a new signal-to-noise ratio (SNR) matrix and information quantity matrix to capture the communication and data heterogeneity, respectively. The celebrated affinity propagation algorithm is leveraged to iteratively refine the clustering of clients based on the newly defined matrices effectively enhancing model aggregation in heterogeneous environments. The convergence analysis and experimental results show that the DSC strategy can improve the convergence rate of wireless FL and demonstrate superior accuracy in heterogeneous environments compared to classical clustering methods.
- [434] arXiv:2405.09336 (replaced) [pdf, html, other]
-
Title: Analytical Characterization of the Operational Diversity Order in Fading ChannelsSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
We introduce and characterize the operational diversity order (ODO) in fading channels, as a proxy to the classical notion of diversity order at any arbitrary operational signal-to-noise ratio (SNR). Thanks to this definition, relevant insights are brought up in a number of cases: (i) We quantify that in dominant line-of-sight scenarios an increased diversity order is attainable compared to that achieved asymptotically, even in the single-antenna case; (ii) this effect is attenuated, but still visible, in the presence of an additional dominant specular component; (iii) the decay slope in Rayleigh product channels increases very slowly, never fully achieving unitary slope for a finite SNR.
- [435] arXiv:2405.09596 (replaced) [pdf, html, other]
-
Title: Enhancing Maritime Trajectory Forecasting via H3 Index and Causal Language Modelling (CLM)Comments: 28 pages, 18 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
The prediction of ship trajectories is a growing field of study in artificial intelligence. Traditional methods rely on the use of LSTM, GRU networks, and even Transformer architectures for the prediction of spatio-temporal series. This study proposes a viable alternative for predicting these trajectories using only GNSS positions. It considers this spatio-temporal problem as a natural language processing problem. The latitude/longitude coordinates of AIS messages are transformed into cell identifiers using the H3 index. Thanks to the pseudo-octal representation, it becomes easier for language models to learn the spatial hierarchy of the H3 index. The method is compared with a classical Kalman filter, widely used in the maritime domain, and introduces the Fréchet distance as the main evaluation metric. We show that it is possible to predict ship trajectories quite precisely up to 8 hours ahead with 30 minutes of context, using solely GNSS positions, without relying on any additional information such as speed, course, or external conditions - unlike many traditional methods. We demonstrate that this alternative works well enough to predict trajectories worldwide.
- [436] arXiv:2405.14325 (replaced) [pdf, html, other]
-
Title: Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent studies highlighted a practical setting of unsupervised anomaly detection (UAD) that builds a unified model for multi-class images. Despite various advancements addressing this challenging task, the detection performance under the multi-class setting still lags far behind state-of-the-art class-separated models. Our research aims to bridge this substantial performance gap. In this paper, we introduce a minimalistic reconstruction-based anomaly detection framework, namely Dinomaly, which leverages pure Transformer architectures without relying on complex designs, additional modules, or specialized tricks. Given this powerful framework consisted of only Attentions and MLPs, we found four simple components that are essential to multi-class anomaly detection: (1) Foundation Transformers that extracts universal and discriminative features, (2) Noisy Bottleneck where pre-existing Dropouts do all the noise injection tricks, (3) Linear Attention that naturally cannot focus, and (4) Loose Reconstruction that does not force layer-to-layer and point-by-point reconstruction. Extensive experiments are conducted across popular anomaly detection benchmarks including MVTec-AD, VisA, and Real-IAD. Our proposed Dinomaly achieves impressive image-level AUROC of 99.6%, 98.7%, and 89.3% on the three datasets respectively, which is not only superior to state-of-the-art multi-class UAD methods, but also achieves the most advanced class-separated UAD records.
- [437] arXiv:2405.17724 (replaced) [pdf, html, other]
-
Title: ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion ModelsSubjects: Artificial Intelligence (cs.AI)
Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce
$\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$ $\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$ $\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively.
Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data. - [438] arXiv:2405.18684 (replaced) [pdf, html, other]
-
Title: Learning Diffeomorphism for Image Registration with Time-Continuous Networks using Semigroup RegularizationComments: 20 pages, 9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Diffeomorphic image registration (DIR) is a critical task in 3D medical image analysis, aimed at finding topology preserving deformations between pairs of images. Focusing on the solution of the flow map differential equation as the diffeomorphic deformation, recent methods use discrete timesteps along with various regularization terms to penalize the negative determinant of Jacobian and impose smoothness of the solution vector field. In this paper, we propose a novel learning-based approach for diffeomorphic 3D-image registration which finds the diffeomorphisms in the time continuum with only a single regularization term and no additional integration. As one of the fundamental properties of flow maps, we exploit the semigroup property as the only form of regularization, ensuring temporally continuous diffeomorphic flows between pairs of images. Leveraging this property, our method alleviates the need for additional regularization terms and scaling and squaring integration during both training and evaluation. To achieve time-continuous diffeomorphisms, we employ time-embedded UNets, an architecture commonly utilized in diffusion models. The proposed method reveals that ensuring diffeomorphism in a continuous time interval leads to better registration results. Experimental results on four public datasets demonstrate the superiority of our model over both learning-based and optimization-based methods.
- [439] arXiv:2405.18839 (replaced) [pdf, html, other]
-
Title: MEGA: Masked Generative Autoencoder for Human Mesh RecoverySubjects: Computer Vision and Pattern Recognition (cs.CV)
Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.
- [440] arXiv:2405.20155 (replaced) [pdf, html, other]
-
Title: MotionDreamer: Exploring Semantic Video Diffusion features for Zero-Shot 3D Mesh AnimationSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Animation techniques bring digital 3D worlds and characters to life. However, manual animation is tedious and automated techniques are often specialized to narrow shape classes. In our work, we propose a technique for automatic re-animation of various 3D shapes based on a motion prior extracted from a video diffusion model. Unlike existing 4D generation methods, we focus solely on the motion, and we leverage an explicit mesh-based representation compatible with existing computer-graphics pipelines. Furthermore, our utilization of diffusion features enhances accuracy of our motion fitting. We analyze efficacy of these features for animation fitting and we experimentally validate our approach for two different diffusion models and four animation models. Finally, we demonstrate that our time-efficient zero-shot method achieves a superior performance re-animating a diverse set of 3D shapes when compared to existing techniques in a user study. The project website is located at this https URL.
- [441] arXiv:2406.01395 (replaced) [pdf, html, other]
-
Title: TE-NeXt: A LiDAR-Based 3D Sparse Convolutional Network for Traversability EstimationComments: This work has been submitted to the Expert Systems With applicationsSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper presents TE-NeXt, a novel and efficient architecture for Traversability Estimation (TE) from sparse LiDAR point clouds based on a residual convolution block. TE-NeXt block fuses notions of current trends such as attention mechanisms and 3D sparse convolutions. TE-NeXt aims to demonstrate high capacity for generalisation in a variety of urban and natural environments, using well-known and accessible datasets such as SemanticKITTI, Rellis-3D and SemanticUSL. Thus, the designed architecture ouperforms state-of-the-art methods in the problem of semantic segmentation, demonstrating better results in unstructured environments and maintaining high reliability and robustness in urbans environments, which leads to better abstraction. Implementation is available in a open repository to the scientific community with the aim of ensuring the reproducibility of results.
- [442] arXiv:2406.01767 (replaced) [pdf, html, other]
-
Title: Region-aware Grasp Framework with Normalized Grasp Space for Efficient 6-DoF GraspingComments: Accepted by CoRL2024, final camera-ready version will be updated soonSubjects: Robotics (cs.RO)
A series of region-based methods succeed in extracting regional features and enhancing grasp detection quality. However, faced with a cluttered scene with potential collision, the definition of the grasp-relevant region stays inconsistent, and the relationship between grasps and regional spaces remains incompletely investigated. In this paper, we propose Normalized Grasp Space (NGS) from a novel region-aware viewpoint, unifying the grasp representation within a normalized regional space and benefiting the generalizability of methods. Leveraging the NGS, we find that CNNs are underestimated for 3D feature extraction and 6-DoF grasp detection in clutter scenes and build a highly efficient Region-aware Normalized Grasp Network (RNGNet). Experiments on the public benchmark show that our method achieves significant >20% performance gains while attaining a real-time inference speed of approximately 50 FPS. Real-world cluttered scene clearance experiments underscore the effectiveness of our method. Further, human-to-robot handover and dynamic object grasping experiments demonstrate the potential of our proposed method for closed-loop grasping in dynamic scenarios.
- [443] arXiv:2406.02458 (replaced) [pdf, other]
-
Title: Deep Block Proximal Linearised Minimisation Algorithm for Non-convex Inverse ProblemsChaoyan Huang, Zhongming Wu, Yanqi Cheng, Tieyong Zeng, Carola-Bibiane Schönlieb, Angelica I. Aviles-RiveroComments: 6 figures, 3 tablesSubjects: Numerical Analysis (math.NA)
Image restoration is typically addressed through non-convex inverse problems, which are often solved using first-order block-wise splitting methods. In this paper, we consider a general type of non-convex optimisation model that captures many inverse image problems and present an inertial block proximal linearised minimisation (iBPLM) algorithm. Our new method unifies the Jacobi-type parallel and the Gauss-Seidel-type alternating update rules, and extends beyond these approaches. The inertial technique is also incorporated into each block-wise subproblem update, which can accelerate numerical convergence. Furthermore, we extend this framework with a plug-and-play variant (PnP-iBPLM) that integrates deep gradient denoisers, offering a flexible and robust solution for complex imaging tasks. We provide comprehensive theoretical analysis, demonstrating both subsequential and global convergence of the proposed algorithms. To validate our methods, we apply them to multi-block dictionary learning problems in image denoising and deblurring. Experimental results show that both iBPLM and PnP-iBPLM significantly enhance numerical performance and robustness in these applications.
- [444] arXiv:2406.03230 (replaced) [pdf, html, other]
-
Title: Defending Large Language Models Against Attacks With Residual Stream Activation AnalysisSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
The widespread adoption of Large Language Models (LLMs), exemplified by OpenAI's ChatGPT, brings to the forefront the imperative to defend against adversarial threats on these models. These attacks, which manipulate an LLM's output by introducing malicious inputs, undermine the model's integrity and the trust users place in its outputs. In response to this challenge, our paper presents an innovative defensive strategy, given white box access to an LLM, that harnesses residual activation analysis between transformer layers of the LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification. We curate multiple datasets to demonstrate how this method of classification has high accuracy across multiple types of attack scenarios, including our newly-created attack dataset. Furthermore, we enhance the model's resilience by integrating safety fine-tuning techniques for LLMs in order to measure its effect on our capability to detect attacks. The results underscore the effectiveness of our approach in enhancing the detection and mitigation of adversarial inputs, advancing the security framework within which LLMs operate.
- [445] arXiv:2406.03578 (replaced) [pdf, html, other]
-
Title: Two-dimensional Kripke Semantics II: Stability and CompletenessComments: Accepted at MFPS 2024Subjects: Logic in Computer Science (cs.LO); Category Theory (math.CT); Logic (math.LO)
We revisit the duality between Kripke and algebraic semantics of intuitionistic and intuitionistic modal logic. We find that there is a certain mismatch between the two semantics, which means that not all algebraic models can be embedded into a Kripke model. This leads to an alternative proposal for a relational semantics, the stable semantics. Instead of an arbitrary partial order, the stable semantics requires a distributive lattice of worlds. We constructively show that the stable semantics is exactly as complete as the algebraic semantics. Categorifying these results leads to a 2-duality between two-dimensional stable semantics and categories of product-preserving presheaves, i.e. models of algebraic theories in the style of Lawvere.
- [446] arXiv:2406.06526 (replaced) [pdf, html, other]
-
Title: GaussianCity: Generative Gaussian Splatting for Unbounded 3D City GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV)
3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian Splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage overhead (out-of-memory issues), arising from the need to expand points to billions, often demanding hundreds of Gigabytes of VRAM for a city scene spanning 10km^2. In this paper, we propose GaussianCity, a generative Gaussian Splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. Our key insights are two-fold: 1) Compact 3D Scene Representation: We introduce BEV-Point as a highly compact intermediate representation, ensuring that the growth in VRAM usage for unbounded scenes remains constant, thus enabling unbounded city generation. 2) Spatial-aware Gaussian Attribute Decoder: We present spatial-aware BEV-Point decoder to produce 3D Gaussian attributes, which leverages Point Serializer to integrate the structural and contextual characteristics of BEV points. Extensive experiments demonstrate that GaussianCity achieves state-of-the-art results in both drone-view and street-view 3D city generation. Notably, compared to CityDreamer, GaussianCity exhibits superior performance with a speedup of 60 times (10.72 FPS v.s. 0.18 FPS).
- [447] arXiv:2406.07473 (replaced) [pdf, other]
-
Title: Choreographing the Rhythms of Observation: Dynamics for Ranged Observer Bipartite-Unipartite SpatioTemporal (ROBUST) NetworksComments: 459 pages, PhD thesisSubjects: Multiagent Systems (cs.MA)
Existing network analysis methods struggle to optimize observer placements in dynamic environments with limited visibility. This dissertation introduces the novel ROBUST (Ranged Observer Bipartite-Unipartite SpatioTemporal) framework, offering a significant advancement in modeling, analyzing, and optimizing observer networks within complex spatiotemporal domains. ROBUST leverages a unique bipartite-unipartite approach, distinguishing between observer and observable entities while incorporating spatial constraints and temporal dynamics.
This research extends spatiotemporal network theory by introducing novel graph-based measures, including myopic degree, spatial closeness centrality, and edge length proportion. These measures, coupled with advanced clustering techniques like Proximal Recurrence, provide insights into network structure, resilience, and the effectiveness of observer placements. The ROBUST framework demonstrates superior resource allocation and strategic responsiveness compared to conventional models. Case studies in oceanographic monitoring, urban safety networks, and multi-agent path planning showcases its practical applicability and adaptability. Results demonstrate significant improvements in coverage, response times, and overall network efficiency.
This work paves the way for future research in incorporating imperfect knowledge, refining temporal pathing methodologies, and expanding the scope of applications. By bridging theoretical advancements with practical solutions, ROBUST stands as a significant contribution to the field, promising to inform and inspire ongoing and future endeavors in network optimization and multi-agent system planning. - [448] arXiv:2406.10880 (replaced) [pdf, html, other]
-
Title: Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASRComments: Accepted to EMNLP 2024 FindingsSubjects: Computation and Language (cs.CL)
Recent advancements in multimodal large language models (MLLMs) have made significant progress in integrating information across various modalities, yet real-world applications in educational and scientific domains remain challenging. This paper introduces the Multimodal Scientific ASR (MS-ASR) task, which focuses on transcribing scientific conference videos by leveraging visual information from slides to enhance the accuracy of technical terminologies. Realized that traditional metrics like WER fall short in assessing performance accurately, prompting the proposal of severity-aware WER (SWER) that considers the content type and severity of ASR errors. We propose the Scientific Vision Augmented ASR (SciVASR) framework as a baseline method, enabling MLLMs to improve transcript quality through post-editing. Evaluations of state-of-the-art MLLMs, including GPT-4o, show a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.
- [449] arXiv:2406.11214 (replaced) [pdf, html, other]
-
Title: Problematic Tokens: Tokenizer Bias in Large Language ModelsComments: 11th IEEE Special session on Privacy and Security of Big Data (PSBD 2024)Subjects: Computation and Language (cs.CL)
Recent advancements in large language models(LLMs), such as GPT-4 and GPT-4o, have shown exceptional performance, especially in languages with abundant resources like English, thanks to extensive datasets that ensure robust training. Conversely, these models exhibit limitations when processing under-resourced languages such as Chinese and Korean, where issues including hallucinatory responses remain prevalent. This paper traces the roots of these disparities to the tokenization process inherent to these models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages. This misrepresentation results in the propagation of under-trained or untrained tokens, which perpetuate biases and pose serious concerns related to data security and ethical standards. We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify these risks and offer strategic solutions to mitigate associated security and ethical issues. Through this study, we emphasize the critical need to rethink tokenization frameworks to foster more equitable and secure AI technologies. The code and data are available at this https URL
- [450] arXiv:2406.12382 (replaced) [pdf, html, other]
-
Title: From Instance Training to Instruction Learning: Task Adapters Generation from InstructionsComments: accepted to NeurIPS 2024Subjects: Computation and Language (cs.CL)
Large language models (LLMs) have acquired the ability to solve general tasks by utilizing instruction finetuning (IFT). However, IFT still relies heavily on instance training of extensive task data, which greatly limits the adaptability of LLMs to real-world scenarios where labeled task instances are scarce and broader task generalization becomes paramount. Contrary to LLMs, humans acquire skills and complete tasks not merely through repeated practice but also by understanding and following instructional guidelines. This paper is dedicated to simulating human learning to address the shortcomings of instance training, focusing on instruction learning to enhance cross-task generalization. Within this context, we introduce Task Adapters Generation from Instructions (TAGI), which automatically constructs the task-specific model in a parameter generation manner based on the given task instructions without retraining for unseen tasks. Specifically, we utilize knowledge distillation to enhance the consistency between TAGI developed through Learning with Instruction and task-specific models developed through Training with Instance, by aligning the labels, output logits, and adapter parameters between them. TAGI is endowed with cross-task generalization capabilities through a two-stage training process that includes hypernetwork pretraining and finetuning. We evaluate TAGI on the Super-Natural Instructions and P3 datasets. The experimental results demonstrate that TAGI can match or even outperform traditional meta-trained models and other hypernetwork models, while significantly reducing computational requirements.
- [451] arXiv:2406.15104 (replaced) [pdf, html, other]
-
Title: Deciphering the Definition of Adversarial Robustness for post-hoc OOD DetectorsSubjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Detecting out-of-distribution (OOD) inputs is critical for safely deploying deep learning models in real-world scenarios. In recent years, many OOD detectors have been developed, and even the benchmarking has been standardized, i.e. OpenOOD. The number of post-hoc detectors is growing fast. They are showing an option to protect a pre-trained classifier against natural distribution shifts and claim to be ready for real-world scenarios. However, its effectiveness in dealing with adversarial examples (AdEx) has been neglected in most studies. In cases where an OOD detector includes AdEx in its experiments, the lack of uniform parameters for AdEx makes it difficult to accurately evaluate the performance of the OOD detector. This paper investigates the adversarial robustness of 16 post-hoc detectors against various evasion attacks. It also discusses a roadmap for adversarial defense in OOD detectors that would help adversarial robustness. We believe that level 1 (AdEx on a unified dataset) should be added to any OOD detector to see the limitations. The last level in the roadmap (defense against adaptive attacks) we added for integrity from an adversarial machine learning (AML) point of view, which we do not believe is the ultimate goal for OOD detectors.
- [452] arXiv:2406.15955 (replaced) [pdf, html, other]
-
Title: Beyond the Doors of Perception: Vision Transformers Represent Relations Between ObjectsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a perceptual stage wherein local object features are extracted and stored in a disentangled representation, and 2) a relational stage wherein object representations are compared. In the second stage, we find evidence that ViTs can learn to represent somewhat abstract visual relations, a capability that has long been considered out of reach for artificial neural networks. Finally, we demonstrate that failures at either stage can prevent a model from learning a generalizable solution to our fairly simple tasks. By understanding ViTs in terms of discrete processing stages, one can more precisely diagnose and rectify shortcomings of existing and future models.
- [453] arXiv:2406.16713 (replaced) [pdf, html, other]
-
Title: ShanghaiTech Mapping Robot is All You Need: Robot System for Collecting Universal Ground Vehicle DatasetsComments: 19 pages, 27 figures. Submitted to IEEE Transactions on RoboticsSubjects: Robotics (cs.RO)
This paper presents the ShanghaiTech Mapping Robot, a state-of-the-art unmanned ground vehicle (UGV) designed for collecting comprehensive multi-sensor datasets to support research in robotics, Simultaneous Localization and Mapping (SLAM), computer vision, and autonomous driving. The robot is equipped with a wide array of sensors including RGB cameras, RGB-D cameras, event-based cameras, IR cameras, LiDARs, mmWave radars, IMUs, ultrasonic range finders, and a GNSS RTK receiver. The sensor suite is integrated onto a specially designed mechanical structure with a centralized power system and a synchronization mechanism to ensure spatial and temporal alignment of the sensor data. A 16-node on-board computing cluster handles sensor control, data collection, and storage. We describe the hardware and software architecture of the robot in detail and discuss the calibration procedures for the various sensors and investigate the interference for LiDAR and RGB-D sensors. The capabilities of the platform are demonstrated through an extensive outdoor dataset collected in a diverse campus environment. Experiments with two LiDAR-based and two RGB-based SLAM approaches showcase the potential of the dataset to support development and benchmarking for robotics. To facilitate research, we make the dataset publicly available along with the associated robot sensor calibration data: this https URL
- [454] arXiv:2406.17586 (replaced) [pdf, html, other]
-
Title: Benchmarking SLAM Algorithms in the Cloud: The SLAM Hive Benchmarking SuiteComments: arXiv admin note: text overlap with arXiv:2303.11854Subjects: Robotics (cs.RO)
Evaluating the performance of Simultaneous Localization and Mapping (SLAM) algorithms is essential for scientists and users of robotic systems alike. But there are a multitude of different permutations of possible options of hardware setups and algorithm configurations, as well as different datasets and algorithms, such that it was previously infeasible to thoroughly compare SLAM systems against the full state of the art. To solve that we present the SLAM Hive Benchmarking Suite, which is able to analyze SLAM algorithms in 1000's of mapping runs, through its utilization of container technology and deployment in the cloud. This paper presents the architecture and open source implementation of SLAM Hive and compares it to existing efforts on SLAM evaluation. We perform mapping runs with popular visual, RGBD and LiDAR based SLAM algorithms against commonly used datasets and show how SLAM Hive can be used to conveniently analyze the results against various aspects. Through this we envision that SLAM Hive can become an essential tool for proper comparisons and evaluations of SLAM algorithms and thus drive the scientific development in the research on SLAM. The open source software as well as a demo to show the live analysis of 1000's of mapping runs can be found on our SLAM Hive website.
- [455] arXiv:2406.17736 (replaced) [pdf, html, other]
-
Title: Fairness in Social Influence Maximization via Optimal TransportSubjects: Social and Information Networks (cs.SI); Multiagent Systems (cs.MA)
We study fairness in social influence maximization, whereby one seeks to select seeds that spread a given information throughout a network, ensuring balanced outreach among different communities (e.g. demographic groups). In the literature, fairness is often quantified in terms of the expected outreach within individual communities. In this paper, we demonstrate that such fairness metrics can be misleading since they overlook the stochastic nature of information diffusion processes. When information diffusion occurs in a probabilistic manner, multiple outreach scenarios can occur. As such, outcomes such as ``In 50\% of the cases, no one in group 1 gets the information, while everyone in group 2 does, and in the other 50%, it is the opposite'', which always results in largely unfair outcomes, are classified as fair by a variety of fairness metrics in the literature. We tackle this problem by designing a new fairness metric, mutual fairness, that captures variability in outreach through optimal transport theory. We propose a new seed-selection algorithm that optimizes both outreach and mutual fairness, and we show its efficacy on several real datasets. We find that our algorithm increases fairness with only a minor decrease (and at times, even an increase) in efficiency.
- [456] arXiv:2406.18406 (replaced) [pdf, html, other]
-
Title: IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware NeuronsComments: NeurIPS 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
It is widely acknowledged that large language models (LLMs) encode a vast reservoir of knowledge after being trained on mass data. Recent studies disclose knowledge conflicts in LLM generation, wherein outdated or incorrect parametric knowledge (i.e., encoded knowledge) contradicts new knowledge provided in the context. To mitigate such knowledge conflicts, we propose a novel framework, IRCAN (Identifying and Reweighting Context-Aware Neurons) to capitalize on neurons that are crucial in processing contextual cues. Specifically, IRCAN first identifies neurons that significantly contribute to context processing, utilizing a context-aware attribution score derived from integrated gradients. Subsequently, the identified context-aware neurons are strengthened via reweighting. In doing so, we steer LLMs to generate context-sensitive outputs with respect to the new knowledge provided in the context. Extensive experiments conducted across a variety of models and tasks demonstrate that IRCAN not only achieves remarkable improvements in handling knowledge conflicts but also offers a scalable, plug-and-play solution that can be integrated seamlessly with existing models. Our codes are released at this https URL.
- [457] arXiv:2407.00476 (replaced) [pdf, html, other]
-
Title: Large Language Models for Power Scheduling: A User-Centric ApproachThomas Mongaillard, Samson Lasaulce, Othman Hicheur, Chao Zhang, Lina Bariah, Vineeth S. Varma, Hang Zou, Qiyang Zhao, Merouane DebbahSubjects: Computation and Language (cs.CL); Systems and Control (eess.SY)
While traditional optimization and scheduling schemes are designed to meet fixed, predefined system requirements, future systems are moving toward user-driven approaches and personalized services, aiming to achieve high quality-of-experience (QoE) and flexibility. This challenge is particularly pronounced in wireless and digitalized energy networks, where users' requirements have largely not been taken into consideration due to the lack of a common language between users and machines. The emergence of powerful large language models (LLMs) marks a radical departure from traditional system-centric methods into more advanced user-centric approaches by providing a natural communication interface between users and devices. In this paper, for the first time, we introduce a novel architecture for resource scheduling problems by constructing three LLM agents to convert an arbitrary user's voice request (VRQ) into a resource allocation vector. Specifically, we design an LLM intent recognition agent to translate the request into an optimization problem (OP), an LLM OP parameter identification agent, and an LLM OP solving agent. To evaluate system performance, we construct a database of typical VRQs in the context of electric vehicle (EV) charging. As a proof of concept, we primarily use Llama 3 8B. Through testing with different prompt engineering scenarios, the obtained results demonstrate the efficiency of the proposed architecture. The conducted performance analysis allows key insights to be extracted. For instance, having a larger set of candidate OPs to model the real-world problem might degrade the final performance because of a higher recognition/OP classification noise level. All results and codes are open source.
- [458] arXiv:2407.00996 (replaced) [pdf, html, other]
-
Title: Can Small Language Models Learn, Unlearn, and Retain Noise Patterns?Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Small Language Models (SLMs) are generally considered more compact versions of large language models (LLMs). This study investigates the ability of SLMs with parameters between 1 and 3 billion to learn, retain, and subsequently eliminate different types of noise present in the data. Four pre-trained SLMs were utilized for this: Olmo 1B, Qwen1.5 1.8B, Gemma 2B, and Phi2 2.7B. The models were instruction-tuned on noise-free data and tested using in-context examples to determine if they could learn noise through examples. Subsequently, noise patterns were introduced in instruction tuning to evaluate the noise learning, unlearning, and retention capabilities of the models. Olmo, the smallest model, was highly sensitive to noise, quickly adapting to noisy patterns. Phi2 resisted learning character-level and transliteration noise, likely due to its carefully curated, structured, and high-quality pretraining data. Gemma excelled with transliteration noise, likely benefiting from its multilingual pretraining. The findings can be used to develop robust training strategies for SLMs.
- [459] arXiv:2407.02072 (replaced) [pdf, other]
-
Title: Component based model order reduction with mortar tied contact for nonlinear quasi-static mechanical problemsSubjects: Computational Engineering, Finance, and Science (cs.CE)
In this work, we present a model order reduction technique for nonlinear structures assembled from this http URL reduced order model is constructed by reducing the substructures with proper orthogonal decomposition and connecting them by a mortar-tied contact formulation. The snapshots for the substructure projection matrices are computed on the substructure level by the proper orthogonal decomposition (POD) method. The snapshots are computed using a random sampling procedure based on a parametrization of boundary conditions. To reduce the computational effort of the snapshot computation full-order simulations of the substructures are only computed when the error of the reduced solution is above a threshold. In numerical examples, we show the accuracy and efficiency of the method for nonlinear problems involving material and geometric nonlinearity as well as non-matching meshes. We are able to predict solutions of systems that we did not compute in our snapshots.
- [460] arXiv:2407.02279 (replaced) [pdf, other]
-
Title: How to Boost Any Loss FunctionComments: NeurIPS'24Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Boosting is a highly successful ML-born optimization setting in which one is required to computationally efficiently learn arbitrarily good models based on the access to a weak learner oracle, providing classifiers performing at least slightly differently from random guessing. A key difference with gradient-based optimization is that boosting's original model does not requires access to first order information about a loss, yet the decades long history of boosting has quickly evolved it into a first order optimization setting -- sometimes even wrongfully defining it as such. Owing to recent progress extending gradient-based optimization to use only a loss' zeroth ($0^{th}$) order information to learn, this begs the question: what loss functions can be efficiently optimized with boosting and what is the information really needed for boosting to meet the original boosting blueprint's requirements?
We provide a constructive formal answer essentially showing that any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical $0^{th}$ order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz -- and in fact not required to be continuous either. Some tools we use are rooted in quantum calculus, the mathematical field -- not to be confounded with quantum computation -- that studies calculus without passing to the limit, and thus without using first order information. - [461] arXiv:2407.04573 (replaced) [pdf, html, other]
-
Title: VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language ModelsSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Vector retrieval algorithms are essential for semantic queries within the rapidly evolving landscape of Large Language Models (LLMs). The ability to retrieve vectors that satisfy both similarity and diversity criteria substantially enhances the performance of LLMs. Although Maximal Marginal Relevance (MMR) is widely employed in retrieval scenarios requiring relevance and diversity, variations in the parameter $\lambda$ lead to fluctuations that complicate the optimization trajectory in vector spaces. This obscures the direction of improvement and highlights the lack of a robust theoretical analysis regarding similarity and diversity constraints in retrieval processes. To address these challenges, this paper introduces a novel approach that characterizes both constraints through the relationship between the sum vector and the query vector. The proximity of these vectors ensures the similarity constraint, while requiring individual vectors within the sum vector to diverge in their alignment with the query vector satisfies the diversity constraint. We first formulate a new combinatorial optimization problem, selecting k vectors from a candidate set such that their sum vector maximally aligns with the query vector, and demonstrate that this problem is NP-complete. This result underscores the inherent difficulty of simultaneously achieving similarity and diversity in vector retrieval, thereby providing a theoretical foundation for future research. Subsequently, we present the heuristic algorithm Vectors Retrieval with Similarity and Diversity, VRSD, which features a clear optimization objective and eliminates the need for preset parameters. VRSD also achieves a modest reduction in time complexity compared to MMR. Empirical validation confirms that VRSD significantly outperforms MMR across various datasets.
- [462] arXiv:2407.05280 (replaced) [pdf, html, other]
-
Title: Perpetual Exploration of a Ring in Presence of Byzantine Black HoleSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Perpetual exploration is a fundamental problem in the domain of mobile agents, where an agent needs to visit each node infinitely often. This issue has received lot of attention, mainly for ring topologies, presence of black holes adds more complexity. A black hole can destroy any incoming agent without any observable trace. In \cite{BampasImprovedPeriodicDataRetrieval,KralovivcPeriodicDataRetrievalFirst}, the authors considered this problem in the context of \textit{ Periodic data retrieval}. They introduced a variant of black hole called gray hole (where the adversary chooses whether to destroy an agent or let it pass) among others and showed that 4 asynchronous and co-located agents are essential to solve this problem (hence perpetual exploration) in presence of such a gray hole if each node of the ring has a whiteboard. This paper investigates the exploration of a ring in presence of a ``byzantine black hole''. In addition to the capabilities of a gray hole, in this variant, the adversary chooses whether to erase any previously stored information on that node. Previously, one particular initial scenario (i.e., agents are co-located) and one particular communication model (i.e., whiteboard) are investigated. Now, there can be other initial scenarios where all agents may not be co-located. Also, there are many weaker models of communications (i.e., Face-to-Face, Pebble) where this problem is yet to be investigated. The agents are synchronous. The main results focus on minimizing the agent number while ensuring that perpetual exploration is achieved even in presence of such a node under various communication models and starting positions. Further, we achieved a better upper and lower bound result (i.e., 3 agents) for this problem (where the malicious node is a generalized version of a gray hole), by trading-off scheduler capability, for co-located and in presence of a whiteboard.
- [463] arXiv:2407.10959 (replaced) [pdf, other]
-
Title: A Unified Probabilistic Approach to Traffic Conflict DetectionComments: 21 pages, 10 figures, under revisionSubjects: Robotics (cs.RO); Machine Learning (stat.ML)
Traffic conflict detection is essential for proactive road safety by identifying potential collisions before they occur. Existing methods rely on surrogate safety measures tailored to specific interactions (e.g., car-following, side-swiping, or path-crossing) and require varying thresholds in different traffic conditions. This variation leads to inconsistencies and limited adaptability of conflict detection in evolving traffic environments. Consequently, a need persists for consistent detection of traffic conflicts across interaction contexts. To address this need, this study proposes a unified probabilistic approach. The proposed approach establishes a unified framework of traffic conflict detection, where traffic conflicts are formulated as context-dependent extreme events of road user interactions. The detection of conflicts is then decomposed into a series of statistical learning tasks: representing interaction contexts, inferring proximity distributions, and assessing extreme collision risk. The unified formulation accommodates diverse hypotheses of traffic conflicts and the learning tasks enable data-driven analysis of factors such as motion states of road users, environment conditions, and participant characteristics. Jointly, this approach supports consistent and comprehensive evaluation of the collision risk emerging in road user interactions. Our experiments using real-world trajectory data show that the approach provides effective collision warnings, generalises across distinct datasets and traffic environments, covers a broad range of conflict types, and captures a long-tailed distribution of conflict intensity. The findings highlight its potential to enhance the safety assessment of traffic infrastructures and policies, improve collision warning systems for autonomous driving, and deepen the understanding of road user behaviour in safety-critical interactions.
- [464] arXiv:2407.11420 (replaced) [pdf, html, other]
-
Title: iKalibr: Unified Targetless Spatiotemporal Calibration for Resilient Integrated Inertial SystemsSubjects: Robotics (cs.RO)
The integrated inertial system, typically integrating an IMU and an exteroceptive sensor such as radar, LiDAR, and camera, has been widely accepted and applied in modern robotic applications for ego-motion estimation, motion control, or autonomous exploration. To improve system accuracy, robustness, and further usability, both multiple and various sensors are generally resiliently integrated, which benefits the system performance regarding failure tolerance, perception capability, and environment compatibility. For such systems, accurate and consistent spatiotemporal calibration is required to maintain a unique spatiotemporal framework for multi-sensor fusion. Considering most existing calibration methods (i) are generally oriented to specific integrated inertial systems, (ii) often only focus on spatial determination, (iii) usually require artificial targets, lacking convenience and usability, we propose iKalibr: a unified targetless spatiotemporal calibration framework for resilient integrated inertial systems, which overcomes the above issues, and enables both accurate and consistent calibration. Altogether four commonly employed sensors are supported in iKalibr currently, namely IMU, radar, LiDAR, and camera. The proposed method starts with a rigorous and efficient dynamic initialization, where all parameters in the estimator would be accurately recovered. Subsequently, several continuous-time batch optimizations are conducted to refine the initialized parameters toward better states. Sufficient real-world experiments were conducted to verify the feasibility and evaluate the calibration performance of iKalibr. The results demonstrate that iKalibr can achieve accurate resilient spatiotemporal calibration. We open-source our implementations at (this https URL) to benefit the research community.
- [465] arXiv:2407.12471 (replaced) [pdf, html, other]
-
Title: Characterization of Political Polarized Users Attacked by Language Toxicity on TwitterJournal-ref: In Companion Publication of the 2024 Conference on Computer-Supported Cooperative Work and Social Computing (CSCW Companion 24). Association for Computing Machinery, New York, NY, USA, 185-189Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Understanding the dynamics of language toxicity on social media is important for us to investigate the propagation of misinformation and the development of echo chambers for political scenarios such as U.S. presidential elections. Recent research has used large-scale data to investigate the dynamics across social media platforms. However, research on the toxicity dynamics is not enough. This study aims to provide a first exploration of the potential language toxicity flow among Left, Right and Center users. Specifically, we aim to examine whether Left users were easier to be attacked by language toxicity. In this study, more than 500M Twitter posts were examined. It was discovered that Left users received much more toxic replies than Right and Center users.
- [466] arXiv:2407.15053 (replaced) [pdf, html, other]
-
Title: Stacked Intelligent Metasurfaces for Task-Oriented Semantic CommunicationsComments: 5 pages, 4 figures, accepted by IEEE Wireless Communications LettersSubjects: Information Theory (cs.IT)
Semantic communication (SemCom) leveraging advanced deep learning (DL) technologies enhances the efficiency and reliability of information transmission. Emerging stacked intelligent metasurface (SIM) with an electromagnetic neural network (EMNN) architecture enables complex computations at the speed of light. In this letter, we introduce an innovative SIM-aided SemCom system for image recognition tasks, where a SIM is positioned in front of the transmitting antenna. In contrast to conventional communication systems that transmit modulated signals carrying the image information or compressed semantic information, the carrier EM wave is directly transmitted from the source. The input layer of the SIM performs source encoding, while the remaining multi-layer architecture constitutes an EMNN for semantic encoding, transforming signals into a unique beam towards a receiving antenna corresponding to the image class. Remarkably, both the source and semantic encoding occur naturally as the EM waves propagate through the SIM. At the receiver, the image is recognized by probing the received signal magnitude across the receiving array. To this end, we utilize an efficient mini-batch gradient descent algorithm to train the transmission coefficients of SIM's meta-atoms to learn the semantic representation of the image. Extensive numerical results verify the effectiveness of utilizing the SIM-based EMNN for image recognition task-oriented SemComs, achieving more than 90\% recognition accuracy.
- [467] arXiv:2407.16677 (replaced) [pdf, html, other]
-
Title: From Imitation to Refinement -- Residual RL for Precise AssemblyComments: Project website: this https URLSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
Advances in behavior cloning (BC), like action-chunking and diffusion, have enabled impressive capabilities. Still, imitation alone remains insufficient for learning reliable policies for tasks requiring precise aligning and inserting of objects, like assembly. Our key insight is that chunked BC policies effectively function as trajectory planners, enabling long-horizon tasks. Conversely, as they execute action chunks open-loop, they lack the fine-grained reactivity necessary for reliable execution. Further, we find that the performance of BC policies saturates despite increasing data. Reinforcement learning (RL) is a natural way to overcome BC's limitations, but it is not straightforward to apply directly to action-chunked models like diffusion policies. We present a simple yet effective method, ResiP (Residual for Precise Manipulation), that sidesteps these challenges by augmenting a frozen, chunked BC model with a fully closed-loop residual policy trained with RL. The residual policy is trained via on-policy RL, addressing distribution shifts and introducing reactive control without altering the BC trajectory planner. Evaluation on high-precision manipulation tasks demonstrates strong performance of ResiP over BC methods and direct RL fine-tuning. Videos, code, and data are available at this https URL.
- [468] arXiv:2407.16725 (replaced) [pdf, html, other]
-
Title: Category-Extensible Out-of-Distribution Detection via Hierarchical Context DescriptionsComments: Accepted by 37th Conference on Neural Information Processing Systems (NeurIPS 2023). Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
The key to OOD detection has two aspects: generalized feature representation and precise category description. Recently, vision-language models such as CLIP provide significant advances in both two issues, but constructing precise category descriptions is still in its infancy due to the absence of unseen categories. This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning. Specifically, perceptual contexts perceive the inter-category difference (e.g., cats vs apples) for current classification tasks, while spurious contexts further identify spurious (similar but exactly not) OOD samples for every single category (e.g., cats vs panthers, apples vs peaches). The two contexts hierarchically construct the precise description for a certain category, which is, first roughly classifying a sample to the predicted category and then delicately identifying whether it is truly an ID sample or actually OOD. Moreover, the precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX). One can efficiently extend the set of recognizable categories by simply merging the hierarchical contexts learned under different sub-task settings. And extensive experiments are conducted to demonstrate CATEX's effectiveness, robustness, and category-extensibility. For instance, CATEX consistently surpasses the rivals by a large margin with several protocols on the challenging ImageNet-1K dataset. In addition, we offer new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models (like GPT-3) to boost zero-shot applications. Code is publicly available at this https URL.
- [469] arXiv:2408.00341 (replaced) [pdf, html, other]
-
Title: Enhancing Attack Resilience in Real-Time Systems through Variable Control Task Sampling RatesComments: 12 pages including references, Total 10 figures (with 3 having subfigures)Subjects: Systems and Control (eess.SY); Cryptography and Security (cs.CR); Operating Systems (cs.OS)
Cyber-physical systems (CPSs) in modern real-time applications integrate numerous control units linked through communication networks, each responsible for executing a mix of real-time safety-critical and non-critical tasks. To ensure predictable timing behaviour, most safety-critical tasks are scheduled with fixed sampling periods, which supports rigorous safety and performance analyses. However, this deterministic execution can be exploited by attackers to launch inference-based attacks on safety-critical tasks. This paper addresses the challenge of preventing such timing inference or schedule-based attacks by dynamically adjusting the execution rates of safety-critical tasks while maintaining their performance. We propose a novel schedule vulnerability analysis methodology, enabling runtime switching between valid schedules for various control task sampling rates. Leveraging this approach, we present the Multi-Rate Attack-Aware Randomized Scheduling (MAARS) framework for preemptive fixed-priority schedulers, designed to reduce the success rate of timing inference attacks on real-time systems. To our knowledge, this is the first method that combines attack-aware schedule randomization with preserved control and scheduling integrity. The framework's efficacy in attack prevention is evaluated on automotive benchmarks using a Hardware-in-the-Loop (HiL) setup.
- [470] arXiv:2408.02407 (replaced) [pdf, html, other]
-
Title: Terracorder: Sense Long and ProsperComments: PreprintSubjects: Machine Learning (cs.LG)
In-situ sensing devices need to be deployed in remote environments for long periods of time; minimizing their power consumption is vital for maximising both their operational lifetime and coverage. We introduce Terracorder -- a versatile multi-sensor device -- and showcase its exceptionally low power consumption using an on-device reinforcement learning scheduler. We prototype a unique device setup for biodiversity monitoring and compare its battery life using our scheduler against a number of fixed schedules; the scheduler captures more than 80% of events at less than 50% of the number of activations of the best-performing fixed schedule. We then explore how a collaborative scheduler can maximise the useful operation of a network of devices, improving overall network power consumption and robustness.
- [471] arXiv:2408.07691 (replaced) [pdf, html, other]
-
Title: A family of high-order accurate contour integral methods for strongly continuous semigroupsSubjects: Numerical Analysis (math.NA)
Exponential integrators based on contour integral representations lead to powerful numerical solvers for a variety of ODEs, PDEs, and other time-evolution equations. They are embarrassingly parallelizable and lead to global-in-time approximations that can be efficiently evaluated anywhere within a finite time horizon. However, there are core theoretical challenges that restrict their use cases to analytic semigroups, e.g., parabolic equations. In this article, we use carefully regularized contour integral representations to construct a family of new high-order quadrature schemes for the larger, less regular, class of strongly continuous semigroups. Our algorithms are accompanied by explicit high-order error bounds and near-optimal parameter selection. We demonstrate key features of the schemes on singular first-order PDEs from Koopman operator theory.
- [472] arXiv:2408.08121 (replaced) [pdf, other]
-
Title: Optimizing Highway Ramp Merge Safety and Efficiency via Spatio-Temporal Cooperative Control and Vehicle-Road CoordinationSubjects: Systems and Control (eess.SY)
In view of existing automatic driving is difficult to accurately and timely obtain the status and driving intention of other vehicles and the safety risk and urgency of autonomous vehicles in the absence of collision are evaluated. As a result, while vehicles generally maintain safe distances, accidents still frequently occur, particularly in merging areas. To ensure safety, improve road efficiency, this paper presents a pre-programmed technique for managing vehicles' spatiotemporal trajectories to proactively mitigate conflicts among vehicles. Firstly, the study focuses on the calculation of safe distances under varying spatiotemporal conditions, taking into account differences in vehicle speed. Subsequently, an evaluation model for vehicle conflict risk is developed, which incorporates critical parameters such as collision acceleration and emergency acceleration. The methodology further identifies the main line vehicles that are potentially in conflict with on-ramp vehicles and determines the target gap for the latter. Based on this selected target gap, a cooperative control method is formulated, enabling the pre-programming of vehicle trajectories. Using highway ramp merging as a case study, the paper introduces a mainline priority spatiotemporal cooperative control method and validates its efficacy through rigorous simulations. The analysis indicates that the average delay time can be reduced by 97.96%, and fuel consumption by 6.01%. The mainline priority strategy demonstrates increased speed, low latency and low fuel consumption.
- [473] arXiv:2408.08700 (replaced) [pdf, html, other]
-
Title: HyCoT: A Transformer-Based Autoencoder for Hyperspectral Image CompressionComments: Accepted at 14th IEEE GRSS Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
The development of learning-based hyperspectral image (HSI) compression models has recently attracted significant interest. Existing models predominantly utilize convolutional filters, which capture only local dependencies. Furthermore,they often incur high training costs and exhibit substantial computational complexity. To address these limitations, in this paper we propose Hyperspectral Compression Transformer (HyCoT) that is a transformer-based autoencoder for pixelwise HSI compression. Additionally, we apply a simple yet effective training set reduction approach to accelerate the training process. Experimental results on the HySpecNet-11k dataset demonstrate that HyCoT surpasses the state of the art across various compression ratios by over 1 dB of PSNR with significantly reduced computational requirements. Our code and pre-trained weights are publicly available at this https URL .
- [474] arXiv:2408.11212 (replaced) [pdf, html, other]
-
Title: How many autonomous vehicles are required to stabilize traffic flow?Subjects: Systems and Control (eess.SY)
The collective behavior of human-driven vehicles (HVs) produces the well-known stop-and-go waves potentially leading to higher fuel consumption and emissions. This paper investigates the stabilization of traffic flow via a minimum number of autonomous vehicles (AVs) subject to constraints on the control parameters aiming to reduce the number of vehicles on the road while achieving lower fuel consumption and emissions. The unconstrained scenario has been well-studied in recent studies. The main motivation to investigate the constrained scenario is that, in realistic engineering applications, lower and upper bounds exist on the control parameters. For the constrained scenario, we optimally find the minimum number of required AVs (via computing the optimal lower bound on the AV penetration rate) to stabilize traffic flow for a given number of HVs. As an immediate consequence, we conclude that for a given number of AVs, the number of HVs in the stabilized traffic flow may not be arbitrarily large in the constrained scenario unlike the unconstrained scenario studied in the literature. We systematically propose a procedure to compute the optimal lower bound on the AV penetration rate using nonlinear optimization techniques. Finally, we validate the theoretical results via numerical simulations. Numerical simulations suggest that enlarging the constraint intervals makes a smaller optimal lower bound on the AV penetration rate attainable. However, it leads to a slower transient response due to a dominant pole closer to the origin.
- [475] arXiv:2408.15696 (replaced) [pdf, html, other]
-
Title: Comparing diversity, negativity, and stereotypes in Chinese-language AI technologies: a case study on Baidu, Ernie and QwenSubjects: Computers and Society (cs.CY)
Large Language Models (LLMs) and search engines have the potential to perpetuate biases and stereotypes by amplifying existing prejudices in their training data and algorithmic processes, thereby influencing public perception and decision-making. While most work has focused on Western-centric AI technologies, we study Chinese-based tools by investigating social biases embedded in the major Chinese search engine, Baidu, and two leading LLMs, Ernie and Qwen. Leveraging a dataset of 240 social groups across 13 categories describing Chinese society, we collect over 30k views encoded in the aforementioned tools by prompting them for candidate words describing such groups. We find that language models exhibit a larger variety of embedded views compared to the search engine, although Baidu and Qwen generate negative content more often than Ernie. We also find a moderate prevalence of stereotypes embedded in the language models, many of which potentially promote offensive and derogatory views. Our work highlights the importance of promoting fairness and inclusivity in AI technologies with a global perspective.
- [476] arXiv:2408.17042 (replaced) [pdf, html, other]
-
Title: E-Graphs as Circuits, and Optimal Extraction via TreewidthComments: Edits for clarity, additional references, and grant support informationSubjects: Data Structures and Algorithms (cs.DS)
We demonstrate a new connection between e-graphs and Boolean circuits. This allows us to adapt existing literature on circuits to easily arrive at an algorithm for optimal e-graph extraction, parameterized by treewidth, which runs in $2^{O(w^2)}\text{poly}(w, n)$ time, where $w$ is the treewidth of the e-graph. Additionally, we show how the circuit view of e-graphs allows us to apply powerful simplification techniques, and we analyze a dataset of e-graphs to show that these techniques can reduce e-graph size and treewidth by 40-80% in many cases. While the core parameterized algorithm may be adapted to work directly on e-graphs, the primary value of the circuit view is in allowing the transfer of ideas from the well-established field of circuits to e-graphs.
- [477] arXiv:2409.01083 (replaced) [pdf, html, other]
-
Title: Affordance-based Robot Manipulation with Flow MatchingSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot trajectories guided by affordances in a supervised Flow Matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance with language prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot trajectories with flow matching policy also leads to consistently better generalization performance and faster inference than alternative behavior cloning methods, especially given multimodal robot action distributions. Our framework seamlessly unifies affordance model learning and trajectory generation with flow matching for robot manipulation.
- [478] arXiv:2409.02562 (replaced) [pdf, html, other]
-
Title: One Homography is All You Need: IMM-based Joint Homography and Multiple Object State EstimationComments: Preprint submitted to Information FusionSubjects: Computer Vision and Pattern Recognition (cs.CV)
A novel online MOT algorithm, IMM Joint Homography State Estimation (IMM-JHSE), is proposed. IMM-JHSE uses an initial homography estimate as the only additional 3D information, whereas other 3D MOT methods use regular 3D measurements. By jointly modelling the homography matrix and its dynamics as part of track state vectors, IMM-JHSE removes the explicit influence of camera motion compensation techniques on predicted track position states, which was prevalent in previous approaches. Expanding upon this, static and dynamic camera motion models are combined using an IMM filter. A simple bounding box motion model is used to predict bounding box positions to incorporate image plane information. In addition to applying an IMM to camera motion, a non-standard IMM approach is applied where bounding-box-based BIoU scores are mixed with ground-plane-based Mahalanobis distances in an IMM-like fashion to perform association only, making IMM-JHSE robust to motion away from the ground plane. Finally, IMM-JHSE makes use of dynamic process and measurement noise estimation techniques. IMM-JHSE improves upon related techniques, including UCMCTrack, OC-SORT, C-BIoU and ByteTrack on the DanceTrack and KITTI-car datasets, increasing HOTA by 2.64 and 2.11, respectively, while offering competitive performance on the MOT17, MOT20 and KITTI-pedestrian datasets. Using publicly available detections, IMM-JHSE outperforms almost all other 2D MOT methods and is outperformed only by 3D MOT methods -- some of which are offline -- on the KITTI-car dataset. Compared to tracking-by-attention methods, IMM-JHSE shows remarkably similar performance on the DanceTrack dataset and outperforms them on the MOT17 dataset. The code is publicly available: \url{this https URL}.
- [479] arXiv:2409.07271 (replaced) [pdf, html, other]
-
Title: CFCPalsy: Facial Image Synthesis with Cross-Fusion Cycle Diffusion Model for Facial Paralysis IndividualsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Currently, the diagnosis of facial paralysis remains a challenging task, often relying heavily on the subjective judgment and experience of clinicians, which can introduce variability and uncertainty in the assessment process. One promising application in real-life situations is the automatic estimation of facial paralysis. However, the scarcity of facial paralysis datasets limits the development of robust machine learning models for automated diagnosis and therapeutic interventions. To this end, this study aims to synthesize a high-quality facial paralysis dataset to address this gap, enabling more accurate and efficient algorithm training. Specifically, a novel Cross-Fusion Cycle Palsy Expression Generative Model (CFCPalsy) based on the diffusion model is proposed to combine different features of facial information and enhance the visual details of facial appearance and texture in facial regions, thus creating synthetic facial images that accurately represent various degrees and types of facial paralysis. We have qualitatively and quantitatively evaluated the proposed method on the commonly used public clinical datasets of facial paralysis to demonstrate its effectiveness. Experimental results indicate that the proposed method surpasses state-of-the-art methods, generating more realistic facial images and maintaining identity consistency.
- [480] arXiv:2409.07293 (replaced) [pdf, html, other]
-
Title: Electrokinetic Propulsion for Electronically Integrated Microscopic RobotsSubjects: Robotics (cs.RO)
Semiconductor microelectronics are emerging as a powerful tool for building smart, autonomous robots too small to see with the naked eye. Yet a number of existing microrobot platforms, despite significant advantages in speed, robustness, power consumption, or ease of fabrication, have no clear path towards electronics integration, limiting their intelligence and sophistication when compared to electronic cousins. Here, we show how to upgrade a self-propelled particle into an an electronically integrated microrobot, reaping the best of both in a single design. Inspired by electrokinetic micromotors, these robots generate electric fields in a surrounding fluid, and by extension propulsive electrokinetic flows. The underlying physics is captured by a model in which robot speed is proportional to applied current, making design and control straightforward. As proof, we build basic robots that use on-board circuits and a closed-loop optical control scheme to navigate waypoints and move in coordinated swarms at speeds of up to one body length per second. Broadly, the unification of micromotor propulsion with on-robot electronics clears the way for robust, fast, easy to manufacture, electronically programmable microrobots that operate reliably over months to years.
- [481] arXiv:2409.12514 (replaced) [pdf, html, other]
-
Title: TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic ManipulationJunjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, Jian TangComments: add more citationsSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this paper, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive evaluations of TinyVLA in both simulation and on real robots, demonstrating that our approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. Additionally, TinyVLA exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts, often matching or exceeding the performance of OpenVLA. We believe that \methodname offers an interesting perspective on utilizing pre-trained multimodal models for policy learning. Our project is at this https URL.
- [482] arXiv:2409.14055 (replaced) [pdf, html, other]
-
Title: Monitoring Human Dependence On AI Systems With Reliance DrillsSubjects: Computers and Society (cs.CY)
AI systems are assisting humans with increasingly diverse intellectual tasks but are still prone to mistakes. Humans are over-reliant on this assistance if they trust AI-generated advice, even though they would make a better decision on their own. To identify such instances of over-reliance, this paper proposes the reliance drill: an exercise that tests whether a human can recognise mistakes in AI-generated advice. Our paper examines the reasons why an organisation might choose to implement reliance drills and the doubts they may have about doing so. As an example, we consider the benefits and risks that could arise when using these drills to detect over-reliance on AI in healthcare professionals. We conclude by arguing that reliance drills should become a standard risk management practice for ensuring humans remain appropriately involved in the oversight of AI-assisted decisions.
- [483] arXiv:2409.14289 (replaced) [pdf, other]
-
Title: Deep Learning Technology for Face Forgery Detection: A SurveyComments: The paper "Deep Learning Technology for Face Forgery Detection: A Survey" is hereby formally withdrawn. The reason for this withdrawal is that I did not adequately consult and obtain proper authorization from the corresponding author during the submission process. I sincerely apologize for any inconvenience this may have caused the journal, reviewers, and readersSubjects: Computer Vision and Pattern Recognition (cs.CV)
Currently, the rapid development of computer vision and deep learning has enabled the creation or manipulation of high-fidelity facial images and videos via deep generative approaches. This technology, also known as deepfake, has achieved dramatic progress and become increasingly popular in social media. However, the technology can generate threats to personal privacy and national security by spreading misinformation. To diminish the risks of deepfake, it is desirable to develop powerful forgery detection methods to distinguish fake faces from real faces. This paper presents a comprehensive survey of recent deep learning-based approaches for facial forgery detection. We attempt to provide the reader with a deeper understanding of the current advances as well as the major challenges for deepfake detection based on deep learning. We present an overview of deepfake techniques and analyse the characteristics of various deepfake datasets. We then provide a systematic review of different categories of deepfake detection and state-of-the-art deepfake detection methods. The drawbacks of existing detection methods are analyzed, and future research directions are discussed to address the challenges in improving both the performance and generalization of deepfake detection.
- [484] arXiv:2409.14411 (replaced) [pdf, html, other]
-
Title: Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic ManipulationMinjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, Jian TangSubjects: Robotics (cs.RO)
Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (\DP) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely \textbf{\methodname}, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that \DP~suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our utilize non-causal attention which allows the policy network to \enquote{see} future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named \methodname, can effectively scale up the model size with improved performance and generalization. We benchmark \methodname~across 50 different tasks from MetaWorld and find that our largest \methodname~outperforms \DP~with an average improvement of 21.6\%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 36.25\% over DP-T on four single-arm tasks and 75\% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at this http URL.
- [485] arXiv:2409.15750 (replaced) [pdf, html, other]
-
Title: The Roles of Generative Artificial Intelligence in Internet of Electric VehiclesHanwen Zhang, Dusit Niyato, Wei Zhang, Changyuan Zhao, Hongyang Du, Abbas Jamalipour, Sumei Sun, Yiyang PeiComments: 25 PagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
With the advancements of generative artificial intelligence (GenAI) models, their capabilities are expanding significantly beyond content generation and the models are increasingly being used across diverse applications. Particularly, GenAI shows great potential in addressing challenges in the electric vehicle (EV) ecosystem ranging from charging management to cyber-attack prevention. In this paper, we specifically consider Internet of electric vehicles (IoEV) and we categorize GenAI for IoEV into four different layers namely, EV's battery layer, individual EV layer, smart grid layer, and security layer. We introduce various GenAI techniques used in each layer of IoEV applications. Subsequently, public datasets available for training the GenAI models are summarized. Finally, we provide recommendations for future directions. This survey not only categorizes the applications of GenAI in IoEV across different layers but also serves as a valuable resource for researchers and practitioners by highlighting the design and implementation challenges within each layer. Furthermore, it provides a roadmap for future research directions, enabling the development of more robust and efficient IoEV systems through the integration of advanced GenAI techniques.
- [486] arXiv:2409.15933 (replaced) [pdf, html, other]
-
Title: SLIMER-IT: Zero-Shot NER on Italian LanguageSubjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Traditional approaches to Named Entity Recognition (NER) frame the task into a BIO sequence labeling problem. Although these systems often excel in the downstream task at hand, they require extensive annotated data and struggle to generalize to out-of-distribution input domains and unseen entity types. On the contrary, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities. While several works address Zero-Shot NER in English, little has been done in other languages. In this paper, we define an evaluation framework for Zero-Shot NER, applying it to the Italian language. Furthermore, we introduce SLIMER-IT, the Italian version of SLIMER, an instruction-tuning approach for zero-shot NER leveraging prompts enriched with definition and guidelines. Comparisons with other state-of-the-art models, demonstrate the superiority of SLIMER-IT on never-seen-before entity tags.
- [487] arXiv:2409.19601 (replaced) [pdf, html, other]
-
Title: Infighting in the Dark: Multi-Labels Backdoor Attack in Federated LearningComments: 11 pages, 7 figuresSubjects: Cryptography and Security (cs.CR)
Federated Learning (FL), a privacy-preserving decentralized machine learning framework, has been shown to be vulnerable to backdoor attacks. Current research primarily focuses on the Single-Label Backdoor Attack (SBA), wherein adversaries share a consistent target. However, a critical fact is overlooked: adversaries may be non-cooperative, have distinct targets, and operate independently, which exhibits a more practical scenario called Multi-Label Backdoor Attack (MBA). Unfortunately, prior works are ineffective in MBA scenario since non-cooperative attackers exclude each other. In this work, we conduct an in-depth investigation to uncover the inherent constraints of the exclusion: similar backdoor mappings are constructed for different targets, resulting in conflicts among backdoor functions. To address this limitation, we propose Mirage, the first non-cooperative MBA strategy in FL that allows attackers to inject effective and persistent backdoors into the global model without collusion by constructing in-distribution (ID) backdoor mapping. Specifically, we introduce an adversarial adaptation method to bridge the backdoor features and the target distribution in an ID manner. Additionally, we further leverage a constrained optimization method to ensure the ID mapping survives in the global training dynamics. Extensive evaluations demonstrate that Mirage outperforms various state-of-the-art attacks and bypasses existing defenses, achieving an average ASR greater than 97\% and maintaining over 90\% after 900 rounds. This work aims to alert researchers to this potential threat and inspire the design of effective defense mechanisms. Code has been made open-source.
- [488] arXiv:2410.01440 (replaced) [pdf, html, other]
-
Title: Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence ModelingSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating high-level task descriptions into long-horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self-refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end-to-end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self-refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed-loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome-Env benchmark, showing advanced performance with better scaling for inference computation. Code is available at this https URL.
- [489] arXiv:2410.02367 (replaced) [pdf, html, other]
-
Title: SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference AccelerationSubjects: Machine Learning (cs.LG)
The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of O(N^2), compared to O(N) for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1 times and 2.7 times, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation. The codes are available at this https URL.
- [490] arXiv:2410.03979 (replaced) [pdf, html, other]
-
Title: Improving Arabic Multi-Label Emotion Classification using Stacked Embeddings and Hybrid Loss FunctionMuhammad Azeem Aslam, Wang Jun, Nisar Ahmed, Muhammad Imran Zaman, Li Yanan, Hu Hongfei, Wang Shiyu, Xin LiuComments: The paper is submitted in Scientific Reports and is currently under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
In multi-label emotion classification, particularly for low-resource languages like Arabic, the challenges of class imbalance and label correlation hinder model performance, especially in accurately predicting minority emotions. To address these issues, this study proposes a novel approach that combines stacked embeddings, meta-learning, and a hybrid loss function to enhance multi-label emotion classification for the Arabic language. The study extracts contextual embeddings from three fine-tuned language models-ArabicBERT, MarBERT, and AraBERT-which are then stacked to form enriched embeddings. A meta-learner is trained on these stacked embeddings, and the resulting concatenated representations are provided as input to a Bi-LSTM model, followed by a fully connected neural network for multi-label classification. To further improve performance, a hybrid loss function is introduced, incorporating class weighting, label correlation matrix, and contrastive learning, effectively addressing class imbalances and improving the handling of label correlations. Extensive experiments validate the proposed model's performance across key metrics such as Precision, Recall, F1-Score, Jaccard Accuracy, and Hamming Loss. The class-wise performance analysis demonstrates the hybrid loss function's ability to significantly reduce disparities between majority and minority classes, resulting in a more balanced emotion classification. An ablation study highlights the contribution of each component, showing the superiority of the model compared to baseline approaches and other loss functions. This study not only advances multi-label emotion classification for Arabic but also presents a generalizable framework that can be adapted to other languages and domains, providing a significant step forward in addressing the challenges of low-resource emotion classification tasks.
- [491] arXiv:2410.05767 (replaced) [pdf, html, other]
-
Title: Grounding is All You Need? Dual Temporal Grounding for Video DialogSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
In the realm of video dialog response generation, the understanding of video content and the temporal nuances of conversation history are paramount. While a segment of current research leans heavily on large-scale pretrained visual-language models and often overlooks temporal dynamics, another delves deep into spatial-temporal relationships within videos but demands intricate object trajectory pre-extractions and sidelines dialog temporal dynamics. This paper introduces the Dual Temporal Grounding-enhanced Video Dialog model (DTGVD), strategically designed to merge the strengths of both dominant approaches. It emphasizes dual temporal relationships by predicting dialog turn-specific temporal regions, filtering video content accordingly, and grounding responses in both video and dialog contexts. One standout feature of DTGVD is its heightened attention to chronological interplay. By recognizing and acting upon the dependencies between different dialog turns, it captures more nuanced conversational dynamics. To further bolster the alignment between video and dialog temporal dynamics, we've implemented a list-wise contrastive learning strategy. Within this framework, accurately grounded turn-clip pairings are designated as positive samples, while less precise pairings are categorized as negative. This refined classification is then funneled into our holistic end-to-end response generation mechanism. Evaluations using AVSD@DSTC-7 and AVSD@DSTC-8 datasets underscore the superiority of our methodology.
- [492] arXiv:2410.06865 (replaced) [pdf, html, other]
-
Title: Students' Perceptions and Use of Generative AI Tools for Programming Across Different Computing CoursesHieke Keuning, Isaac Alpizar-Chacon, Ioanna Lykourentzou, Lauren Beehler, Christian Köppe, Imke de Jong, Sergey SosnovskyComments: Accepted to Koli Calling 24. Numbers in Table 1, row 1 updatedSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Investigation of students' perceptions and opinions on the use of generative artificial intelligence (GenAI) in education is a topic gaining much interest. Studies addressing this are typically conducted with large heterogeneous groups, at one moment in time. However, how students perceive and use GenAI tools can potentially depend on many factors, including their background knowledge, familiarity with the tools, and the learning goals and policies of the courses they are taking.
In this study we explore how students following computing courses use GenAI for programming-related tasks across different programs and courses: Bachelor and Master, in courses in which learning programming is the learning goal, courses that require programming as a means to achieve another goal, and in courses in which programming is optional, but can be useful. We are also interested in changes over time, since GenAI capabilities are changing at a fast pace, and users are adopting GenAI increasingly.
We conducted three consecutive surveys (fall `23, winter `23, and spring `24) among students of all computing programs of a large European research university. We asked questions on the use in education, ethics, and job prospects, and we included specific questions on the (dis)allowed use of GenAI tools in the courses they were taking at the time.
We received 264 responses, which we quantitatively and qualitatively analyzed, to find out how students have employed GenAI tools across 59 different computing courses, and whether the opinion of an average student about these tools evolves over time. Our study contributes to the emerging discussion of how to differentiate GenAI use across different courses, and how to align its use with the learning goals of a computing course. - [493] arXiv:2410.07974 (replaced) [pdf, html, other]
-
Title: Doob's Lagrangian: A Sample-Efficient Variational Approach to Transition Path SamplingYuanqi Du, Michael Plainer, Rob Brekelmans, Chenru Duan, Frank Noé, Carla P. Gomes, Alán Aspuru-Guzik, Kirill NeklyudovComments: Accepted as Spotlight at Conference on Neural Information Processing Systems (NeurIPS 2024); Alanine dipeptide results updated after fixing unphysical parameterizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
Rare event sampling in dynamical systems is a fundamental problem arising in the natural sciences, which poses significant computational challenges due to an exponentially large space of trajectories. For settings where the dynamical system of interest follows a Brownian motion with known drift, the question of conditioning the process to reach a given endpoint or desired rare event is definitively answered by Doob's h-transform. However, the naive estimation of this transform is infeasible, as it requires simulating sufficiently many forward trajectories to estimate rare event probabilities. In this work, we propose a variational formulation of Doob's h-transform as an optimization problem over trajectories between a given initial point and the desired ending point. To solve this optimization, we propose a simulation-free training objective with a model parameterization that imposes the desired boundary conditions by design. Our approach significantly reduces the search space over trajectories and avoids expensive trajectory simulation and inefficient importance sampling estimators which are required in existing methods. We demonstrate the ability of our method to find feasible transition paths on real-world molecular simulation and protein folding tasks.
- [494] arXiv:2410.12649 (replaced) [pdf, html, other]
-
Title: Faster Algorithms for Growing Collision-Free Convex Polytopes in Robot Configuration SpaceComments: 16 pages, 6 figures, accepted for publication in the proceedings of the International Symposium for Robotics Research 2024Subjects: Robotics (cs.RO); Computational Geometry (cs.CG)
We propose two novel algorithms for constructing convex collision-free polytopes in robot configuration space. Finding these polytopes enables the application of stronger motion-planning frameworks such as trajectory optimization with Graphs of Convex Sets [1] and is currently a major roadblock in the adoption of these approaches. In this paper, we build upon IRIS-NP (Iterative Regional Inflation by Semidefinite & Nonlinear Programming) [2] to significantly improve tunability, runtimes, and scaling to complex environments. IRIS-NP uses nonlinear programming paired with uniform random initialization to find configurations on the boundary of the free configuration space. Our key insight is that finding near-by configuration-space obstacles using sampling is inexpensive and greatly accelerates region generation. We propose two algorithms using such samples to either employ nonlinear programming more efficiently (IRIS-NP2 ) or circumvent it altogether using a massively-parallel zero-order optimization strategy (IRIS-ZO). We also propose a termination condition that controls the probability of exceeding a user-specified permissible fraction-in-collision, eliminating a significant source of tuning difficulty in IRIS-NP. We compare performance across eight robot environments, showing that IRIS-ZO achieves an order-of-magnitude speed advantage over IRIS-NP. IRISNP2, also significantly faster than IRIS-NP, builds larger polytopes using fewer hyperplanes, enabling faster downstream computation. Website: this https URL
- [495] arXiv:2410.14116 (replaced) [pdf, html, other]
-
Title: Robustness to Model Approximation, Empirical Model Learning, and Sample Complexity in Wasserstein Regular MDPsComments: 35 pagesSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
The paper studies the robustness properties of discrete-time stochastic optimal control under Wasserstein model approximation for both discounted cost and average cost criteria. Specifically, we study the performance loss when applying an optimal policy designed for an approximate model to the true dynamics compared with the optimal cost for the true model under the sup-norm-induced metric, and relate it to the Wasserstein-1 distance between the approximate and true transition kernels. A primary motivation of this analysis is empirical model learning, as well as empirical noise distribution learning, where Wasserstein convergence holds under mild conditions but stronger convergence criteria, such as total variation, may not. We discuss applications of the results to the disturbance estimation problem, where sample complexity bounds are given, and also to a general empirical model learning approach, obtained under either Markov or i.i.d.~learning settings. Further applications regarding the continuity of invariant probability measures with respect to transition kernels are also discussed.
- [496] arXiv:2410.14193 (replaced) [pdf, html, other]
-
Title: xPerT: Extended Persistence TransformerSubjects: Machine Learning (cs.LG); Algebraic Topology (math.AT)
A persistence diagram provides a compact summary of persistent homology, which captures the topological features of a space at different scales. However, due to its nature as a set, incorporating it as a feature into a machine learning framework is challenging. Several methods have been proposed to use persistence diagrams as input for machine learning models, but they often require complex preprocessing steps and extensive hyperparameter tuning. In this paper, we propose a novel transformer architecture called the \textit{Extended Persistence Transformer (xPerT)}, which is highly scalable than the compared to Persformer, an existing transformer for persistence diagrams. xPerT reduces GPU memory usage by over 90\% and improves accuracy on multiple datasets. Additionally, xPerT does not require complex preprocessing steps or extensive hyperparameter tuning, making it easy to use in practice. Our code is available at this https URL.
- [497] arXiv:2410.14979 (replaced) [pdf, html, other]
-
Title: Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive PsychologySubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
The cognitive mechanism by which Large Language Models (LLMs) solve mathematical problems remains a widely debated and unresolved issue. Currently, there is little interpretable experimental evidence that connects LLMs' problem-solving with human cognitive this http URL determine if LLMs possess human-like mathematical reasoning, we modified the problems used in the human Cognitive Reflection Test (CRT). Our results show that, even with the use of Chains of Thought (CoT) prompts, mainstream LLMs, including the latest o1 model (noted for its reasoning capabilities), have a high error rate when solving these modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original this http URL analysis of LLMs' incorrect answers suggests that they primarily rely on pattern matching from their training data, which aligns more with human intuition (System 1 thinking) rather than with human-like reasoning (System 2 thinking). This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans. As a result, this work may adjust overly optimistic views on LLMs' progress towards artificial general intelligence.
- [498] arXiv:2410.17283 (replaced) [pdf, html, other]
-
Title: Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement TechniquesSubjects: Artificial Intelligence (cs.AI)
Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods. A project associated with this review has been created at this https URL.
- [499] arXiv:2410.17856 (replaced) [pdf, html, other]
-
Title: ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context PromptingSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real-time object tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a $\mathbf{76}\%$ absolute improvement in open-world interaction performance. Codes and demos are now available on the project page: this https URL.
- [500] arXiv:2410.17897 (replaced) [pdf, html, other]
-
Title: Value Residual Learning For Alleviating Attention Concentration In TransformersSubjects: Computation and Language (cs.CL)
Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the $KV$ cache by nearly 50\%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. Further visualization results suggest that Resformer alleviates attention sinks through avoiding value-state drains. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.
- [501] arXiv:2410.18958 (replaced) [pdf, html, other]
-
Title: Stable Consistency Tuning: Understanding and Improving Consistency ModelsComments: Code is available at this https URLSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference~(TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.
- [502] arXiv:2410.19258 (replaced) [pdf, html, other]
-
Title: Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and ReasoningComments: 18pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests demonstrate that our head-level KV cache compression significantly outperforms strong baselines, particularly in low-resource settings (KV size = 64 & 128). Notably, our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering this http URL are available at this https URL
- [503] arXiv:2410.20549 (replaced) [pdf, html, other]
-
Title: Comparing the Consistency of User Studies Conducted in Simulations and Laboratory SettingsComments: Accepted for presentation at 2024 IEEE International Conference on Robotic Computing (IRC)Subjects: Robotics (cs.RO)
Human-robot collaboration enables highly adaptive co-working. The variety of resulting workflows makes it difficult to measure metrics as, e.g. makespans or idle times for multiple systems and tasks in a comparable manner. This issue can be addressed with virtual commissioning, where arbitrary numbers of non-deterministic human-robot workflows in assembly tasks can be simulated. To this end, data-driven models of human decisions are needed. Gathering the required large corpus of data with on-site user studies is quite time-consuming. In comparison, simulation-based studies (e.g., by crowdsourcing) would allow us to access a large pool of study participants with less effort. To rely on respective study results, human action sequences observed in a browser-based simulation environment must be shown to match those gathered in a laboratory setting. To this end, this work aims to understand to what extent cooperative assembly work in a simulated environment differs from that in an on-site laboratory setting. We show how a simulation environment can be aligned with a laboratory setting in which a robot and a human perform pick-and-place tasks together. A user study (N=29) indicates that participants' assembly decisions and perception of the situation are consistent across these different environments.
- [504] arXiv:2410.20675 (replaced) [pdf, other]
-
Title: Impact of Translation and Viewpoint Transition Methods in VR on Spatial Learning and CybersicknessComments: The work needs revision and will be updated laterSubjects: Human-Computer Interaction (cs.HC)
Virtual locomotion technique (VLT) is a fundamental component of virtual reality (VR) systems that translates physical and controller inputs into virtual translational movements and viewpoint transitions. Poorly designed VLTs can result in discomfort, nausea, and reductions in task performance. Understanding the effectiveness of VLTs across various levels of interaction fidelity is crucial to enhance user experience and spatial awareness. The current study addressed a significant gap in VR design research and practice, as few previous efforts have been made to comprehensively evaluate the effectiveness of controller-based VLTs in virtual indoor environments. We conducted a user study in which participants navigated through two complex virtual environments, one focusing on exploratory tasks and the other on goal-oriented navigation. The findings offer insights into the trade-offs among spatial knowledge acquisition, wayfinding performance, cybersickness, and sense of presence, and have design implications for future VR interfaces.
- [505] arXiv:2410.21991 (replaced) [pdf, other]
-
Title: From Explicit Rules to Implicit Reasoning in an Interpretable Violence Monitoring SystemComments: 12 pages,7 figures IEEE TSMCA (Under review)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Recently, research based on pre-trained models has demonstrated outstanding performance in violence surveillance tasks. However, most of them were black-box systems which faced challenges regarding explainability during training and inference processes. An important question is how to incorporate explicit knowledge into these implicit models, thereby designing expertdriven and interpretable violence surveillance systems. This paper proposes a new paradigm for weakly supervised violence monitoring (WSVM) called Rule base Violence Monitoring (RuleVM). The proposed RuleVM uses a dual-branch structure with different designs for images and text. One of the branches is called the implicit branch, which uses only visual features for coarse-grained binary classification. In this branch, image feature extraction is divided into two channels: one responsible for extracting scene frames and the other focusing on extracting actions. The other branch is called the explicit branch, which utilizes language-image alignment to perform fine-grained classification. For the language channel design in the explicit branch, the proposed RuleVM uses the state-of-the-art YOLOWorld model to detect objects in video frames, and association rules are identified through data mining methods as descriptions of the video. Leveraging the dual-branch architecture, RuleVM achieves interpretable coarse-grained and fine-grained violence surveillance. Extensive experiments were conducted on two commonly used benchmarks, and the results show that RuleVM achieved the best performance in both coarse-grained and finegrained monitoring, significantly outperforming existing state-ofthe-art methods. Moreover, interpretability experiments uncovered some interesting rules, such as the observation that as the number of people increases, the risk level of violent behavior also rises.
- [506] arXiv:2410.22980 (replaced) [pdf, html, other]
-
Title: Efficient End-to-End 6-Dof Grasp Detection Framework for Edge Devices with Hierarchical Heatmaps and Feature PropagationSubjects: Robotics (cs.RO)
6-DoF grasp detection is critically important for the advancement of intelligent embodied systems, as it provides feasible robot poses for object grasping. Various methods have been proposed to detect 6-DoF grasps through the extraction of 3D geometric features from RGBD or point cloud data. However, most of these approaches encounter challenges during real robot deployment due to their significant computational demands, which can be particularly problematic for mobile robot platforms, especially those reliant on edge computing devices. This paper presents an Efficient End-to-End Grasp Detection Network (E3GNet) for 6-DoF grasp detection utilizing hierarchical heatmap representations. E3GNet effectively identifies high-quality and diverse grasps in cluttered real-world environments. Benefiting from our end-to-end methodology and efficient network design, our approach surpasses previous methods in model inference efficiency and achieves real-time 6-Dof grasp detection on edge devices. Furthermore, real-world experiments validate the effectiveness of our method, achieving a satisfactory 94% object grasping success rate.
- [507] arXiv:2410.23061 (replaced) [pdf, html, other]
-
Title: Learned RESESOP for solving inverse problems with inexact forward operatorComments: 21 pages, 7 figures, 4 tablesSubjects: Numerical Analysis (math.NA)
When solving inverse problems, one has to deal with numerous potential sources of model inexactnesses, like object motion, calibration errors, or simplified data models. Regularized Sequential Subspace Optimization (ReSeSOp) allows to compensate for such inaccuracies within the reconstruction step by employing consecutive projections onto suitably defined subspaces. However, this approach relies on a priori estimates for the model inexactness levels which are typically unknown. In dynamic imaging applications, where inaccuracies arise from the unpredictable dynamics of the object, these estimates are particularly challenging to determine in advance. To overcome this limitation, we propose a learned version of ReSeSOp which allows to approximate inexactness levels on the fly. The proposed framework generalizes established unrolled iterative reconstruction schemes to inexact forward operators and is particularly tailored to the structure of dynamic problems. We also present a comprehensive mathematical analysis regarding the effect of dependencies within the forward problem, clarifying when and why dividing the overall problem into subproblems is essential. The proposed method is evaluated on various examples from dynamic imaging, including datasets from a rheological CT experiment, brain MRI, and real-time cardiac MRI. The respective results emphasize improvements in reconstruction quality while ensuring adequate data consistency.
- [508] arXiv:2410.23463 (replaced) [pdf, html, other]
-
Title: MDCure: A Scalable Pipeline for Multi-Document Instruction-FollowingSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Multi-document (MD) processing is crucial for LLMs to handle real-world tasks such as summarization and question-answering across large sets of documents. While LLMs have improved at processing long inputs, MD contexts still present challenges, such as managing inter-document dependencies, redundancy, and incoherent structures. We introduce MDCure, a scalable and effective fine-tuning pipeline to enhance the MD capabilities of LLMs without the computational cost of pre-training or reliance on human annotated data. MDCure is based on generation of high-quality synthetic MD instruction data from sets of related articles via targeted prompts. We further introduce MDCureRM, a multi-objective reward model which filters generated data based on their training utility for MD settings. With MDCure, we fine-tune a variety of LLMs, from the FlanT5, Qwen2, and LLAMA3.1 model families, up to 70B parameters in size. Extensive evaluations on a wide range of MD and long-context benchmarks spanning various tasks show MDCure consistently improves performance over pre-trained baselines and over corresponding base models by up to 75.5%. Our code, datasets, and models are available at this https URL.
- [509] arXiv:2410.24006 (replaced) [pdf, html, other]
-
Title: DiffPAD: Denoising Diffusion-based Adversarial Patch DecontaminationComments: Accepted to 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
In the ever-evolving adversarial machine learning landscape, developing effective defenses against patch attacks has become a critical challenge, necessitating reliable solutions to safeguard real-world AI systems. Although diffusion models have shown remarkable capacity in image synthesis and have been recently utilized to counter $\ell_p$-norm bounded attacks, their potential in mitigating localized patch attacks remains largely underexplored. In this work, we propose DiffPAD, a novel framework that harnesses the power of diffusion models for adversarial patch decontamination. DiffPAD first performs super-resolution restoration on downsampled input images, then adopts binarization, dynamic thresholding scheme and sliding window for effective localization of adversarial patches. Such a design is inspired by the theoretically derived correlation between patch size and diffusion restoration error that is generalized across diverse patch attack scenarios. Finally, DiffPAD applies inpainting techniques to the original input images with the estimated patch region being masked. By integrating closed-form solutions for super-resolution restoration and image inpainting into the conditional reverse sampling process of a pre-trained diffusion model, DiffPAD obviates the need for text guidance or fine-tuning. Through comprehensive experiments, we demonstrate that DiffPAD not only achieves state-of-the-art adversarial robustness against patch attacks but also excels in recovering naturalistic images without patch remnants. The source code is available at this https URL.
- [510] arXiv:2411.00374 (replaced) [pdf, html, other]
-
Title: RSRP Measurement Based Channel Autocorrelation Estimation for IRS-Aided Wideband CommunicationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
The passive and frequency-flat reflection of IRS, as well as the high-dimensional IRS-reflected channels, have posed significant challenges for efficient IRS channel estimation, especially in wideband communication systems with significant multi-path channel delay spread. To address these challenges, we propose a novel neural network (NN)-empowered framework for IRS channel autocorrelation matrix estimation in wideband orthogonal frequency division multiplexing (OFDM) systems. This framework relies only on the easily accessible reference signal received power (RSRP) measurements at users in existing wideband communication systems, without requiring additional pilot transmission. Based on the estimates of channel autocorrelation matrix, the passive reflection of IRS is optimized to maximize the average user received signal-to-noise ratio (SNR) over all subcarriers in the OFDM system. Numerical results verify that the proposed algorithm significantly outperforms existing powermeasurement-based IRS reflection designs in wideband channels.
- [511] arXiv:2411.00759 (replaced) [pdf, html, other]
-
Title: Minibatch Optimal Transport and Perplexity Bound Estimation in Discrete Flow MatchingSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Outperforming autoregressive models on categorical data distributions, such as textual data, remains challenging for continuous diffusion and flow models. Discrete flow matching, a recent framework for modeling categorical data, has shown competitive performance with autoregressive models. Despite its similarities with continuous flow matching, the rectification strategy applied in the continuous version does not directly extend to the discrete one due to the inherent stochasticity of discrete paths. This limitation necessitates exploring alternative methods to minimize state transitions during generation. To address this, we propose a dynamic-optimal-transport-like minimization objective for discrete flows with convex interpolants and derive its equivalent Kantorovich formulation. The latter defines transport cost solely in terms of inter-state similarity and is optimized using a minibatch strategy. Another limitation we address in the discrete flow framework is model evaluation. Unlike continuous flows, wherein the instantaneous change of variables enables density estimation, discrete models lack a similar mechanism due to the inherent non-determinism and discontinuity of their paths. To alleviate this issue, we propose an upper bound on the perplexity of discrete flow models, enabling performance evaluation and comparison with other methods.
- [512] arXiv:2411.01013 (replaced) [pdf, html, other]
-
Title: A Similarity-Based Oversampling Method for Multi-label Imbalanced Text DataSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
In real-world applications, as data availability increases, obtaining labeled data for machine learning (ML) projects remains challenging due to the high costs and intensive efforts required for data annotation. Many ML projects, particularly those focused on multi-label classification, also grapple with data imbalance issues, where certain classes may lack sufficient data to train effective classifiers. This study introduces and examines a novel oversampling method for multi-label text classification, designed to address performance challenges associated with data imbalance. The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances. By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes and evaluates their contribution to classifier performance enhancement. Instances that demonstrate performance improvement are then added to the labeled dataset. Experimental results indicate that the proposed approach effectively enhances classifier performance post-oversampling.
- [513] arXiv:2411.01075 (replaced) [pdf, other]
-
Title: Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer ModelsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Training transformer models requires substantial GPU compute and memory resources. In homogeneous clusters, distributed strategies allocate resources evenly, but this approach is inefficient for heterogeneous clusters, where GPUs differ in power and memory. As high-end GPUs are costly and limited in availability, heterogeneous clusters with diverse GPU types are becoming more common. Existing methods attempt to balance compute across GPUs based on capacity but often underutilize compute due to memory constraints. We present Cephalo, a system that optimizes compute and memory usage by decoupling compute distribution from training state assignment. Cephalo outperforms state-of-the-art methods by achieving significantly higher training throughput while supporting larger models and batch sizes.
- [514] arXiv:2411.01881 (replaced) [pdf, html, other]
-
Title: Causal Discovery and Classification Using Lempel-Ziv ComplexityComments: 17 pages, 8 figures, 5 tablesSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Inferring causal relationships in the decision-making processes of machine learning algorithms is a crucial step toward achieving explainable Artificial Intelligence (AI). In this research, we introduce a novel causality measure and a distance metric derived from Lempel-Ziv (LZ) complexity. We explore how the proposed causality measure can be used in decision trees by enabling splits based on features that most strongly \textit{cause} the outcome. We further evaluate the effectiveness of the causality-based decision tree and the distance-based decision tree in comparison to a traditional decision tree using Gini impurity. While the proposed methods demonstrate comparable classification performance overall, the causality-based decision tree significantly outperforms both the distance-based decision tree and the Gini-based decision tree on datasets generated from causal models. This result indicates that the proposed approach can capture insights beyond those of classical decision trees, especially in causally structured data. Based on the features used in the LZ causal measure based decision tree, we introduce a causal strength for each features in the dataset so as to infer the predominant causal variables for the occurrence of the outcome.
- [515] arXiv:2411.03839 (replaced) [pdf, html, other]
-
Title: Noisy Linear Group Testing: Exact Thresholds and Efficient AlgorithmsSubjects: Discrete Mathematics (cs.DM); Information Theory (cs.IT); Combinatorics (math.CO)
In group testing, the task is to identify defective items by testing groups of them together using as few tests as possible. We consider the setting where each item is defective with a constant probability $\alpha$, independent of all other items. In the (over-)idealized noiseless setting, tests are positive exactly if any of the tested items are defective. We study a more realistic model in which observed test results are subject to noise, i.e., tests can display false positive or false negative results with constant positive probabilities. We determine precise constants $c$ such that $cn\log n$ tests are required to recover the infection status of every individual for both adaptive and non-adaptive group testing: in the former, the selection of groups to test can depend on previously observed test results, whereas it cannot in the latter. Additionally, for both settings, we provide efficient algorithms that identify all defective items with the optimal amount of tests with high probability. Thus, we completely solve the problem of binary noisy group testing in the studied setting.
- [516] arXiv:2411.04204 (replaced) [pdf, html, other]
-
Title: Online Budgeted Matching with General BidsComments: Accepted by NeurIPS 2024Subjects: Computer Science and Game Theory (cs.GT); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
Online Budgeted Matching (OBM) is a classic problem with important applications in online advertising, online service matching, revenue management, and beyond. Traditional online algorithms typically assume a small bid setting, where the maximum bid-to-budget ratio (\kappa) is infinitesimally small. While recent algorithms have tried to address scenarios with non-small or general bids, they often rely on the Fractional Last Matching (FLM) assumption, which allows for accepting partial bids when the remaining budget is insufficient. This assumption, however, does not hold for many applications with indivisible bids. In this paper, we remove the FLM assumption and tackle the open problem of OBM with general bids. We first establish an upper bound of 1-\kappa on the competitive ratio for any deterministic online algorithm. We then propose a novel meta algorithm, called MetaAd, which reduces to different algorithms with first known provable competitive ratios parameterized by the maximum bid-to-budget ratio \kappa \in [0, 1]. As a by-product, we extend MetaAd to the FLM setting and get provable competitive algorithms. Finally, we apply our competitive analysis to the design learning-augmented algorithms.
- [517] arXiv:2411.04872 (replaced) [pdf, html, other]
-
Title: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AIElliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, Mark WildonSubjects: Artificial Intelligence (cs.AI)
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.
- [518] arXiv:2411.04997 (replaced) [pdf, html, other]
-
Title: LLM2CLIP: Powerful Language Model Unlocks Richer Visual RepresentationWeiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili QiuSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
CLIP is one of the most important multimodal foundational models today. What powers CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, shape a powerful cross-modal representation space. However, with the rapid advancements in large language models LLMs like GPT-4 and LLaMA, the boundaries of language comprehension and generation are continually being pushed. This raises an intriguing question: can the capabilities of LLMs be harnessed to further improve multimodal representation learning? The potential benefits of incorporating LLMs into CLIP are clear. LLMs' strong textual understanding can fundamentally improve CLIP's ability to handle image captions, drastically enhancing its ability to process long and complex texts, a well-known limitation of vanilla CLIP. Moreover, LLMs are trained on a vast corpus of text, possessing open-world knowledge. This allows them to expand on caption information during training, increasing the efficiency of the learning process. In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP's potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer's textual discriminability. We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP's visual encoder. Thanks to the LLM's presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP's text encoder's context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.
- [519] arXiv:2411.05196 (replaced) [pdf, html, other]
-
Title: Explainable AI through a Democratic Lens: DhondtXAI for Proportional Feature Importance Using the D'Hondt MethodSubjects: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
In democratic societies, electoral systems play a crucial role in translating public preferences into political representation. Among these, the D'Hondt method is widely used to ensure proportional representation, balancing fair representation with governmental stability. Recently, there has been a growing interest in applying similar principles of proportional representation to enhance interpretability in machine learning, specifically in Explainable AI (XAI). This study investigates the integration of D'Hondt-based voting principles in the DhondtXAI method, which leverages resource allocation concepts to interpret feature importance within AI models. Through a comparison of SHAP (Shapley Additive Explanations) and DhondtXAI, we evaluate their effectiveness in feature attribution within CatBoost and XGBoost models for breast cancer and diabetes prediction, respectively. The DhondtXAI approach allows for alliance formation and thresholding to enhance interpretability, representing feature importance as seats in a parliamentary view. Statistical correlation analyses between SHAP values and DhondtXAI allocations support the consistency of interpretations, demonstrating DhondtXAI's potential as a complementary tool for understanding feature importance in AI models. The results highlight that integrating electoral principles, such as proportional representation and alliances, into AI explainability can improve user understanding, especially in high-stakes fields like healthcare.
- [520] arXiv:2411.05521 (replaced) [pdf, html, other]
-
Title: SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query BenchmarkComments: NeurIPS 2024 Track Datasets and BenchmarksSubjects: Databases (cs.DB); Artificial Intelligence (cs.AI)
Electronic health records (EHRs) are stored in various database systems with different database models on heterogeneous storage architectures, such as relational databases, document stores, or graph databases. These different database models have a big impact on query complexity and performance. While this has been a known fact in database research, its implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far. In this paper, we present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark based on synthetic patient data from Synthea, following the SNOMED-CT taxonomy -- a widely used knowledge graph ontology covering medical terminology. SM3-Text-to-Query provides data representations for relational databases (PostgreSQL), document stores (MongoDB), and graph databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically and manually develop 408 template questions, which we augment to construct a benchmark of 10K diverse natural language question/query pairs for these four query languages (40K pairs overall). On our dataset, we evaluate several common in-context-learning (ICL) approaches for a set of representative closed and open-source LLMs. Our evaluation sheds light on the trade-offs between database models and query languages for different ICL strategies and LLMs. Last, SM3-Text-to-Query is easily extendable to additional query languages or real, standard-based patient databases.
- [521] arXiv:2411.05757 (replaced) [pdf, html, other]
-
Title: Tract-RLFormer: A Tract-Specific RL policy based Decoder-only Transformer NetworkAnkita Joshi, Ashutosh Sharma, Anoushkrit Goel, Ranjeet Ranjan Jha, Chirag Ahuja, Arnav Bhavsar, Aditya NigamComments: Accepted at 27th International Conference on Pattern Recognition (ICPR), 2024Subjects: Machine Learning (cs.LG)
Fiber tractography is a cornerstone of neuroimaging, enabling the detailed mapping of the brain's white matter pathways through diffusion MRI. This is crucial for understanding brain connectivity and function, making it a valuable tool in neurological applications. Despite its importance, tractography faces challenges due to its complexity and susceptibility to false positives, misrepresenting vital pathways. To address these issues, recent strategies have shifted towards deep learning, utilizing supervised learning, which depends on precise ground truth, or reinforcement learning, which operates without it. In this work, we propose Tract-RLFormer, a network utilizing both supervised and reinforcement learning, in a two-stage policy refinement process that markedly improves the accuracy and generalizability across various data-sets. By employing a tract-specific approach, our network directly delineates the tracts of interest, bypassing the traditional segmentation process. Through rigorous validation on datasets such as TractoInferno, HCP, and ISMRM-2015, our methodology demonstrates a leap forward in tractography, showcasing its ability to accurately map the brain's white matter tracts.
- [522] arXiv:2411.05777 (replaced) [pdf, html, other]
-
Title: Quantitative Assessment of Intersectional Empathetic Bias and UnderstandingSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
A growing amount of literature critiques the current operationalizations of empathy based on loose definitions of the construct. Such definitions negatively affect dataset quality, model robustness, and evaluation reliability. We propose an empathy evaluation framework that operationalizes empathy close to its psychological origins. The framework measures the variance in responses of LLMs to prompts using existing metrics for empathy and emotional valence. The variance is introduced through the controlled generation of the prompts by varying social biases affecting context understanding, thus impacting empathetic understanding. The control over generation ensures high theoretical validity of the constructs in the prompt dataset. Also, it makes high-quality translation, especially into languages that currently have little-to-no way of evaluating empathy or bias, such as the Slavonic family, more manageable. Using chosen LLMs and various prompt types, we demonstrate the empathy evaluation with the framework, including multiple-choice answers and free generation. The variance in our initial evaluation sample is small and we were unable to measure convincing differences between the empathetic understanding in contexts given by different social groups. However, the results are promising because the models showed significant alterations their reasoning chains needed to capture the relatively subtle changes in the prompts. This provides the basis for future research into the construction of the evaluation sample and statistical methods for measuring the results.
- [523] arXiv:2411.05836 (replaced) [pdf, other]
-
Title: Prion-ViT: Prions-Inspired Vision Transformers for Temperature prediction with SpecklegramsSubjects: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Optics (physics.optics)
Fiber Specklegram Sensors (FSS) are vital for environmental monitoring due to their high temperature sensitivity, but their complex data poses challeng-es for predictive models. This study introduces Prion-ViT, a prion-inspired Vision Transformer model, inspired by biological prion memory mecha-nisms, to improve long-term dependency modeling and temperature prediction accuracy using FSS data. Prion-ViT leverages a persistent memory state to retain and propagate key features across layers, reducing mean absolute error (MAE) to 0.52°C and outperforming models like ResNet, Inception Net V2, and standard vision transformers. This work highlights Prion-ViT's potential for real-time industrial temperature monitoring and broader optical sensing applications.
- [524] arXiv:2411.06383 (replaced) [pdf, html, other]
-
Title: Program Analysis via Multiple Context Free Language ReachabilityComments: Accepted at POPL 2024Subjects: Programming Languages (cs.PL); Computational Complexity (cs.CC); Formal Languages and Automata Theory (cs.FL)
Context-free language (CFL) reachability is a standard approach in static analyses, where the analysis question is phrased as a language reachability problem on a graph $G$ wrt a CFL L. While CFLs lack the expressiveness needed for high precision, common formalisms for context-sensitive languages are such that the corresponding reachability problem is undecidable. Are there useful context-sensitive language-reachability models for static analysis?
In this paper, we introduce Multiple Context-Free Language (MCFL) reachability as an expressive yet tractable model for static program analysis. MCFLs form an infinite hierarchy of mildly context sensitive languages parameterized by a dimension $d$ and a rank $r$. We show the utility of MCFL reachability by developing a family of MCFLs that approximate interleaved Dyck reachability, a common but undecidable static analysis problem.
We show that MCFL reachability be computed in $O(n^{2d+1})$ time on a graph of $n$ nodes when $r=1$, and $O(n^{d(r+1)})$ time when $r>1$. Moreover, we show that when $r=1$, the membership problem has a lower bound of $n^{2d}$ based on the Strong Exponential Time Hypothesis, while reachability for $d=1$ has a lower bound of $n^{3}$ based on the combinatorial Boolean Matrix Multiplication Hypothesis. Thus, for $r=1$, our algorithm is optimal within a factor $n$ for all levels of the hierarchy based on $d$.
We implement our MCFL reachability algorithm and evaluate it by underapproximating interleaved Dyck reachability for a standard taint analysis for Android. Used alongside existing overapproximate methods, MCFL reachability discovers all tainted information on 8 out of 11 benchmarks, and confirms $94.3\%$ of the reachable pairs reported by the overapproximation on the remaining 3. To our knowledge, this is the first report of high and provable coverage for this challenging benchmark set. - [525] arXiv:2411.06493 (replaced) [pdf, html, other]
-
Title: LProtector: An LLM-driven Vulnerability Detection SystemComments: 5 pages, 4 figures. This is a preprint version of the article. The final version will be published in the proceedings of the IEEE conferenceSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
This paper presents LProtector, an automated vulnerability detection system for C/C++ codebases driven by the large language model (LLM) GPT-4o and Retrieval-Augmented Generation (RAG). As software complexity grows, traditional methods face challenges in detecting vulnerabilities effectively. LProtector leverages GPT-4o's powerful code comprehension and generation capabilities to perform binary classification and identify vulnerabilities within target codebases. We conducted experiments on the Big-Vul dataset, showing that LProtector outperforms two state-of-the-art baselines in terms of F1 score, demonstrating the potential of integrating LLMs with vulnerability detection.
- [526] arXiv:2411.06503 (replaced) [pdf, html, other]
-
Title: Diffusion Sampling Correction via Approximately 10 ParametersSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Diffusion Probabilistic Models (DPMs) have demonstrated exceptional performance in generative tasks, but this comes at the expense of sampling efficiency. To enhance sampling speed without sacrificing quality, various distillation-based accelerated sampling algorithms have been recently proposed. However, they typically require significant additional training costs and model parameter storage, which limit their practical application. In this work, we propose PCA-based Adaptive Search (PAS), which optimizes existing solvers for DPMs with minimal learnable parameters and training costs. Specifically, we first employ PCA to obtain a few orthogonal unit basis vectors to span the high-dimensional sampling space, which enables us to learn just a set of coordinates to correct the sampling direction; furthermore, based on the observation that the cumulative truncation error exhibits an ``S''-shape, we design an adaptive search strategy that further enhances the sampling efficiency and reduces the number of stored parameters to approximately 10. Extensive experiments demonstrate that PAS can significantly enhance existing fast solvers in a plug-and-play manner with negligible costs. For instance, on CIFAR10, PAS requires only 12 parameters and less than 1 minute of training on a single NVIDIA A100 GPU to optimize the DDIM from 15.69 FID (NFE=10) to 4.37.
- [527] arXiv:2411.06542 (replaced) [pdf, html, other]
-
Title: Is Linear Feedback on Smoothed Dynamics Sufficient for Stabilizing Contact-Rich Plans?Yuki Shirai, Tong Zhao, H.J. Terry Suh, Huaijiang Zhu, Xinpei Ni, Jiuguang Wang, Max Simchowitz, Tao PangComments: Under review for ICRA2025Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Designing planners and controllers for contact-rich manipulation is extremely challenging as contact violates the smoothness conditions that many gradient-based controller synthesis tools assume. Contact smoothing approximates a non-smooth system with a smooth one, allowing one to use these synthesis tools more effectively. However, applying classical control synthesis methods to smoothed contact dynamics remains relatively under-explored. This paper analyzes the efficacy of linear controller synthesis using differential simulators based on contact smoothing. We introduce natural baselines for leveraging contact smoothing to compute (a) open-loop plans robust to uncertain conditions and/or dynamics, and (b) feedback gains to stabilize around open-loop plans. Using robotic bimanual whole-body manipulation as a testbed, we perform extensive empirical experiments on over 300 trajectories and analyze why LQR seems insufficient for stabilizing contact-rich plans. The video summarizing this paper and hardware experiments is found here: this https URL.
- [528] arXiv:2411.06635 (replaced) [pdf, other]
-
Title: Mixed Effects Deep Learning for the interpretable analysis of single cell RNA sequencing data by quantifying and visualizing batch effectsComments: Main manuscript: 29 pages, including 10 figures and 8 tables. Supplemental material: 17 pagesSubjects: Machine Learning (cs.LG); Genomics (q-bio.GN)
Single-cell RNA sequencing (scRNA-seq) data are often confounded by technical or biological batch effects. Existing deep learning models mitigate these effects but often discard batch-specific information, potentially losing valuable biological insights. We propose a Mixed Effects Deep Learning (MEDL) autoencoder framework that separately models batch-invariant (fixed effects) and batch-specific (random effects) components. By decoupling batch-invariant biological states from batch variations, our framework integrates both into predictive models. Our approach also generates 2D visualizations of how the same cell appears across batches, enhancing interpretability. Retaining both fixed and random effect latent spaces improves classification accuracy.
We applied our framework to three datasets spanning the cardiovascular system (Healthy Heart), Autism Spectrum Disorder (ASD), and Acute Myeloid Leukemia (AML). With 147 batches in the Healthy Heart dataset, far exceeding typical numbers, we tested our framework's ability to handle many batches. In the ASD dataset, our approach captured donor heterogeneity between autistic and healthy individuals. In the AML dataset, it distinguished donor heterogeneity despite missing cell types and diseased donors exhibiting both healthy and malignant cells. These results highlight our framework's ability to characterize fixed and random effects, enhance batch effect visualization, and improve prediction accuracy across diverse datasets. - [529] arXiv:2411.06651 (replaced) [pdf, html, other]
-
Title: Machine learning-enabled velocity model building with uncertainty quantificationSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Accurately characterizing migration velocity models is crucial for a wide range of geophysical applications, from hydrocarbon exploration to monitoring of CO2 sequestration projects. Traditional velocity model building methods such as Full-Waveform Inversion (FWI) are powerful but often struggle with the inherent complexities of the inverse problem, including noise, limited bandwidth, receiver aperture and computational constraints. To address these challenges, we propose a scalable methodology that integrates generative modeling, in the form of Diffusion networks, with physics-informed summary statistics, making it suitable for complicated imaging problems including field datasets. By defining these summary statistics in terms of subsurface-offset image volumes for poor initial velocity models, our approach allows for computationally efficient generation of Bayesian posterior samples for migration velocity models that offer a useful assessment of uncertainty. To validate our approach, we introduce a battery of tests that measure the quality of the inferred velocity models, as well as the quality of the inferred uncertainties. With modern synthetic datasets, we reconfirm gains from using subsurface-image gathers as the conditioning observable. For complex velocity model building involving salt, we propose a new iterative workflow that refines amortized posterior approximations with salt flooding and demonstrate how the uncertainty in the velocity model can be propagated to the final product reverse time migrated images. Finally, we present a proof of concept on field datasets to show that our method can scale to industry-sized problems.
- [530] arXiv:2411.06727 (replaced) [pdf, html, other]
-
Title: Can KAN Work? Exploring the Potential of Kolmogorov-Arnold Networks in Computer VisionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Kolmogorov-Arnold Networks(KANs), as a theoretically efficient neural network architecture, have garnered attention for their potential in capturing complex patterns. However, their application in computer vision remains relatively unexplored. This study first analyzes the potential of KAN in computer vision tasks, evaluating the performance of KAN and its convolutional variants in image classification and semantic segmentation. The focus is placed on examining their characteristics across varying data scales and noise levels. Results indicate that while KAN exhibits stronger fitting capabilities, it is highly sensitive to noise, limiting its robustness. To address this challenge, we propose a smoothness regularization method and introduce a Segment Deactivation technique. Both approaches enhance KAN's stability and generalization, demonstrating its potential in handling complex visual data tasks.
- [531] arXiv:2411.07104 (replaced) [pdf, html, other]
-
Title: Learning Multi-Agent Loco-Manipulation for Long-Horizon Quadrupedal PushingYuming Feng, Chuye Hong, Yaru Niu, Shiqi Liu, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Jie Tan, Ding ZhaoSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Recently, quadrupedal locomotion has achieved significant success, but their manipulation capabilities, particularly in handling large objects, remain limited, restricting their usefulness in demanding real-world applications such as search and rescue, construction, industrial automation, and room organization. This paper tackles the task of obstacle-aware, long-horizon pushing by multiple quadrupedal robots. We propose a hierarchical multi-agent reinforcement learning framework with three levels of control. The high-level controller integrates an RRT planner and a centralized adaptive policy to generate subgoals, while the mid-level controller uses a decentralized goal-conditioned policy to guide the robots toward these sub-goals. A pre-trained low-level locomotion policy executes the movement commands. We evaluate our method against several baselines in simulation, demonstrating significant improvements over baseline approaches, with 36.0% higher success rates and 24.5% reduction in completion time than the best baseline. Our framework successfully enables long-horizon, obstacle-aware manipulation tasks like Push-Cuboid and Push-T on Go1 robots in the real world.
- [532] arXiv:2411.07106 (replaced) [pdf, html, other]
-
Title: Topological Characterization of Stabilizing ConsensusSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); General Topology (math.GN)
We provide a complete characterization of the solvability/impossibility of deterministic stabilizing consensus in any computing model with benign process and communication faults using point-set topology. Relying on the topologies for infinite executions introduced by Nowak, Schmid and Winkler (JACM, 2024) for terminating consensus, we prove that semi-open decision sets and semi-continuous decision functions as introduced by Levin (AMM, 1963) are the appropriate means for this characterization: Unlike the decision functions for terminating consensus, which are continuous, semi-continuous functions do not require the inverse image of an open set to be open and hence allow to map a connected space to a disconnected one. We also show that multi-valued stabilizing consensus with weak and strong validity are equivalent, as is the case for terminating consensus. By applying our results to (variants of) all the known possibilities/impossibilities for stabilizing consensus, we easily provide a topological explanation of these results.
- [533] arXiv:2411.07176 (replaced) [pdf, html, other]
-
Title: More Expressive Attention with Negative WeightsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention can shift the token deletion and copying function from a static OV matrix to dynamic QK inner products, with the OV matrix now focusing more on refinement or modification. The attention head can simultaneously delete, copy, or retain tokens by assigning them negative, positive, or minimal attention weights, respectively. As a result, a single attention head becomes more flexible and expressive. (2) Cog Attention improves the model's robustness against representational collapse, which can occur when earlier tokens are over-squashed into later positions, leading to homogeneous representations. Negative weights reduce effective information paths from earlier to later tokens, helping to mitigate this issue. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.
- [534] arXiv:2411.07211 (replaced) [pdf, other]
-
Title: Exo 2: Growing a Scheduling LanguageComments: To appear in ASPLOS 2025. The arXiv version contains full appendicesSubjects: Programming Languages (cs.PL)
User-schedulable languages (USLs) help programmers productively optimize programs by providing safe means of transforming them. Current USLs are designed to give programmers exactly the control they want, while automating all other concerns. However, there is no universal answer for what performance-conscious programmers want to control, how they want to control it, and what they want to automate, even in relatively narrow domains.
We claim that USLs should, instead, be designed to grow. We present Exo 2, a scheduling language that enables users to define new scheduling operations externally to the compiler. By composing a set of trusted, fine-grained primitives, users can safely write their own scheduling library to build up desired automation. We identify actions (ways of modifying code), inspection (ways of interrogating code), and references (ways of pointing to code) as essential for any user-extensible USL.
We fuse these ideas into a new mechanism called Cursors that enables the creation of scheduling libraries in user code. We demonstrate libraries that amortize scheduling effort across more than 80 high-performance kernels, reducing total scheduling code by an order of magnitude and delivering performance competitive with state-of-the-art implementations on three different platforms. - [535] arXiv:2411.07217 (replaced) [pdf, html, other]
-
Title: Feature Selection Based on Wasserstein DistanceSubjects: Machine Learning (cs.LG)
This paper presents a novel feature selection method leveraging the Wasserstein distance to improve feature selection in machine learning. Unlike traditional methods based on correlation or Kullback-Leibler (KL) divergence, our approach uses the Wasserstein distance to assess feature similarity, inherently capturing class relationships and making it robust to noisy labels. We introduce a Markov blanket-based feature selection algorithm and demonstrate its effectiveness. Our analysis shows that the Wasserstein distance-based feature selection method effectively reduces the impact of noisy labels without relying on specific noise models. We provide a lower bound on its effectiveness, which remains meaningful even in the presence of noise. Experimental results across multiple datasets demonstrate that our approach consistently outperforms traditional methods, particularly in noisy settings.
- [536] arXiv:2411.07579 (replaced) [pdf, html, other]
-
Title: Projecting Gaussian Ellipsoids While Avoiding Affine Projection ApproximationSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Recently, 3D Gaussian Splatting has dominated novel-view synthesis with its real-time rendering speed and state-of-the-art rendering quality. However, during the rendering process, the use of the Jacobian of the affine approximation of the projection transformation leads to inevitable errors, resulting in blurriness, artifacts and a lack of scene consistency in the final rendered images. To address this issue, we introduce an ellipsoid-based projection method to calculate the projection of Gaussian ellipsoid onto the image plane, which is the primitive of 3D Gaussian Splatting. As our proposed ellipsoid-based projection method cannot handle Gaussian ellipsoids with camera origins inside them or parts lying below $z=0$ plane in the camera space, we designed a pre-filtering strategy. Experiments over multiple widely adopted benchmark datasets show that our ellipsoid-based projection method can enhance the rendering quality of 3D Gaussian Splatting and its extensions.
- [537] arXiv:2411.07635 (replaced) [pdf, html, other]
-
Title: Breaking the Low-Rank Dilemma of Linear AttentionSubjects: Computer Vision and Pattern Recognition (cs.CV)
The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the KV buffer and the output features. Consequently, we introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Based on RALA, we construct the Rank-Augmented Vision Linear Transformer (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an 84.4% Top-1 accuracy on ImageNet-1k with only 26M parameters and 4.6G FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA. Code will be available at this https URL.
- [538] arXiv:2411.07668 (replaced) [pdf, html, other]
-
Title: Towards Evaluation Guidelines for Empirical Studies involving LLMsComments: 4 pagesSubjects: Software Engineering (cs.SE)
In the short period since the release of ChatGPT in November 2022, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process (e.g., for data annotation) or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of what our community standards are for high-quality empirical studies involving LLMs.
- [539] arXiv:2411.07870 (replaced) [pdf, html, other]
-
Title: Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual DecodersJournal-ref: EMNLP CustomNLP4U 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Although people are impressed by the content generation skills of large language models, the use of LLMs, such as ChatGPT, is limited by the domain grounding of the content. The correctness and groundedness of the generated content need to be based on a verified context, such as results from Retrieval-Augmented Generation (RAG). One important issue when adapting LLMs to a customized domain is that the generated responses are often incomplete, or the additions are not verified and may even be hallucinated. Prior studies on hallucination detection have focused on evaluation metrics, which are not easily adaptable to dynamic domains and can be vulnerable to attacks like jail-breaking. In this work, we propose 1) a post-processing algorithm that leverages knowledge triplets in RAG context to correct hallucinations and 2) a dual-decoder model that fuses RAG context to guide the generation process.
- [540] arXiv:2411.08278 (replaced) [pdf, html, other]
-
Title: Knowledge Bases in Support of Large Language Models for Processing Web NewsComments: 10 pages, 5 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have received considerable interest in wide applications lately. During pre-training via massive datasets, such a model implicitly memorizes the factual knowledge of trained datasets in its hidden parameters. However, knowledge held implicitly in parameters often makes its use by downstream applications ineffective due to the lack of common-sense reasoning. In this article, we introduce a general framework that permits to build knowledge bases with an aid of LLMs, tailored for processing Web news. The framework applies a rule-based News Information Extractor (NewsIE) to news items for extracting their relational tuples, referred to as knowledge bases, which are then graph-convoluted with the implicit knowledge facts of news items obtained by LLMs, for their classification. It involves two lightweight components: 1) NewsIE: for extracting the structural information of every news item, in the form of relational tuples; 2) BERTGraph: for graph convoluting the implicit knowledge facts with relational tuples extracted by NewsIE. We have evaluated our framework under different news-related datasets for news category classification, with promising experimental results.
- [541] arXiv:2411.08411 (replaced) [pdf, html, other]
-
Title: Modeling and Optimization for Rotatable Antenna Enabled Wireless CommunicationComments: 7 pages, 6 figuresSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Fluid antenna system (FAS)/movable antenna (MA) has emerged as a promising technology to fully exploit the spatial degrees of freedom (DoFs). In this paper, we propose a new rotatable antenna (RA) model, as a simplified implementation of six-dimensional movable antenna (6DMA), to improve the performance of wireless communication systems. Different from conventional fixed-position antenna (FPA), the proposed RA system can independently and flexibly change the three-dimensional (3D) orientation of each antenna by adjusting its declination angles to achieve desired channel realizations. Specifically, we study an RA-enabled uplink communication system, where the receive beamforming and the declination angles of all RAs are jointly optimized to maximize the minimum signal-to-interference-plus-noise ratio (SINR) among all the users. In the special single-user and free-space propagation setup, the optimal declination angles are derived in closed form with the maximum-ratio combining (MRC) beamformer applied at the base station (BS). In the general multi-user and multi-path setup, we propose an alternating optimization (AO) algorithm to alternately optimize the receive beamforming and the declination angles in an iterative manner. Simulation results are provided to demonstrate that the proposed RA-enabled system can significantly outperform other benchmark schemes.
- [542] arXiv:2411.08504 (replaced) [pdf, html, other]
-
Title: Towards Objective and Unbiased Decision Assessments with LLM-Enhanced Hierarchical Attention NetworksComments: Source code is available at: this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
How objective and unbiased are we while making decisions? This work investigates cognitive bias identification in high-stake decision making process by human experts, questioning its effectiveness in real-world settings, such as candidates assessments for university admission. We begin with a statistical analysis assessing correlations among different decision points among in the current process, which discovers discrepancies that imply cognitive bias and inconsistency in decisions. This motivates our exploration of bias-aware AI-augmented workflow that surpass human judgment. We propose BGM-HAN, an enhanced Hierarchical Attention Network with Byte-Pair Encoding, Gated Residual Connections and Multi-Head Attention. Using it as a backbone model, we further propose a Shortlist-Analyse-Recommend (SAR) agentic workflow, which simulate real-world decision-making. In our experiments, both the proposed model and the agentic workflow significantly improves on both human judgment and alternative models, validated with real-world data.
- [543] arXiv:2411.08509 (replaced) [pdf, html, other]
-
Title: Sum Rate Maximization for Movable Antenna-Aided Downlink RSMA SystemsSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Rate splitting multiple access (RSMA) is regarded as a crucial and powerful physical layer (PHY) paradigm for next-generation communication systems. Particularly, users employ successive interference cancellation (SIC) to decode part of the interference while treating the remainder as noise. However, conventional RSMA systems rely on fixed-position antenna arrays, limiting their ability to fully exploit spatial diversity. This constraint reduces beamforming gain and significantly impairs RSMA performance. To address this problem, we propose a movable antenna (MA)-aided RSMA scheme that allows the antennas at the base station (BS) to dynamically adjust their positions. Our objective is to maximize the system sum rate of common and private messages by jointly optimizing the MA positions, beamforming matrix, and common rate allocation. To tackle the formulated non-convex problem, we apply fractional programming (FP) and develop an efficient two-stage, coarse-to-fine-grained searching (CFGS) algorithm to obtain high-quality solutions. Numerical results demonstrate that, with optimized antenna adjustments, the MA-enabled system achieves substantial performance and reliability improvements in RSMA over fixed-position antenna setups.
- [544] arXiv:2411.08586 (replaced) [pdf, other]
-
Title: Optimizing Automatic Summarization of Long Clinical Records Using Dynamic Context Extension:Testing and Evaluation of the NBCE MethodSubjects: Artificial Intelligence (cs.AI)
Summarizing patient clinical notes is vital for reducing documentation burdens. Current manual summarization makes medical staff struggle. We propose an automatic method using LLMs, but long inputs cause LLMs to lose context, reducing output quality especially in small size model. We used a 7B model, open-calm-7b, enhanced with Native Bayes Context Extend and a redesigned decoding mechanism to reference one sentence at a time, keeping inputs within context windows, 2048 tokens. Our improved model achieved near parity with Google's over 175B Gemini on ROUGE-L metrics with 200 samples, indicating strong performance using less resources, enhancing automated EMR summarization feasibility.
- [545] arXiv:2411.08656 (replaced) [pdf, html, other]
-
Title: MikuDance: Animating Character Art with Mixed Motion DynamicsSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose MikuDance, a diffusion-based pipeline incorporating mixed motion dynamics to animate stylized character art. MikuDance consists of two key techniques: Mixed Motion Modeling and Mixed-Control Diffusion, to address the challenges of high-dynamic motion and reference-guidance misalignment in character art animation. Specifically, a Scene Motion Tracking strategy is presented to explicitly model the dynamic camera in pixel-wise space, enabling unified character-scene motion modeling. Building on this, the Mixed-Control Diffusion implicitly aligns the scale and body shape of diverse characters with motion guidance, allowing flexible control of local character motion. Subsequently, a Motion-Adaptive Normalization module is incorporated to effectively inject global scene motion, paving the way for comprehensive character art animation. Through extensive experiments, we demonstrate the effectiveness and generalizability of MikuDance across various character art and motion guidance, consistently producing high-quality animations with remarkable motion dynamics.
- [546] arXiv:2411.08674 (replaced) [pdf, html, other]
-
Title: Reducing ADC Front-end Costs During Training of On-sensor Printed Multilayer PerceptronsComments: This article is accepted for publication in IEEE Embedded Systems LettersSubjects: Hardware Architecture (cs.AR); Neural and Evolutionary Computing (cs.NE)
Printed electronics technology offers a cost-effectiveand fully-customizable solution to computational needs beyondthe capabilities of traditional silicon technologies, offering ad-vantages such as on-demand manufacturing and conformal, low-cost hardware. However, the low-resolution fabrication of printedelectronics, which results in large feature sizes, poses a challengefor integrating complex designs like those of machine learn-ing (ML) classification systems. Current literature optimizes onlythe Multilayer Perceptron (MLP) circuit within the classificationsystem, while the cost of analog-to-digital converters (ADCs)is overlooked. Printed applications frequently require on-sensorprocessing, yet while the digital classifier has been extensivelyoptimized, the analog-to-digital interfacing, specifically the ADCs,dominates the total area and energy consumption. In this work,we target digital printed MLP classifiers and we propose thedesign of customized ADCs per MLP's input which involvesminimizing the distinct represented numbers for each input,simplifying thus the ADC's circuitry. Incorporating this ADCoptimization in the MLP training, enables eliminating ADC levelsand the respective comparators, while still maintaining highclassification accuracy. Our approach achieves 11.2x lower ADCarea for less than 5% accuracy drop across varying MLPs.
- [547] arXiv:2411.08702 (replaced) [pdf, html, other]
-
Title: A Deep Uzawa-Lagrange Multiplier Approach for Boundary Conditions in PINNs and Deep Ritz MethodsComments: 19 pages, 13 figures; updated address of authorSubjects: Numerical Analysis (math.NA)
We introduce a deep learning-based framework for weakly enforcing boundary conditions in the numerical approximation of partial differential equations. Building on existing physics-informed neural network and deep Ritz methods, we propose the Deep Uzawa algorithm, which incorporates Lagrange multipliers to handle boundary conditions effectively. This modification requires only a minor computational adjustment but ensures enhanced convergence properties and provably accurate enforcement of boundary conditions, even for singularly perturbed problems.
We provide a comprehensive mathematical analysis demonstrating the convergence of the scheme and validate the effectiveness of the Deep Uzawa algorithm through numerical experiments, including high-dimensional, singularly perturbed problems and those posed over non-convex domains. - [548] arXiv:2411.08733 (replaced) [pdf, html, other]
-
Title: Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language ModelsComments: EMNLP 2024 MainSubjects: Computation and Language (cs.CL)
Aligning Large Language Models (LLMs) traditionally relies on costly training and human preference annotations. Self-alignment seeks to reduce these expenses by enabling models to align themselves. To further lower costs and achieve alignment without any expensive tuning or annotations, we introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (DRPO). Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions, all without additional training or human intervention. The core of DRPO is a dynamic rewarding mechanism, which identifies and rectifies model-specific alignment weaknesses, allowing LLMs to adapt efficiently to diverse alignment challenges. Empirical evaluations on eight recent LLMs, both open- and closed-sourced, demonstrate that DRPO significantly enhances alignment performance, with base models outperforming their SFT/RLHF-tuned counterparts. Moreover, the prompts automatically optimized by DRPO surpass those curated by human experts, further validating the effectiveness of our approach. Our findings highlight the great potential of current LLMs to achieve adaptive self-alignment through inference-time optimization, complementing tuning-based alignment methods.
- [549] arXiv:2411.08756 (replaced) [pdf, html, other]
-
Title: Masked Image Modeling Boosting Semi-Supervised Semantic SegmentationComments: 13 pages. This work has been submitted to the IEEE for possible publicationSubjects: Computer Vision and Pattern Recognition (cs.CV)
In view of the fact that semi- and self-supervised learning share a fundamental principle, effectively modeling knowledge from unlabeled data, various semi-supervised semantic segmentation methods have integrated representative self-supervised learning paradigms for further regularization. However, the potential of the state-of-the-art generative self-supervised paradigm, masked image modeling, has been scarcely studied. This paradigm learns the knowledge through establishing connections between the masked and visible parts of masked image, during the pixel reconstruction process. By inheriting and extending this insight, we successfully leverage masked image modeling to boost semi-supervised semantic segmentation. Specifically, we introduce a novel class-wise masked image modeling that independently reconstructs different image regions according to their respective classes. In this way, the mask-induced connections are established within each class, mitigating the semantic confusion that arises from plainly reconstructing images in basic masked image modeling. To strengthen these intra-class connections, we further develop a feature aggregation strategy that minimizes the distances between features corresponding to the masked and visible parts within the same class. Additionally, in semantic space, we explore the application of masked image modeling to enhance regularization. Extensive experiments conducted on well-known benchmarks demonstrate that our approach achieves state-of-the-art performance. The code will be available at this https URL.
- [550] arXiv:2208.14424 (replaced) [pdf, html, other]
-
Title: Inevitability of knowing less than nothingComments: v3: final version accepted for publication in Quantum JournalSubjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Mathematical Physics (math-ph)
A colloquial interpretation of entropy is that it is the knowledge gained upon learning the outcome of a random experiment. Conditional entropy is then interpreted as the knowledge gained upon learning the outcome of one random experiment after learning the outcome of another, possibly statistically dependent, random experiment. In the classical world, entropy and conditional entropy take only non-negative values, consistent with the intuition that one has regarding the aforementioned interpretations. However, for certain entangled states, one obtains negative values when evaluating commonly accepted and information-theoretically justified formulas for the quantum conditional entropy, leading to the confounding conclusion that one can know less than nothing in the quantum world. Here, we introduce a physically motivated framework for defining quantum conditional entropy, based on two simple postulates inspired by the second law of thermodynamics (non-decrease of entropy) and extensivity of entropy, and we argue that all plausible definitions of quantum conditional entropy should respect these two postulates. We then prove that all plausible quantum conditional entropies take on negative values for certain entangled states, so that it is inevitable that one can know less than nothing in the quantum world. All of our arguments are based on constructions of physical processes that respect the first postulate, the one inspired by the second law of thermodynamics.
- [551] arXiv:2209.15051 (replaced) [pdf, html, other]
-
Title: Circularity of Thermodynamical Material Networks: Indicators, Examples, and AlgorithmsComments: To be submittedSubjects: Dynamical Systems (math.DS); Computational Engineering, Finance, and Science (cs.CE)
The transition towards a circular economy has gained importance over the last years since the traditional linear take-make-dispose paradigm is not sustainable in the long term. Recently, thermodynamical material networks (TMNs) [1] have been proposed as an approach to design material flows based on the idea that any supply chain can be seen as a set of thermodynamic compartments that can be added, removed, modified or connected differently. Compared to the well-established material flow analysis (MFA), TMNs leverage dynamical energy balances and ordinary differential equations along with the usual mass balances, thus tackling circular economy as a material network design problem analogous to traditional engineering design approaches (e.g., design of thermodynamic cycles, electrical and hydraulic networks) rather than as an analysis of stock-and-flow data. Hence, TMNs allow the depiction of highly dynamic material stocks and flows whose variations can occur in less than 1 minute; achieving such modelling accuracy with MFA would be more data intensive. In this paper, we first develop several circularity indicators of TMNs using a graph-based formalism. Then, we illustrate their calculation using two numerical examples for the case of fluid materials and one numerical example for the case of solid materials, for which the detailed hybrid dynamical equations and simulation outputs are provided. The paper source code is publicly available.
- [552] arXiv:2210.14980 (replaced) [pdf, html, other]
-
Title: Interstellar Object Accessibility and Mission DesignBenjamin P. S. Donitz, Declan Mages, Hiroyasu Tsukamoto, Peter Dixon, Damon Landau, Soon-Jo Chung, Erica Bufanda, Michel Ingham, Julie Castillo-RogezComments: IEEE Aerospace Conference, Preprint Version, Accepted: November 2022Subjects: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
Interstellar objects (ISOs) represent a compelling and under-explored category of celestial bodies, providing physical laboratories to understand the formation of our solar system and probe the composition and properties of material formed in exoplanetary systems. In this work, we investigate existing approaches to designing successful flyby missions to ISOs, including a deep learning-driven guidance and control algorithm for ISOs traveling at velocities over 60 km/s. We have generated spacecraft trajectories to a series of synthetic representative ISOs, simulating a ground campaign to observe the target and resolve its state, thereby determining the cruise and close approach delta-Vs required for the encounter. We discuss the accessibility of and mission design to ISOs with varying characteristics, with special focuses on 1) state covariance estimation throughout the cruise, 2) handoffs from traditional navigation approaches to novel autonomous navigation for fast flyby regimes, and 3) overall recommendations about preparing for the future in situ exploration of these targets. The lessons learned also apply to the fast flyby of other small bodies, e.g., long-period comets and potentially hazardous asteroids, which also require tactical responses with similar characteristics.
- [553] arXiv:2308.08539 (replaced) [pdf, html, other]
-
Title: Constant-depth circuits for Boolean functions and quantum memory devices using multi-qubit gatesComments: 54 pages, 11 figures. v2: corrected typos, added one figure and references; v3: published version in Quantum Journal, added more references, included a table of related results, improved the ancillary complexity for both QRAM and QRAG, changed the titleSubjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Emerging Technologies (cs.ET)
We explore the power of the unbounded Fan-Out gate and the Global Tunable gates generated by Ising-type Hamiltonians in constructing constant-depth quantum circuits, with particular attention to quantum memory devices. We propose two types of constant-depth constructions for implementing Uniformly Controlled Gates. These gates include the Fan-In gates defined by $|x\rangle|b\rangle\mapsto |x\rangle|b\oplus f(x)\rangle$ for $x\in\{0,1\}^n$ and $b\in\{0,1\}$, where $f$ is a Boolean function. The first of our constructions is based on computing the one-hot encoding of the control register $|x\rangle$, while the second is based on Boolean analysis and exploits different representations of $f$ such as its Fourier expansion. Via these constructions, we obtain constant-depth circuits for the quantum counterparts of read-only and read-write memory devices -- Quantum Random Access Memory (QRAM) and Quantum Random Access Gate (QRAG) -- of memory size $n$. The implementation based on one-hot encoding requires either $O(n\log^{(d)}{n}\log^{(d+1)}{n})$ ancillae and $O(n\log^{(d)}{n})$ Fan-Out gates or $O(n\log^{(d)}{n})$ ancillae and $16d-10$ Global Tunable gates, where $d$ is any positive integer and $\log^{(d)}{n} = \log\cdots \log{n}$ is the $d$-times iterated logarithm. On the other hand, the implementation based on Boolean analysis requires $8d-6$ Global Tunable gates at the expense of $O(n^{1/(1-2^{-d})})$ ancillae.
- [554] arXiv:2310.00289 (replaced) [pdf, html, other]
-
Title: Pubic Symphysis-Fetal Head Segmentation Using Pure Transformer with Bi-level Routing AttentionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In this paper, we propose a method, named BRAU-Net, to solve the pubic symphysis-fetal head segmentation task. The method adopts a U-Net-like pure Transformer architecture with bi-level routing attention and skip connections, which effectively learns local-global semantic information. The proposed BRAU-Net was evaluated on transperineal Ultrasound images dataset from the pubic symphysis-fetal head segmentation and angle of progression (FH-PS-AOP) challenge. The results demonstrate that the proposed BRAU-Net achieves comparable a final score. The codes will be available at this https URL.
- [555] arXiv:2311.07189 (replaced) [pdf, html, other]
-
Title: $\Pi_{2}$-Rule Systems and Inductive Classes of G\"{o}del AlgebrasComments: 29 pagesSubjects: Logic (math.LO); Logic in Computer Science (cs.LO)
In this paper we present a general theory of $\Pi_{2}$-rules for systems of intuitionistic and modal logic. We introduce the notions of $\Pi_{2}$-rule system and of an Inductive Class, and provide model-theoretic and algebraic completeness theorems, which serve as our basic tools. As an illustration of the general theory, we analyse the structure of inductive classes of Gödel algebras, from a structure theoretic and logical point of view. We show that unlike other well-studied settings (such as logics, or single-conclusion rule systems), there are continuum many $\Pi_{2}$-rule systems extending $\mathsf{LC}=\mathsf{IPC}+(p\rightarrow q)\vee (q\rightarrow p)$, and show how our methods allow easy proofs of the admissibility of the well-known Takeuti-Titani rule. Our final results concern general questions admissibility in $\mathsf{LC}$: (1) we present a full classification of those inductive classes which are inductively complete, i.e., where all $\Pi_{2}$-rules which are admissible are derivable, and (2) show that the problem of admissibility of $\Pi_{2}$-rules over $\mathsf{LC}$ is decidable.
- [556] arXiv:2401.03060 (replaced) [pdf, other]
-
Title: Super-resolution multi-contrast unbiased eye atlases with deep probabilistic refinementHo Hin Lee, Adam M. Saunders, Michael E. Kim, Samuel W. Remedios, Lucas W. Remedios, Yucheng Tang, Qi Yang, Xin Yu, Shunxing Bao, Chloe Cho, Louise A. Mawn, Tonia S. Rex, Kevin L. Schey, Blake E. Dewey, Jeffrey M. Spraggins, Jerry L. Prince, Yuankai Huo, Bennett A. LandmanComments: Published in SPIE Journal of Medical Imaging (this https URL). 27 pages, 6 figuresJournal-ref: J. Med. Imag. 11(6), 064004 (2024)Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Purpose: Eye morphology varies significantly across the population, especially for the orbit and optic nerve. These variations limit the feasibility and robustness of generalizing population-wise features of eye organs to an unbiased spatial reference.
Approach: To tackle these limitations, we propose a process for creating high-resolution unbiased eye atlases. First, to restore spatial details from scans with a low through-plane resolution compared to a high in-plane resolution, we apply a deep learning-based super-resolution algorithm. Then, we generate an initial unbiased reference with an iterative metric-based registration using a small portion of subject scans. We register the remaining scans to this template and refine the template using an unsupervised deep probabilistic approach that generates a more expansive deformation field to enhance the organ boundary alignment. We demonstrate this framework using magnetic resonance images across four different tissue contrasts, generating four atlases in separate spatial alignments.
Results: For each tissue contrast, we find a significant improvement using the Wilcoxon signed-rank test in the average Dice score across four labeled regions compared to a standard registration framework consisting of rigid, affine, and deformable transformations. These results highlight the effective alignment of eye organs and boundaries using our proposed process.
Conclusions: By combining super-resolution preprocessing and deep probabilistic models, we address the challenge of generating an eye atlas to serve as a standardized reference across a largely variable population. - [557] arXiv:2401.07506 (replaced) [pdf, html, other]
-
Title: SeMaScore : a new evaluation metric for automatic speech recognition tasksComments: Accepted at Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
In this study, we present SeMaScore, generated using a segment-wise mapping and scoring algorithm that serves as an evaluation metric for automatic speech recognition tasks. SeMaScore leverages both the error rate and a more robust similarity score. We show that our algorithm's score generation improves upon the state-of-the-art BERTScore. Our experimental results show that SeMaScore corresponds well with expert human assessments, signal-to-noise ratio levels, and other natural language metrics. We outperform BERTScore by 41x in metric computation speed. Overall, we demonstrate that SeMaScore serves as a more dependable evaluation metric, particularly in real-world situations involving atypical speech patterns.
- [558] arXiv:2401.16922 (replaced) [pdf, html, other]
-
Title: Learning Properties of Quantum States Without the I.I.D. AssumptionComments: 36+10 pages, 7 Figures. Close to the published versionJournal-ref: Nature Communications, 15, Article number: 9677 (2024)Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Probability (math.PR); Statistics Theory (math.ST)
We develop a framework for learning properties of quantum states beyond the assumption of independent and identically distributed (i.i.d.) input states. We prove that, given any learning problem (under reasonable assumptions), an algorithm designed for i.i.d. input states can be adapted to handle input states of any nature, albeit at the expense of a polynomial increase in training data size (aka sample complexity). Importantly, this polynomial increase in sample complexity can be substantially improved to polylogarithmic if the learning algorithm in question only requires non-adaptive, single-copy measurements. Among other applications, this allows us to generalize the classical shadow framework to the non-i.i.d. setting while only incurring a comparatively small loss in sample efficiency. We use rigorous quantum information theory to prove our main results. In particular, we leverage permutation invariance and randomized single-copy measurements to derive a new quantum de Finetti theorem that mainly addresses measurement outcome statistics and, in turn, scales much more favorably in Hilbert space dimension.
- [559] arXiv:2404.04391 (replaced) [pdf, html, other]
-
Title: Adaptive Power Flow Approximations with Second-Order Sensitivity InsightsSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
The power flow equations are fundamental to power system planning, analysis, and control. However, the inherent non-linearity and non-convexity of these equations present formidable obstacles in problem-solving processes. To mitigate these challenges, recent research has proposed adaptive power flow linearizations that aim to achieve accuracy over wide operating ranges. The accuracy of these approximations inherently depends on the curvature of the power flow equations within these ranges, which necessitates considering second-order sensitivities. In this paper, we leverage second-order sensitivities to both analyze and improve power flow approximations. We evaluate the curvature across broad operational ranges and subsequently utilize this information to inform the computation of various sample-based power flow approximation techniques. Additionally, we leverage second-order sensitivities to guide the development of rational approximations that yield linear constraints in optimization problems. This approach is extended to enhance accuracy beyond the limitations of linear functions across varied operational scenarios.
- [560] arXiv:2405.06446 (replaced) [pdf, html, other]
-
Title: Recoloring via modular decompositionComments: 13 pagesSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
The reconfiguration graph of the $k$-colorings of a graph $G$, denoted $R_{k}(G)$, is the graph whose vertices are the $k$-colorings of $G$ and two colorings are adjacent in $R_{k}(G)$ if they differ in color on exactly one vertex. A graph $G$ is said to be recolorable if $R_{\ell}(G)$ is connected for all $\ell \geq \chi(G)$+1. We demonstrate how to use the modular decomposition of a graph class to prove that the graphs in the class are recolorable. In particular, we prove that every ($P_5$, diamond)-free graph, every ($P_5$, house, bull)-free graph, and every ($P_5$, $C_5$, co-fork)-free graph is recolorable.
A graph is prime if it cannot be decomposed by modular decomposition except into single vertices. For a prime graph $H$, we study the complexity of deciding if $H$ is $k$-colorable and the complexity of deciding if there exists a path between two given $k$-colorings in $R_{k}(H)$. Suppose $\mathcal{G}$ is a hereditary class of graphs. We prove that if every blowup of every prime graph in $\mathcal{G}$ is recolorable, then every graph in $\mathcal{G}$ is recolorable. - [561] arXiv:2405.14022 (replaced) [pdf, html, other]
-
Title: I2I-Mamba: Multi-modal medical image synthesis via selective state space modelingComments: 10 pages, 6 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In recent years, deep learning models comprising transformer components have pushed the performance envelope in medical image synthesis tasks. Contrary to convolutional neural networks (CNNs) that use static, local filters, transformers use self-attention mechanisms to permit adaptive, non-local filtering to sensitively capture long-range context. However, this sensitivity comes at the expense of substantial model complexity, which can compromise learning efficacy particularly on relatively modest-sized imaging datasets. Here, we propose a novel adversarial model for multi-modal medical image synthesis, I2I-Mamba, that leverages selective state space modeling (SSM) to efficiently capture long-range context while maintaining local precision. To do this, I2I-Mamba injects channel-mixed Mamba (cmMamba) blocks in the bottleneck of a convolutional backbone. In cmMamba blocks, SSM layers are used to learn context across the spatial dimension and channel-mixing layers are used to learn context across the channel dimension of feature maps. Comprehensive demonstrations are reported for imputing missing images in multi-contrast MRI and MRI-CT protocols. Our results indicate that I2I-Mamba offers superior performance against state-of-the-art CNN- and transformer-based methods in synthesizing target-modality images.
- [562] arXiv:2405.14131 (replaced) [pdf, html, other]
-
Title: Statistical Advantages of Perturbing Cosine Router in Mixture of ExpertsComments: 40 pagesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in MoE has been lacking. Considering the least square estimation of the cosine routing MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as $\mathcal{O}(1/\log^{\tau}(n))$ where $\tau > 0$ is some constant and $n$ is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router -- simply adding noises to the $L^2$ norms in the cosine router, which we refer to as \textit{perturbed cosine router}. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.
- [563] arXiv:2405.14703 (replaced) [pdf, other]
-
Title: The categorical contours of the Chomsky-Sch\"utzenberger representation theoremComments: This is a thoroughly revised and expanded version of a paper with a similar title presented at the 38th Conference on the Mathematical Foundations of Programming Semantics (MFPS 2022). On the request of the LMCS referees, an Addendum included in a previous version of this paper but not in the original conference version has been moved into a separate paper (in preparation). arXiv admin note: text overlap with arXiv:2212.09060Subjects: Category Theory (math.CT); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
We develop fibrational perspectives on context-free grammars and on nondeterministic finite-state automata over categories and operads. A generalized CFG is a functor from a free colored operad (aka multicategory) generated by a pointed finite species into an arbitrary base operad: this encompasses classical CFGs by taking the base to be a certain operad constructed from a free monoid, as an instance of a more general construction of an \emph{operad of spliced arrows} $\mathcal{W}\,\mathcal{C}$ for any category $\mathcal{C}$. A generalized NFA is a functor from an arbitrary bipointed category or pointed operad satisfying the unique lifting of factorizations and finite fiber properties: this encompasses classical word automata and tree automata without $\epsilon$-transitions, but also automata over non-free categories and operads. We show that generalized context-free and regular languages satisfy suitable generalizations of many of the usual closure properties, and in particular we give a simple conceptual proof that context-free languages are closed under intersection with regular languages. Finally, we observe that the splicing functor $\mathcal{W} : Cat \to Oper$ admits a left adjoint $\mathcal{C}: Oper \to Cat$, which we call the \emph{contour category} construction since the arrows of $\mathcal{C}\,\mathcal{O}$ have a geometric interpretation as oriented contours of operations of $\mathcal{O}$. A direct consequence of the contour / splicing adjunction is that every pointed finite species induces a universal CFG generating a language of \emph{tree contour words.} This leads us to a generalization of the Chomsky-Schützenberger Representation Theorem, establishing that a subset of a homset $L \subseteq \mathcal{C}(A,B)$ is a CFL of arrows if and only if it is a functorial image of the intersection of a $\mathcal{C}$-chromatic tree contour language with a regular language.
- [564] arXiv:2405.14765 (replaced) [pdf, html, other]
-
Title: A Quantum Speed-Up for Approximating the Top Eigenvectors of a MatrixComments: v2: Main changes are the addition of a quantum lower bound for approximating the top eigenvalue of a matrix, and small improvements in the existing text. 54 pagesSubjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS)
Finding a good approximation of the top eigenvector of a given $d\times d$ matrix $A$ is a basic and important computational problem, with many applications. We give two different quantum algorithms that, given query access to the entries of a Hermitian matrix $A$ and assuming a constant eigenvalue gap, output a classical description of a good approximation of the top eigenvector: one algorithm with time complexity $\mathcal{\tilde{O}}(d^{1.75})$ and one with time complexity $d^{1.5+o(1)}$ (the first algorithm has a slightly better dependence on the $\ell_2$-error of the approximating vector than the second, and uses different techniques of independent interest). Both of our quantum algorithms provide a polynomial speed-up over the best-possible classical algorithm, which needs $\Omega(d^2)$ queries to entries of $A$, and hence $\Omega(d^2)$ time. We extend this to a quantum algorithm that outputs a classical description of the subspace spanned by the top-$q$ eigenvectors in time $qd^{1.5+o(1)}$. We also prove a nearly-optimal lower bound of $\tilde{\Omega}(d^{1.5})$ on the quantum query complexity of approximating the top eigenvector.
Our quantum algorithms run a version of the classical power method that is robust to certain benign kinds of errors, where we implement each matrix-vector multiplication with small and well-behaved error on a quantum computer, in different ways for the two algorithms. Our first algorithm estimates the matrix-vector product one entry at a time, using a new "Gaussian phase estimation" procedure. Our second algorithm uses block-encoding techniques to compute the matrix-vector product as a quantum state, from which we obtain a classical description by a new time-efficient unbiased pure-state tomography procedure. - [565] arXiv:2405.20559 (replaced) [pdf, other]
-
Title: Information-driven design of imaging systemsSubjects: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Image and Video Processing (eess.IV); Data Analysis, Statistics and Probability (physics.data-an)
Most modern imaging systems process the data they capture algorithmically before-or instead of-human viewing. As a result, performance depends not on how interpretable the measurements appear, but how effectively they encode details for algorithmic processing. Information theory provides mathematical tools to analyze this, but developing methods that can handle the complexity of real-world measurements yet remain practical enough for widespread use has proven challenging. We introduce a data-driven approach for estimating the information content of imaging system measurements. Our framework requires only experimental measurements and noise characterization, with no need for ground truth data. We demonstrate that these information estimates reliably predict system performance across diverse imaging modalities, including color photography, radio astronomy, lensless imaging, and label-free microscopy. To automate the process of designing imaging systems that maximize information capture we introduce an optimization technique called Information-Driven Encoder Analysis Learning (IDEAL). The tools we develop in this work unlock information theory as a powerful, practical tool for analyzing and designing imaging systems across a broad range of applications.
A video summarizing this work can be found at this https URL - [566] arXiv:2406.05964 (replaced) [pdf, html, other]
-
Title: Distributionally Robust Safe Sample Elimination under Covariate ShiftHiroyuki Hanada, Tatsuya Aoyama, Satoshi Akahane, Tomonari Tanaka, Yoshito Okura, Yu Inatsu, Noriaki Hashimoto, Shion Takeno, Taro Murayama, Hanju Lee, Shinya Kojima, Ichiro TakeuchiSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We consider a machine learning setup where one training dataset is used to train multiple models across slightly different data distributions. This occurs when customized models are needed for various deployment environments. To reduce storage and training costs, we propose the DRSSS method, which combines distributionally robust (DR) optimization and safe sample screening (SSS). The key benefit of this method is that models trained on the reduced dataset will perform the same as those trained on the full dataset for all possible different environments. In this paper, we focus on covariate shift as a type of data distribution change and demonstrate the effectiveness of our method through experiments.
- [567] arXiv:2406.13292 (replaced) [pdf, html, other]
-
Title: An interpretable generative multimodal neuroimaging-genomics framework for decoding Alzheimer's diseaseGiorgio Dolci (1,2), Federica Cruciani (1), Md Abdur Rahaman (2), Anees Abrol (2), Jiayu Chen (2), Zening Fu (2), Ilaria Boscolo Galazzo (1), Gloria Menegaz (1), Vince D. Calhoun (2) ((1) Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy, (2) Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA)Comments: 28 pages, 8 figures, submitted to a journalSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Alzheimer's disease (AD) is the most prevalent form of dementia with a progressive decline in cognitive abilities. The AD continuum encompasses a prodromal stage known as MCI, where patients may either progress to AD (MCIc) or remain stable (MCInc). Understanding AD mechanisms requires complementary analyses relying on different data sources, leading to the development of multimodal DL models. We leveraged structural and functional MRI to investigate the disease-induced GM and functional network connectivity changes. Moreover, considering AD's strong genetic component, we introduced SNPs as a third channel. Missing one or more modalities is a typical concern of multimodal methods. We hence propose a novel DL-based classification framework where a generative module employing Cycle GAN was adopted for imputing missing data in the latent space. Additionally, we adopted an XAI method, Integrated Gradients, to extract features' relevance, enhancing our understanding of the learned representations. Two tasks were addressed: AD detection and MCI conversion prediction. Experimental results showed that our framework reached the SOA in the classification of CN/AD with an average test accuracy of $0.926\pm0.02$. For the MCInc/MCIc task, we achieved an average prediction accuracy of $0.711\pm0.01$ using the pre-trained model for CN and AD. The interpretability analysis revealed that significant GM modulations led the classification performance in cortical and subcortical brain areas well known for their association with AD. Impairments in sensory-motor and visual functional network connectivity along AD, as well as mutations in SNPs defining biological processes linked to endocytosis, amyloid-beta, and cholesterol, were identified as contributors to the results. Overall, our integrative DL model shows promise for AD detection and MCI prediction, while shading light on important biological insights.
- [568] arXiv:2406.14775 (replaced) [pdf, html, other]
-
Title: Machine Learning Global Simulation of Nonlocal Gravity Wave PropagationComments: International Conference on Machine Learning 2024Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn); Geophysics (physics.geo-ph)
Global climate models typically operate at a grid resolution of hundreds of kilometers and fail to resolve atmospheric mesoscale processes, e.g., clouds, precipitation, and gravity waves (GWs). Model representation of these processes and their sources is essential to the global circulation and planetary energy budget, but subgrid scale contributions from these processes are often only approximately represented in models using parameterizations. These parameterizations are subject to approximations and idealizations, which limit their capability and accuracy. The most drastic of these approximations is the "single-column approximation" which completely neglects the horizontal evolution of these processes, resulting in key biases in current climate models. With a focus on atmospheric GWs, we present the first-ever global simulation of atmospheric GW fluxes using machine learning (ML) models trained on the WINDSET dataset to emulate global GW emulation in the atmosphere, as an alternative to traditional single-column parameterizations. Using an Attention U-Net-based architecture trained on globally resolved GW momentum fluxes, we illustrate the importance and effectiveness of global nonlocality, when simulating GWs using data-driven schemes.
- [569] arXiv:2407.20367 (replaced) [pdf, html, other]
-
Title: Mixed Newton Method for Optimization in Complex SpacesNikita Yudin, Roland Hildebrand, Sergey Bakhurin, Alexander Degtyarev, Anna Lisachenko, Ilya Kuruzov, Andrei Semenov, Mohammad AlkousaComments: 16 pages, 7 figures, 6 tablesSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
In this paper, we modify and apply the recently introduced Mixed Newton Method, which is originally designed for minimizing real-valued functions of complex variables, to the minimization of real-valued functions of real variables by extending the functions to complex space. We show that arbitrary regularizations preserve the favorable local convergence properties of the method, and construct a special type of regularization used to prevent convergence to complex minima. We compare several variants of the method applied to training neural networks with real and complex parameters.
- [570] arXiv:2409.08481 (replaced) [pdf, html, other]
-
Title: USTC-TD: A Test Dataset and Benchmark for Image and Video Coding in 2020sZhuoyuan Li, Junqi Liao, Chuanbo Tang, Haotian Zhang, Yuqi Li, Yifan Bian, Xihua Sheng, Xinmin Feng, Yao Li, Changsheng Gao, Li Li, Dong Liu, Feng WuComments: 23 pages. Project Page: this https URLSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Image/video coding has been a remarkable research area for both academia and industry for many years. Testing datasets, especially high-quality image/video datasets are desirable for the justified evaluation of coding-related research, practical applications, and standardization activities. We put forward a test dataset namely USTC-TD, which has been successfully adopted in the practical end-to-end image/video coding challenge of the IEEE International Conference on Visual Communications and lmage Processing (VCIP) in 2022 and 2023. USTC-TD contains 40 images at 4K spatial resolution and 10 video sequences at 1080p spatial resolution, featuring various content due to the diverse environmental factors (e.g. scene type, texture, motion, view) and the designed imaging factors (e.g. illumination, lens, shadow). We quantitatively evaluate USTC-TD on different image/video features (spatial, temporal, color, lightness), and compare it with the previous image/video test datasets, which verifies the wider coverage and more diversity of the proposed dataset. We also evaluate both classic standardized and recent learned image/video coding schemes on USTC-TD with PSNR and MS-SSIM, and provide an extensive benchmark for the evaluated schemes. Based on the characteristics and specific design of the proposed test dataset, we analyze the benchmark performance and shed light on the future research and development of image/video coding. All the data are released online: this https URL .
- [571] arXiv:2410.13986 (replaced) [pdf, html, other]
-
Title: Recurrent Neural Goodness-of-Fit Test for Time SeriesComments: 27 pages, 4 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Time series data are crucial across diverse domains such as finance and healthcare, where accurate forecasting and decision-making rely on advanced modeling techniques. While generative models have shown great promise in capturing the intricate dynamics inherent in time series, evaluating their performance remains a major challenge. Traditional evaluation metrics fall short due to the temporal dependencies and potential high dimensionality of the features. In this paper, we propose the REcurrent NeurAL (RENAL) Goodness-of-Fit test, a novel and statistically rigorous framework for evaluating generative time series models. By leveraging recurrent neural networks, we transform the time series into conditionally independent data pairs, enabling the application of a chi-square-based goodness-of-fit test to the temporal dependencies within the data. This approach offers a robust, theoretically grounded solution for assessing the quality of generative models, particularly in settings with limited time sequences. We demonstrate the efficacy of our method across both synthetic and real-world datasets, outperforming existing methods in terms of reliability and accuracy. Our method fills a critical gap in the evaluation of time series generative models, offering a tool that is both practical and adaptable to high-stakes applications.
- [572] arXiv:2410.14212 (replaced) [pdf, html, other]
-
Title: Comparative Evaluation of Clustered Federated Learning MethodsMichael Ben Ali (IRIT), Omar El-Rifai (IRIT), Imen Megdiche (IRIT, IRIT-SIG, INUC), André Peninou (IRIT, IRIT-SIG, UT2J), Olivier Teste (IRIT-SIG, IRIT, UT2J, UT)Journal-ref: The 2nd IEEE International Conference on Federated Learning Technologies and Applications (FLTA24), Sep 2024, Valencia (Espagne), SpainSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Over recent years, Federated Learning (FL) has proven to be one of the most promising methods of distributed learning which preserves data privacy. As the method evolved and was confronted to various real-world scenarios, new challenges have emerged. One such challenge is the presence of highly heterogeneous (often referred as non-IID) data distributions among participants of the FL protocol. A popular solution to this hurdle is Clustered Federated Learning (CFL), which aims to partition clients into groups where the distribution are homogeneous. In the literature, state-of-the-art CFL algorithms are often tested using a few cases of data heterogeneities, without systematically justifying the choices. Further, the taxonomy used for differentiating the different heterogeneity scenarios is not always straightforward. In this paper, we explore the performance of two state-of-theart CFL algorithms with respect to a proposed taxonomy of data heterogeneities in federated learning (FL). We work with three image classification datasets and analyze the resulting clusters against the heterogeneity classes using extrinsic clustering metrics. Our objective is to provide a clearer understanding of the relationship between CFL performances and data heterogeneity scenarios.
- [573] arXiv:2410.17184 (replaced) [pdf, html, other]
-
Title: Technical Report: Toward Applying Quantum Computing to Network VerificationKahlil Dozier, Justin Beltran, Kylie Berg, Hugo Matousek, Loqman Salamatian, Ethan Katz-Bassett, Dan RubensteinSubjects: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Network verification (NWV), broadly defined as the verification of properties of distributed protocols used in network systems, cannot be efficiently solved on classical hardware via brute force. Prior work has developed a variety of methods that scale by observing a structure in the search space and then evaluating classes within the search space instead of individual instances. However, even these classification mechanisms have their limitations. In this paper, we consider a radically different approach: applying quantum computing to more efficiently solve NWV problems. We provide an overview of how to map variants of NWV problems into unstructured search problems that can be solved via quantum computing with quadratic speedup, making the approach feasible in theory to problems that are double in size (of the input). Emerging quantum systems cannot yet tackle problems of practical interest, but rapid advances in hardware and algorithm development make now a great time to start thinking about their application. With this in mind, we explore the limits of scale of the problem for which quantum computing can solve NWV problems as unstructured search.
- [574] arXiv:2410.21858 (replaced) [pdf, html, other]
-
Title: Joint Estimation of Conditional Mean and Covariance for Unbalanced PanelsSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
We propose a nonparametric, kernel-based joint estimator for conditional mean and covariance matrices in large unbalanced panels. Our estimator, with proven consistency and finite-sample guarantees, is applied to a comprehensive panel of monthly US stock excess returns from 1962 to 2021, conditioned on macroeconomic and firm-specific covariates. The estimator captures time-varying cross-sectional dependencies effectively, demonstrating robust statistical performance. In asset pricing, it generates conditional mean-variance efficient portfolios with out-of-sample Sharpe ratios that substantially exceed those of equal-weighted benchmarks.
- [575] arXiv:2410.21862 (replaced) [pdf, html, other]
-
Title: Hierarchical mixtures of Unigram models for short text clustering: the role of Beta-Liouville priorsComments: 32 pages, 4 figures. SubmittedSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
This paper presents a variant of the Multinomial mixture model tailored for the unsupervised classification of short text data. Traditionally, the Multinomial probability vector in this hierarchical model is assigned a Dirichlet prior distribution. Here, however, we explore an alternative prior--the Beta-Liouville distribution--which offers a more flexible correlation structure than the Dirichlet. We examine the theoretical properties of the Beta-Liouville distribution, focusing on its conjugacy with the Multinomial likelihood. This property enables the derivation of update equations for a CAVI (Coordinate Ascent Variational Inference) variational algorithm, facilitating the approximate posterior estimation of model parameters. Additionally, we propose a stochastic variant of the CAVI algorithm that enhances scalability. The paper concludes with data examples that demonstrate effective strategies for setting the Beta-Liouville hyperparameters.