Search | arXiv e-print repository

Advancing Post-OCR Correction: A Comparative Study of Synthetic Data

Abstract: This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthet… ▽ More This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages. △ Less

Submitted 13 August, 2024; v1 submitted 5 August, 2024; originally announced August 2024.

Comments: ACL 2024 findings

arXiv:2406.04244 [pdf, other]

Benchmark Data Contamination of Large Language Models: A Survey

Authors: Cheng Xu, Shuhao Guan, Derek Greene, M-Tahar Kechadi

Abstract: The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable per… ▽ More The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 31 pages, 7 figures, 3 tables

arXiv:2406.03921 [pdf, other]

Knowledge Transfer, Knowledge Gaps, and Knowledge Silos in Citation Networks

Authors: Eoghan Cunningham, Derek Greene

Abstract: The advancement of science relies on the exchange of ideas across disciplines and the integration of diverse knowledge domains. However, tracking knowledge flows and interdisciplinary integration in rapidly evolving, multidisciplinary fields remains a significant challenge. This work introduces a novel network analysis framework to study the dynamics of knowledge transfer directly from citation da… ▽ More The advancement of science relies on the exchange of ideas across disciplines and the integration of diverse knowledge domains. However, tracking knowledge flows and interdisciplinary integration in rapidly evolving, multidisciplinary fields remains a significant challenge. This work introduces a novel network analysis framework to study the dynamics of knowledge transfer directly from citation data. By applying dynamic community detection to cumulative, time-evolving citation networks, we can identify research areas as groups of papers sharing knowledge sources and outputs. Our analysis characterises the life-cycles and knowledge transfer patterns of these dynamic communities over time. We demonstrate our approach through a case study of eXplainable Artificial Intelligence (XAI) research, an emerging interdisciplinary field at the intersection of machine learning, statistics, and psychology. Key findings include: (i) knowledge transfer between these important foundational topics and the contemporary topics in XAI research is limited, and the extent of knowledge transfer varies across different contemporary research topics; (ii) certain application domains exist as isolated "knowledge silos"; (iii) significant "knowledge gaps" are identified between related XAI research areas, suggesting opportunities for cross-pollination and improved knowledge integration. By mapping interdisciplinary integration and bridging knowledge gaps, this work can inform strategies to synthesise ideas from disparate sources and drive innovation. More broadly, our proposed framework enables new insights into the evolution of knowledge ecosystems directly from citation data, with applications spanning literature review, research planning, and science policy. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2403.19011 [pdf, other]

Sequential Inference of Hospitalization Electronic Health Records Using Probabilistic Models

Authors: Alan D. Kaplan, Priyadip Ray, John D. Greene, Vincent X. Liu

Abstract: In the dynamic hospital setting, decision support can be a valuable tool for improving patient outcomes. Data-driven inference of future outcomes is challenging in this dynamic setting, where long sequences such as laboratory tests and medications are updated frequently. This is due in part to heterogeneity of data types and mixed-sequence types contained in variable length sequences. In this work… ▽ More In the dynamic hospital setting, decision support can be a valuable tool for improving patient outcomes. Data-driven inference of future outcomes is challenging in this dynamic setting, where long sequences such as laboratory tests and medications are updated frequently. This is due in part to heterogeneity of data types and mixed-sequence types contained in variable length sequences. In this work we design a probabilistic unsupervised model for multiple arbitrary-length sequences contained in hospitalization Electronic Health Record (EHR) data. The model uses a latent variable structure and captures complex relationships between medications, diagnoses, laboratory tests, neurological assessments, and medications. It can be trained on original data, without requiring any lossy transformations or time binning. Inference algorithms are derived that use partial data to infer properties of the complete sequences, including their length and presence of specific values. We train this model on data from subjects receiving medical care in the Kaiser Permanente Northern California integrated healthcare delivery system. The results are evaluated against held-out data for predicting the length of sequences and presence of Intensive Care Unit (ICU) in hospitalization bed sequences. Our method outperforms a baseline approach, showing that in these experiments the trained model captures information in the sequences that is informative of their future values. △ Less

Submitted 24 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

arXiv:2401.11198 [pdf, other]

A Deep Learning Approach for Selective Relevance Feedback

Authors: Suchana Datta, Debasis Ganguly, Sean MacAvaney, Derek Greene

Abstract: Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness over a sufficiently large number of queries. However, PRF often introduces a drift into the original information need, thus hurting the retrieval effectiveness of several queries. While a selective application of PRF can potentially alleviate this issue, previous approaches have largely relied on unsupervised or feature-ba… ▽ More Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness over a sufficiently large number of queries. However, PRF often introduces a drift into the original information need, thus hurting the retrieval effectiveness of several queries. While a selective application of PRF can potentially alleviate this issue, previous approaches have largely relied on unsupervised or feature-based learning to determine whether a query should be expanded. In contrast, we revisit the problem of selective PRF from a deep learning perspective, presenting a model that is entirely data-driven and trained in an end-to-end manner. The proposed model leverages a transformer-based bi-encoder architecture. Additionally, to further improve retrieval effectiveness with this selective PRF approach, we make use of the model's confidence estimates to combine the information from the original and expanded queries. In our experiments, we apply this selective feedback on a number of different combinations of ranking and feedback models, and show that our proposed approach consistently improves retrieval effectiveness for both sparse and dense ranking models, with the feedback models being either sparse, dense or generative. △ Less

Submitted 20 January, 2024; originally announced January 2024.

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.10463 [pdf, other]

doi 10.1145/3627673.3679987

RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models

Authors: Dairui Liu, Boming Yang, Honghui Du, Derek Greene, Aonghus Lawlor, Ruihai Dong, Irene Li

Abstract: In the evolving field of personalized news recommendation, understanding the semantics of the underlying data is crucial. Large Language Models (LLMs) like GPT-4 have shown promising performance in understanding natural language. However, the extent of their applicability in news recommendation systems remains to be validated. This paper introduces RecPrompt, the first framework for news recommend… ▽ More In the evolving field of personalized news recommendation, understanding the semantics of the underlying data is crucial. Large Language Models (LLMs) like GPT-4 have shown promising performance in understanding natural language. However, the extent of their applicability in news recommendation systems remains to be validated. This paper introduces RecPrompt, the first framework for news recommendation that leverages the capabilities of LLMs through prompt engineering. This system incorporates a prompt optimizer that applies an iterative bootstrapping process, enhancing the LLM-based recommender's ability to align news content with user preferences and interests more effectively. Moreover, this study offers insights into the effective use of LLMs in news recommendation, emphasizing both the advantages and the challenges of incorporating LLMs into recommendation systems. △ Less

Submitted 9 August, 2024; v1 submitted 16 December, 2023; originally announced December 2023.

Comments: 8 pages, 3 figures, and 8 tables

arXiv:2309.14984 [pdf, other]

The Role of Document Embedding in Research Paper Recommender Systems: To Breakdown or to Bolster Disciplinary Borders?

Authors: Eoghan Cunningham, Derek Greene, Barry Smyth

Abstract: In the extensive recommender systems literature, novelty and diversity have been identified as key properties of useful recommendations. However, these properties have received limited attention in the specific sub-field of research paper recommender systems. In this work, we argue for the importance of offering novel and diverse research paper recommendations to scientists. This approach aims to… ▽ More In the extensive recommender systems literature, novelty and diversity have been identified as key properties of useful recommendations. However, these properties have received limited attention in the specific sub-field of research paper recommender systems. In this work, we argue for the importance of offering novel and diverse research paper recommendations to scientists. This approach aims to reduce siloed reading, break down filter bubbles, and promote interdisciplinary research. We propose a novel framework for evaluating the novelty and diversity of research paper recommendations that leverages methods from network analysis and natural language processing. Using this framework, we show that the choice of representational method within a larger research paper recommendation system can have a measurable impact on the nature of downstream recommendations, specifically on their novelty and diversity. We introduce a novel paper embedding method, which we demonstrate offers more innovative and diverse recommendations without sacrificing precision, compared to other state-of-the-art baselines. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: Under Review at Scientometrics

arXiv:2308.08697 [pdf, other]

doi 10.1145/3617233.3617255

Handwriting Analysis on the Diaries of Rosamond Jacob

Authors: Sharmistha S. Sawant, Saloni D. Thakare, Derek Greene, Gerardine Meaney, Alan F. Smeaton

Abstract: Handwriting is an art form that most people learn at an early age. Each person's writing style is unique with small changes as we grow older and as our mood changes. Here we analyse handwritten text in a culturally significant personal diary. We compare changes in handwriting and relate this to the sentiment of the written material and to the topic of diary entries. We identify handwritten text fr… ▽ More Handwriting is an art form that most people learn at an early age. Each person's writing style is unique with small changes as we grow older and as our mood changes. Here we analyse handwritten text in a culturally significant personal diary. We compare changes in handwriting and relate this to the sentiment of the written material and to the topic of diary entries. We identify handwritten text from digitised images and generate a canonical form for words using shape matching to compare how the same handwritten word appears over a period of time. For determining the sentiment of diary entries, we use the Hedonometer, a dictionary-based approach to scoring sentiment. We apply these techniques to the historical diary entries of Rosamond Jacob (1888-1960), an Irish writer and political activist whose daily diary entries report on the major events in Ireland during the first half of the last century. △ Less

Submitted 16 August, 2023; originally announced August 2023.

Comments: International Conference on Content-based Multimedia Indexing, September 20--22, 2023, Orleans, France

arXiv:2306.08020 [pdf, other]

doi 10.1007/978-3-030-36599-8_31

Curatr: A Platform for Semantic Analysis and Curation of Historical Literary Texts

Authors: Susan Leavy, Gerardine Meaney, Karen Wade, Derek Greene

Abstract: The increasing availability of digital collections of historical and contemporary literature presents a wealth of possibilities for new research in the humanities. The scale and diversity of such collections however, presents particular challenges in identifying and extracting relevant content. This paper presents Curatr, an online platform for the exploration and curation of literature with machi… ▽ More The increasing availability of digital collections of historical and contemporary literature presents a wealth of possibilities for new research in the humanities. The scale and diversity of such collections however, presents particular challenges in identifying and extracting relevant content. This paper presents Curatr, an online platform for the exploration and curation of literature with machine learning-supported semantic search, designed within the context of digital humanities scholarship. The platform provides a text mining workflow that combines neural word embeddings with expert domain knowledge to enable the generation of thematic lexicons, allowing researches to curate relevant sub-corpora from a large corpus of 18th and 19th century digitised texts. △ Less

Submitted 13 June, 2023; originally announced June 2023.

Comments: 12 pages

Journal ref: Metadata and Semantic Research (MTSR 2019), Communications in Computer and Information Science, vol 1057. Springer, Cham

arXiv:2306.07506 [pdf, other]

doi 10.1145/3680295

Topic-Centric Explanations for News Recommendation

Authors: Dairui Liu, Derek Greene, Irene Li, Xuefei Jiang, Ruihai Dong

Abstract: News recommender systems (NRS) have been widely applied for online news websites to help users find relevant articles based on their interests. Recent methods have demonstrated considerable success in terms of recommendation performance. However, the lack of explanation for these recommendations can lead to mistrust among users and lack of acceptance of recommendations. To address this issue, we p… ▽ More News recommender systems (NRS) have been widely applied for online news websites to help users find relevant articles based on their interests. Recent methods have demonstrated considerable success in terms of recommendation performance. However, the lack of explanation for these recommendations can lead to mistrust among users and lack of acceptance of recommendations. To address this issue, we propose a new explainable news model to construct a topic-aware explainable recommendation approach that can both accurately identify relevant articles and explain why they have been recommended, using information from associated topics. Additionally, our model incorporates two coherence metrics applied to assess topic quality, providing measure of the interpretability of these explanations. The results of our experiments on the MIND dataset indicate that the proposed explainable NRS outperforms several other baseline systems, while it is also capable of producing interpretable topics compared to those generated by a classical LDA topic model. Furthermore, we present a case study through a real-world example showcasing the usefulness of our NRS for generating explanations. △ Less

Submitted 6 October, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

Comments: 20 pages

Journal ref: ACM Trans. Recomm. Syst. 1, 1, Article 1 (January 2024), 26 pages.

arXiv:2304.00310 [pdf, other]

On the Feasibility and Robustness of Pointwise Evaluation of Query Performance Prediction

Authors: Suchana Datta, Debasis Ganguly, Derek Greene, Mandar Mitra

Abstract: Despite the retrieval effectiveness of queries being mutually independent of one another, the evaluation of query performance prediction (QPP) systems has been carried out by measuring rank correlation over an entire set of queries. Such a listwise approach has a number of disadvantages, notably that it does not support the common requirement of assessing QPP for individual queries. In this paper,… ▽ More Despite the retrieval effectiveness of queries being mutually independent of one another, the evaluation of query performance prediction (QPP) systems has been carried out by measuring rank correlation over an entire set of queries. Such a listwise approach has a number of disadvantages, notably that it does not support the common requirement of assessing QPP for individual queries. In this paper, we propose a pointwise QPP framework that allows us to evaluate the quality of a QPP system for individual queries by measuring the deviations between each prediction versus the corresponding true value, and then aggregating the results over a set of queries. Our experiments demonstrate that this new approach leads to smaller variances in QPP evaluations across a range of different target metrics and retrieval models. △ Less

Submitted 1 April, 2023; originally announced April 2023.

arXiv:2303.08954 [pdf, other]

PRESTO: A Multilingual Dataset for Parsing Realistic Task-Oriented Dialogs

Authors: Rahul Goel, Waleed Ammar, Aditya Gupta, Siddharth Vashishtha, Motoki Sano, Faiz Surani, Max Chang, HyunJeong Choe, David Greene, Kyle He, Rattima Nitisaroj, Anna Trukhina, Shachi Paul, Pararth Shah, Rushin Shah, Zhou Yu

Abstract: Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversation… ▽ More Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversations, we introduce PRESTO, a public dataset of over 550K contextual multilingual conversations between humans and virtual assistants. PRESTO contains a diverse array of challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions. It is the only large scale human generated conversational parsing dataset that provides structured context such as a user's contacts and lists for each example. Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model, which is further pronounced in a low-resource setup. △ Less

Submitted 16 March, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: PRESTO v1 Release

arXiv:2302.01826 [pdf, other]

doi 10.1145/3543873.3587570

Graph Embedding for Mapping Interdisciplinary Research Networks

Authors: Eoghan Cunningham, Derek Greene

Abstract: Representation learning is the first step in automating tasks such as research paper recommendation, classification, and retrieval. Due to the accelerating rate of research publication, together with the recognised benefits of interdisciplinary research, systems that facilitate researchers in discovering and understanding relevant works from beyond their immediate school of knowledge are vital. Th… ▽ More Representation learning is the first step in automating tasks such as research paper recommendation, classification, and retrieval. Due to the accelerating rate of research publication, together with the recognised benefits of interdisciplinary research, systems that facilitate researchers in discovering and understanding relevant works from beyond their immediate school of knowledge are vital. This work explores different methods of research paper representation (or document embedding), to identify those methods that are capable of preserving the interdisciplinary implications of research papers in their embeddings. In addition to evaluating state of the art methods of document embedding in a interdisciplinary citation prediction task, we propose a novel Graph Neural Network architecture designed to preserve the key interdisciplinary implications of research articles in citation network node embeddings. Our proposed method outperforms other GNN-based methods in interdisciplinary citation prediction, without compromising overall citation prediction performance. △ Less

Submitted 20 March, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

arXiv:2212.08733 [pdf, other]

Counterfactual Explanations for Misclassified Images: How Human and Machine Explanations Differ

Authors: Eoin Delaney, Arjun Pakrashi, Derek Greene, Mark T. Keane

Abstract: Counterfactual explanations have emerged as a popular solution for the eXplainable AI (XAI) problem of elucidating the predictions of black-box deep-learning systems due to their psychological validity, flexibility across problem domains and proposed legal compliance. While over 100 counterfactual methods exist, claiming to generate plausible explanations akin to those preferred by people, few hav… ▽ More Counterfactual explanations have emerged as a popular solution for the eXplainable AI (XAI) problem of elucidating the predictions of black-box deep-learning systems due to their psychological validity, flexibility across problem domains and proposed legal compliance. While over 100 counterfactual methods exist, claiming to generate plausible explanations akin to those preferred by people, few have actually been tested on users ($\sim7\%$). So, the psychological validity of these counterfactual algorithms for effective XAI for image data is not established. This issue is addressed here using a novel methodology that (i) gathers ground truth human-generated counterfactual explanations for misclassified images, in two user studies and, then, (ii) compares these human-generated ground-truth explanations to computationally-generated explanations for the same misclassifications. Results indicate that humans do not "minimally edit" images when generating counterfactual explanations. Instead, they make larger, "meaningful" edits that better approximate prototypes in the counterfactual class. △ Less

Submitted 16 December, 2022; originally announced December 2022.

arXiv:2206.03159 [pdf, other]

The Structure of Interdisciplinary Science: Uncovering and Explaining Roles in Citation Graphs

Authors: Eoghan Cunningham, Derek Greene

Abstract: Role discovery is the task of dividing the set of nodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex local structures. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these method… ▽ More Role discovery is the task of dividing the set of nodes on a graph into classes of structurally similar roles. Modern strategies for role discovery typically rely on graph embedding techniques, which are capable of recognising complex local structures. However, when working with large, real-world networks, it is difficult to interpret or validate a set of roles identified according to these methods. In this work, motivated by advancements in the field of explainable artificial intelligence (XAI), we propose a new framework for interpreting role assignments on large graphs using small subgraph structures known as graphlets. We demonstrate our methods on a large, multidisciplinary citation network, where we successfully identify a number of important citation patterns which reflect interdisciplinary research △ Less

Submitted 7 June, 2022; originally announced June 2022.

Comments: submitted to the international conference on complex networks and their applications

arXiv:2204.07292 [pdf, other]

Unsupervised Probabilistic Models for Sequential Electronic Health Records

Authors: Alan D. Kaplan, John D. Greene, Vincent X. Liu, Priyadip Ray

Abstract: We develop an unsupervised probabilistic model for heterogeneous Electronic Health Record (EHR) data. Utilizing a mixture model formulation, our approach directly models sequences of arbitrary length, such as medications and laboratory results. This allows for subgrouping and incorporation of the dynamics underlying heterogeneous data types. The model consists of a layered set of latent variables… ▽ More We develop an unsupervised probabilistic model for heterogeneous Electronic Health Record (EHR) data. Utilizing a mixture model formulation, our approach directly models sequences of arbitrary length, such as medications and laboratory results. This allows for subgrouping and incorporation of the dynamics underlying heterogeneous data types. The model consists of a layered set of latent variables that encode underlying structure in the data. These variables represent subject subgroups at the top layer, and unobserved states for sequences in the second layer. We train this model on episodic data from subjects receiving medical care in the Kaiser Permanente Northern California integrated healthcare delivery system. The resulting properties of the trained model generate novel insight from these complex and multifaceted data. In addition, we show how the model can be used to analyze sequences that contribute to assessment of mortality likelihood. △ Less

Submitted 31 August, 2022; v1 submitted 14 April, 2022; originally announced April 2022.

arXiv:2203.12504 [pdf, other]

Author Multidisciplinarity and Disciplinary Roles in Field of Study Networks

Authors: Eoghan Cunningham, Barry Smyth, Derek Greene

Abstract: When studying large research corpora, "distant reading" methods are vital to understand the topics and trends in the corresponding research space. In particular, given the recognised benefits of multidisciplinary research, it may be important to map schools or communities of diverse research topics, and to understand the multidisciplinary role that topics play within and between these communities.… ▽ More When studying large research corpora, "distant reading" methods are vital to understand the topics and trends in the corresponding research space. In particular, given the recognised benefits of multidisciplinary research, it may be important to map schools or communities of diverse research topics, and to understand the multidisciplinary role that topics play within and between these communities. This work proposes Field of Study (FoS) networks as a novel network representation for use in scientometric analysis. We describe the formation of FoS networks, which relate research topics according to the authors who publish in them, from corpora of articles in which fields of study can be identified. FoS networks are particularly useful for the distant reading of large datasets of research papers when analysed through the lens of exploring multidisciplinary science. In an evolving scientific landscape, modular communities in FoS networks offer an alternative categorisation strategy for research topics and sub-disciplines, when compared to traditional prescribed discipline classification schemes. Furthermore, structural role analysis of FoS networks can highlight important characteristics of topics in such communities. To support this, we present two case studies which explore multidisciplinary research in corpora of varying size and scope; namely, 6,323 articles relating to network science research and 4,184,011 articles relating to research on the COVID-19-pandemic. △ Less

Submitted 23 March, 2022; originally announced March 2022.

arXiv:2203.12455 [pdf, other]

doi 10.1145/3487553.3524653

Assessing Network Representations for Identifying Interdisciplinarity

Authors: Eoghan Cunningham, Derek Greene

Abstract: Many studies have sought to identify interdisciplinary research as a function of the diversity of disciplines identified in an article's references or citations. However, given the constant evolution of the scientific landscape, disciplinary boundaries are shifting and blurring, making it increasingly difficult to describe research within a strict taxonomy. In this work, we explore the potential f… ▽ More Many studies have sought to identify interdisciplinary research as a function of the diversity of disciplines identified in an article's references or citations. However, given the constant evolution of the scientific landscape, disciplinary boundaries are shifting and blurring, making it increasingly difficult to describe research within a strict taxonomy. In this work, we explore the potential for graph learning methods to learn embedded representations for research papers that encode their 'interdisciplinarity' in a citation network. This facilitates the identification of interdisciplinary research without the use of disciplinary categories. We evaluate these representations and their ability to identify interdisciplinary research, according to their utility in interdisciplinary citation prediction. We find that those representations which preserve structural equivalence in the citation graph are best able to predict distant, interdisciplinary interactions in the network, according to multiple definitions of citation distance. △ Less

Submitted 8 April, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

arXiv:2203.07216 [pdf, other]

doi 10.18653/v1/2022.findings-acl.178

A Novel Perspective to Look At Attention: Bi-level Attention-based Explainable Topic Modeling for News Classification

Authors: Dairui Liu, Derek Greene, Ruihai Dong

Abstract: Many recent deep learning-based solutions have widely adopted the attention-based mechanism in various tasks of the NLP discipline. However, the inherent characteristics of deep learning models and the flexibility of the attention mechanism increase the models' complexity, thus leading to challenges in model explainability. In this paper, to address this challenge, we propose a novel practical fra… ▽ More Many recent deep learning-based solutions have widely adopted the attention-based mechanism in various tasks of the NLP discipline. However, the inherent characteristics of deep learning models and the flexibility of the attention mechanism increase the models' complexity, thus leading to challenges in model explainability. In this paper, to address this challenge, we propose a novel practical framework by utilizing a two-tier attention architecture to decouple the complexity of explanation and the decision-making process. We apply it in the context of a news article classification task. The experiments on two large-scaled news corpora demonstrate that the proposed model can achieve competitive performance with many state-of-the-art alternatives and illustrate its appropriateness from an explainability perspective. △ Less

Submitted 27 October, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

Comments: Findings of ACL2022

arXiv:2202.07376 [pdf, other]

Deep-QPP: A Pairwise Interaction-based Deep Learning Model for Supervised Query Performance Prediction

Authors: Suchana Datta, Debasis Ganguly, Derek Greene, Mandar Mitra

Abstract: Motivated by the recent success of end-to-end deep neural models for ranking tasks, we present here a supervised end-to-end neural approach for query performance prediction (QPP). In contrast to unsupervised approaches that rely on various statistics of document score distributions, our approach is entirely data-driven. Further, in contrast to weakly supervised approaches, our method also does not… ▽ More Motivated by the recent success of end-to-end deep neural models for ranking tasks, we present here a supervised end-to-end neural approach for query performance prediction (QPP). In contrast to unsupervised approaches that rely on various statistics of document score distributions, our approach is entirely data-driven. Further, in contrast to weakly supervised approaches, our method also does not rely on the outputs from different QPP estimators. In particular, our model leverages information from the semantic interactions between the terms of a query and those in the top-documents retrieved with it. The architecture of the model comprises multiple layers of 2D convolution filters followed by a feed-forward layer of parameters. Experiments on standard test collections demonstrate that our proposed supervised approach outperforms other state-of-the-art supervised and unsupervised approaches. △ Less

Submitted 15 February, 2022; originally announced February 2022.

arXiv:2202.06306 [pdf, ps, other]

An Analysis of Variations in the Effectiveness of Query Performance Prediction

Authors: Debasis Ganguly, Suchana Datta, Mandar Mitra, Derek Greene

Abstract: A query performance predictor estimates the retrieval effectiveness of an IR system for a given query. An important characteristic of QPP evaluation is that, since the ground truth retrieval effectiveness for QPP evaluation can be measured with different metrics, the ground truth itself is not absolute, which is in contrast to other retrieval tasks, such as that of ad-hoc retrieval. Motivated by t… ▽ More A query performance predictor estimates the retrieval effectiveness of an IR system for a given query. An important characteristic of QPP evaluation is that, since the ground truth retrieval effectiveness for QPP evaluation can be measured with different metrics, the ground truth itself is not absolute, which is in contrast to other retrieval tasks, such as that of ad-hoc retrieval. Motivated by this argument, the objective of this paper is to investigate how such variances in the ground truth for QPP evaluation can affect the outcomes of QPP experiments. We consider this not only in terms of the absolute values of the evaluation metrics being reported (e.g. Pearson's $r$, Kendall's $τ$), but also with respect to the changes in the ranks of different QPP systems when ordered by the QPP metric scores. Our experiments reveal that the observed QPP outcomes can vary considerably, both in terms of the absolute evaluation metric values and also in terms of the relative system ranks. Through our analysis, we report the optimal combinations of QPP evaluation metric and experimental settings that are likely to lead to smaller variations in the observed results. △ Less

Submitted 13 February, 2022; originally announced February 2022.

arXiv:2108.13370 [pdf, other]

doi 10.1057/s41599-021-00922-7

Collaboration in the Time of COVID: A Scientometric Analysis of Multidisciplinary SARS-CoV-2 Research

Authors: Eoghan Cunningham, Barry Smyth, Derek Greene

Abstract: The novel coronavirus SARS-CoV-2 and the COVID-19 illness it causes have inspired unprecedented levels of multidisciplinary research in an effort to address a generational public health challenge. In this work we conduct a scientometric analysis of COVID-19 research, paying particular attention to the nature of collaboration that this pandemic has fostered among different disciplines. Increased mu… ▽ More The novel coronavirus SARS-CoV-2 and the COVID-19 illness it causes have inspired unprecedented levels of multidisciplinary research in an effort to address a generational public health challenge. In this work we conduct a scientometric analysis of COVID-19 research, paying particular attention to the nature of collaboration that this pandemic has fostered among different disciplines. Increased multidisciplinary collaboration has been shown to produce greater scientific impact, albeit with higher co-ordination costs. As such, we consider a collection of over 166,000 COVID-19-related articles to assess the scale and diversity of collaboration in COVID-19 research, which we compare to non-COVID-19 controls before and during the pandemic. We show that COVID-19 research teams are not only significantly smaller than their non-COVID-19 counterparts, but they are also more diverse. Furthermore, we find that COVID-19 research has increased the multidisciplinarity of authors across most scientific fields of study, indicating that COVID-19 has helped to remove some of the barriers that usually exist between disparate disciplines. Finally, we highlight a number of interesting areas of multidisciplinary research during COVID-19, and propose methodologies for visualising the nature of multidisciplinary collaboration, which may have application beyond this pandemic. △ Less

Submitted 30 August, 2021; originally announced August 2021.

Comments: Submitted to Humanities and Social Sciences Communications: accepted pending minor revisions

Journal ref: Humanit Soc Sci Commun 8, 240 (2021)

arXiv:2107.09734 [pdf, other]

Uncertainty Estimation and Out-of-Distribution Detection for Counterfactual Explanations: Pitfalls and Solutions

Authors: Eoin Delaney, Derek Greene, Mark T. Keane

Abstract: Whilst an abundance of techniques have recently been proposed to generate counterfactual explanations for the predictions of opaque black-box systems, markedly less attention has been paid to exploring the uncertainty of these generated explanations. This becomes a critical issue in high-stakes scenarios, where uncertain and misleading explanations could have dire consequences (e.g., medical diagn… ▽ More Whilst an abundance of techniques have recently been proposed to generate counterfactual explanations for the predictions of opaque black-box systems, markedly less attention has been paid to exploring the uncertainty of these generated explanations. This becomes a critical issue in high-stakes scenarios, where uncertain and misleading explanations could have dire consequences (e.g., medical diagnosis and treatment planning). Moreover, it is often difficult to determine if the generated explanations are well grounded in the training data and sensitive to distributional shifts. This paper proposes several practical solutions that can be leveraged to solve these problems by establishing novel connections with other research works in explainability (e.g., trust scores) and uncertainty estimation (e.g., Monte Carlo Dropout). Two experiments demonstrate the utility of our proposed solutions. △ Less

Submitted 20 July, 2021; originally announced July 2021.

Journal ref: ICML Workshop on Algorithmic Recourse, July 2021

arXiv:2104.14461 [pdf]

Twin Systems for DeepCBR: A Menagerie of Deep Learning and Case-Based Reasoning Pairings for Explanation and Data Augmentation

Authors: Mark T Keane, Eoin M Kenny, Mohammed Temraz, Derek Greene, Barry Smyth

Abstract: Recently, it has been proposed that fruitful synergies may exist between Deep Learning (DL) and Case Based Reasoning (CBR); that there are insights to be gained by applying CBR ideas to problems in DL (what could be called DeepCBR). In this paper, we report on a program of research that applies CBR solutions to the problem of Explainable AI (XAI) in the DL. We describe a series of twin-systems pai… ▽ More Recently, it has been proposed that fruitful synergies may exist between Deep Learning (DL) and Case Based Reasoning (CBR); that there are insights to be gained by applying CBR ideas to problems in DL (what could be called DeepCBR). In this paper, we report on a program of research that applies CBR solutions to the problem of Explainable AI (XAI) in the DL. We describe a series of twin-systems pairings of opaque DL models with transparent CBR models that allow the latter to explain the former using factual, counterfactual and semi-factual explanation strategies. This twinning shows that functional abstractions of DL (e.g., feature weights, feature importance and decision boundaries) can be used to drive these explanatory solutions. We also raise the prospect that this research also applies to the problem of Data Augmentation in DL, underscoring the fecundity of these DeepCBR ideas. △ Less

Submitted 13 June, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

Comments: 7 pages,4 figures, 2 tables

Journal ref: IJCAI-21 Workshop on DL-CBR-AML, July 2021

arXiv:2009.13211 [pdf, other]

Instance-based Counterfactual Explanations for Time Series Classification

Authors: Eoin Delaney, Derek Greene, Mark T. Keane

Abstract: In recent years, there has been a rapidly expanding focus on explaining the predictions made by black-box AI systems that handle image and tabular data. However, considerably less attention has been paid to explaining the predictions of opaque AI systems handling time series data. In this paper, we advance a novel model-agnostic, case-based technique -- Native Guide -- that generates counterfactua… ▽ More In recent years, there has been a rapidly expanding focus on explaining the predictions made by black-box AI systems that handle image and tabular data. However, considerably less attention has been paid to explaining the predictions of opaque AI systems handling time series data. In this paper, we advance a novel model-agnostic, case-based technique -- Native Guide -- that generates counterfactual explanations for time series classifiers. Given a query time series, $T_{q}$, for which a black-box classification system predicts class, $c$, a counterfactual time series explanation shows how $T_{q}$ could change, such that the system predicts an alternative class, $c'$. The proposed instance-based technique adapts existing counterfactual instances in the case-base by highlighting and modifying discriminative areas of the time series that underlie the classification. Quantitative and qualitative results from two comparative experiments indicate that Native Guide generates plausible, proximal, sparse and diverse explanations that are better than those produced by key benchmark counterfactual methods. △ Less

Submitted 24 June, 2021; v1 submitted 28 September, 2020; originally announced September 2020.

arXiv:2008.05223 [pdf, other]

Bone Segmentation in Contrast Enhanced Whole-Body Computed Tomography

Authors: Patrick Leydon, Martin O'Connell, Derek Greene, Kathleen M Curran

Abstract: Segmentation of bone regions allows for enhanced diagnostics, disease characterisation and treatment monitoring in CT imaging. In contrast enhanced whole-body scans accurate automatic segmentation is particularly difficult as low dose whole body protocols reduce image quality and make contrast enhanced regions more difficult to separate when relying on differences in pixel intensities. This paper… ▽ More Segmentation of bone regions allows for enhanced diagnostics, disease characterisation and treatment monitoring in CT imaging. In contrast enhanced whole-body scans accurate automatic segmentation is particularly difficult as low dose whole body protocols reduce image quality and make contrast enhanced regions more difficult to separate when relying on differences in pixel intensities. This paper outlines a U-net architecture with novel preprocessing techniques, based on the windowing of training data and the modification of sigmoid activation threshold selection to successfully segment bone-bone marrow regions from low dose contrast enhanced whole-body CT scans. The proposed method achieved mean Dice coefficients of 0.979, 0.965, and 0.934 on two internal datasets and one external test dataset respectively. We have demonstrated that appropriate preprocessing is important for differentiating between bone and contrast dye, and that excellent results can be achieved with limited data. △ Less

Submitted 13 August, 2020; v1 submitted 12 August, 2020; originally announced August 2020.

Comments: 15 pages, 10 figures and 3 tables. Submitted to The Journal of Physics in Medicine and Biology for possible publication

arXiv:2005.06898 [pdf, other]

Mitigating Gender Bias in Machine Learning Data Sets

Authors: Susan Leavy, Gerardine Meaney, Karen Wade, Derek Greene

Abstract: Artificial Intelligence has the capacity to amplify and perpetuate societal biases and presents profound ethical implications for society. Gender bias has been identified in the context of employment advertising and recruitment tools, due to their reliance on underlying language processing and recommendation algorithms. Attempts to address such issues have involved testing learned associations, in… ▽ More Artificial Intelligence has the capacity to amplify and perpetuate societal biases and presents profound ethical implications for society. Gender bias has been identified in the context of employment advertising and recruitment tools, due to their reliance on underlying language processing and recommendation algorithms. Attempts to address such issues have involved testing learned associations, integrating concepts of fairness to machine learning and performing more rigorous analysis of training data. Mitigating bias when algorithms are trained on textual data is particularly challenging given the complex way gender ideology is embedded in language. This paper proposes a framework for the identification of gender bias in training data for machine learning.The work draws upon gender theory and sociolinguistics to systematically indicate levels of bias in textual training data and associated neural word embedding models, thus highlighting pathways for both removing bias from training data and critically assessing its impact. △ Less

Submitted 18 May, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

Comments: 10 pages, 5 figures, 5 Tables, Presented as Bias2020 workshop (as part of the ECIR Conference) - http://bias.disim.univaq.it

arXiv:1908.05192 [pdf, other]

Temporal Analysis of Reddit Networks via Role Embeddings

Authors: Siobhan Grayson, Derek Greene

Abstract: Inspired by diachronic word analysis from the field of natural language processing, we propose an approach for uncovering temporal insights regarding user roles from social networks using graph embedding methods. Specifically, we apply the role embedding algorithm, struc2vec, to a collection of social networks exhibiting either "loyal" or "vagrant" characteristics derived from the popular online s… ▽ More Inspired by diachronic word analysis from the field of natural language processing, we propose an approach for uncovering temporal insights regarding user roles from social networks using graph embedding methods. Specifically, we apply the role embedding algorithm, struc2vec, to a collection of social networks exhibiting either "loyal" or "vagrant" characteristics derived from the popular online social news aggregation website Reddit. For each subreddit, we extract nine months of data and create network role embeddings on consecutive time windows. We are then able to compare and contrast how user roles change over time by aligning the resulting temporal embeddings spaces. In particular, we analyse temporal role embeddings from an individual and a community-level perspective for both loyal and vagrant communities present on Reddit. △ Less

Submitted 14 August, 2019; originally announced August 2019.

arXiv:1810.05511 [pdf, other]

Semi-Supervised Overlapping Community Finding based on Label Propagation with Pairwise Constraints

Authors: Elham Alghamdi, Derek Greene

Abstract: Algorithms for detecting communities in complex networks are generally unsupervised, relying solely on the structure of the network. However, these methods can often fail to uncover meaningful groupings that reflect the underlying communities in the data, particularly when those structures are highly overlapping. One way to improve the usefulness of these algorithms is by incorporating additional… ▽ More Algorithms for detecting communities in complex networks are generally unsupervised, relying solely on the structure of the network. However, these methods can often fail to uncover meaningful groupings that reflect the underlying communities in the data, particularly when those structures are highly overlapping. One way to improve the usefulness of these algorithms is by incorporating additional background information, which can be used as a source of constraints to direct the community detection process. In this work, we explore the potential of semi-supervised strategies to improve algorithms for finding overlapping communities in networks. Specifically, we propose a new method, based on label propagation, for finding communities using a limited number of pairwise constraints. Evaluations on synthetic and real-world datasets demonstrate the potential of this approach for uncovering meaningful community structures in cases where each node can potentially belong to more than one community. △ Less

Submitted 21 November, 2018; v1 submitted 12 October, 2018; originally announced October 2018.

Comments: Fix tables

arXiv:1810.03046 [pdf, other]

MeetupNet Dublin: Discovering Communities in Dublin's Meetup Network

Authors: Arjun Pakrashi, Elham Alghamdi, Brian Mac Namee, Derek Greene

Abstract: Meetup.com is a global online platform which facilitates the organisation of meetups in different parts of the world. A meetup group typically focuses on one specific topic of interest, such as sports, music, language, or technology. However, many users of this platform attend multiple meetups. On this basis, we can construct a co-membership network for a given location. This network encodes how p… ▽ More Meetup.com is a global online platform which facilitates the organisation of meetups in different parts of the world. A meetup group typically focuses on one specific topic of interest, such as sports, music, language, or technology. However, many users of this platform attend multiple meetups. On this basis, we can construct a co-membership network for a given location. This network encodes how pairs of meetups are connected to one another via common members. In this work we demonstrate that, by applying techniques from social network analysis to this type of representation, we can reveal the underlying meetup community structure, which is not immediately apparent from the platform's website. Specifically, we map the landscape of Dublin's meetup communities, to explore the interests and activities of meetup.com users in the city. △ Less

Submitted 2 November, 2018; v1 submitted 6 October, 2018; originally announced October 2018.

arXiv:1710.05212 [pdf]

On Supporting Digital Journalism: Case Studies in Co-Designing Journalistic Tools

Authors: Georgiana Ifrim, Derek Greene, Mark T. Keane, Claudia Orellana-Rodriguez, Bichen Shi, Gevorg Poghosyan

Abstract: Since 2013 researchers at University College Dublin in the Insight Centre for Data Analytics have been involved in a significant research programme in digital journalism, specifically targeting tools and social media guidelines to support the work of journalists. Most of this programme was undertaken in collaboration with The Irish Times. This collaboration involved identifying key problems curren… ▽ More Since 2013 researchers at University College Dublin in the Insight Centre for Data Analytics have been involved in a significant research programme in digital journalism, specifically targeting tools and social media guidelines to support the work of journalists. Most of this programme was undertaken in collaboration with The Irish Times. This collaboration involved identifying key problems currently faced by digital journalists, developing tools as solutions to these problems, and then iteratively co-designing these tools with feedback from journalists. This paper reports on our experiences and learnings from this research programme, with a view to informing similar efforts in the future. △ Less

Submitted 14 October, 2017; originally announced October 2017.

Comments: Computation + Journalism Symposium (C+J 2017), October 2017, Northwestern University, Evanston, IL USA

Journal ref: Computation + Journalism Symposium (C+J 2017), October 2017, Northwestern University, Evanston, IL USA

arXiv:1702.07186 [pdf, other]

Stability of Topic Modeling via Matrix Factorization

Authors: Mark Belford, Brian Mac Namee, Derek Greene

Abstract: Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different… ▽ More Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. This corresponds to the concept of "instability" which has previously been studied in the context of $k$-means clustering. In many applications of topic modeling, this problem of instability is not considered and topic models are treated as being definitive, even though the results may change considerably if the initialization process is altered. In this paper we demonstrate the inherent instability of popular topic modeling approaches, using a number of new measures to assess stability. To address this issue in the context of matrix factorization for topic modeling, we propose the use of ensemble learning strategies. Based on experiments performed on annotated text corpora, we show that a K-Fold ensemble strategy, combining both ensembles and structured initialization, can significantly reduce instability, while simultaneously yielding more accurate topic models. △ Less

Submitted 9 September, 2017; v1 submitted 23 February, 2017; originally announced February 2017.

arXiv:1702.06891 [pdf, other]

EVE: Explainable Vector Based Embedding Technique Using Wikipedia

Authors: M. Atif Qureshi, Derek Greene

Abstract: We present an unsupervised explainable word embedding technique, called EVE, which is built upon the structure of Wikipedia. The proposed model defines the dimensions of a semantic vector representing a word using human-readable labels, thereby it readily interpretable. Specifically, each vector is constructed using the Wikipedia category graph structure together with the Wikipedia article link st… ▽ More We present an unsupervised explainable word embedding technique, called EVE, which is built upon the structure of Wikipedia. The proposed model defines the dimensions of a semantic vector representing a word using human-readable labels, thereby it readily interpretable. Specifically, each vector is constructed using the Wikipedia category graph structure together with the Wikipedia article link structure. To test the effectiveness of the proposed word embedding model, we consider its usefulness in three fundamental tasks: 1) intruder detection - to evaluate its ability to identify a non-coherent vector from a list of coherent vectors, 2) ability to cluster - to evaluate its tendency to group related vectors together while keeping unrelated vectors in separate clusters, and 3) sorting relevant items first - to evaluate its ability to rank vectors (items) relevant to the query in the top order of the result. For each task, we also propose a strategy to generate a task-specific human-interpretable explanation from the model. These demonstrate the overall effectiveness of the explainable embeddings generated by EVE. Finally, we compare EVE with the Word2Vec, FastText, and GloVe embedding techniques across the three tasks, and report improvements over the state-of-the-art. △ Less

Submitted 22 February, 2017; originally announced February 2017.

arXiv:1607.03055 [pdf, other]

Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach

Authors: Derek Greene, James P. Cross

Abstract: This study analyzes the political agenda of the European Parliament (EP) plenary, how it has evolved over time, and the manner in which Members of the European Parliament (MEPs) have reacted to external and internal stimuli when making plenary speeches. To unveil the plenary agenda and detect latent themes in legislative speeches over time, MEP speech content is analyzed using a new dynamic topic… ▽ More This study analyzes the political agenda of the European Parliament (EP) plenary, how it has evolved over time, and the manner in which Members of the European Parliament (MEPs) have reacted to external and internal stimuli when making plenary speeches. To unveil the plenary agenda and detect latent themes in legislative speeches over time, MEP speech content is analyzed using a new dynamic topic modeling method based on two layers of Non-negative Matrix Factorization (NMF). This method is applied to a new corpus of all English language legislative speeches in the EP plenary from the period 1999-2014. Our findings suggest that two-layer NMF is a valuable alternative to existing dynamic topic modeling approaches found in the literature, and can unveil niche topics and associated vocabularies not captured by existing methods. Substantively, our findings suggest that the political agenda of the EP evolves significantly over time and reacts to exogenous events such as EU Treaty referenda and the emergence of the Euro-crisis. MEP contributions to the plenary agenda are also found to be impacted upon by voting behaviour and the committee structure of the Parliament. △ Less

Submitted 11 July, 2016; originally announced July 2016.

Comments: Long version including appendix. arXiv admin note: substantial text overlap with arXiv:1505.07302

arXiv:1601.02975 [pdf, other]

Indicators of Good Student Performance in Moodle Activity Data

Authors: Ewa Młynarska, Derek Greene, Pádraig Cunningham

Abstract: In this paper we conduct an analysis of Moodle activity data focused on identifying early predictors of good student performance. The analysis shows that three relevant hypotheses are largely supported by the data. These hypotheses are: early submission is a good sign, a high level of activity is predictive of good results and evening activity is even better than daytime activity. We highlight som… ▽ More In this paper we conduct an analysis of Moodle activity data focused on identifying early predictors of good student performance. The analysis shows that three relevant hypotheses are largely supported by the data. These hypotheses are: early submission is a good sign, a high level of activity is predictive of good results and evening activity is even better than daytime activity. We highlight some pathological examples where high levels of activity correlates with bad results. △ Less

Submitted 12 January, 2016; originally announced January 2016.

Comments: Short version

arXiv:1508.01067 [pdf, other]

Topic Stability over Noisy Sources

Authors: Jing Su, Oisín Boydell, Derek Greene, Gerard Lynch

Abstract: Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the st… ▽ More Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the stability of different topic models. From these observations, we propose guidelines for text corpus generation, with a focus on automatic speech transcription. We also suggest topic model selection methods for noisy corpora. △ Less

Submitted 5 August, 2015; originally announced August 2015.

arXiv:1505.07302 [pdf, other]

Unveiling the Political Agenda of the European Parliament Plenary: A Topical Analysis

Authors: Derek Greene, James P. Cross

Abstract: This study analyzes political interactions in the European Parliament (EP) by considering how the political agenda of the plenary sessions has evolved over time and the manner in which Members of the European Parliament (MEPs) have reacted to external and internal stimuli when making Parliamentary speeches. It does so by considering the context in which speeches are made, and the content of those… ▽ More This study analyzes political interactions in the European Parliament (EP) by considering how the political agenda of the plenary sessions has evolved over time and the manner in which Members of the European Parliament (MEPs) have reacted to external and internal stimuli when making Parliamentary speeches. It does so by considering the context in which speeches are made, and the content of those speeches. To detect latent themes in legislative speeches over time, speech content is analyzed using a new dynamic topic modeling method, based on two layers of matrix factorization. This method is applied to a new corpus of all English language legislative speeches in the EP plenary from the period 1999-2014. Our findings suggest that the political agenda of the EP has evolved significantly over time, is impacted upon by the committee structure of the Parliament, and reacts to exogenous events such as EU Treaty referenda and the emergence of the Euro-crisis have a significant impact on what is being discussed in Parliament. △ Less

Submitted 7 July, 2015; v1 submitted 27 May, 2015; originally announced May 2015.

Comments: Add link to implementation code on Github

arXiv:1502.04609 [pdf, other]

TextLuas: Tracking and Visualizing Document and Term Clusters in Dynamic Text Data

Authors: Derek Greene, Daniel Archambault, Václav Belák, Pádraig Cunningham

Abstract: For large volumes of text data collected over time, a key knowledge discovery task is identifying and tracking clusters. These clusters may correspond to emerging themes, popular topics, or breaking news stories in a corpus. Therefore, recently there has been increased interest in the problem of clustering dynamic data. However, there exists little support for the interactive exploration of the ou… ▽ More For large volumes of text data collected over time, a key knowledge discovery task is identifying and tracking clusters. These clusters may correspond to emerging themes, popular topics, or breaking news stories in a corpus. Therefore, recently there has been increased interest in the problem of clustering dynamic data. However, there exists little support for the interactive exploration of the output of these analysis techniques, particularly in cases where researchers wish to simultaneously explore both the change in cluster structure over time and the change in the textual content associated with clusters. In this paper, we propose a model for tracking dynamic clusters characterized by the evolutionary events of each cluster. Motivated by this model, the TextLuas system provides an implementation for tracking these dynamic clusters and visualizing their evolution using a metro map metaphor. To provide overviews of cluster content, we adapt the tag cloud representation to the dynamic clustering scenario. We demonstrate the TextLuas system on two different text corpora, where they are shown to elucidate the evolution of key themes. We also describe how TextLuas was applied to a problem in bibliographic network research. △ Less

Submitted 3 November, 2014; originally announced February 2015.

Comments: 21 page version

arXiv:1407.7736 [pdf, ps, other]

A Latent Space Analysis of Editor Lifecycles in Wikipedia

Authors: Xiangju Qin, Derek Greene, Pádraig Cunningham

Abstract: Collaborations such as Wikipedia are a key part of the value of the modern Internet. At the same time there is concern that these collaborations are threatened by high levels of member turnover. In this paper we borrow ideas from topic analysis to editor activity on Wikipedia over time into a latent space that offers an insight into the evolving patterns of editor behavior. This latent space repre… ▽ More Collaborations such as Wikipedia are a key part of the value of the modern Internet. At the same time there is concern that these collaborations are threatened by high levels of member turnover. In this paper we borrow ideas from topic analysis to editor activity on Wikipedia over time into a latent space that offers an insight into the evolving patterns of editor behavior. This latent space representation reveals a number of different categories of editor (e.g. content experts, social networkers) and we show that it does provide a signal that predicts an editor's departure from the community. We also show that long term editors gradually diversify their participation by shifting edit preference from one or two namespaces to multiple namespaces and experience relatively soft evolution in their editor profiles, while short term editors generally distribute their contribution randomly among the namespaces and experience considerably fluctuated evolution in their editor profiles. △ Less

Submitted 29 July, 2014; originally announced July 2014.

Comments: 16 pages, In Proc. of 5th International Workshop on Mining Ubiquitous and Social Environments (MUSE) at ECML/PKDD 2014

arXiv:1404.4606 [pdf, other]

How Many Topics? Stability Analysis for Topic Models

Authors: Derek Greene, Derek O'Callaghan, Pádraig Cunningham

Abstract: Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpu… ▽ More Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process. △ Less

Submitted 19 June, 2014; v1 submitted 16 April, 2014; originally announced April 2014.

Comments: Improve readability of plots. Add minor clarifications

arXiv:1403.2923 [pdf, ps, other]

Adaptive Representations for Tracking Breaking News on Twitter

Authors: Igor Brigadir, Derek Greene, Pádraig Cunningham

Abstract: Twitter is often the most up-to-date source for finding and tracking breaking news stories. Therefore, there is considerable interest in developing filters for tweet streams in order to track and summarize stories. This is a non-trivial text analytics task as tweets are short, and standard retrieval methods often fail as stories evolve over time. In this paper we examine the effectiveness of adapt… ▽ More Twitter is often the most up-to-date source for finding and tracking breaking news stories. Therefore, there is considerable interest in developing filters for tweet streams in order to track and summarize stories. This is a non-trivial text analytics task as tweets are short, and standard retrieval methods often fail as stories evolve over time. In this paper we examine the effectiveness of adaptive mechanisms for tracking and summarizing breaking news stories. We evaluate the effectiveness of these mechanisms on a number of recent news events for which manually curated timelines are available. Assessments based on ROUGE metrics indicate that an adaptive approaches are best suited for tracking evolving stories on Twitter. △ Less

Submitted 28 November, 2014; v1 submitted 12 March, 2014; originally announced March 2014.

Comments: 8 Page

ACM Class: I.5.4; I.5.1; H.3.3

arXiv:1401.7535 [pdf, other]

Online Social Media in the Syria Conflict: Encompassing the Extremes and the In-Betweens

Authors: Derek O'Callaghan, Nico Prucha, Derek Greene, Maura Conway, Joe Carthy, Pádraig Cunningham

Abstract: The Syria conflict has been described as the most socially mediated in history, with online social media playing a particularly important role. At the same time, the ever-changing landscape of the conflict leads to difficulties in applying analytical approaches taken by other studies of online political activism. Therefore, in this paper, we use an approach that does not require strong prior assum… ▽ More The Syria conflict has been described as the most socially mediated in history, with online social media playing a particularly important role. At the same time, the ever-changing landscape of the conflict leads to difficulties in applying analytical approaches taken by other studies of online political activism. Therefore, in this paper, we use an approach that does not require strong prior assumptions or the proposal of an advance hypothesis to analyze Twitter and YouTube activity of a range of protagonists to the conflict, in an attempt to reveal additional insights into the relationships between them. By means of a network representation that combines multiple data views, we uncover communities of accounts falling into four categories that broadly reflect the situation on the ground in Syria. A detailed analysis of selected communities within the anti-regime categories is provided, focusing on their central actors, preferred online platforms, and activity surrounding "real world" events. Our findings indicate that social media activity in Syria is considerably more convoluted than reported in many other studies of online political activism, suggesting that alternative analytical approaches can play an important role in this type of scenario. △ Less

Submitted 13 August, 2014; v1 submitted 29 January, 2014; originally announced January 2014.

Comments: 8 pages, 3 figures, 3 tables. Minor changes including additional references

arXiv:1308.6149 [pdf, other]

The Extreme Right Filter Bubble

Authors: Derek O'Callaghan, Derek Greene, Maura Conway, Joe Carthy, Pádraig Cunningham

Abstract: Due to its status as the most popular video sharing platform, YouTube plays an important role in the online strategy of extreme right groups, where it is often used to host associated content such as music and other propaganda. In this paper, we develop a categorization suitable for the analysis of extreme right channels found on YouTube. By combining this with an NMF-based topic modelling method,… ▽ More Due to its status as the most popular video sharing platform, YouTube plays an important role in the online strategy of extreme right groups, where it is often used to host associated content such as music and other propaganda. In this paper, we develop a categorization suitable for the analysis of extreme right channels found on YouTube. By combining this with an NMF-based topic modelling method, we categorize channels originating from links propagated by extreme right Twitter accounts. This method is also used to categorize related channels, which are determined using results returned by YouTube's related video service. We identify the existence of a "filter bubble", whereby users who access an extreme right YouTube video are highly likely to be recommended further extreme right content. △ Less

Submitted 28 August, 2013; originally announced August 2013.

Comments: 10 pages, 7 figures

ACM Class: H.3.3; H.3.5

arXiv:1308.5125 [pdf, other]

Discovering Latent Patterns from the Analysis of User-Curated Movie Lists

Authors: Derek Greene, Pádraig Cunningham

Abstract: User content curation is becoming an important source of preference data, as well as providing information regarding the items being curated. One popular approach involves the creation of lists. On Twitter, these lists might contain accounts relevant to a particular topic, whereas on a community site such as the Internet Movie Database (IMDb), this might take the form of lists of movies sharing co… ▽ More User content curation is becoming an important source of preference data, as well as providing information regarding the items being curated. One popular approach involves the creation of lists. On Twitter, these lists might contain accounts relevant to a particular topic, whereas on a community site such as the Internet Movie Database (IMDb), this might take the form of lists of movies sharing common characteristics. While list curation involves substantial combined effort on the part of users, researchers have rarely looked at mining the outputs of this kind of crowdsourcing activity. Here we study a large collection of movie lists from IMDb. We apply network analysis methods to a graph that reflects the degree to which pairs of movies are "co-listed", that is, assigned to the same lists. This allows us to uncover a more nuanced grouping of movies that goes beyond categorisation schemes based on attributes such as genre or director. △ Less

Submitted 23 August, 2013; originally announced August 2013.

Comments: 13 pages

arXiv:1306.3839 [pdf, other]

TwitterCrowds: Techniques for Exploring Topic and Sentiment in Microblogging Data

Authors: Daniel Archambault, Derek Greene, Pádraig Cunningham

Abstract: Analysts and social scientists in the humanities and industry require techniques to help visualize large quantities of microblogging data. Methods for the automated analysis of large scale social media data (on the order of tens of millions of tweets) are widely available, but few visualization techniques exist to support interactive exploration of the results. In this paper, we present extended d… ▽ More Analysts and social scientists in the humanities and industry require techniques to help visualize large quantities of microblogging data. Methods for the automated analysis of large scale social media data (on the order of tens of millions of tweets) are widely available, but few visualization techniques exist to support interactive exploration of the results. In this paper, we present extended descriptions of ThemeCrowds and SentireCrowds, two tag-based visualization techniques for this data. We subsequently introduce a new list equivalent for both of these techniques and present a number of case studies showing them in operation. Finally, we present a formal user study to evaluate the effectiveness of these list interface equivalents when comparing them to ThemeCrowds and SentireCrowds. We find that discovering topics associated with areas of strong positive or negative sentiment is faster when using a list interface. In terms of user preference, multilevel tag clouds were found to be more enjoyable to use. Despite both interfaces being usable for all tested tasks, we have evidence to support that list interfaces can be more efficient for tasks when an appropriate ordering is known beforehand. △ Less

Submitted 17 June, 2013; originally announced June 2013.

Comments: 19 pages, colour figures

arXiv:1302.1726 [pdf, other]

Uncovering the Wider Structure of Extreme Right Communities Spanning Popular Online Networks

Authors: Derek O'Callaghan, Derek Greene, Maura Conway, Joe Carthy, Pádraig Cunningham

Abstract: Recent years have seen increased interest in the online presence of extreme right groups. Although originally composed of dedicated websites, the online extreme right milieu now spans multiple networks, including popular social media platforms such as Twitter, Facebook and YouTube. Ideally therefore, any contemporary analysis of online extreme right activity requires the consideration of multiple… ▽ More Recent years have seen increased interest in the online presence of extreme right groups. Although originally composed of dedicated websites, the online extreme right milieu now spans multiple networks, including popular social media platforms such as Twitter, Facebook and YouTube. Ideally therefore, any contemporary analysis of online extreme right activity requires the consideration of multiple data sources, rather than being restricted to a single platform. We investigate the potential for Twitter to act as a gateway to communities within the wider online network of the extreme right, given its facility for the dissemination of content. A strategy for representing heterogeneous network data with a single homogeneous network for the purpose of community detection is presented, where these inherently dynamic communities are tracked over time. We use this strategy to discover and analyze persistent English and German language extreme right communities. △ Less

Submitted 16 May, 2013; v1 submitted 7 February, 2013; originally announced February 2013.

Comments: 10 pages, 11 figures. Due to use of "sigchi" template, minor changes were made to ensure 10 page limit was not exceeded. Minor clarifications in Introduction, Data and Methodology sections

ACM Class: J.4; H.2.8

arXiv:1301.5809 [pdf, other]

Producing a Unified Graph Representation from Multiple Social Network Views

Authors: Derek Greene, Pádraig Cunningham

Abstract: In many social networks, several different link relations will exist between the same set of users. Additionally, attribute or textual information will be associated with those users, such as demographic details or user-generated content. For many data analysis tasks, such as community finding and data visualisation, the provision of multiple heterogeneous types of user data makes the analysis pro… ▽ More In many social networks, several different link relations will exist between the same set of users. Additionally, attribute or textual information will be associated with those users, such as demographic details or user-generated content. For many data analysis tasks, such as community finding and data visualisation, the provision of multiple heterogeneous types of user data makes the analysis process more complex. We propose an unsupervised method for integrating multiple data views to produce a single unified graph representation, based on the combination of the k-nearest neighbour sets for users derived from each view. These views can be either relation-based or feature-based. The proposed method is evaluated on a number of annotated multi-view Twitter datasets, where it is shown to support the discovery of the underlying community structure in the data. △ Less

Submitted 18 February, 2013; v1 submitted 24 January, 2013; originally announced January 2013.

Comments: 13 pages. Clarify notation

arXiv:1207.0017 [pdf, other]

Identifying Topical Twitter Communities via User List Aggregation

Authors: Derek Greene, Derek O'Callaghan, Pádraig Cunningham

Abstract: A particular challenge in the area of social media analysis is how to find communities within a larger network of social interactions. Here a community may be a group of microblogging users who post content on a coherent topic, or who are associated with a specific event or news story. Twitter provides the ability to curate users into lists, corresponding to meaningful topics or themes. Here we de… ▽ More A particular challenge in the area of social media analysis is how to find communities within a larger network of social interactions. Here a community may be a group of microblogging users who post content on a coherent topic, or who are associated with a specific event or news story. Twitter provides the ability to curate users into lists, corresponding to meaningful topics or themes. Here we describe an approach for crowdsourcing the list building efforts of many different Twitter users, in order to identify topical communities. This approach involves the use of ensemble community finding to produce stable groupings of user lists, and by extension, individual Twitter users. We examine this approach in the context of a case study surrounding the detection of communities on Twitter relating to the London 2012 Olympics. △ Less

Submitted 29 June, 2012; originally announced July 2012.

arXiv:1206.7050 [pdf, other]

An Analysis of Interactions Within and Between Extreme Right Communities in Social Media

Authors: Derek O'Callaghan, Derek Greene, Maura Conway, Joe Carthy, Pádraig Cunningham

Abstract: Many extreme right groups have had an online presence for some time through the use of dedicated websites. This has been accompanied by increased activity in social media platforms in recent years, enabling the dissemination of extreme right content to a wider audience. In this paper, we present an analysis of the activity of a selection of such groups on Twitter, using network representations bas… ▽ More Many extreme right groups have had an online presence for some time through the use of dedicated websites. This has been accompanied by increased activity in social media platforms in recent years, enabling the dissemination of extreme right content to a wider audience. In this paper, we present an analysis of the activity of a selection of such groups on Twitter, using network representations based on reciprocal follower and interaction relationships, while also analyzing topics found in their corresponding tweets. International relationships between certain extreme right groups across geopolitical boundaries are initially identified. Furthermore, we also discover stable communities of accounts within local interaction networks, in addition to associated topics, where the underlying extreme right ideology of these communities is often identifiable. △ Less

Submitted 27 January, 2014; v1 submitted 29 June, 2012; originally announced June 2012.

Comments: 20 pages, 6 figures. Additional topic analysis

Showing 1–50 of 53 results for author: Greene, D