Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–42 of 42 results for author: Hanbury, A

.
  1. arXiv:2406.08080  [pdf, other

    cs.CL cs.AI cs.CY

    AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

    Authors: Pia Pachinger, Janis Goldzycher, Anna Maria Planitzer, Wojciech Kusa, Allan Hanbury, Julia Neidhardt

    Abstract: Model interpretability in toxicity detection greatly profits from token-level annotations. However, currently such annotations are only available in English. We introduce a dataset annotated for offensive language detection sourced from a news forum, notable for its incorporation of the Austrian German dialect, comprising 4,562 user comments. In addition to binary offensiveness classification, we… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to Findings of the Association for Computational Linguistics: ACL 2024

    ACM Class: I.2.7

  2. arXiv:2311.12474  [pdf, other

    cs.CL cs.IR

    CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

    Authors: Wojciech Kusa, Oscar E. Mendoza, Matthias Samwald, Petr Knoth, Allan Hanbury

    Abstract: Systematic literature reviews (SLRs) play an essential role in summarising, synthesising and validating scientific evidence. In recent years, there has been a growing interest in using machine learning techniques to automate the identification of relevant studies for SLRs. However, the lack of standardised evaluation datasets makes comparing the performance of such automated literature screening s… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: Accepted at NeurIPS 2023 Datasets and Benchmarks Track

  3. arXiv:2309.06131  [pdf, other

    cs.IR cs.CL

    Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection

    Authors: Sophia Althammer, Guido Zuccon, Sebastian Hofstätter, Suzan Verberne, Allan Hanbury

    Abstract: Search methods based on Pretrained Language Models (PLM) have demonstrated great effectiveness gains compared to statistical and early neural ranking models. However, fine-tuning PLM-based rankers requires a great amount of annotated training data. Annotating data involves a large manual effort and thus is expensive, especially in domain specific tasks. In this paper we investigate fine-tuning PLM… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: Accepted at SIGIR-AP 2023

  4. arXiv:2309.01684  [pdf, other

    cs.IR cs.CL cs.DL

    CRUISE-Screening: Living Literature Reviews Toolbox

    Authors: Wojciech Kusa, Petr Knoth, Allan Hanbury

    Abstract: Keeping up with research and finding related work is still a time-consuming task for academics. Researchers sift through thousands of studies to identify a few relevant ones. Automation techniques can help by increasing the efficiency and effectiveness of this task. To this end, we developed CRUISE-Screening, a web-based application for conducting living literature reviews - a type of literature r… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

    Comments: Paper accepted at CIKM 2023. The arXiv version has an extra section about limitations in the Appendix that is not present in the ACM version

  5. arXiv:2307.00381  [pdf, other

    cs.IR cs.CL

    Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking

    Authors: Wojciech Kusa, Óscar E. Mendoza, Petr Knoth, Gabriella Pasi, Allan Hanbury

    Abstract: Clinical trials (CTs) often fail due to inadequate patient recruitment. This paper tackles the challenges of CT retrieval by presenting an approach that addresses the patient-to-trials paradigm. Our approach involves two key components in a pipeline-based model: (i) a data enrichment technique for enhancing both queries and documents during the first retrieval stage, and (ii) a novel re-ranking sc… ▽ More

    Submitted 1 July, 2023; originally announced July 2023.

    Comments: Under review

  6. Outcome-based Evaluation of Systematic Review Automation

    Authors: Wojciech Kusa, Guido Zuccon, Petr Knoth, Allan Hanbury

    Abstract: Current methods of evaluating search strategies and automated citation screening for systematic literature reviews typically rely on counting the number of relevant and not relevant publications. This established practice, however, does not accurately reflect the reality of conducting a systematic review, because not all included publications have the same influence on the final outcome of the sys… ▽ More

    Submitted 30 June, 2023; originally announced June 2023.

    Comments: Accepted at ICTIR2023

  7. arXiv:2304.08188  [pdf, ps, other

    cs.IR

    Statute-enhanced lexical retrieval of court cases for COLIEE 2022

    Authors: Tobias Fink, Gabor Recski, Wojciech Kusa, Allan Hanbury

    Abstract: We discuss our experiments for COLIEE Task 1, a court case retrieval competition using cases from the Federal Court of Canada. During experiments on the training data we observe that passage level retrieval with rank fusion outperforms document level retrieval. By explicitly adding extracted statute information to the queries and documents we can further improve the results. We submit two passage… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: Sixteenth International Workshop on Juris-informatics (JURISIN). 2022

  8. TripJudge: A Relevance Judgement Test Collection for TripClick Health Retrieval

    Authors: Sophia Althammer, Sebastian Hofstätter, Suzan Verberne, Allan Hanbury

    Abstract: Robust test collections are crucial for Information Retrieval research. Recently there is a growing interest in evaluating retrieval systems for domain-specific retrieval tasks, however these tasks often lack a reliable test collection with human-annotated relevance assessments following the Cranfield paradigm. In the medical domain, the TripClick collection was recently proposed, which contains c… ▽ More

    Submitted 14 August, 2022; originally announced August 2022.

    Comments: To be published at CIKM 2022 as resource paper

  9. arXiv:2206.12993  [pdf, other

    cs.IR cs.CL

    Are We There Yet? A Decision Framework for Replacing Term Based Retrieval with Dense Retrieval Systems

    Authors: Sebastian Hofstätter, Nick Craswell, Bhaskar Mitra, Hamed Zamani, Allan Hanbury

    Abstract: Recently, several dense retrieval (DR) models have demonstrated competitive performance to term-based retrieval that are ubiquitous in search systems. In contrast to term-based matching, DR projects queries and documents into a dense vector space and retrieves results via (approximate) nearest neighbor search. Deploying a new system, such as DR, inevitably involves tradeoffs in aspects of its perf… ▽ More

    Submitted 26 June, 2022; originally announced June 2022.

  10. arXiv:2203.13088  [pdf, other

    cs.IR cs.AI cs.CL cs.LG

    Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction

    Authors: Sebastian Hofstätter, Omar Khattab, Sophia Althammer, Mete Sertkan, Allan Hanbury

    Abstract: Recent progress in neural information retrieval has demonstrated large gains in effectiveness, while often sacrificing the efficiency and interpretability of the neural model compared to classical approaches. This paper proposes ColBERTer, a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction. Along the effectiveness Pareto frontier, ColBERTer's reduction… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

  11. arXiv:2203.06989  [pdf, other

    cs.NI cs.DC cs.LG

    Identifying the root cause of cable network problems with machine learning

    Authors: Georg Heiler, Thassilo Gadermaier, Thomas Haider, Allan Hanbury, Peter Filzmoser

    Abstract: Good quality network connectivity is ever more important. For hybrid fiber coaxial (HFC) networks, searching for upstream high noise in the past was cumbersome and time-consuming. Even with machine learning due to the heterogeneity of the network and its topological structure, the task remains challenging. We present the automation of a simple business rule (largest change of a specific value) and… ▽ More

    Submitted 15 March, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

  12. arXiv:2201.07534  [pdf, other

    cs.IR

    Automation of Citation Screening for Systematic Literature Reviews using Neural Networks: A Replicability Study

    Authors: Wojciech Kusa, Allan Hanbury, Petr Knoth

    Abstract: In the process of Systematic Literature Review, citation screening is estimated to be one of the most time-consuming steps. Multiple approaches to automate it using various machine learning techniques have been proposed. The first research papers that apply deep neural networks to this problem were published in the last two years. In this work, we conduct a replicability study of the first two dee… ▽ More

    Submitted 19 January, 2022; originally announced January 2022.

    Comments: Accepted at ECIR 2022

  13. arXiv:2201.01614  [pdf, other

    cs.IR

    PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval

    Authors: Sophia Althammer, Sebastian Hofstätter, Mete Sertkan, Suzan Verberne, Allan Hanbury

    Abstract: Dense passage retrieval (DPR) models show great effectiveness gains in first stage retrieval for the web domain. However in the web domain we are in a setting with large amounts of training data and a query-to-passage or a query-to-document retrieval task. We investigate in this paper dense document-to-document retrieval with limited labelled target data for training, in particular legal case retr… ▽ More

    Submitted 14 August, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

    Comments: Accepted at ECIR 2022

  14. arXiv:2201.00365  [pdf, ps, other

    cs.IR cs.CL

    Establishing Strong Baselines for TripClick Health Retrieval

    Authors: Sebastian Hofstätter, Sophia Althammer, Mete Sertkan, Allan Hanbury

    Abstract: We present strong Transformer-based re-ranking and dense retrieval baselines for the recently released TripClick health ad-hoc retrieval collection. We improve the - originally too noisy - training data with a simple negative sampling policy. We achieve large gains over BM25 in the re-ranking task of TripClick, which were not achieved with the original baselines. Furthermore, we study the impact o… ▽ More

    Submitted 2 January, 2022; originally announced January 2022.

    Comments: Accepted at ECIR 2022

  15. arXiv:2110.05601  [pdf

    cs.HC cs.IR

    A Time-Optimized Content Creation Workflow for Remote Teaching

    Authors: Sebastian Hofstätter, Sophia Althammer, Mete Sertkan, Allan Hanbury

    Abstract: We describe our workflow to create an engaging remote learning experience for a university course, while minimizing the post-production time of the educators. We make use of ubiquitous and commonly free services and platforms, so that our workflow is inclusive for all educators and provides polished experiences for students. Our learning materials provide for each lecture: 1) a recorded video, upl… ▽ More

    Submitted 13 October, 2021; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: Accepted at SIGCSE-TS 2022

  16. arXiv:2108.03937  [pdf, other

    cs.IR

    DoSSIER@COLIEE 2021: Leveraging dense retrieval and summarization-based re-ranking for case law retrieval

    Authors: Sophia Althammer, Arian Askari, Suzan Verberne, Allan Hanbury

    Abstract: In this paper, we present our approaches for the case law retrieval and the legal case entailment task in the Competition on Legal Information Extraction/Entailment (COLIEE) 2021. As first stage retrieval methods combined with neural re-ranking methods using contextualized language models like BERT achieved great performance improvements for information retrieval in the web and news domain, we eva… ▽ More

    Submitted 9 August, 2021; originally announced August 2021.

    Comments: Published in COLIEE 2021

  17. arXiv:2106.05768  [pdf, other

    cs.CL cs.IR

    Linguistically Informed Masking for Representation Learning in the Patent Domain

    Authors: Sophia Althammer, Mark Buckley, Sebastian Hofstätter, Allan Hanbury

    Abstract: Domain-specific contextualized language models have demonstrated substantial effectiveness gains for domain-specific downstream tasks, like similarity matching, entity recognition or information retrieval. However successfully applying such models in highly specific language domains requires domain adaptation of the pre-trained models. In this paper we propose the empirically motivated Linguistica… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: Published at SIGIR 2021 PatentSemTech workshop

  18. arXiv:2105.09816  [pdf, other

    cs.IR cs.CL

    Intra-Document Cascading: Learning to Select Passages for Neural Document Ranking

    Authors: Sebastian Hofstätter, Bhaskar Mitra, Hamed Zamani, Nick Craswell, Allan Hanbury

    Abstract: An emerging recipe for achieving state-of-the-art effectiveness in neural document re-ranking involves utilizing large pre-trained language models - e.g., BERT - to evaluate all individual passages in the document and then aggregating the outputs by pooling or additional Transformer layers. A major drawback of this approach is high query latency due to the cost of evaluating every passage in the d… ▽ More

    Submitted 20 May, 2021; originally announced May 2021.

    Comments: Accepted at SIGIR 2021 (Full Paper Track)

  19. arXiv:2104.06967  [pdf, other

    cs.IR cs.CL

    Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling

    Authors: Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, Allan Hanbury

    Abstract: A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows. The neural IR community made great advancements in training effective dual-encoder dense retrieval (DR) models recently. A dense text retrieval model uses a single vector representation per query and passage to score a match, which enables low-… ▽ More

    Submitted 26 May, 2021; v1 submitted 14 April, 2021; originally announced April 2021.

    Comments: Accepted at SIGIR 2021 (Full Paper track)

  20. arXiv:2101.06980  [pdf, other

    cs.IR cs.CL

    Mitigating the Position Bias of Transformer Models in Passage Re-Ranking

    Authors: Sebastian Hofstätter, Aldo Lipani, Sophia Althammer, Markus Zlabinger, Allan Hanbury

    Abstract: Supervised machine learning models and their evaluation strongly depends on the quality of the underlying dataset. When we search for a relevant piece of information it may appear anywhere in a given passage. However, we observe a bias in the position of the correct answer in the text in two popular Question Answering datasets used for passage re-ranking. The excessive favoring of earlier position… ▽ More

    Submitted 18 January, 2021; originally announced January 2021.

    Comments: Accepted at ECIR 2021 (Full paper track)

  21. arXiv:2012.11405  [pdf, other

    cs.IR

    Cross-domain Retrieval in the Legal and Patent Domains: a Reproducibility Study

    Authors: Sophia Althammer, Sebastian Hofstätter, Allan Hanbury

    Abstract: Domain specific search has always been a challenging information retrieval task due to several challenges such as the domain specific language, the unique task setting, as well as the lack of accessible queries and corresponding relevance judgements. In the last years, pretrained language models, such as BERT, revolutionized web and news search. Naturally, the community aims to adapt these advance… ▽ More

    Submitted 19 January, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

    Comments: Accepted at ECIR 2021 (Reproducibility paper track)

  22. arXiv:2010.10470  [pdf, other

    physics.soc-ph

    Behavioral gender differences are reinforced during the COVID-19 crisis

    Authors: Tobias Reisch, Georg Heiler, Jan Hurt, Peter Klimek, Allan Hanbury, Stefan Thurner

    Abstract: Behavioral gender differences are known to exist for a wide range of human activities including the way people communicate, move, provision themselves, or organize leisure activities. Using mobile phone data from 1.2 million devices in Austria (15% of the population) across the first phase of the COVID-19 crisis, we quantify gender-specific patterns of communication intensity, mobility, and circad… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

    Comments: 26 pages, 30 figures

  23. arXiv:2010.02666  [pdf, other

    cs.IR

    Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

    Authors: Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, Allan Hanbury

    Abstract: Retrieval and ranking models are the backbone of many applications such as web search, open domain QA, or text-based recommender systems. The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking archit… ▽ More

    Submitted 22 January, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: Updated paper with dense retrieval results and query-level analysis

  24. arXiv:2009.03798  [pdf, other

    physics.soc-ph stat.AP stat.CO

    The impact of COVID-19 on relative changes in aggregated mobility using mobile-phone data

    Authors: Georg Heiler, Allan Hanbury, Peter Filzmoser

    Abstract: Evaluating relative changes leads to additional insights which would remain hidden when only evaluating absolute changes. We analyze a dataset describing mobility of mobile phones in Austria before, during COVID-19 lock-down measures until recent. By applying compositional data analysis we show that formerly hidden information becomes available: we see that the elderly population groups increase r… ▽ More

    Submitted 8 September, 2020; originally announced September 2020.

  25. arXiv:2008.10064  [pdf, other

    cs.CY cs.SI stat.AP

    Country-wide mobility changes observed using mobile phone data during COVID-19 pandemic

    Authors: Georg Heiler, Tobias Reisch, Jan Hurt, Mohammad Forghani, Aida Omani, Allan Hanbury, Farid Karimipour

    Abstract: In March 2020, the Austrian government introduced a widespread lock-down in response to the COVID-19 pandemic. Based on subjective impressions and anecdotal evidence, Austrian public and private life came to a sudden halt. Here we assess the effect of the lock-down quantitatively for all regions in Austria and present an analysis of daily changes of human mobility throughout Austria using near-rea… ▽ More

    Submitted 23 August, 2020; originally announced August 2020.

  26. arXiv:2008.05363  [pdf, other

    cs.IR cs.CL

    Fine-Grained Relevance Annotations for Multi-Task Document Ranking and Question Answering

    Authors: Sebastian Hofstätter, Markus Zlabinger, Mete Sertkan, Michael Schröder, Allan Hanbury

    Abstract: There are many existing retrieval and question answering datasets. However, most of them either focus on ranked list evaluation or single-candidate question answering. This divide makes it challenging to properly evaluate approaches concerned with ranking documents and providing snippets or answers for a given query. In this work, we present FiRA: a novel dataset of Fine-Grained Relevance Annotati… ▽ More

    Submitted 12 August, 2020; originally announced August 2020.

    Comments: Accepted at CIKM 2020 (Resource Track)

  27. arXiv:2005.08367  [pdf, other

    cs.IR

    DEXA: Supporting Non-Expert Annotators with Dynamic Examples from Experts

    Authors: Markus Zlabinger, Marta Sabou, Sebastian Hofstätter, Mete Sertkan, Allan Hanbury

    Abstract: The success of crowdsourcing based annotation of text corpora depends on ensuring that crowdworkers are sufficiently well-trained to perform the annotation task accurately. To that end, a frequent approach to train annotators is to provide instructions and a few example cases that demonstrate how the task should be performed (referred to as the CONTROL approach). These globally defined "task-level… ▽ More

    Submitted 17 May, 2020; originally announced May 2020.

    Comments: 4 pages, 1 figure, 3 tables, accepted to SIGIR2020

  28. arXiv:2005.04908  [pdf, other

    cs.IR

    Local Self-Attention over Long Text for Efficient Document Retrieval

    Authors: Sebastian Hofstätter, Hamed Zamani, Bhaskar Mitra, Nick Craswell, Allan Hanbury

    Abstract: Neural networks, particularly Transformer-based architectures, have achieved significant performance improvements on several retrieval benchmarks. When the items being retrieved are documents, the time and memory cost of employing Transformers over a full sequence of document terms can be prohibitive. A popular strategy involves considering only the first n terms of the document. This can, however… ▽ More

    Submitted 11 May, 2020; originally announced May 2020.

    Comments: Accepted at SIGIR 2020 (short paper)

  29. arXiv:2002.01854  [pdf, other

    cs.IR

    Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking

    Authors: Sebastian Hofstätter, Markus Zlabinger, Allan Hanbury

    Abstract: Search engines operate under a strict time constraint as a fast response is paramount to user satisfaction. Thus, neural re-ranking models have a limited time-budget to re-rank documents. Given the same amount of time, a faster re-ranking model can incorporate more documents than a less efficient one, leading to a higher effectiveness. To utilize this property, we propose TK (Transformer-Kernel):… ▽ More

    Submitted 4 February, 2020; originally announced February 2020.

    Comments: Accepted at ECAI 2020 (full paper). arXiv admin note: text overlap with arXiv:1912.01385

  30. arXiv:2001.05357  [pdf, ps, other

    cs.IR

    DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations

    Authors: Markus Zlabinger, Sebastian Hofstätter, Navid Rekabsaz, Allan Hanbury

    Abstract: The effective extraction of ranked disease-symptom relationships is a critical component in various medical tasks, including computer-assisted medical diagnosis or the discovery of unexpected associations between diseases. While existing disease-symptom relationship extraction methods are used as the foundation in the various medical tasks, no collection is available to systematically evaluate the… ▽ More

    Submitted 15 January, 2020; originally announced January 2020.

    Comments: 7 pages; 3 tables; accepted as short-paper to the 42nd European Conference on Information Retrieval (ECIR), Lisbon 2020

  31. arXiv:1912.04713  [pdf, other

    cs.IR

    Neural-IR-Explorer: A Content-Focused Tool to Explore Neural Re-Ranking Results

    Authors: Sebastian Hofstätter, Markus Zlabinger, Allan Hanbury

    Abstract: In this paper we look beyond metrics-based evaluation of Information Retrieval systems, to explore the reasons behind ranking results. We present the content-focused Neural-IR-Explorer, which empowers users to browse through retrieval results and inspect the inner workings and fine-grained results of neural re-ranking models. The explorer includes a categorized overview of the available queries, a… ▽ More

    Submitted 10 December, 2019; originally announced December 2019.

    Comments: Accepted at ECIR 2020 (demo paper)

  32. arXiv:1912.01385  [pdf, other

    cs.IR cs.CL

    TU Wien @ TREC Deep Learning '19 -- Simple Contextualization for Re-ranking

    Authors: Sebastian Hofstätter, Markus Zlabinger, Allan Hanbury

    Abstract: The usage of neural network models puts multiple objectives in conflict with each other: Ideally we would like to create a neural model that is effective, efficient, and interpretable at the same time. However, in most instances we have to choose which property is most important to us. We used the opportunity of the TREC 2019 Deep Learning track to evaluate the effectiveness of a balanced neural r… ▽ More

    Submitted 3 December, 2019; originally announced December 2019.

    Comments: Presented at TREC 2019

  33. arXiv:1907.12975  [pdf

    cs.CV cs.LG q-bio.TO

    Deep Learning architectures for generalized immunofluorescence based nuclear image segmentation

    Authors: Florian Kromp, Lukas Fischer, Eva Bozsaky, Inge Ambros, Wolfgang Doerr, Sabine Taschner-Mandl, Peter Ambros, Allan Hanbury

    Abstract: Separating and labeling each instance of a nucleus (instance-aware segmentation) is the key challenge in segmenting single cell nuclei on fluorescence microscopy images. Deep Neural Networks can learn the implicit transformation of a nuclear image into a probability map indicating the class membership of each pixel (nucleus or background), but the use of post-processing steps to turn the probabili… ▽ More

    Submitted 30 July, 2019; originally announced July 2019.

    Comments: 10 pages + 3 supplementary pages

  34. arXiv:1907.04614  [pdf, other

    cs.IR

    Let's measure run time! Extending the IR replicability infrastructure to include performance aspects

    Authors: Sebastian Hofstätter, Allan Hanbury

    Abstract: Establishing a docker-based replicability infrastructure offers the community a great opportunity: measuring the run time of information retrieval systems. The time required to present query results to a user is paramount to the users satisfaction. Recent advances in neural IR re-ranking models put the issue of query latency at the forefront. They bring a complex trade-off between performance and… ▽ More

    Submitted 10 July, 2019; originally announced July 2019.

    Comments: Position paper @ SIGIR 2019 Open-Source IR Replicability Challenge (OSIRRC)

  35. arXiv:1904.12683  [pdf, other

    cs.IR

    On the Effect of Low-Frequency Terms on Neural-IR Models

    Authors: Sebastian Hofstätter, Navid Rekabsaz, Carsten Eickhoff, Allan Hanbury

    Abstract: Low-frequency terms are a recurring challenge for information retrieval models, especially neural IR frameworks struggle with adequately capturing infrequently observed words. While these terms are often removed from neural models - mainly as a concession to efficiency demands - they traditionally play an important role in the performance of IR models. In this paper, we analyze the effects of low-… ▽ More

    Submitted 30 April, 2019; v1 submitted 29 April, 2019; originally announced April 2019.

    Comments: Accepted at SIGIR'19

  36. arXiv:1812.10424  [pdf, other

    cs.CL cs.LG stat.ML

    Measuring Societal Biases from Text Corpora with Smoothed First-Order Co-occurrence

    Authors: Navid Rekabsaz, Robert West, James Henderson, Allan Hanbury

    Abstract: Text corpora are widely used resources for measuring societal biases and stereotypes. The common approach to measuring such biases using a corpus is by calculating the similarities between the embedding vector of a word (like nurse) and the vectors of the representative words of the concepts of interest (such as genders). In this study, we show that, depending on what one aims to quantify as bias,… ▽ More

    Submitted 27 April, 2021; v1 submitted 13 December, 2018; originally announced December 2018.

    Comments: In proceedings of the International AAAI Conference on Web and Social Media (ICWSM) 2021

  37. Why rankings of biomedical image analysis competitions should be interpreted with care

    Authors: Lena Maier-Hein, Matthias Eisenmann, Annika Reinke, Sinan Onogur, Marko Stankovic, Patrick Scholz, Tal Arbel, Hrvoje Bogunovic, Andrew P. Bradley, Aaron Carass, Carolin Feldmann, Alejandro F. Frangi, Peter M. Full, Bram van Ginneken, Allan Hanbury, Katrin Honauer, Michal Kozubek, Bennett A. Landman, Keno März, Oskar Maier, Klaus Maier-Hein, Bjoern H. Menze, Henning Müller, Peter F. Neher, Wiro Niessen , et al. (13 additional authors not shown)

    Abstract: International challenges have become the standard for validation of biomedical image analysis methods. Given their scientific impact, it is surprising that a critical analysis of common practices related to the organization of challenges has not yet been performed. In this paper, we present a comprehensive analysis of biomedical image analysis challenges conducted up to now. We demonstrate the imp… ▽ More

    Submitted 18 September, 2019; v1 submitted 6 June, 2018; originally announced June 2018.

    Comments: Article published in Nature Communications: https://rdcu.be/bRmNr

    Journal ref: Nature communications 9.1 (2018): 5217

  38. arXiv:1711.06196  [pdf, other

    cs.CL cs.IR

    Addressing Cross-Lingual Word Sense Disambiguation on Low-Density Languages: Application to Persian

    Authors: Navid Rekabsaz, Mihai Lupu, Allan Hanbury, Andres Duque

    Abstract: We explore the use of unsupervised methods in Cross-Lingual Word Sense Disambiguation (CL-WSD) with the application of English to Persian. Our proposed approach targets the languages with scarce resources (low-density) by exploiting word embedding and semantic similarity of the words in context. We evaluate the approach on a recent evaluation benchmark and compare it with the state-of-the-art unsu… ▽ More

    Submitted 21 March, 2018; v1 submitted 16 November, 2017; originally announced November 2017.

  39. arXiv:1707.06598  [pdf, other

    cs.IR cs.CL

    Toward Incorporation of Relevant Documents in word2vec

    Authors: Navid Rekabsaz, Bhaskar Mitra, Mihai Lupu, Allan Hanbury

    Abstract: Recent advances in neural word embedding provide significant benefit to various information retrieval tasks. However as shown by recent studies, adapting the embedding models for the needs of IR tasks can bring considerable further improvements. The embedding models in general define the term relatedness by exploiting the terms' co-occurrences in short-window contexts. An alternative (and well-stu… ▽ More

    Submitted 4 April, 2018; v1 submitted 20 July, 2017; originally announced July 2017.

    Comments: Neu-IR Workshop at the ACM Conference on Research and Development in Information Retrieval (NeuIR-SIGIR 2017)

  40. Volatility Prediction using Financial Disclosures Sentiments with Word Embedding-based IR Models

    Authors: Navid Rekabsaz, Mihai Lupu, Artem Baklanov, Allan Hanbury, Alexander Duer, Linda Anderson

    Abstract: Volatility prediction--an essential concept in financial markets--has recently been addressed using sentiment analysis methods. We investigate the sentiment of annual disclosures of companies in stock markets to forecast volatility. We specifically explore the use of recent Information Retrieval (IR) term weighting models that are effectively extended by related terms using word embeddings. In par… ▽ More

    Submitted 28 September, 2017; v1 submitted 7 February, 2017; originally announced February 2017.

  41. arXiv:1606.06086  [pdf, other

    cs.CL cs.IR

    Uncertainty in Neural Network Word Embedding: Exploration of Threshold for Similarity

    Authors: Navid Rekabsaz, Mihai Lupu, Allan Hanbury

    Abstract: Word embedding, specially with its recent developments, promises a quantification of the similarity between terms. However, it is not clear to which extent this similarity value can be genuinely meaningful and useful for subsequent tasks. We explore how the similarity score obtained from the models is really indicative of term relatedness. We first observe and quantify the uncertainty factor of th… ▽ More

    Submitted 4 April, 2018; v1 submitted 20 June, 2016; originally announced June 2016.

    Comments: Neu-IR Workshop at the ACM Conference on Research and Development in Information Retrieval (NeuIR-SIGIR 2016)

  42. arXiv:1512.07454  [pdf, other

    cs.CY cs.IR

    Evaluation-as-a-Service: Overview and Outlook

    Authors: Allan Hanbury, Henning Müller, Krisztian Balog, Torben Brodt, Gordon V. Cormack, Ivan Eggel, Tim Gollub, Frank Hopfgartner, Jayashree Kalpathy-Cramer, Noriko Kando, Anastasia Krithara, Jimmy Lin, Simon Mercer, Martin Potthast

    Abstract: Evaluation in empirical computer science is essential to show progress and assess technologies developed. Several research domains such as information retrieval have long relied on systematic evaluation to measure progress: here, the Cranfield paradigm of creating shared test collections, defining search tasks, and collecting ground truth for these tasks has persisted up until now. In recent years… ▽ More

    Submitted 23 December, 2015; originally announced December 2015.