Search | arXiv e-print repository

arXiv:2405.19817 [pdf]

Performance Examination of Symbolic Aggregate Approximation in IoT Applications

Authors: Suzana Veljanovska, Hans Dermot Doran

Abstract: Symbolic Aggregate approXimation (SAX) is a common dimensionality reduction approach for time-series data which has been employed in a variety of domains, including classification and anomaly detection in time-series data. Domains also include shape recognition where the shape outline is converted into time-series data forinstance epoch classification of archived arrowheads. In this paper we propo… ▽ More Symbolic Aggregate approXimation (SAX) is a common dimensionality reduction approach for time-series data which has been employed in a variety of domains, including classification and anomaly detection in time-series data. Domains also include shape recognition where the shape outline is converted into time-series data forinstance epoch classification of archived arrowheads. In this paper we propose a dimensionality reduction and shape recognition approach based on the SAX algorithm, an application which requires responses on cost efficient, IoT-like, platforms. The challenge is largely dealing with the computational expense of the SAX algorithm in IoT-like applications, from simple time-series dimension reduction through shape recognition. The approach is based on lowering the dimensional space while capturing and preserving the most representative features of the shape. We present three scenarios of increasing computational complexity backing up our statements with measurement of performance characteristics △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: Embedded World Conference, Nuremberg, 2024

arXiv:2405.05146 [pdf]

Hybrid Convolutional Neural Networks with Reliability Guarantee

Authors: Hans Dermot Doran, Suzana Veljanovska

Abstract: Making AI safe and dependable requires the generation of dependable models and dependable execution of those models. We propose redundant execution as a well-known technique that can be used to ensure reliable execution of the AI model. This generic technique will extend the application scope of AI-accelerators that do not feature well-documented safety or dependability properties. Typical redunda… ▽ More Making AI safe and dependable requires the generation of dependable models and dependable execution of those models. We propose redundant execution as a well-known technique that can be used to ensure reliable execution of the AI model. This generic technique will extend the application scope of AI-accelerators that do not feature well-documented safety or dependability properties. Typical redundancy techniques incur at least double or triple the computational expense of the original. We adopt a co-design approach, integrating reliable model execution with non-reliable execution, focusing that additional computational expense only where it is strictly necessary. We describe the design, implementation and some preliminary results of a hybrid CNN. △ Less

Submitted 9 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

Comments: 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2024). Dependable and Secure Machine Learning Workshop (DSML 2024), Brisbane, Australia, June 24-27, 2024

arXiv:2209.04649 [pdf]

Mixed Criticality Communication within an Unmanned Delivery Rotorcraft

Authors: Hans Dermot Doran, Prosper Leibundgut, Sami Qazimi, Roman Fritschi

Abstract: Stand-alone functions additional to a UAV flight-controller, such as safety-relevant flight-path monitoring or payload-monitoring and control, may be SORA-required or advised for specific flight paths of delivery-drones. These functions, articulated as discrete electronic components either internal or external to the main fuselage, can be networked with other on-board electronics systems. Such an… ▽ More Stand-alone functions additional to a UAV flight-controller, such as safety-relevant flight-path monitoring or payload-monitoring and control, may be SORA-required or advised for specific flight paths of delivery-drones. These functions, articulated as discrete electronic components either internal or external to the main fuselage, can be networked with other on-board electronics systems. Such an integration requires respecting the integrity levels of each component on the network both in terms of function and in terms of power-supply. In this body of work we detail an intra-component communication system for small autonomous and semi-autonomous unmanned aerial vehicles (UAVs.) We discuss the context and the (conservative) design decisions before detailing the hardware and software interfaces and reporting on a first implementation. We finish by drawing conclusions and proposing future work. △ Less

Submitted 10 September, 2022; originally announced September 2022.

Comments: Presented at the 48th European Rotorcraft Forum (ERF,) Winterthur, 2022

arXiv:2207.10809 [pdf]

Security and Safety Aspects of AI in Industry Applications

Authors: Hans Dermot Doran

Abstract: In this relatively informal discussion-paper we summarise issues in the domains of safety and security in machine learning that will affect industry sectors in the next five to ten years. Various products using neural network classification, most often in vision related applications but also in predictive maintenance, have been researched and applied in real-world applications in recent years. Nev… ▽ More In this relatively informal discussion-paper we summarise issues in the domains of safety and security in machine learning that will affect industry sectors in the next five to ten years. Various products using neural network classification, most often in vision related applications but also in predictive maintenance, have been researched and applied in real-world applications in recent years. Nevertheless, reports of underlying problems in both safety and security related domains, for instance adversarial attacks have unsettled early adopters and are threatening to hinder wider scale adoption of this technology. The problem for real-world applicability lies in being able to assess the risk of applying these technologies. In this discussion-paper we describe the process of arriving at a machine-learnt neural network classifier pointing out safety and security vulnerabilities in that workflow, citing relevant research where appropriate. △ Less

Submitted 16 July, 2022; originally announced July 2022.

Comments: As presented at the Embedded World Conference, Nuremberg, 2022

arXiv:2108.02565 [pdf]

Dependable Neural Networks Through Redundancy, A Comparison of Redundant Architectures

Authors: Hans Dermot Doran, Gianluca Ielpo, David Ganz, Michael Zapke

Abstract: With edge-AI finding an increasing number of real-world applications, especially in industry, the question of functionally safe applications using AI has begun to be asked. In this body of work, we explore the issue of achieving dependable operation of neural networks. We discuss the issue of dependability in general implementation terms before examining lockstep solutions. We intuit that it is no… ▽ More With edge-AI finding an increasing number of real-world applications, especially in industry, the question of functionally safe applications using AI has begun to be asked. In this body of work, we explore the issue of achieving dependable operation of neural networks. We discuss the issue of dependability in general implementation terms before examining lockstep solutions. We intuit that it is not necessarily a given that two similar neural networks generate results at precisely the same time and that synchronization between the platforms will be required. We perform some preliminary measurements that may support this intuition and introduce some work in implementing lockstep neural network engines. △ Less

Submitted 30 July, 2021; originally announced August 2021.

Comments: Presented at the Embedded World Conference 2021, Nuremberg (online). 4 pages, 5 figures

ACM Class: B.8.1; C.1.4; C.3

arXiv:2107.08997 [pdf]

Dynamic Lockstep Processors for Applications with Functional Safety Relevance

Authors: Hans Dermot Doran, Timo Lang

Abstract: Lockstep processing is a recognized technique for helping to secure functional-safety relevant processing against, for instance, single upset errors that might cause faulty execution of code. Lockstepping processors does however bind processing resources in a fashion not beneficial to architectures and applications that would benefit from multi-core/-processors. We propose a novel on-demand synchr… ▽ More Lockstep processing is a recognized technique for helping to secure functional-safety relevant processing against, for instance, single upset errors that might cause faulty execution of code. Lockstepping processors does however bind processing resources in a fashion not beneficial to architectures and applications that would benefit from multi-core/-processors. We propose a novel on-demand synchronizing of cores/processors for lock-step operation featuring post-processing resource release, a concept that facilitates the implementation of modularly redundant core/processor arrays. We discuss the fundamentals of the design and some implementation notes on work achieved to date. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: 4 pages, 8 figures

arXiv:2007.01900 [pdf]

Examining Redundancy in the Context of Safe Machine Learning

Authors: Hans Dermot Doran, Monika Reif

Abstract: This paper describes a set of experiments with neural network classifiers on the MNIST database of digits. The purpose is to investigate naïve implementations of redundant architectures as a first step towards safe and dependable machine learning. We report on a set of measurements using the MNIST database which ultimately serve to underline the expected difficulties in using NN classifiers in saf… ▽ More This paper describes a set of experiments with neural network classifiers on the MNIST database of digits. The purpose is to investigate naïve implementations of redundant architectures as a first step towards safe and dependable machine learning. We report on a set of measurements using the MNIST database which ultimately serve to underline the expected difficulties in using NN classifiers in safe and dependable systems. △ Less

Submitted 3 July, 2020; originally announced July 2020.

Comments: 5 pages, 7 tables, 5 figures

arXiv:2005.07262 [pdf]

Voting Framework for Distributed Real-Time Ethernet based Dependable and Safe Systems

Authors: Hans Dermot Doran

Abstract: In many industrial sectors such as factory automation and process control sensor redundancy is required to ensure reliable and highly-available operation. Measured values from N-redundant sensors are typically subjected to some voting scheme to determine a value which is used in further processing. In this paper we present a voting framework which allows the sensors and the voting scheme to be con… ▽ More In many industrial sectors such as factory automation and process control sensor redundancy is required to ensure reliable and highly-available operation. Measured values from N-redundant sensors are typically subjected to some voting scheme to determine a value which is used in further processing. In this paper we present a voting framework which allows the sensors and the voting scheme to be configured at systemconfiguration time. The voting scheme is designed as a Real Time Ethernet profile. We describe the structure of the voting system and the design and verification of the framework. We argue the applicability of this sub-system based on a successful prototype implementation. △ Less

Submitted 30 April, 2020; originally announced May 2020.

Comments: 4 pages, 3 figures, conference - International Conference on Factory Communication Systems

arXiv:2005.00127 [pdf]

Conceptual Design of Human-Drone Communication in Collaborative Environments

Authors: Hans Dermot Doran, Monika Reif, Marco Oehler, Curdin Stoehr, Pierluigi Capone

Abstract: Autonomous robots and drones will work collaboratively and cooperatively in tomorrow's industry and agriculture. Before this becomes a reality, some form of standardised communication between man and machine must be established that specifically facilitates communication between autonomous machines and both trained and untrained human actors in the working environment. We present preliminary resul… ▽ More Autonomous robots and drones will work collaboratively and cooperatively in tomorrow's industry and agriculture. Before this becomes a reality, some form of standardised communication between man and machine must be established that specifically facilitates communication between autonomous machines and both trained and untrained human actors in the working environment. We present preliminary results on a human-drone and a drone-human language situated in the agricultural industry where interactions with trained and untrained workers and visitors can be expected. We present basic visual indicators enhanced with flight patterns for drone-human interaction and human signaling based on aircraft marshaling for humane-drone interaction. We discuss preliminary results on image recognition and future work. △ Less

Submitted 30 April, 2020; originally announced May 2020.

Comments: 4 pages, 4 figures

arXiv:2004.14545 [pdf, other]

Explainable Deep Learning: A Field Guide for the Uninitiated

Authors: Gabrielle Ras, Ning Xie, Marcel van Gerven, Derek Doran

Abstract: Deep neural networks (DNNs) have become a proven and indispensable machine learning tool. As a black-box model, it remains difficult to diagnose what aspects of the model's input drive the decisions of a DNN. In countless real-world domains, from legislation and law enforcement to healthcare, such diagnosis is essential to ensure that DNN decisions are driven by aspects appropriate in the context… ▽ More Deep neural networks (DNNs) have become a proven and indispensable machine learning tool. As a black-box model, it remains difficult to diagnose what aspects of the model's input drive the decisions of a DNN. In countless real-world domains, from legislation and law enforcement to healthcare, such diagnosis is essential to ensure that DNN decisions are driven by aspects appropriate in the context of its use. The development of methods and studies enabling the explanation of a DNN's decisions has thus blossomed into an active, broad area of research. A practitioner wanting to study explainable deep learning may be intimidated by the plethora of orthogonal directions the field has taken. This complexity is further exacerbated by competing definitions of what it means ``to explain'' the actions of a DNN and to evaluate an approach's ``ability to explain''. This article offers a field guide to explore the space of explainable deep learning aimed at those uninitiated in the field. The field guide: i) Introduces three simple dimensions defining the space of foundational methods that contribute to explainable deep learning, ii) discusses the evaluations for model explanations, iii) places explainability in the context of other related deep learning research areas, and iv) finally elaborates on user-oriented explanation designing and potential future directions on explainable deep learning. We hope the guide is used as an easy-to-digest starting point for those just embarking on research in this field. △ Less

Submitted 13 September, 2021; v1 submitted 29 April, 2020; originally announced April 2020.

Comments: Survey paper on Explainable Deep Learning, 70 pages including references, 13 figures, 5 tables

arXiv:1911.07814 [pdf, other]

A First Look at References from the Dark to Surface Web World

Authors: Mahdieh Zabihimayvan, Derek Doran

Abstract: Tor is one of the most well-known networks that protects the identity of both content providers and their clients against any tracking or tracing on the Internet. So far, most research attention has been focused on investigating the security and privacy concerns of Tor and characterizing the topic or hyperlink structure of its hidden services. However, there is still lack of knowledge about the in… ▽ More Tor is one of the most well-known networks that protects the identity of both content providers and their clients against any tracking or tracing on the Internet. So far, most research attention has been focused on investigating the security and privacy concerns of Tor and characterizing the topic or hyperlink structure of its hidden services. However, there is still lack of knowledge about the information leakage attributed to the linking from Tor hidden services to the surface Web. This work addresses this gap by presenting a broad evaluation of the network of referencing from Tor to surface Web and investigates to what extent Tor hidden services are vulnerable against this type of information leakage. The analyses also consider how linking to surface websites can change the overall hyperlink structure of Tor hidden services. They also provide reports regarding the type of information and services provided by Tor domains. Results recover the dark-to-surface network as a single massive connected component where over 90% of Tor hidden services have at least one link to the surface world despite their interest in being isolated from surface Web tracking. We identify that Tor directories have closest proximity to all other Web resources and significantly contribute to both communication and information dissemination through the network which emphasizes on the main application of Tor as information provider to the public. Our study is the product of crawling near 2 million pages from 23,145 onion seed addresses, over a three-month period. △ Less

Submitted 18 November, 2019; originally announced November 2019.

Comments: 15 pages

arXiv:1911.02133 [pdf, other]

Contextual Grounding of Natural Language Entities in Images

Authors: Farley Lai, Ning Xie, Derek Doran, Asim Kadav

Abstract: In this paper, we introduce a contextual grounding approach that captures the context in corresponding text entities and image regions to improve the grounding accuracy. Specifically, the proposed architecture accepts pre-trained text token embeddings and image object features from an off-the-shelf object detector as input. Additional encoding to capture the positional and spatial information can… ▽ More In this paper, we introduce a contextual grounding approach that captures the context in corresponding text entities and image regions to improve the grounding accuracy. Specifically, the proposed architecture accepts pre-trained text token embeddings and image object features from an off-the-shelf object detector as input. Additional encoding to capture the positional and spatial information can be added to enhance the feature quality. There are separate text and image branches facilitating respective architectural refinements for different modalities. The text branch is pre-trained on a large-scale masked language modeling task while the image branch is trained from scratch. Next, the model learns the contextual representations of the text tokens and image objects through layers of high-order interaction respectively. The final grounding head ranks the correspondence between the textual and visual representations through cross-modal interaction. In the evaluation, we show that our model achieves the state-of-the-art grounding accuracy of 71.36% over the Flickr30K Entities dataset. No additional pre-training is necessary to deliver competitive results compared with related work that often requires task-agnostic and task-specific pre-training on cross-modal dadasets. The implementation is publicly available at https://gitlab.com/necla-ml/grounding. △ Less

Submitted 5 November, 2019; originally announced November 2019.

Comments: Accepted to NeurIPS 2019 workshop on Visually Grounded Interaction and Language (ViGIL)

arXiv:1903.05675 [pdf, other]

Fuzzy Rough Set Feature Selection to Enhance Phishing Attack Detection

Authors: Mahdieh Zabihimayvan, Derek Doran

Abstract: Phishing as one of the most well-known cybercrime activities is a deception of online users to steal their personal or confidential information by impersonating a legitimate website. Several machine learning-based strategies have been proposed to detect phishing websites. These techniques are dependent on the features extracted from the website samples. However, few studies have actually considere… ▽ More Phishing as one of the most well-known cybercrime activities is a deception of online users to steal their personal or confidential information by impersonating a legitimate website. Several machine learning-based strategies have been proposed to detect phishing websites. These techniques are dependent on the features extracted from the website samples. However, few studies have actually considered efficient feature selection for detecting phishing attacks. In this work, we investigate an agreement on the definitive features which should be used in phishing detection. We apply Fuzzy Rough Set (FRS) theory as a tool to select most effective features from three benchmarked data sets. The selected features are fed into three often used classifiers for phishing detection. To evaluate the FRS feature selection in developing a generalizable phishing detection, the classifiers are trained by a separate out-of-sample data set of 14,000 website samples. The maximum F-measure gained by FRS feature selection is 95% using Random Forest classification. Also, there are 9 universal features selected by FRS over all the three data sets. The F-measure value using this universal feature set is approximately 93% which is a comparable result in contrast to the FRS performance. Since the universal feature set contains no features from third-part services, this finding implies that with no inquiry from external sources, we can gain a faster phishing detection which is also robust toward zero-day attacks. △ Less

Submitted 13 March, 2019; originally announced March 2019.

Comments: Preprint of accepted paper in IEEE International Conference on Fuzzy Systems 2019

arXiv:1902.06680 [pdf, other]

A Broad Evaluation of the Tor English Content Ecosystem

Authors: Mahdieh Zabihimayvan, Reza Sadeghi, Derek Doran, Mehdi Allahyari

Abstract: Tor is among most well-known dark net in the world. It has noble uses, including as a platform for free speech and information dissemination under the guise of true anonymity, but may be culturally better known as a conduit for criminal activity and as a platform to market illicit goods and data. Past studies on the content of Tor support this notion, but were carried out by targeting popular doma… ▽ More Tor is among most well-known dark net in the world. It has noble uses, including as a platform for free speech and information dissemination under the guise of true anonymity, but may be culturally better known as a conduit for criminal activity and as a platform to market illicit goods and data. Past studies on the content of Tor support this notion, but were carried out by targeting popular domains likely to contain illicit content. A survey of past studies may thus not yield a complete evaluation of the content and use of Tor. This work addresses this gap by presenting a broad evaluation of the content of the English Tor ecosystem. We perform a comprehensive crawl of the Tor dark web and, through topic and network analysis, characterize the types of information and services hosted across a broad swath of Tor domains and their hyperlink relational structure. We recover nine domain types defined by the information or service they host and, among other findings, unveil how some types of domains intentionally silo themselves from the rest of Tor. We also present measurements that (regrettably) suggest how marketplaces of illegal drugs and services do emerge as the dominant type of Tor domain. Our study is the product of crawling over 1 million pages from 20,000 Tor seed addresses, yielding a collection of over 150,000 Tor pages. We make a dataset of the intend to make the domain structure publicly available as a dataset at https://github.com/wsu-wacs/TorEnglishContent. △ Less

Submitted 18 February, 2019; originally announced February 2019.

Comments: 11 pages

arXiv:1901.06706 [pdf, other]

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

Authors: Ning Xie, Farley Lai, Derek Doran, Asim Kadav

Abstract: Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a ne… ▽ More Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a new inference task, Visual Entailment (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71% accuracy and outperforms several other state-of-the-art VQA based models. Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at https://github.com/ necla-ml/SNLI-VE. △ Less

Submitted 20 January, 2019; originally announced January 2019.

arXiv:1811.10658 [pdf, other]

HELOC Applicant Risk Performance Evaluation by Topological Hierarchical Decomposition

Authors: Kyle Brown, Derek Doran, Ryan Kramer, Brad Reynolds

Abstract: Strong regulations in the financial industry mean that any decisions based on machine learning need to be explained. This precludes the use of powerful supervised techniques such as neural networks. In this study we propose a new unsupervised and semi-supervised technique known as the topological hierarchical decomposition (THD). This process breaks a dataset down into ever smaller groups, where g… ▽ More Strong regulations in the financial industry mean that any decisions based on machine learning need to be explained. This precludes the use of powerful supervised techniques such as neural networks. In this study we propose a new unsupervised and semi-supervised technique known as the topological hierarchical decomposition (THD). This process breaks a dataset down into ever smaller groups, where groups are associated with a simplicial complex that approximate the underlying topology of a dataset. We apply THD to the FICO machine learning challenge dataset, consisting of anonymized home equity loan applications using the MAPPER algorithm to build simplicial complexes. We identify different groups of individuals unable to pay back loans, and illustrate how the distribution of feature values in a simplicial complex can be used to explain the decision to grant or deny a loan by extracting illustrative explanations from two THDs on the dataset. △ Less

Submitted 26 November, 2018; originally announced November 2018.

Comments: 10 pages, 4 figures, to be published in the NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness, Explainability, Accuracy, and Privacy

arXiv:1811.10582 [pdf, other]

Visual Entailment Task for Visually-Grounded Language Learning

Authors: Ning Xie, Farley Lai, Derek Doran, Asim Kadav

Abstract: We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k.… ▽ More We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform. △ Less

Submitted 20 January, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

Comments: 4 pages, accepted by Visually Grounded Interaction and Language (ViGIL) workshop in NeurIPS 2018

arXiv:1811.04132 [pdf, other]

Reasoning over RDF Knowledge Bases using Deep Learning

Authors: Monireh Ebrahimi, Md Kamruzzaman Sarker, Federico Bianchi, Ning Xie, Derek Doran, Pascal Hitzler

Abstract: Semantic Web knowledge representation standards, and in particular RDF and OWL, often come endowed with a formal semantics which is considered to be of fundamental importance for the field. Reasoning, i.e., the drawing of logical inferences from knowledge expressed in such standards, is traditionally based on logical deductive methods and algorithms which can be proven to be sound and complete and… ▽ More Semantic Web knowledge representation standards, and in particular RDF and OWL, often come endowed with a formal semantics which is considered to be of fundamental importance for the field. Reasoning, i.e., the drawing of logical inferences from knowledge expressed in such standards, is traditionally based on logical deductive methods and algorithms which can be proven to be sound and complete and terminating, i.e. correct in a very strong sense. For various reasons, though, in particular, the scalability issues arising from the ever-increasing amounts of Semantic Web data available and the inability of deductive algorithms to deal with noise in the data, it has been argued that alternative means of reasoning should be investigated which bear high promise for high scalability and better robustness. From this perspective, deductive algorithms can be considered the gold standard regarding correctness against which alternative methods need to be tested. In this paper, we show that it is possible to train a Deep Learning system on RDF knowledge graphs, such that it is able to perform reasoning over new RDF knowledge graphs, with high precision and recall compared to the deductive gold standard. △ Less

Submitted 9 November, 2018; originally announced November 2018.

arXiv:1810.09620 [pdf, other]

Deep Neural Ranking for Crowdsourced Geopolitical Event Forecasting

Authors: Giuseppe Nebbione, Derek Doran, Srikanth Nadella, Brandon Minnery

Abstract: There are many examples of 'wisdom of the crowd' effects in which the large number of participants imparts confidence in the collective judgment of the crowd. But how do we form an aggregated judgment when the size of the crowd is limited? Whose judgments do we include, and whose do we accord the most weight? This paper considers this problem in the context of geopolitical event forecasting, where… ▽ More There are many examples of 'wisdom of the crowd' effects in which the large number of participants imparts confidence in the collective judgment of the crowd. But how do we form an aggregated judgment when the size of the crowd is limited? Whose judgments do we include, and whose do we accord the most weight? This paper considers this problem in the context of geopolitical event forecasting, where volunteer analysts are queried to give their expertise, confidence, and predictions about the outcome of an event. We develop a forecast aggregation model that integrates topical information about a question, meta-data about a pair of forecasters, and their predictions in a deep siamese neural network that decides which forecasters' predictions are more likely to be close to the correct response. A ranking of the forecasters is induced from a tournament of pair-wise forecaster comparisons, with the ranking used to create an aggregate forecast. Preliminary results find the aggregate prediction of the best forecasters ranked by our deep siamese network model consistently beats typical aggregation techniques by Brier score. △ Less

Submitted 22 October, 2018; originally announced October 2018.

arXiv:1801.09715 [pdf, other]

doi 10.12720/jcm.13.8.473-481

Contrasting Web Robot and Human Behaviors with Network Models

Authors: Kyle Brown, Derek Doran

Abstract: The web graph is a commonly-used network representation of the hyperlink structure of a website. A network of similar structure to the web graph, which we call the session graph has properties that reflect the browsing habits of the agents in the web server logs. In this paper, we apply session graphs to compare the activity of humans against web robots or crawlers. Understanding these properties… ▽ More The web graph is a commonly-used network representation of the hyperlink structure of a website. A network of similar structure to the web graph, which we call the session graph has properties that reflect the browsing habits of the agents in the web server logs. In this paper, we apply session graphs to compare the activity of humans against web robots or crawlers. Understanding these properties will enable us to improve models of HTTP traffic, which can be used to predict and generate realistic traffic for testing and improving web server efficiency, as well as devising new caching algorithms. We apply large-scale network properties, such as the connectivity and degree distribution of human and Web robot session graphs in order to identify characteristics of the traffic which would be useful for modeling web traffic and improving cache performance. We find that the empirical degree distributions of session graphs for human and robot requests on one Web server are best fit by different theoretical distributions, indicating at a difference in the processes which generate the traffic. △ Less

Submitted 29 January, 2018; originally announced January 2018.

Comments: 9 pages

arXiv:1712.05813 [pdf, other]

doi 10.1109/ICMLA.2017.0-161

Realistic Traffic Generation for Web Robots

Authors: Kyle Brown, Derek Doran

Abstract: Critical to evaluating the capacity, scalability, and availability of web systems are realistic web traffic generators. Web traffic generation is a classic research problem, no generator accounts for the characteristics of web robots or crawlers that are now the dominant source of traffic to a web server. Administrators are thus unable to test, stress, and evaluate how their systems perform in the… ▽ More Critical to evaluating the capacity, scalability, and availability of web systems are realistic web traffic generators. Web traffic generation is a classic research problem, no generator accounts for the characteristics of web robots or crawlers that are now the dominant source of traffic to a web server. Administrators are thus unable to test, stress, and evaluate how their systems perform in the face of ever increasing levels of web robot traffic. To resolve this problem, this paper introduces a novel approach to generate synthetic web robot traffic with high fidelity. It generates traffic that accounts for both the temporal and behavioral qualities of robot traffic by statistical and Bayesian models that are fitted to the properties of robot traffic seen in web logs from North America and Europe. We evaluate our traffic generator by comparing the characteristics of generated traffic to those of the original data. We look at session arrival rates, inter-arrival times and session lengths, comparing and contrasting them between generated and real traffic. Finally, we show that our generated traffic affects cache performance similarly to actual traffic, using the common LRU and LFU eviction policies. △ Less

Submitted 15 December, 2017; originally announced December 2017.

Comments: 8 pages

arXiv:1712.05359 [pdf, other]

Seasonal Stochastic Blockmodeling for Anomaly Detection in Dynamic Networks

Authors: Jace Robinson, Derek Doran

Abstract: Sociotechnological and geospatial processes exhibit time varying structure that make insight discovery challenging. To detect abnormal moments in these processes, a definition of `normal' must be established. This paper proposes a new statistical model for such systems, modeled as dynamic networks, to address this challenge. It assumes that vertices fall into one of k types and that the probabilit… ▽ More Sociotechnological and geospatial processes exhibit time varying structure that make insight discovery challenging. To detect abnormal moments in these processes, a definition of `normal' must be established. This paper proposes a new statistical model for such systems, modeled as dynamic networks, to address this challenge. It assumes that vertices fall into one of k types and that the probability of edge formation at a particular time depends on the types of the incident nodes and the current time. The time dependencies are driven by unique seasonal processes, which many systems exhibit (e.g., predictable spikes in geospatial or web traffic each day). The paper defines the model as a generative process and an inference procedure to recover the `normal' seasonal processes from data when they are unknown. An outline of anomaly detection experiments to be completed over Enron emails and New York City taxi trips is presented. △ Less

Submitted 14 December, 2017; originally announced December 2017.

Comments: Working manuscript, to be update before aimed conference submission in Spring 2018

arXiv:1712.05247 [pdf, other]

Intrinsic Point of Interest Discovery from Trajectory Data

Authors: Matthew Piekenbrock, Derek Doran

Abstract: This paper presents a framework for intrinsic point of interest discovery from trajectory databases. Intrinsic points of interest are regions of a geospatial area innately defined by the spatial and temporal aspects of trajectory data, and can be of varying size, shape, and resolution. Any trajectory database exhibits such points of interest, and hence are intrinsic, as compared to most other poin… ▽ More This paper presents a framework for intrinsic point of interest discovery from trajectory databases. Intrinsic points of interest are regions of a geospatial area innately defined by the spatial and temporal aspects of trajectory data, and can be of varying size, shape, and resolution. Any trajectory database exhibits such points of interest, and hence are intrinsic, as compared to most other point of interest definitions which are said to be extrinsic, as they require trajectory metadata, external knowledge about the region the trajectories are observed, or other application-specific information. Spatial and temporal aspects are qualities of any trajectory database, making the framework applicable to data from any domain and of any resolution. The framework is developed under recent developments on the consistency of nonparametric hierarchical density estimators and enables the possibility of formal statistical inference and evaluation over such intrinsic points of interest. Comparisons of the POIs uncovered by the framework in synthetic truth data to thousands of parameter settings for common POI discovery methods show a marked improvement in fidelity without the need to tune any parameters by hand. △ Less

Submitted 14 December, 2017; originally announced December 2017.

Comments: 10 pages, 9 figures

arXiv:1711.08006 [pdf, other]

Relating Input Concepts to Convolutional Neural Network Decisions

Authors: Ning Xie, Md Kamruzzaman Sarker, Derek Doran, Pascal Hitzler, Michael Raymer

Abstract: Many current methods to interpret convolutional neural networks (CNNs) use visualization techniques and words to highlight concepts of the input seemingly relevant to a CNN's decision. The methods hypothesize that the recognition of these concepts are instrumental in the decision a CNN reaches, but the nature of this relationship has not been well explored. To address this gap, this paper examines… ▽ More Many current methods to interpret convolutional neural networks (CNNs) use visualization techniques and words to highlight concepts of the input seemingly relevant to a CNN's decision. The methods hypothesize that the recognition of these concepts are instrumental in the decision a CNN reaches, but the nature of this relationship has not been well explored. To address this gap, this paper examines the quality of a concept's recognition by a CNN and the degree to which the recognitions are associated with CNN decisions. The study considers a CNN trained for scene recognition over the ADE20k dataset. It uses a novel approach to find and score the strength of minimally distributed representations of input concepts (defined by objects in scene images) across late stage feature maps. Subsequent analysis finds evidence that concept recognition impacts decision making. Strong recognition of concepts frequently-occurring in few scenes are indicative of correct decisions, but recognizing concepts common to many scenes may mislead the network. △ Less

Submitted 21 November, 2017; originally announced November 2017.

Comments: 10 pages (including references), 9 figures, paper accepted by NIPS IEVDL 2017

ACM Class: I.2.10; I.4.m

arXiv:1710.04324 [pdf, other]

Explaining Trained Neural Networks with Semantic Web Technologies: First Steps

Authors: Md Kamruzzaman Sarker, Ning Xie, Derek Doran, Michael Raymer, Pascal Hitzler

Abstract: The ever increasing prevalence of publicly available structured data on the World Wide Web enables new applications in a variety of domains. In this paper, we provide a conceptual approach that leverages such data in order to explain the input-output behavior of trained artificial neural networks. We apply existing Semantic Web technologies in order to provide an experimental proof of concept. The ever increasing prevalence of publicly available structured data on the World Wide Web enables new applications in a variety of domains. In this paper, we provide a conceptual approach that leverages such data in order to explain the input-output behavior of trained artificial neural networks. We apply existing Semantic Web technologies in order to provide an experimental proof of concept. △ Less

Submitted 11 October, 2017; originally announced October 2017.

arXiv:1710.00794 [pdf, other]

What Does Explainable AI Really Mean? A New Conceptualization of Perspectives

Authors: Derek Doran, Sarah Schulz, Tarek R. Besold

Abstract: We characterize three notions of explainable AI that cut across research fields: opaque systems that offer no insight into its algo- rithmic mechanisms; interpretable systems where users can mathemat- ically analyze its algorithmic mechanisms; and comprehensible systems that emit symbols enabling user-driven explanations of how a conclusion is reached. The paper is motivated by a corpus analysis o… ▽ More We characterize three notions of explainable AI that cut across research fields: opaque systems that offer no insight into its algo- rithmic mechanisms; interpretable systems where users can mathemat- ically analyze its algorithmic mechanisms; and comprehensible systems that emit symbols enabling user-driven explanations of how a conclusion is reached. The paper is motivated by a corpus analysis of NIPS, ACL, COGSCI, and ICCV/ECCV paper titles showing differences in how work on explainable AI is positioned in various fields. We close by introducing a fourth notion: truly explainable systems, where automated reasoning is central to output crafted explanations without requiring human post processing as final step of the generative process. △ Less

Submitted 2 October, 2017; originally announced October 2017.

arXiv:1707.04653 [pdf, other]

doi 10.1145/3106426.3106490

A Semantics-Based Measure of Emoji Similarity

Authors: Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran

Abstract: Emoji have grown to become one of the most important forms of communication on the web. With its widespread use, measuring the similarity of emoji has become an important problem for contemporary text processing since it lies at the heart of sentiment analysis, search, and interface design tasks. This paper presents a comprehensive analysis of the semantic similarity of emoji through embedding mod… ▽ More Emoji have grown to become one of the most important forms of communication on the web. With its widespread use, measuring the similarity of emoji has become an important problem for contemporary text processing since it lies at the heart of sentiment analysis, search, and interface design tasks. This paper presents a comprehensive analysis of the semantic similarity of emoji through embedding models that are learned over machine-readable emoji meanings in the EmojiNet knowledge base. Using emoji descriptions, emoji sense labels and emoji sense definitions, and with different training corpora obtained from Twitter and Google News, we develop and test multiple embedding models to measure emoji similarity. To evaluate our work, we create a new dataset called EmoSim508, which assigns human-annotated semantic similarity scores to a set of 508 carefully selected emoji pairs. After validation with EmoSim508, we present a real-world use-case of our emoji embedding models using a sentiment analysis task and show that our models outperform the previous best-performing emoji embedding model on this task. The EmoSim508 dataset and our emoji embedding models are publicly released with this paper and can be downloaded from http://emojinet.knoesis.org/. △ Less

Submitted 14 July, 2017; originally announced July 2017.

Comments: This paper is accepted at Web Intelligence 2017 as a full paper, In 2017 IEEE/WIC/ACM International Conference on Web Intelligence (WI). Leipzig, Germany: ACM, 2017

arXiv:1707.04652 [pdf, other]

EmojiNet: An Open Service and API for Emoji Sense Discovery

Authors: Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran

Abstract: This paper presents the release of EmojiNet, the largest machine-readable emoji sense inventory that links Unicode emoji representations to their English meanings extracted from the Web. EmojiNet is a dataset consisting of: (i) 12,904 sense labels over 2,389 emoji, which were extracted from the web and linked to machine-readable sense definitions seen in BabelNet, (ii) context words associated wit… ▽ More This paper presents the release of EmojiNet, the largest machine-readable emoji sense inventory that links Unicode emoji representations to their English meanings extracted from the Web. EmojiNet is a dataset consisting of: (i) 12,904 sense labels over 2,389 emoji, which were extracted from the web and linked to machine-readable sense definitions seen in BabelNet, (ii) context words associated with each emoji sense, which are inferred through word embedding models trained over Google News corpus and a Twitter message corpus for each emoji sense definition, and (iii) recognizing discrepancies in the presentation of emoji on different platforms, specification of the most likely platform-based emoji sense for a selected set of emoji. The dataset is hosted as an open service with a REST API and is available at http://emojinet.knoesis.org/. The development of this dataset, evaluation of its quality, and its applications including emoji sense disambiguation and emoji sense similarity are discussed. △ Less

Submitted 14 July, 2017; originally announced July 2017.

Comments: This paper was published at ICWSM 2017 as a full paper, Proc. of the 11th International AAAI Conference on Web and Social Media (ICWSM 2017). Montreal, Canada. 2017

arXiv:1706.07895 [pdf, other]

doi 10.1145/3106426.3109424

Seasonality in Dynamic Stochastic Block Models

Authors: Jace Robinson, Derek Doran

Abstract: Sociotechnological and geospatial processes exhibit time varying structure that make insight discovery challenging. This paper proposes a new statistical model for such systems, modeled as dynamic networks, to address this challenge. It assumes that vertices fall into one of k types and that the probability of edge formation at a particular time depends on the types of the incident nodes and the c… ▽ More Sociotechnological and geospatial processes exhibit time varying structure that make insight discovery challenging. This paper proposes a new statistical model for such systems, modeled as dynamic networks, to address this challenge. It assumes that vertices fall into one of k types and that the probability of edge formation at a particular time depends on the types of the incident nodes and the current time. The time dependencies are driven by unique seasonal processes, which many systems exhibit (e.g., predictable spikes in geospatial or web traffic each day). The paper defines the model as a generative process and an inference procedure to recover the seasonal processes from data when they are unknown. Evaluation with synthetic dynamic networks show the recovery of the latent seasonal processes that drive its formation. △ Less

Submitted 23 June, 2017; originally announced June 2017.

Comments: 4 page workshop

arXiv:1610.09516 [pdf, other]

Finding Street Gang Members on Twitter

Authors: Lakshika Balasuriya, Sanjaya Wijeratne, Derek Doran, Amit Sheth

Abstract: Most street gang members use Twitter to intimidate others, to present outrageous images and statements to the world, and to share recent illegal activities. Their tweets may thus be useful to law enforcement agencies to discover clues about recent crimes or to anticipate ones that may occur. Finding these posts, however, requires a method to discover gang member Twitter profiles. This is a challen… ▽ More Most street gang members use Twitter to intimidate others, to present outrageous images and statements to the world, and to share recent illegal activities. Their tweets may thus be useful to law enforcement agencies to discover clues about recent crimes or to anticipate ones that may occur. Finding these posts, however, requires a method to discover gang member Twitter profiles. This is a challenging task since gang members represent a very small population of the 320 million Twitter users. This paper studies the problem of automatically finding gang members on Twitter. It outlines a process to curate one of the largest sets of verifiable gang member profiles that have ever been studied. A review of these profiles establishes differences in the language, images, YouTube links, and emojis gang members use compared to the rest of the Twitter population. Features from this review are used to train a series of supervised classifiers. Our classifier achieves a promising F1 score with a low false positive rate. △ Less

Submitted 29 October, 2016; originally announced October 2016.

Comments: 8 pages, 9 figures, 2 tables, Published as a full paper at 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2016)

Journal ref: The 2016 IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining. vol. 8, pp. 685-692. San Francisco, CA, USA (2016)

arXiv:1610.08597 [pdf, other]

Word Embeddings to Enhance Twitter Gang Member Profile Identification

Authors: Sanjaya Wijeratne, Lakshika Balasuriya, Derek Doran, Amit Sheth

Abstract: Gang affiliates have joined the masses who use social media to share thoughts and actions publicly. Interestingly, they use this public medium to express recent illegal actions, to intimidate others, and to share outrageous images and statements. Agencies able to unearth these profiles may thus be able to anticipate, stop, or hasten the investigation of gang-related crimes. This paper investigates… ▽ More Gang affiliates have joined the masses who use social media to share thoughts and actions publicly. Interestingly, they use this public medium to express recent illegal actions, to intimidate others, and to share outrageous images and statements. Agencies able to unearth these profiles may thus be able to anticipate, stop, or hasten the investigation of gang-related crimes. This paper investigates the use of word embeddings to help identify gang members on Twitter. Building on our previous work, we generate word embeddings that translate what Twitter users post in their profile descriptions, tweets, profile images, and linked YouTube content to a real vector format amenable for machine learning classification. Our experimental results show that pre-trained word embeddings can boost the accuracy of supervised learning algorithms trained over gang members social media posts. △ Less

Submitted 26 October, 2016; originally announced October 2016.

Comments: 7 pages, 1 figure, 2 tables, Published at IJCAI Workshop on Semantic Machine Learning (SML 2016)

Journal ref: IJCAI Workshop on Semantic Machine Learning (SML 2016). pp. 18-24. CEUR-WS, New York City, NY (07 2016)

arXiv:1610.07710 [pdf, other]

EmojiNet: Building a Machine Readable Sense Inventory for Emoji

Authors: Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran

Abstract: Emoji are a contemporary and extremely popular way to enhance electronic communication. Without rigid semantics attached to them, emoji symbols take on different meanings based on the context of a message. Thus, like the word sense disambiguation task in natural language processing, machines also need to disambiguate the meaning or sense of an emoji. In a first step toward achieving this goal, thi… ▽ More Emoji are a contemporary and extremely popular way to enhance electronic communication. Without rigid semantics attached to them, emoji symbols take on different meanings based on the context of a message. Thus, like the word sense disambiguation task in natural language processing, machines also need to disambiguate the meaning or sense of an emoji. In a first step toward achieving this goal, this paper presents EmojiNet, the first machine readable sense inventory for emoji. EmojiNet is a resource enabling systems to link emoji with their context-specific meaning. It is automatically constructed by integrating multiple emoji resources with BabelNet, which is the most comprehensive multilingual sense inventory available to date. The paper discusses its construction, evaluates the automatic resource creation process, and presents a use case where EmojiNet disambiguates emoji usage in tweets. EmojiNet is available online for use at http://emojinet.knoesis.org. △ Less

Submitted 24 October, 2016; originally announced October 2016.

Comments: 15 pages, 4 figures, 3 tables, Accepted to publish at the 8th International Conference on Social Informatics (SocInfo 2016) as a full research track paper

ACM Class: I.2.7

arXiv:1512.04456 [pdf]

Teaching the Foundations of Data Science: An Interdisciplinary Approach

Authors: Daniel Asamoah, Derek Doran, Shu Schiller

Abstract: The astronomical growth of data has necessitated the need for educating well-qualified data scientists to derive deep insights from large and complex data sets generated by organizations. In this paper, we present our interdisciplinary approach and experiences in teaching a Data Science course, the first of its kind offered at the Wright State University. Two faculty members from the Management In… ▽ More The astronomical growth of data has necessitated the need for educating well-qualified data scientists to derive deep insights from large and complex data sets generated by organizations. In this paper, we present our interdisciplinary approach and experiences in teaching a Data Science course, the first of its kind offered at the Wright State University. Two faculty members from the Management Information Systems (MIS) and Computer Science (CS) departments designed and co-taught the course with perspectives from their previous research and teaching experiences. Students in the class had mix backgrounds with mainly MIS and CS majors. Students' learning outcomes and post course survey responses suggested that the course delivered a broad overview of data science as desired, and that students worked synergistically with those of different majors in collaborative lab assignments and in a semester long project. The interdisciplinary pedagogy helped build collaboration and create satisfaction among learners. △ Less

Submitted 14 December, 2015; originally announced December 2015.

Comments: Presented at SIGDSA Business Analytics Conference 2015

arXiv:1509.04905 [pdf, other]

doi 10.1007/s13278-015-0290-0

On the discovery of social roles in large scale social systems

Authors: Derek Doran

Abstract: The social role of a participant in a social system is a label conceptualizing the circumstances under which she interacts within it. They may be used as a theoretical tool that explains why and how users participate in an online social system. Social role analysis also serves practical purposes, such as reducing the structure of complex systems to rela- tionships among roles rather than alters, a… ▽ More The social role of a participant in a social system is a label conceptualizing the circumstances under which she interacts within it. They may be used as a theoretical tool that explains why and how users participate in an online social system. Social role analysis also serves practical purposes, such as reducing the structure of complex systems to rela- tionships among roles rather than alters, and enabling a comparison of social systems that emerge in similar contexts. This article presents a data-driven approach for the discovery of social roles in large scale social systems. Motivated by an analysis of the present art, the method discovers roles by the conditional triad censuses of user ego-networks, which is a promising tool because they capture the degree to which basic social forces push upon a user to interact with others. Clusters of censuses, inferred from samples of large scale network carefully chosen to preserve local structural prop- erties, define the social roles. The promise of the method is demonstrated by discussing and discovering the roles that emerge in both Facebook and Wikipedia. The article con- cludes with a discussion of the challenges and future opportunities in the discovery of social roles in large social systems. △ Less

Submitted 16 September, 2015; originally announced September 2015.

Journal ref: Social Network Analysis And Mining, Vol. 5, No. 49, December 2015

arXiv:1509.00670 [pdf, other]

doi 10.1145/2808797.2809311

Stay Awhile and Listen: User Interactions in a Crowdsourced Platform Offering Emotional Support

Authors: Derek Doran, Samir Yelne, Luisa Massari, Maria-Carla Calzarossa, LaTrelle Jackson, Glen Moriarty

Abstract: Internet and online-based social systems are rising as the dominant mode of communication in society. However, the public or semi-private environment under which most online communications operate under do not make them suitable channels for speaking with others about personal or emotional problems. This has led to the emergence of online platforms for emotional support offering free, anonymous, a… ▽ More Internet and online-based social systems are rising as the dominant mode of communication in society. However, the public or semi-private environment under which most online communications operate under do not make them suitable channels for speaking with others about personal or emotional problems. This has led to the emergence of online platforms for emotional support offering free, anonymous, and confidential conversations with live listeners. Yet very little is known about the way these platforms are utilized, and if their features and design foster strong user engagement. This paper explores the utilization and the interaction features of hundreds of thousands of users on 7 Cups of Tea, a leading online platform offering online emotional support. It dissects the level of activity of hundreds of thousands of users, the patterns by which they engage in conversation with each other, and uses machine learning methods to find factors promoting engagement. The study may be the first to measure activities and interactions in a large-scale online social system that fosters peer-to-peer emotional support. △ Less

Submitted 2 September, 2015; originally announced September 2015.

arXiv:1411.6462 [pdf, other]

Understanding Common Perceptions from Online Social Media

Authors: Derek Doran, Swapna Gokhale, Aldo Dagnino

Abstract: Modern society habitually uses online social media services to publicly share observations, thoughts, opinions, and beliefs at any time and from any location. These geotagged social media posts may provide aggregate insights into people's perceptions on a bad range of topics across a given geographical area beyond what is currently possible through services such as Yelp and Foursquare. This paper… ▽ More Modern society habitually uses online social media services to publicly share observations, thoughts, opinions, and beliefs at any time and from any location. These geotagged social media posts may provide aggregate insights into people's perceptions on a bad range of topics across a given geographical area beyond what is currently possible through services such as Yelp and Foursquare. This paper develops probabilistic language models to investigate whether collective, topic-based perceptions within a geographical area can be extracted from the content of geotagged Twitter posts. The capability of the methodology is illustrated using tweets from three areas of different sizes. An application of the approach to support power grid restoration following a storm is presented. △ Less

Submitted 24 November, 2014; originally announced November 2014.

Comments: In Proc. of the International Conference of Software Engineering and Knowledge Engineering, pp. 107-112, 2013

arXiv:1410.4616 [pdf, other]

Accurate Local Estimation of Geo-Coordinates for Social Media Posts

Authors: Derek Doran, Swapna Gokhale, Aldo Dagnino

Abstract: Associating geo-coordinates with the content of social media posts can enhance many existing applications and services and enable a host of new ones. Unfortunately, a majority of social media posts are not tagged with geo-coordinates. Even when location data is available, it may be inaccurate, very broad or sometimes fictitious. Contemporary location estimation approaches based on analyzing the co… ▽ More Associating geo-coordinates with the content of social media posts can enhance many existing applications and services and enable a host of new ones. Unfortunately, a majority of social media posts are not tagged with geo-coordinates. Even when location data is available, it may be inaccurate, very broad or sometimes fictitious. Contemporary location estimation approaches based on analyzing the content of these posts can identify only broad areas such as a city, which limits their usefulness. To address these shortcomings, this paper proposes a methodology to narrowly estimate the geo-coordinates of social media posts with high accuracy. The methodology relies solely on the content of these posts and prior knowledge of the wide geographical region from where the posts originate. An ensemble of language models, which are smoothed over non-overlapping sub-regions of a wider region, lie at the heart of the methodology. Experimental evaluation using a corpus of over half a million tweets from New York City shows that the approach, on an average, estimates locations of tweets to within just 2.15km of their actual positions. △ Less

Submitted 16 October, 2014; originally announced October 2014.

Comments: In Proceedings of the 26th International Conference on Software Engineering and Knowledge Engineering, pp. 642 - 647, 2014

Showing 1–37 of 37 results for author: Doran, D