Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–45 of 45 results for author: Freire, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.18898  [pdf, other

    cs.IR cs.DB

    A Flexible and Scalable Approach for Collecting Wildlife Advertisements on the Web

    Authors: Juliana Barbosa, Sunandan Chakraborty, Juliana Freire

    Abstract: Wildlife traffickers are increasingly carrying out their activities in cyberspace. As they advertise and sell wildlife products in online marketplaces, they leave digital traces of their activity. This creates a new opportunity: by analyzing these traces, we can obtain insights into how trafficking networks work as well as how they can be disrupted. However, collecting such information is difficul… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

  2. arXiv:2403.15553  [pdf, other

    cs.DB

    Efficiently Estimating Mutual Information Between Attributes Across Tables

    Authors: Aécio Santos, Flip Korn, Juliana Freire

    Abstract: Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external tables to join with a given input table. Existing approaches rely on data discovery systems to identify joinable tables from external sources, typically based on… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

    Comments: Accepted to IEEE ICDE 2024

  3. arXiv:2310.18208  [pdf, other

    cs.CL cs.LG

    ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

    Authors: Benjamin Feuer, Yurong Liu, Chinmay Hegde, Juliana Freire

    Abstract: Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited st… ▽ More

    Submitted 6 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: 17 pages, 8 figures

  4. arXiv:2309.16157  [pdf, other

    cs.DB cs.DS

    Sampling Methods for Inner Product Sketching

    Authors: Majid Daliri, Juliana Freire, Christopher Musco, Aécio Santos, Haoxiang Zhang

    Abstract: Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the co… ▽ More

    Submitted 15 January, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: 17 pages, 10 figures

  5. arXiv:2308.05907  [pdf, ps, other

    cs.DS cs.DB

    Simple Analysis of Priority Sampling

    Authors: Majid Daliri, Juliana Freire, Christopher Musco, Aécio Santos, Haoxiang Zhang

    Abstract: We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by Mario Szegedy at STOC 2006, which resolved a conjecture by Duffield, Lund, and Thorup.

    Submitted 10 August, 2023; originally announced August 2023.

    Comments: 6 pages

  6. arXiv:2307.05374  [pdf, other

    eess.SP cs.LG

    Multi-Task Learning to Enhance Generalizability of Neural Network Equalizers in Coherent Optical Systems

    Authors: Sasipim Srivallapanondh, Pedro J. Freire, Ashraful Alam, Nelson Costa, Bernhard Spinnler, Antonio Napoli, Egor Sedov, Sergei K. Turitsyn, Jaroslaw E. Prilepsky

    Abstract: For the first time, multi-task learning is proposed to improve the flexibility of NN-based equalizers in coherent systems. A "single" NN-based equalizer improves Q-factor by up to 4 dB compared to CDC, without re-training, even with variations in launch power, symbol rate, or transmission distance.

    Submitted 3 November, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

    Comments: 4 pages, European Conference on Optical Communication (ECOC)

  7. arXiv:2305.09495  [pdf, other

    cs.LG physics.optics

    Hardware Realization of Nonlinear Activation Functions for NN-based Optical Equalizers

    Authors: Sasipim Srivallapanondh, Pedro J. Freire, Antonio Napoli, Sergei K. Turitsyn, Jaroslaw E. Prilepsky

    Abstract: To reduce the complexity of the hardware implementation of neural network-based optical channel equalizers, we demonstrate that the performance of the biLSTM equalizer with approximated activation functions is close to that of the original model.

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: 2 pages, 1 figure, 1 table, Conference on Lasers & Electro-Optics 2023

  8. arXiv:2304.08597  [pdf, other

    cs.LG cs.IR

    eTOP: Early Termination of Pipelines for Faster Training of AutoML Systems

    Authors: Haoxiang Zhang, Juliana Freire, Yash Garg

    Abstract: Recent advancements in software and hardware technologies have enabled the use of AI/ML models in everyday applications has significantly improved the quality of service rendered. However, for a given application, finding the right AI/ML model is a complex and costly process, that involves the generation, training, and evaluation of multiple interlinked steps (called pipelines), such as data pre-p… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: NA

  9. arXiv:2301.05811  [pdf, other

    cs.DB cs.DS

    Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation

    Authors: Aline Bessa, Majid Daliri, Juliana Freire, Cameron Musco, Christopher Musco, Aécio Santos, Haoxiang Zhang

    Abstract: We present a new approach for computing compact sketches that can be used to approximate the inner product between pairs of high-dimensional vectors. Based on the Weighted MinHash algorithm, our approach admits strong accuracy guarantees that improve on the guarantees of popular linear sketching approaches for inner product estimation, such as CountSketch and Johnson-Lindenstrauss projection. Spec… ▽ More

    Submitted 5 May, 2023; v1 submitted 13 January, 2023; originally announced January 2023.

    Comments: 23 pages, 6 figures

    Journal ref: In Proceedings of the ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS) 2023

  10. arXiv:2212.04703  [pdf, other

    eess.SP cs.AR cs.CC cs.LG

    Implementing Neural Network-Based Equalizers in a Coherent Optical Transmission System Using Field-Programmable Gate Arrays

    Authors: Pedro J. Freire, Sasipim Srivallapanondh, Michael Anderson, Bernhard Spinnler, Thomas Bex, Tobias A. Eriksson, Antonio Napoli, Wolfgang Schairer, Nelson Costa, Michaela Blott, Sergei K. Turitsyn, Jaroslaw E. Prilepsky

    Abstract: In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardwa… ▽ More

    Submitted 19 February, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: Invited paper at Journal of Lightwave Technology - IEEE

  11. arXiv:2212.04569  [pdf, other

    eess.SP cs.LG

    Knowledge Distillation Applied to Optical Channel Equalization: Solving the Parallelization Problem of Recurrent Connection

    Authors: Sasipim Srivallapanondh, Pedro J. Freire, Bernhard Spinnler, Nelson Costa, Antonio Napoli, Sergei K. Turitsyn, Jaroslaw E. Prilepsky

    Abstract: To circumvent the non-parallelizability of recurrent neural network-based equalizers, we propose knowledge distillation to recast the RNN into a parallelizable feedforward structure. The latter shows 38\% latency decrease, while impacting the Q-factor by only 0.5dB.

    Submitted 8 December, 2022; originally announced December 2022.

    Comments: Paper Accepted for Oral presentation - OFC 2023 (Optical Fiber Communication Conference)

  12. arXiv:2208.12866  [pdf, other

    eess.SP cs.CC cs.ET cs.LG

    Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation

    Authors: Pedro J. Freire, Antonio Napoli, Diego Arguello Ron, Bernhard Spinnler, Michael Anderson, Wolfgang Schairer, Thomas Bex, Nelson Costa, Sergei K. Turitsyn, Jaroslaw E. Prilepsky

    Abstract: In this paper, a new methodology is proposed that allows for the low-complexity development of neural network (NN) based equalizers for the mitigation of impairments in high-speed coherent optical transmission systems. In this work, we provide a comprehensive description and comparison of various deep model compression approaches that have been applied to feed-forward and recurrent NN designs. Add… ▽ More

    Submitted 26 November, 2022; v1 submitted 26 August, 2022; originally announced August 2022.

  13. arXiv:2206.12180  [pdf, other

    eess.SP cs.LG

    Towards FPGA Implementation of Neural Network-Based Nonlinearity Mitigation Equalizers in Coherent Optical Transmission Systems

    Authors: Pedro J. Freire, Michael Anderson, Bernhard Spinnler, Thomas Bex, Jaroslaw E. Prilepsky, Tobias A. Eriksson, Nelson Costa, Wolfgang Schairer, Michaela Blott, Antonio Napoli, Sergei K. Turitsyn

    Abstract: For the first time, recurrent and feedforward neural network-based equalizers for nonlinearity compensation are implemented in an FPGA, with a level of complexity comparable to that of a dispersion equalizer. We demonstrate that the NN-based equalizers can outperform a 1 step-per-span DBP.

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: Accepted Oral in the European Conference on Optical Communication (ECOC) 2022

  14. arXiv:2203.14362  [pdf, other

    cs.DB

    GPU-Powered Spatial Database Engine for Commodity Hardware: Extended Version

    Authors: Harish Doraiswamy, Juliana Freire

    Abstract: Given the massive growth in the volume of spatial data, there is a great need for systems that can efficiently evaluate spatial queries over large data sets. These queries are notoriously expensive using traditional database solutions. While faster response times can be attained through powerful clusters or servers with large main-memory, these options, due to cost and complexity, are out of reach… ▽ More

    Submitted 27 March, 2022; originally announced March 2022.

  15. arXiv:2202.12689  [pdf, other

    eess.SP cs.LG

    Domain Adaptation: the Key Enabler of Neural Network Equalizers in Coherent Optical Systems

    Authors: Pedro J. Freire, Bernhard Spinnler, Daniel Abode, Jaroslaw E. Prilepsky, Abdallah A. I. Ali, Nelson Costa, Wolfgang Schairer, Antonio Napoli, Andrew D. Ellis, Sergei K. Turitsyn

    Abstract: We introduce the domain adaptation and randomization approach for calibrating neural network-based equalizers for real transmissions, using synthetic data. The approach renders up to 99\% training process reduction, which we demonstrate in three experimental setups.

    Submitted 25 February, 2022; originally announced February 2022.

    Comments: Paper Accepted at OFC 2022

  16. arXiv:2201.04226  [pdf, other

    cs.SI

    Understanding how people consume low quality and extreme news using web traffic data

    Authors: Zhouhan Chen, Haohan Chen, Juliana Freire, Jonathan Nagler, Joshua A. Tucker

    Abstract: To mitigate the spread of fake news, researchers need to understand who visit fake new sites, what brings people to those sites, where visitors come from, and what content they prefer to consume. In this paper, we analyze web traffic data from The Gateway Pundit (TGP), a popular far-right website that is known for repeatedly sharing false information that has made its web traffic available to the… ▽ More

    Submitted 11 January, 2022; originally announced January 2022.

  17. arXiv:2111.02508  [pdf, other

    cs.LG

    AlphaD3M: Machine Learning Pipeline Synthesis

    Authors: Iddo Drori, Yamuna Krishnamurthy, Remi Rampin, Raoni de Paula Lourenco, Jorge Piazentin Ono, Kyunghyun Cho, Claudio Silva, Juliana Freire

    Abstract: We introduce AlphaD3M, an automatic machine learning (AutoML) system based on meta reinforcement learning using sequence models with self play. AlphaD3M is based on edit operations performed over machine learning pipeline primitives providing explainability. We compare AlphaD3M with state-of-the-art AutoML systems: Autosklearn, Autostacker, and TPOT, on OpenML datasets. AlphaD3M achieves competiti… ▽ More

    Submitted 3 November, 2021; originally announced November 2021.

    Comments: ICML 2018 AutoML Workshop

  18. arXiv:2109.08711  [pdf, ps, other

    eess.SP cs.LG

    Experimental Evaluation of Computational Complexity for Different Neural Network Equalizers in Optical Communications

    Authors: Pedro J. Freire, Yevhenii Osadchuk, Antonio Napoli, Bernhard Spinnler, Wolfgang Schairer, Nelson Costa, Jaroslaw E. Prilepsky, Sergei K. Turitsyn

    Abstract: Addressing the neural network-based optical channel equalizers, we quantify the trade-off between their performance and complexity by carrying out the comparative analysis of several neural network architectures, presenting the results for TWC and SSMF set-ups.

    Submitted 17 September, 2021; originally announced September 2021.

    Comments: ORAL presentation at the Asia Communications and Photonics Conference (ACP 2021)

  19. arXiv:2105.06058  [pdf, other

    cs.DB

    DataExposer: Exposing Disconnect between Data and Systems

    Authors: Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, Divesh Srivastava

    Abstract: As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of the data. For example, a health-monitoring system that is designed under the assumption that weight is reported in imperial units (lbs) will malfunction when encountering weight reported in metric units (kilograms). Similar to software debuggi… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

  20. arXiv:2104.03353  [pdf, other

    cs.DB cs.DS cs.IR

    Correlation Sketches for Approximate Join-Correlation Queries

    Authors: Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, Juliana Freire

    Abstract: The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities~to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column $Q$ and a join column $K_Q$ from a query table… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

    Comments: Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21)

  21. arXiv:2102.05716  [pdf, other

    cs.IR cs.DB

    Auctus: A Dataset Search Engine for Data Augmentation

    Authors: Sonia Castelo, Rémi Rampin, Aécio Santos, Aline Bessa, Fernando Chirigati, Juliana Freire

    Abstract: The large volumes of structured data currently available, from Web tables to open-data portals and enterprise data, open up new opportunities for progress in answering many important scientific, societal, and business questions. However, finding relevant data is difficult. While search engines have addressed this problem for Web documents, there are many new challenges involved in supporting the d… ▽ More

    Submitted 31 August, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

  22. arXiv:2009.00449  [pdf, other

    cs.HC

    Towards Evaluating Exploratory Model Building Process with AutoML Systems

    Authors: Sungsoo Ray Hong, Sonia Castelo, Vito D'Orazio, Christopher Benthune, Aecio Santos, Scott Langevin, David Jonker, Enrico Bertini, Juliana Freire

    Abstract: The use of Automated Machine Learning (AutoML) systems are highly open-ended and exploratory. While rigorously evaluating how end-users interact with AutoML is crucial, establishing a robust evaluation methodology for such exploratory systems is challenging. First, AutoML is complex, including multiple sub-components that support a variety of sub-tasks for synthesizing ML pipelines, such as data p… ▽ More

    Submitted 1 September, 2020; originally announced September 2020.

  23. arXiv:2005.00160  [pdf, other

    cs.HC

    PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines

    Authors: Jorge Piazentin Ono, Sonia Castelo, Roque Lopez, Enrico Bertini, Juliana Freire, Claudio Silva

    Abstract: In recent years, a wide variety of automated machine learning (AutoML) methods have been proposed to search and generate end-to-end learning pipelines. While these techniques facilitate the creation of models for real-world applications, given their black-box nature, the complexity of the underlying algorithms, and the large number of pipelines they derive, it is difficult for their developers to… ▽ More

    Submitted 3 September, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

    Comments: To appear at IEEE VIS 2020

  24. BugDoc: Algorithms to Debug Computational Processes

    Authors: Raoni Lourenço, Juliana Freire, Dennis Shasha

    Abstract: Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challen… ▽ More

    Submitted 12 April, 2020; originally announced April 2020.

    Comments: To appear in SIGMOD 2020. arXiv admin note: text overlap with arXiv:2002.04640

  25. arXiv:2004.03630  [pdf, other

    cs.DB

    A GPU-friendly Geometric Data Model and Algebra for Spatial Queries: Extended Version

    Authors: Harish Doraiswamy, Juliana Freire

    Abstract: The availability of low cost sensors has led to an unprecedented growth in the volume of spatial data. However, the time required to evaluate even simple spatial queries over large data sets greatly hampers our ability to interactively explore these data sets and extract actionable insights. Graphics Processing Units~(GPUs) are increasingly being used to speedup spatial queries. However, existing… ▽ More

    Submitted 7 April, 2020; originally announced April 2020.

    Comments: This is the extended version of the paper published in SIGMOD 2020

    ACM Class: H.2.1; H.2.8

  26. arXiv:2002.04640  [pdf, other

    cs.LG cs.DB stat.ML

    Debugging Machine Learning Pipelines

    Authors: Raoni Lourenço, Juliana Freire, Dennis Shasha

    Abstract: Machine learning tasks entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous or uninformative outputs, the pipeline may fail or produce incorrect results. Inferring the root cause of failures and unexpected behavior is challenging, usually requiring much human thought, and is both time-consumin… ▽ More

    Submitted 11 February, 2020; originally announced February 2020.

    Comments: 10 pages

    Journal ref: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, June 2019, Article No.: 3

  27. arXiv:1910.08678  [pdf, other

    cs.DB

    Effective Discovery of Meaningful Outlier Relationships

    Authors: Aline Bessa, Juliana Freire, Divesh Srivastava, Tamraparni Dasu

    Abstract: We propose PODS (Predictable Outliers in Data-trendS), a method that, given a collection of temporal data sets, derives data-driven explanations for outliers by identifying meaningful relationships between them. First, we formalize the notion of meaningfulness, which so far has been informally framed in terms of explainability. Next, since outliers are rare and it is difficult to determine whether… ▽ More

    Submitted 8 April, 2020; v1 submitted 18 October, 2019; originally announced October 2019.

  28. arXiv:1910.03698  [pdf, other

    cs.LG cs.CL stat.ML

    AutoML using Metadata Language Embeddings

    Authors: Iddo Drori, Lu Liu, Yi Nian, Sharath C. Koorathota, Jie S. Li, Antonio Khalil Moretti, Juliana Freire, Madeleine Udell

    Abstract: As a human choosing a supervised learning algorithm, it is natural to begin by reading a text description of the dataset and documentation for the algorithms you might use. We demonstrate that the same idea improves the performance of automated machine learning methods. We use language embeddings from modern NLP to improve state-of-the-art AutoML systems by augmenting their recommendations with ve… ▽ More

    Submitted 8 October, 2019; originally announced October 2019.

    Journal ref: NeurIPS Workshop on Meta-Learning, 2019

  29. Visus: An Interactive System for Automatic Machine Learning Model Building and Curation

    Authors: Aécio Santos, Sonia Castelo, Cristian Felix, Jorge Piazentin Ono, Bowen Yu, Sungsoo Hong, Cláudio T. Silva, Enrico Bertini, Juliana Freire

    Abstract: While the demand for machine learning (ML) applications is booming, there is a scarcity of data scientists capable of building such models. Automatic machine learning (AutoML) approaches have been proposed that help with this problem by synthesizing end-to-end ML data processing pipelines. However, these follow a best-effort approach and a user in the loop is necessary to curate and refine the der… ▽ More

    Submitted 5 July, 2019; originally announced July 2019.

    Comments: Accepted for publication in the 2019 Workshop on Human-In-the-Loop Data Analytics (HILDA'19), co-located with SIGMOD 2019

  30. arXiv:1905.10345  [pdf, other

    cs.LG stat.ML

    Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar

    Authors: Iddo Drori, Yamuna Krishnamurthy, Raoni Lourenco, Remi Rampin, Kyunghyun Cho, Claudio Silva, Juliana Freire

    Abstract: Automatic machine learning is an important problem in the forefront of machine learning. The strongest AutoML systems are based on neural networks, evolutionary algorithms, and Bayesian optimization. Recently AlphaD3M reached state-of-the-art results with an order of magnitude speedup using reinforcement learning with self-play. In this work we extend AlphaD3M by using a pipeline grammar and a pre… ▽ More

    Submitted 24 May, 2019; originally announced May 2019.

    Comments: ICML Workshop on Automated Machine Learning

  31. arXiv:1905.00957  [pdf, other

    cs.CL cs.IR cs.SI

    A Topic-Agnostic Approach for Identifying Fake News Pages

    Authors: Sonia Castelo, Thais Almeida, Anas Elghafari, Aécio Santos, Kien Pham, Eduardo Nakamura, Juliana Freire

    Abstract: Fake news and misinformation have been increasingly used to manipulate popular opinion and influence political processes. To better understand fake news, how they are propagated, and how to counter their effect, it is necessary to first identify them. Recently, approaches have been proposed to automatically classify articles as fake based on their content. An important challenge for these approach… ▽ More

    Submitted 2 May, 2019; originally announced May 2019.

    Comments: Accepted for publication in the Companion Proceedings of the 2019 World Wide Web Conference (WWW'19 Companion). Presented in the 2019 International Workshop on Misinformation, Computational Fact-Checking and Credible Web (MisinfoWorkshop2019). 6 pages

  32. Bootstrapping Domain-Specific Content Discovery on the Web

    Authors: Kien Pham, Aécio Santos, Juliana Freire

    Abstract: The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest $D$, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to se… ▽ More

    Submitted 25 February, 2019; originally announced February 2019.

    Comments: Accepted for publication in the Proceedings of the 2019 World Wide Web Conference (WWW'19). 11 pages, 8 figures

  33. arXiv:1809.05518  [pdf, other

    cs.RO cs.HC

    SocialRobot: Towards a Personalized Elderly Care Mobile Robot

    Authors: David Portugal, Luís Santos, Pedro Trindade, Christophoros Christophorou, Panayiotis Andreou, Dimosthenis Georgiadis, Marios Belk, João Freire, Paulo Alvito, George Samaras, Eleni Christodoulou, Jorge Dias

    Abstract: SocialRobot is a collaborative European project, which focuses on providing a practical and interactive solution to improve the quality of life of elderly people. Having this in mind, a state of the art robotic mobile platform has been integrated with virtual social care technology to meet the elderly individual needs and requirements, following a human centered approach. In this short paper, we m… ▽ More

    Submitted 14 September, 2018; originally announced September 2018.

  34. arXiv:1808.01406  [pdf, other

    cs.SE cs.DL

    ReproServer: Making Reproducibility Easier and Less Intensive

    Authors: Remi Rampin, Fernando Chirigati, Vicky Steeves, Juliana Freire

    Abstract: Reproducibility in the computational sciences has been stymied because of the complex and rapidly changing computational environments in which modern research takes place. While many will espouse reproducibility as a value, the challenge of making it happen (both for themselves and testing the reproducibility of others' work) often outweigh the benefits. There have been a few reproducibility solut… ▽ More

    Submitted 3 August, 2018; originally announced August 2018.

  35. A Collaborative Approach to Computational Reproducibility

    Authors: Fernando Chirigati, Rebecca Capone, Dennis Shasha, Remi Rampin, Juliana Freire

    Abstract: Although a standard in natural science, reproducibility has been only episodically applied in experimental computer science. Scientific papers often present a large number of tables, plots and pictures that summarize the obtained results, but then loosely describe the steps taken to derive them. Not only can the methods and the implementation be complex, but also their configuration may require se… ▽ More

    Submitted 9 August, 2017; originally announced September 2017.

    Journal ref: The Journal of Information Systems, Volume 59, Pages 95-97, ISSN 0306-4379 (2016)

  36. Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets

    Authors: Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, Juliana Freire

    Abstract: The increasing ability to collect data from urban environments, coupled with a push towards openness by governments, has resulted in the availability of numerous spatio-temporal data sets covering diverse aspects of a city. Discovering relationships between these data sets can produce new insights by enabling domain experts to not only test but also generate hypotheses. However, discovering these… ▽ More

    Submitted 21 October, 2016; originally announced October 2016.

    Journal ref: Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16), pp. 1011-1025

  37. arXiv:1606.00046  [pdf, other

    cs.DB

    The Exception that Improves the Rule

    Authors: Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Mueller

    Abstract: The database community has developed numerous tools and techniques for data curation and exploration, from declarative languages, to specialized techniques for data repair, and more. Yet, there is currently no consensus on how to best expose these powerful tools to an analyst in a simple, intuitive, and above all, flexible way. Thus, analysts continue to rely on tools such as spreadsheets, imperat… ▽ More

    Submitted 31 May, 2016; originally announced June 2016.

    Comments: Authors in alphabetical order; Preprint for HILDA 2015

  38. arXiv:1601.06128  [pdf, other

    cs.HC

    RioBusData: Outlier Detection in Bus Routes of Rio de Janeiro

    Authors: Aline Bessa, Fernando de Mesentier Silva, Rodrigo Frassetto Nogueira, Enrico Bertini, Juliana Freire

    Abstract: Buses are the primary means of public transportation in the city of Rio de Janeiro, carrying around 100 million passengers every month. Recently, real-time GPS coordinates of all operating public buses has been made publicly available - roughly 1 million GPS entries each captured each day. In an initial study, we observed that a substantial number of buses follow trajectories that do not follow th… ▽ More

    Submitted 22 January, 2016; originally announced January 2016.

    Comments: In Symposium on Visualization in Data Science (VDS at IEEE VIS), Chicago, Illinois, US, 2015

  39. arXiv:1502.02403  [pdf, other

    cs.SE

    YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

    Authors: Timothy McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen Dey, Juliana Freire, Deborah Huntzinger, Christopher Jones, David Koop, Paolo Missier, Mark Schildhauer, Christopher Schwalm, Yaxing Wei, James Cheney, Mark Bieda, Bertram Ludaescher

    Abstract: Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow… ▽ More

    Submitted 9 February, 2015; originally announced February 2015.

  40. arXiv:1501.06774  [pdf, other

    cs.DM math.CO

    Maximum Common Subelement Metrics and its Applications to Graphs

    Authors: Lauro Lins, Nivan Ferreira, Juliana Freire, Claudio Silva

    Abstract: In this paper we characterize a mathematical model called Maximum Common Subelement (MCS) Model and prove the existence of four different metrics on such model. We generalize metrics on graphs previously proposed in the literature and identify new ones by showing three different examples of MCS Models on graphs based on (1) subgraphs, (2) induced subgraphs and (3) an extended notion of subgraphs.… ▽ More

    Submitted 27 January, 2015; originally announced January 2015.

  41. arXiv:1401.2000  [pdf, other

    cs.CE cond-mat.stat-mech physics.comp-ph

    A model project for reproducible papers: critical temperature for the Ising model on a square lattice

    Authors: M. Dolfi, J. Gukelberger, A. Hehn, J. Imriška, K. Pakrouski, T. F. Rønnow, M. Troyer, I. Zintchenko, F. Chirigati, J. Freire, D. Shasha

    Abstract: In this paper we present a simple, yet typical simulation in statistical physics, consisting of large scale Monte Carlo simulations followed by an involved statistical analysis of the results. The purpose is to provide an example publication to explore tools for writing reproducible papers. The simulation estimates the critical temperature where the Ising model on the square lattice becomes magnet… ▽ More

    Submitted 9 January, 2014; originally announced January 2014.

    Comments: Authors are listed in alphabetical order by institution and name. 5 pages, 4 figures

  42. arXiv:1309.1784  [pdf, other

    cs.SE cs.DB

    Enabling Reproducible Science with VisTrails

    Authors: David Koop, Juliana Freire, Claudio T. Silva

    Abstract: With the increasing amount of data and use of computation in science, software has become an important component in many different domains. Computing is now being used more often and in more aspects of scientific work including data acquisition, simulation, analysis, and visualization. To ensure reproducibility, it is important to capture the different computational processes used as well as their… ▽ More

    Submitted 14 December, 2013; v1 submitted 6 September, 2013; originally announced September 2013.

    Comments: Accepted for WSSSPE 2013

  43. arXiv:1110.6651  [pdf, other

    cs.DB

    Multilingual Schema Matching for Wikipedia Infoboxes

    Authors: Thanh Nguyen, Viviane Moreira, Huong Nguyen, Hoa Nguyen, Juliana Freire

    Abstract: Recent research has taken advantage of Wikipedia's multilingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in particular, to enable answers that straddle… ▽ More

    Submitted 30 October, 2011; originally announced October 2011.

    Comments: VLDB2012

    Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 2, pp. 133-144 (2011)

  44. arXiv:1105.4251  [pdf

    cs.DB

    Synthesizing Products for Online Catalogs

    Authors: Hoa Nguyen, Ariel Fuxman, Stelios Paparizos, Juliana Freire, Rakesh Agrawal

    Abstract: A high-quality, comprehensive product catalog is essential to the success of Product Search engines and shopping sites such as Yahoo! Shopping, Google Product Search or Bing Shopping. But keeping catalogs up-to-date becomes a challenging task, calling for the need of automated techniques. In this paper, we introduce the problem of product synthesis, a key component of catalog creation and maintena… ▽ More

    Submitted 21 May, 2011; originally announced May 2011.

    Comments: VLDB2011

    Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 7, pp. 409-418 (2011)

  45. arXiv:cs/0310035  [pdf, ps, other

    cs.DB

    Supporting Exploratory Queries in Database Centric Web Applications

    Authors: Abhijit Kadlag, Amol Wanjari, Juliana Freire, Jayant R. Haritsa

    Abstract: Users of database-centric Web applications, especially in the e-commerce domain, often resort to exploratory ``trial-and-error'' queries since the underlying data space is huge and unfamiliar, and there are several alternatives for search attributes in this space. For example, scouting for cheap airfares typically involves posing multiple queries, varying flight times, dates, and airport locatio… ▽ More

    Submitted 17 October, 2003; originally announced October 2003.

    Comments: Version After chaning the authors, earlier by mistake only first two Authors were taken

    ACM Class: H.2.4