Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1978942.1979444acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

Wrangler: interactive visual specification of data transformation scripts

Published: 07 May 2011 Publication History

Abstract

Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting erroneous or missing values, and integrating multiple data sources. These transforms are often difficult to specify and difficult to reuse across analysis tasks, teams, and tools. In response, we introduce Wrangler, an interactive system for creating data transformations. Wrangler combines direct manipulation of visualized data with automatic inference of relevant transforms, enabling analysts to iteratively explore the space of applicable operations and preview their effects. Wrangler leverages semantic data types (e.g., geographic locations, dates, classification codes) to aid validation and type conversion. Interactive histories support review, refinement, and annotation of transformation scripts. User study results show that Wrangler significantly reduces specification time and promotes the use of robust, auditable transforms instead of manual editing.

Supplementary Material

M4V File (paper355.m4v)

References

[1]
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In ACM SIGMOD, pages 337--348, 2003.
[2]
A. F. Blackwell. SWYN: A visual representation for regular expressions. In Your Wish is my Command: Programming by Example, pages 245--270, 2001.
[3]
L. Chiticariu, P. G. Kolaitis, and L. Popa. Interactive generation of integrated schemas. In ACM SIGMOD, pages 833--846, 2008.
[4]
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., New York, NY, 2003.
[5]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In ACM SIGMOD, pages 240--251, 2002.
[6]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1):1--16, 2007.
[7]
K. Fisher and R. Gruber. Pads: a domain-specific language for processing ad hoc data. In ACM PLDI, pages 295--304, 2005.
[8]
H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. In ACM SIGMOD, page 590, 2000.
[9]
L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In ACM SIGMOD, pages 805--810, 2005.
[10]
J. M. Hellerstein. Quantitative data cleaning for large databases, 2008. White Paper, United Nations Economic Commission for Europe.
[11]
V. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Rev., 22(2):85--126, 2004.
[12]
E. Horvitz. Principles of mixed-initiative user interfaces. In ACM CHI, pages 159--166, 1999.
[13]
D. Huynh and S. Mazzocchi. Google Refine. http://code.google.com/p/google-refine/.
[14]
D. F. Huynh, R. C. Miller, and D. R. Karger. Potluck: semi-ontology alignment for casual users. In ISWC, pages 903--910, 2007.
[15]
Z. G. Ives, C. A. Knoblock, S. Minton, M. Jacob, P. Pratim, T. R. Tuchinda, J. Luis, A. Maria, and M. C. Gazen. Interactive data integration through smart copy & paste. In CIDR, 2009.
[16]
H. Kang, L. Getoor, B. Shneiderman, M. Bilgic, and L. Licamele. Interactive entity resolution in relational data: A visual analytic tool and its evaluation. IEEE TVCG, 14(5):999--1014, 2008.
[17]
L. V. S. Lakshmanan, F. Sadri, and S. N. Subramanian. SchemaSQL: An extension to SQL for multidatabase interoperability. ACM Trans. Database Syst., 26(4):476--519, 2001.
[18]
J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with vegemite. In IUI, pages 97--106, 2009.
[19]
R. C. Miller and B. A. Myers. Interactive simultaneous editing of multiple text regions. In USENIX Tech. Conf., pages 161--174, 2001.
[20]
D. A. Norman. The Design of Everyday Things. Basic Books, 2002.
[21]
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10:334--350, 2001.
[22]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001.
[23]
G. G. Robertson, M. P. Czerwinski, and J. E. Churchill. Visualization of mappings between schemas. In ACM CHI, pages 431--439, 2005.
[24]
C. Scaffidi, B. Myers, and M. Shaw. Intelligently creating and recommending reusable reformatting rules. In ACM IUI, pages 297--306, 2009.
[25]
S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34(1--3):233--272, 1999.
[26]
R. Tuchinda, P. Szekely, and C. A. Knoblock. Building mashups by example. In ACM IUI, pages 139--148, 2008.

Cited By

View all
  • (2024)DataLoom: Simplifying Data Loading with LLMsProceedings of the VLDB Endowment10.14778/3685800.368589717:12(4449-4452)Online publication date: 1-Aug-2024
  • (2024)ArcheType: A Novel Framework for Open-Source Column Type Annotation Using Large Language ModelsProceedings of the VLDB Endowment10.14778/3665844.366585717:9(2279-2292)Online publication date: 1-May-2024
  • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
  • Show More Cited By

Index Terms

  1. Wrangler: interactive visual specification of data transformation scripts

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '11: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
    May 2011
    3530 pages
    ISBN:9781450302289
    DOI:10.1145/1978942
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 May 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data analysis
    2. data cleaning
    3. transformation
    4. visualization
    5. wrangler

    Qualifiers

    • Research-article

    Conference

    CHI '11
    Sponsor:

    Acceptance Rates

    CHI '11 Paper Acceptance Rate 410 of 1,532 submissions, 27%;
    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Upcoming Conference

    CHI '25
    CHI Conference on Human Factors in Computing Systems
    April 26 - May 1, 2025
    Yokohama , Japan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)315
    • Downloads (Last 6 weeks)27
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DataLoom: Simplifying Data Loading with LLMsProceedings of the VLDB Endowment10.14778/3685800.368589717:12(4449-4452)Online publication date: 1-Aug-2024
    • (2024)ArcheType: A Novel Framework for Open-Source Column Type Annotation Using Large Language ModelsProceedings of the VLDB Endowment10.14778/3665844.366585717:9(2279-2292)Online publication date: 1-May-2024
    • (2024)Chorus: Foundation Models for Unified Data Discovery and ExplorationProceedings of the VLDB Endowment10.14778/3659437.365946117:8(2104-2114)Online publication date: 1-Apr-2024
    • (2024)Drag, Drop, Merge: A Tool for Streamlining Integration of Longitudinal Survey InstrumentsProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665965(1-7)Online publication date: 14-Jun-2024
    • (2024)Table-GPT: Table Fine-tuned GPT for Diverse Table TasksProceedings of the ACM on Management of Data10.1145/36549792:3(1-28)Online publication date: 30-May-2024
    • (2024)WaitGPT: Monitoring and Steering Conversational LLM Agent in Data Analysis with On-the-Fly Code VisualizationProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676374(1-14)Online publication date: 13-Oct-2024
    • (2024)SQLucid: Grounding Natural Language Database Queries with Interactive ExplanationsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676368(1-20)Online publication date: 13-Oct-2024
    • (2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
    • (2024)"We Have No Idea How Models will Behave in Production until Production": How Engineers Operationalize Machine LearningProceedings of the ACM on Human-Computer Interaction10.1145/36536978:CSCW1(1-34)Online publication date: 26-Apr-2024
    • (2024)Cleenex: Support for User Involvement during an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/364847616:1(1-26)Online publication date: 15-Feb-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media