Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2742784acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Why Big Data Industrial Systems Need Rules and What We Can Do About It

Published: 27 May 2015 Publication History

Abstract

Big Data industrial systems that address problems such as classification, information extraction, and entity matching very commonly use hand-crafted rules. Today, however, little is understood about the usage of such rules. In this paper we explore this issue. We discuss how these systems differ from those considered in academia. We describe default solutions, their limitations, and reasons for using rules. We show examples of extensive rule usage in industry. Contrary to popular perceptions, we show that there is a rich set of research challenges in rule generation, evaluation, execution, optimization, and maintenance. We discuss ongoing work at WalmartLabs and UW-Madison that illustrate these challenges. Our main conclusions are (1) using rules (together with techniques such as learning and crowdsourcing) is fundamental to building semantics-intensive Big Data systems, and (2) it is increasingly critical to address rule management, given the tens of thousands of rules industrial systems often manage today in an ad-hoc fashion.

References

[1]
Regex magic http://www.regexmagic.com/.
[2]
A. Gattani et al. Entity extraction, linking, classification, and tagging for social media: A Wikipedia-based approach. PVLDB, 6(11):1126--1137, 2013.
[3]
R. Agrawal and R. Srikant. Mining sequential patterns. In ICDE '95.
[4]
E. Baralis and P. Garza. A lazy approach to pruning classification rules. In ICDM '02.
[5]
R. Bekkerman and M. Gavish. High-precision phrase-based document classification on a modern scale. In KDD '11.
[6]
M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In AAAI '99.
[7]
L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP '13.
[8]
F. Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In IJCAI '01.
[9]
W. W. Cohen. Fast effective rule induction. In ICML '95.
[10]
F. Denis. Learning regular languages from simple positive examples. Mach. Learn., 44, 2001.
[11]
A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012.
[12]
P. Domingos. The rise system: conquering without separating. In ICTAI '94.
[13]
G. Dong, X. Zhang, L. Wong, and J. Li. Caep: Classification by aggregating emerging patterns. In DS '99.
[14]
H. Fernau. Algorithms for learning regular expressions. In ALT '05.
[15]
L. Firoiu, T. Oates, and P. R. Cohen. Learning regular languages from positive evidence. In In Twentieth Annual Meeting of the Cognitive Science Society, 1998.
[16]
S. Godbole, I. Bhattacharya, A. Gupta, and A. Verma. Building re-usable dictionary repositories for real-world text mining. In CIKM '10.
[17]
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD '14.
[18]
M. Hern--andez, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. HIL: a high-level scripting language for entity integration. In EDBT '13.
[19]
W. L. D. IV, P. Schwarz, and E. Terzi. Finding representative association rules from large rule collections. In SDM '09.
[20]
W. Li, J. Han, and J. Pei. Cmar: accurate and efficient classification based on multiple class-association rules. In ICDM '01.
[21]
Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular expression learning for information extraction. In EMNLP '08.
[22]
D. Lin. Automatic retrieval and clustering of similar words. In COLING '98.
[23]
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In AAAI '98.
[24]
Liu et al. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl., 2005.
[25]
I. Miliaraki, K. Berberich, R. Gemulla, and S. Zoupanos. Mind the gap: Large-scale frequent sequence mining. In SIGMOD '13.
[26]
O. Deshpande et al. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD '13.
[27]
J. Rocchio. Relevance feedback in information retrieval. In The SMART retrieval system. Prentice Hall, 1971.
[28]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 54, 1988.
[29]
D. Shen, J.-D. Ruvini, and B. Sarwar. Large-scale item categorization for e-commerce. In CIKM '12.
[30]
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB '2007.
[31]
S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34, 1999.
[32]
C. Sun, N. Rampalli, F. Yang, and A. Doan. Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. PVLDB, 2014.
[33]
H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hätönen, and H. Mannila. Pruning and grouping discovered association rules, 1995.
[34]
S. M. Weiss and N. Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1998.
[35]
S. E. Whang and H. Garcia-Molina. Entity resolution with evolving rules. Proc. VLDB Endow., 3, 2010.
[36]
X. Chai et al. Social media analytics: The Kosmix story. IEEE Data Eng. Bull., 36(3):4--12, 2013.
[37]
G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR '08.
[38]
X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. In SDM '03.

Cited By

View all
  • (2019)CertusProceedings of the VLDB Endowment10.14778/3311880.331188312:6(653-666)Online publication date: 1-Feb-2019
  • (2018)Adaptive rule monitoring systemProceedings of the 1st International Workshop on Software Engineering for Cognitive Services10.1145/3195555.3195564(45-51)Online publication date: 28-May-2018
  • (2017)Synthesizing entity matching rules by examplesProceedings of the VLDB Endowment10.14778/3149193.314919911:2(189-202)Online publication date: 1-Oct-2017

Index Terms

  1. Why Big Data Industrial Systems Need Rules and What We Can Do About It

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
    May 2015
    2110 pages
    ISBN:9781450327589
    DOI:10.1145/2723372
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. big data
    2. classification
    3. rule management

    Qualifiers

    • Research-article

    Funding Sources

    • @WalmartLabs

    Conference

    SIGMOD/PODS'15
    Sponsor:
    SIGMOD/PODS'15: International Conference on Management of Data
    May 31 - June 4, 2015
    Victoria, Melbourne, Australia

    Acceptance Rates

    SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)CertusProceedings of the VLDB Endowment10.14778/3311880.331188312:6(653-666)Online publication date: 1-Feb-2019
    • (2018)Adaptive rule monitoring systemProceedings of the 1st International Workshop on Software Engineering for Cognitive Services10.1145/3195555.3195564(45-51)Online publication date: 28-May-2018
    • (2017)Synthesizing entity matching rules by examplesProceedings of the VLDB Endowment10.14778/3149193.314919911:2(189-202)Online publication date: 1-Oct-2017

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media