Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3294052.3319680acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Regularizing Conjunctive Features for Classification

Published: 25 June 2019 Publication History

Abstract

We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and the goal is to find feature queries that allow for a linear separation between the two sets of examples. We focus on conjunctive feature queries, and explore two fundamental problems: (a) deciding whether separating feature queries exist (separability), and (b) generating such queries when they exist. In the approximate versions of these problems, we allow a predefined fraction of the examples to be misclassified. To restrict the complexity of the generated classifiers, we explore various ways of regularizing (i.e., imposing simplicity constraints on) them by limiting their dimension, the number of joins in feature queries, and their generalized hypertree width (ghw). Among other results, we show that the separability problem is tractable in the case of bounded ghw; yet, the generation problem is intractable, simply because the feature queries might be too large. So, we explore a third problem: classifying new entities without necessarily generating the feature queries. Interestingly, in the case of bounded ghw we can efficiently classify without ever explicitly generating the feature queries.

References

[1]
Farrukh Ahmed, Michele Samorani, Colin Bellinger, and Osmar R. Za"i ane. 2016. Advantage of integration in big data: Feature generation in multi-relational databases for imbalanced learning. In BigData. IEEE, 532--539.
[2]
Ethem Alpaydin. 2009. Introduction to machine learning .MIT press.
[3]
Timos Antonopoulos, Frank Neven, and Fré dé ric Servais. 2013. Definability problems for graph query languages. In ICDT 2013. 141--152.
[4]
Marcelo Arenas and Gonzalo I. Diaz. 2016. The Exact Complexity of the First-Order Logic Definability Problem. ACM TODS, Vol. 41, 2 (2016), 13:1--13:14.
[5]
Boris Aronov, Delia Garijo, Yurai Nú nez-Rodr'iguez, David Rappaport, Carlos Seara, and Jorge Urrutia. 2012. Minimizing the error of linear separators on linearly inseparable data. Discrete Applied Math., Vol. 160, 10--11 (2012), 1441--1452.
[6]
Pablo Barceló and Miguel Romero. 2017. The Complexity of Reverse Engineering Problems for Conjunctive Queries. In ICDT 2017 . 7:1--7:17.
[7]
Angela Bonifati, Wim Martens, and Thomas Timm. 2017. An Analytical Study of Large SPARQL Query Logs. PVLDB, Vol. 11, 2 (2017), 149--161.
[8]
Olivier Chapelle, Patrick Haffner, and Vladimir Vapnik. 1999. Support vector machines for histogram-based image classification. IEEE Trans. Neural Networks, Vol. 10, 5 (1999), 1055--1064.
[9]
Hubie Chen and V'i ctor Dalmau. 2005. Beyond Hypertree Width: Decomposition Methods Without Decompositions. In CP 2005 . 167--181.
[10]
Jö rg Flum and Martin Grohe. 2006. Parameterized Complexity Theory .Springer.
[11]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, Vol. 9, 3 (2008), 432--441.
[12]
Georg Gottlob, Gianluigi Greco, Nicola Leone, and Francesco Scarcello. 2016. Hypertree Decompositions: Questions and Answers. In PODS 2016. 57--74.
[13]
Georg Gottlob, Nicola Leone, and Francesco Scarcello. 2002. Hypertree Decompositions and Tractable Queries. J. Comput. Syst. Sci., Vol. 64, 3 (2002), 579--627.
[14]
Martin Grohe, Christof Lö ding, and Martin Ritzert. 2017. Learning MSO-definable hypotheses on strings. In ALT (Proceedings of Machine Learning Research), Vol. 76. PMLR, 434--451.
[15]
Martin Grohe and Martin Ritzert. 2017. Learning first-order definable concepts over structures of small degree. In LICS . IEEE Computer Society, 1--12.
[16]
Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A. Zadeh. 2006. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) .Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[17]
Klaus-Uwe Hö ffgen, Hans Ulrich Simon, and Kevin S. Van Horn. 1995. Robust Trainability of Single Neurons. J. Comput. Syst. Sci., Vol. 50, 1 (1995), 114--125.
[18]
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Trans. Vis. Comput. Graph., Vol. 18, 12 (2012), 2917--2926.
[19]
Narendra Karmarkar. 1984. A new polynomial-time algorithm for linear programming. Combinatorica, Vol. 4, 4 (1984), 373--396.
[20]
James M. Keller, Michael R. Gray, and James A. Givens. 1985. A fuzzy K-nearest neighbor algorithm. IEEE Trans. Systems, Man, and Cybernetics, Vol. 15, 4 (1985), 580--585.
[21]
Leonid Khachiyan. 1979. A Polynomial Algorithm in Linear Programming. Soviet Mathematics Dodlaky, Vol. 20 (1979), 191--194.
[22]
Benny Kimelfeld and Christopher Ré. 2017a. A Relational Framework for Classifier Engineering. In PODS 2017. 5--20.
[23]
Benny Kimelfeld and Christopher Ré. 2017b. A Relational Framework for Classifier Engineering. In PODS. ACM, 5--20.
[24]
Arno J. Knobbe, Marc de Haas, and Arno Siebes. 2001. Propositionalisation and Aggregates. In Principles of Data Mining and Knowledge Discovery, 5th European Conference, PKDD 2001, Freiburg, Germany, September 3--5, 2001, Proceedings. 277--288.
[25]
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of machine learning .MIT press.
[26]
C. A. Murthy. 2017. Bridging Feature Selection and Extraction: Compound Feature Generation. IEEE Trans. Knowl. Data Eng., Vol. 29, 4 (2017), 757--770.
[27]
Claudia Perlich and Foster J. Provost. 2006. Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, Vol. 62, 1--2 (2006), 65--105.
[28]
Massimiliano Pontil and Alessandro Verri. 1998. Support Vector Machines for 3D Object Recognition. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 20, 6 (1998), 637--646.
[29]
Michele Samorani, Manuel Laguna, Robert Kirk DeLisle, and Daniel C. Weaver. 2011. A Randomized Exhaustive Propositionalization Approach for Molecule Classification. INFORMS Journal on Computing, Vol. 23, 3 (2011), 331--345.
[30]
Bernhard Schö lkopf and Alexander Johannes Smola. 2002. Learning with Kernels: support vector machines, regularization, optimization, and beyond .MIT Press.
[31]
Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms .Cambridge University Press, New York, NY, USA.
[32]
Balder ten Cate and V'i ctor Dalmau. 2015. The Product Homomorphism Problem and Applications. In ICDT 2015. 161--176.
[33]
Ross Willard. 2010. Testing Expressibility Is Hard. In CP 2010. 9--23.
[34]
Ce Zhang, Arun Kumar, and Christopher Ré. 2014. Materialization optimizations for feature selection workloads. In SIGMOD Conference . 265--276.
[35]
Wojciech Ziarko. 1993. Variable precision rough set model. Journal of computer and system sciences, Vol. 46, 1 (1993), 39--59.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '19: Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
June 2019
494 pages
ISBN:9781450362276
DOI:10.1145/3294052
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification
  2. conjunctive queries
  3. feature generation
  4. generalized hypertree width
  5. separability

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '19
Sponsor:
SIGMOD/PODS '19: International Conference on Management of Data
June 30 - July 5, 2019
Amsterdam, Netherlands

Acceptance Rates

PODS '19 Paper Acceptance Rate 29 of 87 submissions, 33%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 185
    Total Downloads
  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media