research-article

Regularizing Conjunctive Features for Classification

Authors:

Pablo Barceló,

Alexander Baumgartner,

Benny KimelfeldAuthors Info & Claims

PODS '19: Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pages 2 - 16

https://doi.org/10.1145/3294052.3319680

Published: 25 June 2019 Publication History

Abstract

We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and the goal is to find feature queries that allow for a linear separation between the two sets of examples. We focus on conjunctive feature queries, and explore two fundamental problems: (a) deciding whether separating feature queries exist (separability), and (b) generating such queries when they exist. In the approximate versions of these problems, we allow a predefined fraction of the examples to be misclassified. To restrict the complexity of the generated classifiers, we explore various ways of regularizing (i.e., imposing simplicity constraints on) them by limiting their dimension, the number of joins in feature queries, and their generalized hypertree width (ghw). Among other results, we show that the separability problem is tractable in the case of bounded ghw; yet, the generation problem is intractable, simply because the feature queries might be too large. So, we explore a third problem: classifying new entities without necessarily generating the feature queries. Interestingly, in the case of bounded ghw we can efficiently classify without ever explicitly generating the feature queries.

References

[1]

Farrukh Ahmed, Michele Samorani, Colin Bellinger, and Osmar R. Za"i ane. 2016. Advantage of integration in big data: Feature generation in multi-relational databases for imbalanced learning. In BigData. IEEE, 532--539.

[2]

Ethem Alpaydin. 2009. Introduction to machine learning .MIT press.

Digital Library

[3]

Timos Antonopoulos, Frank Neven, and Fré dé ric Servais. 2013. Definability problems for graph query languages. In ICDT 2013. 141--152.

Digital Library

[4]

Marcelo Arenas and Gonzalo I. Diaz. 2016. The Exact Complexity of the First-Order Logic Definability Problem. ACM TODS, Vol. 41, 2 (2016), 13:1--13:14.

Digital Library

[5]

Boris Aronov, Delia Garijo, Yurai Nú nez-Rodr'iguez, David Rappaport, Carlos Seara, and Jorge Urrutia. 2012. Minimizing the error of linear separators on linearly inseparable data. Discrete Applied Math., Vol. 160, 10--11 (2012), 1441--1452.

Digital Library

[6]

Pablo Barceló and Miguel Romero. 2017. The Complexity of Reverse Engineering Problems for Conjunctive Queries. In ICDT 2017 . 7:1--7:17.

[7]

Angela Bonifati, Wim Martens, and Thomas Timm. 2017. An Analytical Study of Large SPARQL Query Logs. PVLDB, Vol. 11, 2 (2017), 149--161.

Digital Library

[8]

Olivier Chapelle, Patrick Haffner, and Vladimir Vapnik. 1999. Support vector machines for histogram-based image classification. IEEE Trans. Neural Networks, Vol. 10, 5 (1999), 1055--1064.

Digital Library

[9]

Hubie Chen and V'i ctor Dalmau. 2005. Beyond Hypertree Width: Decomposition Methods Without Decompositions. In CP 2005 . 167--181.

Digital Library

[10]

Jö rg Flum and Martin Grohe. 2006. Parameterized Complexity Theory .Springer.

Digital Library

[11]

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, Vol. 9, 3 (2008), 432--441.

[12]

Georg Gottlob, Gianluigi Greco, Nicola Leone, and Francesco Scarcello. 2016. Hypertree Decompositions: Questions and Answers. In PODS 2016. 57--74.

Digital Library

[13]

Georg Gottlob, Nicola Leone, and Francesco Scarcello. 2002. Hypertree Decompositions and Tractable Queries. J. Comput. Syst. Sci., Vol. 64, 3 (2002), 579--627.

Digital Library

[14]

Martin Grohe, Christof Lö ding, and Martin Ritzert. 2017. Learning MSO-definable hypotheses on strings. In ALT (Proceedings of Machine Learning Research), Vol. 76. PMLR, 434--451.

[15]

Martin Grohe and Martin Ritzert. 2017. Learning first-order definable concepts over structures of small degree. In LICS . IEEE Computer Society, 1--12.

Digital Library

[16]

Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A. Zadeh. 2006. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) .Springer-Verlag New York, Inc., Secaucus, NJ, USA.

Digital Library

[17]

Klaus-Uwe Hö ffgen, Hans Ulrich Simon, and Kevin S. Van Horn. 1995. Robust Trainability of Single Neurons. J. Comput. Syst. Sci., Vol. 50, 1 (1995), 114--125.

Digital Library

[18]

Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Trans. Vis. Comput. Graph., Vol. 18, 12 (2012), 2917--2926.

Digital Library

[19]

Narendra Karmarkar. 1984. A new polynomial-time algorithm for linear programming. Combinatorica, Vol. 4, 4 (1984), 373--396.

Digital Library

[20]

James M. Keller, Michael R. Gray, and James A. Givens. 1985. A fuzzy K-nearest neighbor algorithm. IEEE Trans. Systems, Man, and Cybernetics, Vol. 15, 4 (1985), 580--585.

[21]

Leonid Khachiyan. 1979. A Polynomial Algorithm in Linear Programming. Soviet Mathematics Dodlaky, Vol. 20 (1979), 191--194.

[22]

Benny Kimelfeld and Christopher Ré. 2017a. A Relational Framework for Classifier Engineering. In PODS 2017. 5--20.

Digital Library

[23]

Benny Kimelfeld and Christopher Ré. 2017b. A Relational Framework for Classifier Engineering. In PODS. ACM, 5--20.

Digital Library

[24]

Arno J. Knobbe, Marc de Haas, and Arno Siebes. 2001. Propositionalisation and Aggregates. In Principles of Data Mining and Knowledge Discovery, 5th European Conference, PKDD 2001, Freiburg, Germany, September 3--5, 2001, Proceedings. 277--288.

Digital Library

[25]

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of machine learning .MIT press.

Digital Library

[26]

C. A. Murthy. 2017. Bridging Feature Selection and Extraction: Compound Feature Generation. IEEE Trans. Knowl. Data Eng., Vol. 29, 4 (2017), 757--770.

Digital Library

[27]

Claudia Perlich and Foster J. Provost. 2006. Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, Vol. 62, 1--2 (2006), 65--105.

Digital Library

[28]

Massimiliano Pontil and Alessandro Verri. 1998. Support Vector Machines for 3D Object Recognition. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 20, 6 (1998), 637--646.

Digital Library

[29]

Michele Samorani, Manuel Laguna, Robert Kirk DeLisle, and Daniel C. Weaver. 2011. A Randomized Exhaustive Propositionalization Approach for Molecule Classification. INFORMS Journal on Computing, Vol. 23, 3 (2011), 331--345.

Digital Library

[30]

Bernhard Schö lkopf and Alexander Johannes Smola. 2002. Learning with Kernels: support vector machines, regularization, optimization, and beyond .MIT Press.

[31]

Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms .Cambridge University Press, New York, NY, USA.

Digital Library

[32]

Balder ten Cate and V'i ctor Dalmau. 2015. The Product Homomorphism Problem and Applications. In ICDT 2015. 161--176.

[33]

Ross Willard. 2010. Testing Expressibility Is Hard. In CP 2010. 9--23.

Digital Library

[34]

Ce Zhang, Arun Kumar, and Christopher Ré. 2014. Materialization optimizations for feature selection workloads. In SIGMOD Conference . 265--276.

Digital Library

[35]

Wojciech Ziarko. 1993. Variable precision rough set model. Journal of computer and system sciences, Vol. 46, 1 (1993), 39--59.

Digital Library

Cited By

Index Terms

Regularizing Conjunctive Features for Classification

Recommendations

The Fine Classification of Conjunctive Queries and Parameterized Logarithmic Space

We perform a fundamental investigation of the complexity of conjunctive query evaluation from the perspective of parameterized complexity. We classify sets of Boolean conjunctive queries according to the complexity of this problem. Previous work showed ...
Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database Theory

The problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
The fine classification of conjunctive queries and parameterized logarithmic space complexity
PODS '13: Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI symposium on Principles of database systems

We perform a fundamental investigation of the complexity of conjunctive query evaluation from the perspective of parameterized complexity. We classify sets of boolean conjunctive queries according to the complexity of this problem. Previous work showed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PODS '19: Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

June 2019

494 pages

ISBN:9781450362276

DOI:10.1145/3294052

General Chairs:
Dan Suciu
University of Washington, USA
,
Sebastian Skritek
TU Wien, Austria
,
Program Chair:
Christoph Koch
EPFL, Switzerland

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '19

Sponsor:

SIGMOD

SIGMOD/PODS '19: International Conference on Management of Data

June 30 - July 5, 2019

Amsterdam, Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
186
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten