Reducing the size of databases for multirelational classification: a subgraph-based approach

Guo, Hongyu; Viktor, Herna L.; Paquet, Eric

doi:10.1007/s10844-012-0229-0

Reducing the size of databases for multirelational classification: a subgraph-based approach

Published: 29 November 2012

Volume 40, pages 349–374, (2013)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Hongyu Guo¹,
Herna L. Viktor² &
Eric Paquet^1,2

285 Accesses
Explore all metrics

Abstract

Multirelational classification aims to discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a database often spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory management, financial management, and so on. When considering classification, different phases of the knowledge discovery process are affected by economic utility. For instance, in the data preprocessing process, one must consider the cost associated with acquiring, cleaning, and transforming large volumes of data. When training and testing the data mining models, one has to consider the impact of the data size on the running time of the learning algorithm. In order to address these utility-based issues, the paper presents an approach to create a pruned database for multirelational classification, while minimizing predictive performance loss on the final model. Our method identifies a set of strongly uncorrelated subgraphs from the original database schema, to use for training, and discards all others. The experiments performed show that our strategy is able to, without sacrificing predictive accuracy, significantly reduce the size of the databases, in terms of the number of relations, tuples, and attributes.The approach prunes the sizes of databases by as much as 94 %. Such reduction also results in decreasing computational cost of the learning process. The method improves the multirelational learning algorithms’ execution time by as much as 80 %. In particular, our results demonstrate that one may build an accurate model with only a small subset of the provided database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resling: a scalable and generic framework to mine top-k representative subgraph patterns

Article 08 November 2017

FSMS: A Frequent Subgraph Mining Algorithm Using Mapping Sets

Discovering Correlation in Frequent Subgraphs

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Further discussions regarding all the resulted join paths for this database will be presented in Example 1 in this section

References

Almuallim, H., & Dietterich, T.G. (1991). Learning with many irrelevant features. In AAAI ’91 (Vol. 2, pp. 547–552). Anaheim, California: AAAI Press.
Google Scholar
Almuallim, H., & Dietterich, T.G. (1992). Efficient algorithms for identifying relevant features. Tech. Rep., Corvallis, OR, USA.
Alphonse, E., & Matwin. S. (2004.) Filtering multi-instance problems to reduce dimensionality in relational learning. Journal of Intelligent Information Systems, 22(1), 23–40.
Article Google Scholar
Berka, P. (2000). Guide to the financial data set. In A. Siebes & P. Berka (Eds.), PKDD2000 discovery challenge.
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transaction on Knowledge and Discovery Data, 1(1), 5.
Article Google Scholar
Blockeel, H., & Raedt, L.D. (1998). Top-down induction of first-order logical decision trees. Artificial Intelligence 101(1–2), 285–297.
Article MathSciNet MATH Google Scholar
Bringmann, B., & Zimmermann, A. (2009). One in a million: picking the right patterns. Knowledge and Information Systems, 18, 61–81.
Article Google Scholar
Burges, C.J.C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167.
Article Google Scholar
Burnside, J.D.E., Ramakrishnan, R., Costa, V.S., Shavlik, J. (2005). View learning for statistical relational learning: With an application to mammography. In Proceeding of the 19th IJCAI (pp. 677–683).
Ceci, M., & Appice, A. (2006). Spatial associative classification: propositional vs structural approach. Journal of Intelligent Information Systems, 27, 191–213.
Article Google Scholar
Chan, P.K., & Stolfo, S.J. (1993). Experiments on multistrategy learning by meta-learning. In CIKM ’93 (pp. 314–323). New York: ACM Press.
Chapter Google Scholar
Chen, B.C., Ramakrishnan, R., Shavlik, J.W., Tamma, P. (2009). Bellwether analysis: searching for cost-effective query-defined predictors in large databases. ACM Transaction on Knowledge and Discovery Data, 3(1), 1–49.
Article Google Scholar
Cohen, W. (1995). Learning to classify English text with ILP methods. In L. De Raedt (Ed.), ILP ’95 (pp. 3–24). DEPTCW.
De Marchi, F., & Petit, J.M. (2007). Semantic sampling of existing databases through informative armstrong databases. Information Systems, 32(3), 446–457.
Article Google Scholar
De Raedt, L. (2008). Logical and relational learning. Cognitive Technologies. New York: Springer.
Book MATH Google Scholar
Dehaspe, L., Toivonen, H., King, R.D. (1998). Finding frequent substructures in chemical compounds. In AAAI Press (pp. 30–36).
Dzeroski, S., & Lavrac, N. (2001). Relational data mining. In S. Dzeroski & N. Lavrac (Eds.). Berlin: Springer.
Frank, R., Moser, F., Ester, M. (2007). A method for multi-relational classification using single and multi-feature aggregation functions. In PKDD 2007 (pp. 430–437).
Getoor, L., & Taskar, B. (2007). Statistical relational learning. MIT Press: Cambridge.
MATH Google Scholar
Ghiselli, E.E. (1964). Theory of psychological measurement. New York: McGrawHill Book Company.
Google Scholar
Giraud-Carrier, C.G., Vilalta, R., Brazdil, P. (2004). Introduction to the special issue on meta-learning. Machine Learning, 54(3), 187–193.
Article Google Scholar
Guo, H., & Viktor, H.L. (2006). Mining relational data through correlation-based multiple view validation. In KDD ’06 (pp. 567–573). New York, NY, USA.
Guo, H., & Viktor, H.L. (2008). Multirelational classification: a multiple view approach. Knowledge and Information Systems, 17(3), 287–312.
Article Google Scholar
Guo, H., Viktor, H.L., Paquet, E. (2007). Pruning relations for substructure discovery of multi-relational databases. In PKDD (pp. 462–470).
Guo, H., Viktor, H.L., Paquet, E. (2011). Privacy disclosure and preservation in learning with multi-relational databases. JCSE, 5(3), 183–196.
Google Scholar
Habrard, A., Bernard, M., Sebban, M. (2005). Detecting irrelevant subtrees to improve probabilistic learning from tree-structured data. Fundamenta Informaticae, 66(1–2), 103–130.
MathSciNet MATH Google Scholar
Hall, M. (1998). Correlation-based feature selection for machine learning. Ph.D thesis, Department of Computer Science, University of Waikato, New Zealand.
Hamill, R., & Martin, N. (2004). Database support for path query functions. In Proc. of 21st British national conference on databases (BNCOD 21) (pp. 84–99).
Han, J., & Kamber, M. (2005). Data mining: Concepts and techniques (2nd Edition). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..
Google Scholar
Heckerman, D. (1998). A tutorial on learning with bayesian networks. In Proceedings of the NATO advanced study institute on learning in graphical models (pp. 301–354). Norwell, MA, USA: Kluwer Academic Publishers.
Chapter Google Scholar
Heckerman, D., Geiger, D., Chickering, D.M. (1995). Learning bayesian networks: the combination of knowledge and statistical data. Machine Learning, 20(3), 197–243.
MATH Google Scholar
Hogarth, R. (1977). Methods for aggregating opinions. In H. Jungermann & G. de Zeeuw (Eds.), Decision making and change in human affairs. Dordrecht-Holland.
Jamil, H.M. (2002). Bottom-up association rule mining in relational databases. Journal of Intelligent Information Systems, 19(2), 191–206.
Article Google Scholar
Jensen, D., Jensen, D., Neville, J. (2002). Schemas and models. In Proceedings of the SIGKDD-2002 workshop on multi-relational learning (pp. 56–70).
Kietz, J.U., Zücker, R., Vaduva, A. (2000). Mining mart: Combining case-based-reasoning and multistrategy learning into a framework for reusing kdd-applications. In 5th international workshop on multistrategy learning (MSL 2000). Guimaraes, Portugal.
Kira, K., & Rendell, L.A. (1992). A practical approach to feature selection. In ML92 proceedings of the 9th international workshop on machine learning (pp. 249–256). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..
Google Scholar
Knobbe, A.J. (2004). Multi-relational data mining. PhD thesis, University Utrecht.
Kohavi, R., & John, G.H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324.
Article MATH Google Scholar
Kohavi, R., Langley, P., Yun, Y. (1997). The utility of feature weighting in nearest-neighbor algorithms. In ECML ’97. Prague, Czech Republic: Springer.
Google Scholar
Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In ICML ’96 (pp. 284–292).
Krogel, M.A. (2005). On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto-von-Guericke-Universität Magdeburg.
Krogel, M.A., & Wrobel, S. (2003). Facets of aggregation approaches to propositionalization. In ILP’03.
Landwehr, N., Kersting, K., Raedt, L.D. (2007). Integrating naive bayes and foil. Journal of Machine Learning Research, 8, 481–507.
MATH Google Scholar
Landwehr, N., Passerini, A., Raedt, L.D., Frasconi, P. (2010). Fast learning of relational kernels. Machine Learning 78(3), 305–342.
Article Google Scholar
Lipton, R.J., Naughton, J.F., Schneider, D.A., Seshadri, S. (1993). Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116(1–2), 195–226.
Article MathSciNet MATH Google Scholar
Liu, H., & Setiono, R. (1996). A probabilistic approach to feature selection - a filter solution. In ICML ’96 (pp. 319–327).
Margaritis, D. (2009). Toward provably correct feature selection in arbitrary domains. In NIPS (pp. 1240–1248).
Merz, C.J. (1999). Using correspondence analysis to combine classifiers. Machine Learning, 36(1–2), 33–58.
Article Google Scholar
Neville, J., Jensen, D., Friedland, L., Hay, M. (2003). Learning relational probability trees. In Proceedings of the ninth ACM SIGKDD (pp 625–630). New York, NY, USA: ACM Press.
Google Scholar
Olken, F., & Rotem, D. (1986). Simple random sampling from relational databases. In VLDB (pp. 160–169).
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..
Google Scholar
Perlich, C., & Provost, F. (2006). Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1–2), 65–105.
Article Google Scholar
Perlich, C., & Provost, F.J. (2003). Aggregation-based feature invention and relational concept classes. In KDD’03 (pp. 167–176).
Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T. (1988). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press.
MATH Google Scholar
Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..
Google Scholar
Quinlan, J.R., & Cameron-Jones, R.M. (1993). Foil: A midterm report. In ECML ’93 (pp. 3–20).
Reutemann, P., Pfahringer, B., Frank, E. (2004). A toolbox for learning from relational data with propositional and multi-instance learners. In Australian conference on artificial intelligence (pp. 1017–1023).
Rückert, U., & Kramer, S. (2008). Margin-based first-order rule learning. Machine Learning, 70, 189–206.
Article Google Scholar
Singh, L., Getoor, L., Licamele, L. (2005). Pruning social networks using structural properties and descriptive attributes. In ICDM ’05 (pp. 773–776).
Ting, K.M., & Witten, I.H. (1999). Issues in stacked generalization. Journal of Artificial Intelligence Research (JAIR), 10, 271–289.
MATH Google Scholar
Witten, I.H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with Java implementations. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc..
Google Scholar
Wolpert, D.H. (1990). Stacked generalization. Tech. Rep. LA-UR-90-3460, Los Alamos, NM.
Yin, X., Han, J., Yang, J., Yu, P.S. (2006). Efficient classification across multiple database relations: A crossmine approach. IEEE Transactions on Knowledge and Data Engineering, 18(6), 770–783.
Article Google Scholar
Zajonic, R. (1962). A note on group judgements and group size. Human Relations, 15, 177–180.
Article Google Scholar
Zhong, N., & Ohsuga, S. (1995). KOSI - an integrated system for discovering functional relations from databases. Journal of Intelligent Information Systems, 5(1), 25–50.
Article Google Scholar
Zucker, J.D., & Ganascia, J.G. (1996). Representation changes for efficient learning in structural domains. In ICML ’96( pp. 543–551).

Download references

Author information

Authors and Affiliations

National Research Council of Canada, 1200 Montreal Road, Ottawa, ON, K1A 0R6, Canada
Hongyu Guo & Eric Paquet
School of Electrical Engineering and Computer Science, University of Ottawa, 800 King Edward Avenue, Ottawa, ON, K1N 6N5, Canada
Herna L. Viktor & Eric Paquet

Authors

Hongyu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Herna L. Viktor
View author publications
You can also search for this author in PubMed Google Scholar
Eric Paquet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongyu Guo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, H., Viktor, H.L. & Paquet, E. Reducing the size of databases for multirelational classification: a subgraph-based approach. J Intell Inf Syst 40, 349–374 (2013). https://doi.org/10.1007/s10844-012-0229-0

Download citation

Received: 14 March 2012
Revised: 30 October 2012
Accepted: 06 November 2012
Published: 29 November 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10844-012-0229-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reducing the size of databases for multirelational classification: a subgraph-based approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Resling: a scalable and generic framework to mine top-k representative subgraph patterns

FSMS: A Frequent Subgraph Mining Algorithm Using Mapping Sets

Discovering Correlation in Frequent Subgraphs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Reducing the size of databases for multirelational classification: a subgraph-based approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Resling: a scalable and generic framework to mine top-k representative subgraph patterns

FSMS: A Frequent Subgraph Mining Algorithm Using Mapping Sets

Discovering Correlation in Frequent Subgraphs

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation