article

Free access

Classification in Networked Data: A Toolkit and a Univariate Case Study

Authors:

Sofus A. Macskassy,

Foster ProvostAuthors Info & Claims

The Journal of Machine Learning Research, Volume 8

Pages 935 - 983

Published: 01 May 2007 Publication History

Abstract

This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well---well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes---that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection.

References

[1]

J. C. Almack. The influence of intelligence on the selection of associates. School and Society, 16: 529-530, 1922.

[2]

A. Bernstein, S. Clearwater, and F. Provost. The relational vector-space model and industry classification. In Proceedings of the Learning Statistical Models from Relational Data Workshop at the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI), 2003.

[3]

J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, 36(2):192-236, 1974.

[4]

J. Besag. Statistical analysis of non-lattice data. The Statistician, 24(3):179-195, 1975.

[5]

J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, 48(3): 259-302, 1986.

[6]

P. M. Blau. Inequality and Heterogeneity: A Primitive Theory of Social Structure. New York: Free Press, 1977.

[7]

A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 19-26, 2001.

Digital Library

[8]

A. Blum, J. Lafferty, R. Reddy, and M. R. Rwebangira. Semi-supervised learning using randomized mincuts. In Proceedings of the Twentyfirst International Conference on Machine Learning (ICML), pages 97-104, 2004.

Digital Library

[9]

H. Bott. Observation of play activities in a nursery school. Genetic Psychology Monographs, 4: 44-88, 1928.

[10]

Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 23(11):1222-1239, 2001.

Digital Library

[11]

S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 307-319, 1998.

Digital Library

[12]

C. Cortes, D. Pregibon, and C. T. Volinsky. Communities of interest. In Proceedings of the Fourth International Conference on Advances in Intelligent Data Analysis (IDA), pages 105-114, 2001.

Digital Library

[13]

R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic networks and expert systems. Springer, 1999.

Digital Library

[14]

M. Craven, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and C. Y. Quek. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI), pages 509-516, 1998.

Digital Library

[15]

L. De Raedt, H. Blockeel, L. Dehaspe, and W. Van Laer. Three companions for data mining in first order logic. In S. Dzeroski and N. Lavrac, editors, Relational Data Mining, pages 105-139. Berlin; New York: Springer, 2001.

Digital Library

[16]

I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 269-274, 2001.

Digital Library

[17]

R. L. Dobrushin. The description of a random field by means of conditional probabilities and conditions of its regularity. Theory of Probability and its Applications, 13(2):197-224, 1968.

[18]

P. Domingos and M. Richardson. Mining the network value of customers. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 57-66, 2001.

Digital Library

[19]

S. Dzeroski and N. Lavrac. Relational Data Mining. Berlin; New York: Springer, 2001.

Digital Library

[20]

T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 3: 291-316, 1997.

Digital Library

[21]

T. Fawcett and F. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53-62, 1999.

Digital Library

[22]

P. A. Flach and N. Lachiche. Naive Bayesian classification of structured data. Machine Learning, 57:233-269, 2004.

Digital Library

[23]

N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), pages 1300-1309, 1999.

Digital Library

[24]

A. Galstyan and P. Cohen. Is guilt by association a bad thing? In Proceedings of the First International Conference on Intelligence Analysis (IA), 2005.

[25]

S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 6:721-741, 1984.

Digital Library

[26]

W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman & Hall/CRC, 1995.

[27]

D. Greig, B. Porteous, and A. Seheult. Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society, 51(2):271-279, 1989.

[28]

D. D. Heckathorn and J. Jeffri. Jazz networks: Using respondent-driven sampling to study stratification in two jazz musician communities. Unpublished paper presented at American Sociological Association Annual Meeting, August 2003.

[29]

D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research (JMLR), 1:49-75, 2000.

Digital Library

[30]

S. Hill, F. Provost, and C. Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, 22(2):256-276, 2006a.

[31]

S. Hill, D. K. Agarwal, R. Bell, and C. Volinsky. Building an effective representation for dynamic networks. Journal of Computational & Graphical Statistics, 15(3):584-608, 2006b.

[32]

G. E. Hinton and T. J. Sejnowski. Learning and relearning in Boltzmann machines. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1: Foundations:282-317, 1986.

Digital Library

[33]

J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79(8): 2554-2558, 1982.

[34]

Z. Huang, H. Chen, and D. Zeng. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Transactions on Information Systems (TOIS), 22(1): 116-142, 2004.

Digital Library

[35]

R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(3):267-287, 1983.

Digital Library

[36]

E. Ising. Beitrag zur Theorie des Ferromagnetismus. Zeitschrift f. Physik, 31:253-258, 1925. {German}.

[37]

D. Jensen and J. Neville. Data mining in social networks. In National Academy of Sciences Symposium on Dynamic Social Network Modeling and Analysis, 2002a.

[38]

D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection bias in relational learning. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML), pages 259-266, 2002b.

Digital Library

[39]

D. Jensen, J. Neville, and B. Gallagher. Why collective inference improves relational classification. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 593-598, 2004.

Digital Library

[40]

T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages 290-297, 2003.

[41]

J. Kleinberg and É. Tardos. Approximation algorithms for classification problems with pairwise relations: Metric labeling and Markov random fields. In Proceedings of the Fortieth Annual Symposium on Foundations of Computer Science (FOCS), pages 14-23, 1999.

Digital Library

[42]

R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2): 273-324, 1997.

Digital Library

[43]

D. Koller and A. Pfeffer. Probabilistic frame-based systems. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI), pages 580-587, 1998.

Digital Library

[44]

S. Kramer, N. Lavrac, and P. Flach. Propositionalization approaches to relational data mining. In S. Dzeroski and N. Lavrac, editors, Relational Data Mining, pages 262-291. Berlin; New York: Springer, 2001.

Digital Library

[45]

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 282-289, 2001.

Digital Library

[46]

P. Langley. Crafting papers on machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 1207-1212, 2000.

Digital Library

[47]

P. Lazarsfeld and R. K. Merton. Friendship as a social process: A substantive and methodological analysis. In M. Berger, T. Abel, and C. H. Page, editors, Freedom and Control in Modern Society, pages 18-66. Van Nostrand, 1954.

[48]

C. P. Loomis. Political and occupational cleavages in a Hanoverian village. Sociometry, 9:316- 3333, 1946.

[49]

Q. Lu and L. Getoor. Link-based classification. In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages 496-503, 2003.

[50]

S. A. Macskassy and F. Provost. A simple relational classifier. In Proceedings of the Multi-Relational Data Mining Workshop (MRDM) at the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.

[51]

S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by-association, collective inference, and focused data access. In Proceedings of the First International Conference on Intelligence Analysis (IA), 2005.

[52]

A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127-163, 2000.

Digital Library

[53]

M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27:415-444, 2001.

[54]

K. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief-propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 467-475, 1999.

Digital Library

[55]

J. Neville and D. Jensen. Iterative classification in relational data. In Proceedings of the Workshop on Learning Statistical Models from Relational Data at the Seventeenth National Conference on Artificial Intelligence (AAAI), pages 13-20, 2000.

[56]

J. Neville and D. Jensen. Collective classification with relational dependency networks. In Proceedings of the Multi-Relational Data Mining Workshop (MRDM) at the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.

[57]

J. Neville and D. Jensen. Dependency networks for relational data. In Proceedings of the Fourth IEEE International Conference in Data Mining (ICDM), pages 170-177, 2004.

Digital Library

[58]

J. Neville and D. Jensen. Leveraging relational autocorrelation with latent group models. In Proceedings of the Fifth IEEE International Conference in Data Mining (ICDM), pages 322-329, 2005.

Digital Library

[59]

J. Neville and D. Jensen. Relational dependency networks. Journal of Machine Learning Research (JMLR), 8(Mar):653-692, 2007.

Digital Library

[60]

J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning relational probability trees. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 625-630, 2003.

Digital Library

[61]

J. Neville, Ö. Simsek, D. Jensen, J. Komoroske, K. Palmer, and H. Goldberg. Using relational knowledge discovery to prevent securities fraud. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 449-458, 2005.

Digital Library

[62]

M. E. J. Newman. Mixing patterns in networks. Physical Review E, 67, 2003. 026126.

[63]

J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.

Digital Library

[64]

C. Perlich. Citation-based document classification. In Workshop on Information Technology and Systems (WITS), 2003.

[65]

C. Perlich and F. Provost. Aggregation-based feature invention and relational concept classes. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 167-176, 2003.

Digital Library

[66]

C. Perlich and F. Provost. Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1/2):65-105, 2006.

Digital Library

[67]

C. Perlich, F. Provost, and J. Simonoff. Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research (JMLR), 4:211-255, 2003.

Digital Library

[68]

A. Popescul and L. H. Ungar. Statistical relational learning for link prediction. In Proceedings of the Learning Statistical Models from Relational Data Workshop at the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI), 2003.

[69]

R. B. Potts. Some generalized order-disorder transformations. Cambridge Philosophic Society, 48: 106-109, 1952.

[70]

H. M. Richardson. Community of values as a factor in friendships of college and adult women. Journal of Social Psychology, 11:303-312, 1940.

[71]

M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1/2):107-136, 2006.

Digital Library

[72]

J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, chapter 14, pages 313-323. Prentice-Hall, 1971.

[73]

A. Rosenfeld, R. Hummel, and S. Zucker. Scene labeling by relaxation operations. IEEE Transactions on Systems, Man and Cybernetics, 6:420-433, 1976.

[74]

L. J. Savage. The Foundations of Statistics. John Wiley and Sons, 1954.

[75]

E. Segal, H. Wang, and D. Koller. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19:I264-I272, Jul 2003a.

[76]

E. Segal, R. Yelensky, and D. Koller. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics, 19:I273-I282, Jul 2003b.

[77]

B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 485-492, 2002.

Digital Library

[78]

B. Taskar, E. Segal, and D. Koller. Probabilistic classification and clustering in relational data. In Proceedings of the Seventeenthth International Joint Conference on Artificial Intelligence (IJCAI) , pages 870-878, 2001.

Digital Library

[79]

B. Taskar, V. Chatalbashev, and D. Koller. Learning associative Markov networks. In Proceedings of the Twentyfirst International Conference on Machine Learning (ICML), 2004.

Digital Library

[80]

K. Tumulty. Inside Bush's Secret Spy Net {electronic version}. Time, 167(21), 2006. Retrieved May 25, 2006, from http://www.time.com/time/archive/preview/0,10987,1194021, 00.html.

[81]

V. N. Vapnik. The support vector method of function estimation. In J. Suykens and J. Vandewalle, editors, Nonlinear Modeling: Advanced Black-Box Techniques, pages 55-86. Kluwer, Boston, 1998a.

[82]

V. N. Vapnik. Statistical Learning Theory. John Wiley, NY, 1998b.

[83]

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, University of California, Berkeley, 2003.

[84]

G. Winkler. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods. Springer-Verlag, 2nd edition, 2003.

Digital Library

[85]

I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools with Java Implementations . Morgan Kaufmann, San Francisco, 2000.

Digital Library

[86]

X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages 912-919, 2003.

Cited By

Tomčić ASavić MRadovanović M(2024)Hub‐aware random walk graph embedding methods for classificationStatistical Analysis and Data Mining10.1002/sam.1167617:2Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1002/sam.11676
Bhavsar SPatil VPatil A(2022)Graph partitioning and visualization in graph mining: a surveyMultimedia Tools and Applications10.1007/s11042-022-13017-581:30(43315-43356)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s11042-022-13017-5
Saha TRangwala HDomeniconi C(2022)FLIP: Active Learning for Relational Network ClassificationMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44845-8_1(1-18)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/978-3-662-44845-8_1
Show More Cited By

Index Terms

Classification in Networked Data: A Toolkit and a Univariate Case Study

Recommendations

Classification in Networked Data: A Toolkit and a Univariate Case Study

This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to ...
Unsupervised Ensemble Classification With Sequential and Networked Data
Ensemble learning, the machine learning paradigm where multiple models are combined, has exhibited promising perfomance in a variety of tasks. The present work focuses on unsupervised ensemble classification. The term unsupervised refers to the ensemble ...
Modified nearest neighbour classifier for hyperspectral data classification

A modified k-nearest neighbour k-NN classifier is proposed for supervised remote sensing classification of hyperspectral data. To compare its performance in terms of classification accuracy and computational cost, k-NN and a back-propagation neural ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 8, Issue

5/1/2007

1150 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 May 2007

Published in JMLR Volume 8

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

146
Total Citations
View Citations
1,438
Total Downloads

Downloads (Last 12 months)244
Downloads (Last 6 weeks)22

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tomčić ASavić MRadovanović M(2024)Hub‐aware random walk graph embedding methods for classificationStatistical Analysis and Data Mining10.1002/sam.1167617:2Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1002/sam.11676
Bhavsar SPatil VPatil A(2022)Graph partitioning and visualization in graph mining: a surveyMultimedia Tools and Applications10.1007/s11042-022-13017-581:30(43315-43356)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s11042-022-13017-5
Saha TRangwala HDomeniconi C(2022)FLIP: Active Learning for Relational Network ClassificationMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44845-8_1(1-18)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/978-3-662-44845-8_1
Stankova MPraet SMartens DProvost F(2021)Node classification over bipartite graphs through projectionMachine Language10.1007/s10994-020-05898-0110:1(37-87)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1007/s10994-020-05898-0
Lebichot BSaerens M(2020)An experimental study of graph-based semi-supervised classification with additional node informationKnowledge and Information Systems10.1007/s10115-020-01500-062:11(4337-4371)Online publication date: 1-Nov-2020
https://dl.acm.org/doi/10.1007/s10115-020-01500-0
Baldi MFersini EMessina E(2020)Relational Bayesian Model Averaging for Sentiment Analysis in Social NetworksMachine Learning, Optimization, and Data Science10.1007/978-3-030-64583-0_27(285-296)Online publication date: 19-Jul-2020
https://dl.acm.org/doi/10.1007/978-3-030-64583-0_27
Yu SVorobeychik YElkind EVeloso MAgmon NTaylor M(2019)Removing Malicious Nodes from NetworksProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3306127.3331708(314-322)Online publication date: 8-May-2019
https://dl.acm.org/doi/10.5555/3306127.3331708
Chin AChen YM. Altenburger KUgander J(2019)Decoupled Smoothing on GraphsThe World Wide Web Conference10.1145/3308558.3313748(263-272)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313748
Awal GBharadwaj K(2019)Leveraging collective intelligence for behavioral prediction in signed social networks through evolutionary approachInformation Systems Frontiers10.1007/s10796-017-9760-421:2(417-439)Online publication date: 1-Apr-2019
https://dl.acm.org/doi/10.1007/s10796-017-9760-4
Mao WCai ZZeng BGuan X(2019)Learning edge weights in file co-occurrence graphs for malware detectionData Mining and Knowledge Discovery10.1007/s10618-018-0593-733:1(168-203)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1007/s10618-018-0593-7
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents