Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Unbiased Characterization of Node Pairs over Large Graphs

Published: 01 April 2015 Publication History

Abstract

Characterizing user pair relationships is important for applications such as friend recommendation and interest targeting in online social networks (OSNs). Due to the large-scale nature of such networks, it is infeasible to enumerate all user pairs and thus sampling is used. In this article, we show that it is a great challenge for OSN service providers to characterize user pair relationships, even when they possess the complete graph topology. The reason is that when sampling techniques (i.e., uniform vertex sampling (UVS) and random walk (RW)) are naively applied, they can introduce large biases, particularly for estimating similarity distribution of user pairs with constraints like existence of mutual neighbors, which is important for applications such as identifying network homophily. Estimating statistics of user pairs is more challenging in the absence of the complete topology information, as an unbiased sampling technique like UVS is usually not allowed and exploring the OSN graph topology is expensive. To address these challenges, we present unbiased sampling methods to characterize user pair properties based on UVS and RW techniques. We carry out an evaluation of our methods to show their accuracy and efficiency. Finally, we apply our methods to three OSNs—Foursquare, Douban, and Xiami—and discover that significant homophily is present in these networks.

References

[1]
Dimitris Achlioptas, David Kempe, Aaron Clauset, and Cristopher Moore. 2005. On the bias of traceroute sampling: Or, power-law degree distributions in regular graphs. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing. 694--703.
[2]
Yongyeol Ahn, Seungyeop Han, Haewoon Kwak, Sue Moon, and Hawoong Jeong. 2007. Analysis of topological characteristics of huge online social networking services. In Proceedings of the 16th International Conference on World Wide Web. 835--844.
[3]
Konstantin Avrachenkov, Bruno Ribeiro, and Don Towsley. 2010. Improving random walk estimation accuracy with uniform restarts. In Proceedings of the 7th Workshop on Algorithms and Models for the Web Graph. 98--109.
[4]
Stephen Boyd, Persi Diaconis, and Lin Xiao. 2004. Fastest mixing Markov chain on a graph. SIAM Review 46, 4, 667--689.
[5]
Siddhartha Chib and Edward Greenberg. 1995. Understanding the Metropolis-Hastings algorithm. American Statistician 49, 4, 327--335.
[6]
Minas Gjoka, Carter T. Butts, Maciej Kurant, and Athina Markopoulou. 2011. Multigraph sampling of online social networks. IEEE Journal on Selected Areas in Communications 29, 9, 1893--1905.
[7]
Minas Gjoka, Maciej Kurant, Carter T. Butts, and Athina Markopoulou. 2010. Walking in Facebook: A case study of unbiased sampling of OSNs. In Proceedings of IEEE INFOCOM 2010. 2498--2506.
[8]
Christos Gkantsidis, Milena Mihail, and Amin Saberi. 2006. Random walks in peer-to-peer networks: Algorithms and evaluation. Performance Evaluation 63, 3, 241--263.
[9]
Jacob Goldenberg, Barak Libai, and Eitan Muller. 2001. Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing Letters 12, 3, 211--223.
[10]
W. Keith Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.
[11]
Douglas D. Heckathorn. 2002. Respondent-driven sampling II: Deriving valid population estimates from chain-referral samples of hidden populations. Social Problems 49, 1, 11--34.
[12]
Galin L. Jones. 2004. On the Markov chain central limit theorem. Probability Surveys 1, 299--320.
[13]
U. Kang, Spiros Papadimitriou, Jimeng Sun, and Hanghang Tong. 2011. Centralities in large networks: Algorithms and observations. In Proceedings of the SIAM International Conference on Data Mining. 119--1306.
[14]
Maciej Kurant, Minas Gjoka, Carter T. Butts, and Athina Markopoulou. 2011a. Walking on a graph with a magnifying glass: Stratified sampling via weighted random walks. In Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 281--292.
[15]
Maciej Kurant, Minas Gjoka, Yan Wang, Zack W. Almquist, Carter T. Butts, and Athina Markopoulou. 2011b. Coarse-Grained Topology Estimation via Graph Sampling. Technical Report. arXiv:1105.5488.
[16]
Maciej Kurant, Athina Markopoulou, and Patrick Thiran. 2010. On the bias of BFS (breadth first search) and of other graph sampling techniques. In Proceedings of the International Teletraffic Congress. 1--8.
[17]
Maciej Kurant, Athina Markopoulou, and Patrick Thiran. 2011c. Towards unbiased BFS sampling. IEEE Journal on Selected Areas in Communications 29, 9, 1799--1809.
[18]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. 591--600.
[19]
Chul-Ho Lee, Xin Xu, and Do Young Eun. 2012. Beyond random walk and Metropolis-Hastings samplers: Why you should not backtrack for unbiased graph sampling. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems. 319--330.
[20]
Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 631--636.
[21]
Jure Leskovec and Eric Horvitz. 2008. Planetary-scale views on a large instant-messaging network. In Proceedings of the 17th International Conference on World Wide Web. 915--924.
[22]
Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. 2010a. Predicting positive and negative links in online social networks. In Proceedings of the 19th International Conference on World Wide Web. 641--650.
[23]
Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. 2010b. Signed networks in social media. In Proceedings of the 28th ACM Conference on Human Factors in Computing Systems. 1361--1370.
[24]
Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. 2005. Graphs over time: Densification laws, shrinking diameters and possible explanations. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. 177--187.
[25]
Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2009. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6, 1, 29--123.
[26]
Laszlo Lovász. 1993. Random walks on graphs: A survey. Combinatorics 2, 1--46.
[27]
Laurent Massoulié, Erwan Le Merrer, Anne-Marie Kermarrec, and Ayalvadi Ganesh. 2006. Peer counting and sampling in overlay networks: Random walk methods. In Proceedings of the 25th Annual ACM Symposium on Principles of Distributed Computing. 123--132.
[28]
Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 2011. Equations of state calculations by fast computing machines. IEEE Journal on Selected Areas in Communications 21, 6, 1087--1092.
[29]
Sean Meyn and Richard L. Tweedie. 2009. Markov Chains and Stochastic Stability. Cambridge University Press.
[30]
Stanley Milgram. 1967. The small world problem. Psychology Today 2, 1, 60--67.
[31]
Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. 29--42.
[32]
Abedelaziz Mohaisen, Aaram Yun, and Yongdae Kim. 2010. Measuring the mixing time of social graphs. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement. 390--403.
[33]
Amir H. Rasti, Mojtaba Torkjazi, Reza Rejaie, Nick Duffield, Walter Willinger, and Daniel Stutzbach. 2009. Respondent-driven sampling for characterizing unstructured overlays. In Proceedings of the IEEE INFOCOM Mini-Conference.
[34]
Bruno Ribeiro and Don Towsley. 2010. Estimating and sampling graphs with multidimensional random walks. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement. 390--403.
[35]
Bruno Ribeiro, Pinghui Wang, Fabricio Murai, and Don Towsley. 2012. Sampling directed graphs with random walks. In Proceedings of IEEE INFOCOM. 1692--1700.
[36]
Matthew Richardson, Rakesh Agrawal, and Pedro Domingos. 2003. Trust management for the semantic Web. In Proceedings of the 2nd International Semantic Web Conference. 351--368.
[37]
Matei Ripeanu, Ian T. Foster, and Adriana Iamnitchi. 2002. Mapping the Gnutella network: Properties of large-scale peer-to-peer systems and implications for system design. IEEE Internet Computing Journal 6, 1, 50--57.
[38]
Gareth O. Roberts and Jeffrey S. Rosenthal. 2004. General state space Markov chains and MCMC algorithms. Probability Surveys 1, 20--71.
[39]
Matthew J. Salganik and Douglas D. Heckathorn. 2004. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology 34, 193--239.
[40]
Xiaolin Shi, Lada A. Adamic, and Martin J. Strauss. 2007. Networks of strong ties. Physica A: Statistical Mechanics and its Applications 378, 1, 33--47.
[41]
Parag Singla and Matthew Richardson. 2008. Yes, there is a correlation—from social networks to personal behavior on the Web. In Proceedings of the 17th International Conference on World Wide Web. 655--664.
[42]
Daniel Stutzbach, Rea Rejaie, Nick Duffield, Subhabrata Sen, and Walter Willinger. 2009. On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Transactions on Networking 17, 2, 377--390.
[43]
Pinghui Wang, Junzhou Zhao, John C. S. Lui, Don Towsley, and Xiaohong Guan. 2012. Sampling Content Distributed over Graphs. Technical Report. The Chinese University of Hong Kong. Available at http://www.cse.cuhk.edu.hk/∼cslui/samplingcontentreport.pdf.
[44]
Pinghui Wang, Junzhou Zhao, John C. S. Lui, Don Towsley, and Xiaohong Guan. 2013. Sampling node pairs over large graphs. In Proceedings of the IEEE 29th International Conference on Data Engineering. 781--792.
[45]
Junzhou Zhao, John C. S. Lui, Don Towsley, Xiaohong Guan, and Yadong Zhou. 2011. Empirical analysis of the evolution of follower network: A case study on Douban. In Proceedings of IEEE Conference on Computer Communications Workshops. 941--946.
[46]
Ming Zhong and Kai Shen. 2006. Random walk based node sampling in self-organizing networks. ACM SIGOPS Operating Systems Review 40, 3, 49--55.

Cited By

View all

Index Terms

  1. Unbiased Characterization of Node Pairs over Large Graphs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 9, Issue 3
    TKDD Special Issue (SIGKDD'13)
    April 2015
    313 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/2737800
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 April 2015
    Accepted: 01 September 2014
    Revised: 01 May 2014
    Received: 01 February 2014
    Published in TKDD Volume 9, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Social network
    2. graph sampling
    3. homophily
    4. random walks

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • NSF
    • 863 High Tech Development Plan (2012AA011003)
    • Application Foundation Research Program of SuZhou (SYG201311)
    • ARL Cooperative Agreement W911NF-09-2-0053
    • 111 International Collaboration Program of China
    • ARO under MURI W911NF-08-1-0233
    • National Natural Science Foundation of China
    • Prospective Research Project on Future Networks of Jiangsu Future Networks Innovation Institute

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 10 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)ALECComputers in Biology and Medicine10.1016/j.compbiomed.2023.106841158:COnline publication date: 1-May-2023
    • (2022)An Active Learning Algorithm Based on the Distribution Principle of Bhattacharyya DistanceMathematics10.3390/math1011192710:11(1927)Online publication date: 4-Jun-2022
    • (2021)Estimating Distributions of Large Graphs from Incomplete Sampled Data2021 IFIP Networking Conference (IFIP Networking)10.23919/IFIPNetworking52078.2021.9472848(1-9)Online publication date: 21-Jun-2021
    • (2020)Trapping Malicious Crawlers in Social NetworksProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412004(775-784)Online publication date: 19-Oct-2020
    • (2016)Information-agnostic coflow scheduling with optimal demotion thresholds2016 IEEE International Conference on Communications (ICC)10.1109/ICC.2016.7511241(1-6)Online publication date: May-2016
    • (2016)Accelerating graph mining algorithms via uniform random edge sampling2016 IEEE International Conference on Communications (ICC)10.1109/ICC.2016.7511156(1-6)Online publication date: May-2016

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media