Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2623330.2623757acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Graph sample and hold: a framework for big-graph analytics

Published: 24 August 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy.
    While previous work focused particularly on sampling schemes to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we pro- pose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. We use a Horvitz-Thompson construction in conjunction with a scheme that samples arriving edges without adjacencies to previously sampled edges with probability p and holds edges with adjacencies with probability q. Our sample and hold framework facilitates the accurate estimation of subgraph patterns by enabling the dependence of the sampling process to vary based on previous history. Within our framework, we show how to produce statistically unbiased estimators for various graph properties from the sample. Given that the graph analytics will run on a sample instead of the whole population, the runtime complexity is kept under control. Moreover, given that the estimators are unbiased, the approximation error is also kept under control. Finally, we test the performance of the proposed framework (gSH) on various types of graphs, showing that from a sample with -- 40K edges, it produces estimates with relative errors < 1%.

    Supplementary Material

    MP4 File (p1446-sidebyside.mp4)

    References

    [1]
    AGGARWAL, C., ZHAO, Y., AND YU, P. On clustering graph streams. In SDM (2010), pp. 478--489.
    [2]
    AGGARWAL, C., ZHAO, Y., AND YU, P. Outlier detection in graph streams. In ICDE (2011), pp. 399--409.
    [3]
    AHMED, N. K., NEVILLE, J., AND KOMPELLA, R. Network sampling designs for relational classification. In ICWSM (2012).
    [4]
    AHMED, N. K., NEVILLE, J., AND KOMPELLA, R. Network sampling: from static to streaming graphs. (to appear) TKDD (2013).
    [5]
    AL HASAN, M., AND ZAKI, M. Output space sampling for graph patterns. Proceedings of the VLDB Endowment 2, 1 (2009), 730--741.
    [6]
    BABCOCK, B., DATAR, M., AND MOTWANI, R. Sampling from a moving window over streaming data. In SODA (2002), pp. 633--634.
    [7]
    BAR-YOSSEF, Z., KUMAR, R., AND SIVAKUMAR, D. Reductions in streaming algorithms with an application to counting triangles in graphs. In SODA (2002), pp. 623--632.
    [8]
    BECCHETTI, L., BOLDI, P., CASTILLO, C., AND GIONIS, A. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In KDD (2008), pp. 16--24.
    [9]
    BUCHSBAUM, A., GIANCARLO, R., AND WESTBROOK, J. On finding common neighborhoods in massive graphs. Theoretical Computer Science 299, 1 (2003), 707--718.
    [10]
    BURIOL, L., FRAHLING, G., LEONARDI, S., MARCHETTI-SPACCAMELA, A., AND SOHLER, C. Counting triangles in data streams. In PODS (2006), pp. 253--262.
    [11]
    COHEN, E., CORMODE, G., AND DUFFIELD, N. Don't let the negatives bring you down: sampling from streams of signed updates. In SIGMETRICS 40, 1 (2012), 343--354.
    [12]
    COHEN, E., DUFFIELD, N., KAPLAN, H., LUND, C., AND THORUP, M. Algorithms and estimators for accurate summarization of internet traffic. In SIGCOMM (2007), pp. 265--278.
    [13]
    CORMODE, G., AND MUTHUKRISHNAN, S. Space efficient mining of multigraph streams. In PODS (2005), pp. 271--282.
    [14]
    DASGUPTA, A., KUMAR, R., AND SIVAKUMAR, D. Social sampling. In KDD (2012), pp. 235--243.
    [15]
    ESTAN, C., AND VARGHESE, G. New directions in traffic measurement and accounting. In SIGCOMM (2002), pp. 323--336.
    [16]
    FAN, W. Streamminer: a classifier ensemble-based engine to mine concept-drifting data streams. In VLDB (2004), pp. 1257--1260.
    [17]
    FRANK, O. Sampling and estimation in large social networks. Social Networks 1, 1 (1978), 91--101.
    [18]
    GIBBONS, P., AND MATIAS, Y. New sampling-based summary statistics for improving approximate query answers. In SIGMOD(1998).
    [19]
    HORVITZ, D. G., AND THOMPSON, D. J. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47, 260 (1952), 663--685.
    [20]
    JHA, M., SESHADHRI, C., AND PINAR, A. A space efficient streaming algorithm for triangle counting using the birthday paradox. In KDD (2013), pp. 589--597.
    [21]
    JOWHARI, H., AND GHODSI, M. New streaming algorithms for counting triangles in graphs. In COCOON. 2005, pp. 710--716.
    [22]
    LESKOVEC, J., AND FALOUTSOS, C. Sampling from large graphs. In KDD (2006), pp. 631--636.
    [23]
    MAIYA, A. S., AND BERGER-WOLF, T. Y. Sampling Community Structure. In WWW (2010), pp. 701--710.
    [24]
    MANKU, G. S., AND MOTWANI, R. Approximate Frequency Counts over Data Streams. In VLDB (2002), pp. 346--357.
    [25]
    PAVAN, A., TANGWONGSAN, K., TIRTHAPURA, S., AND WU, K.-L. Counting and sampling triangles from a graph stream. VLDB 6, 14 (2013), 1870--1881.
    [26]
    ROSSI, R. A., GLEICH, D. F., GEBREMEDHIN, A. H., AND PATWARY, M. A. Fast maximum clique algorithms for large graphs. In WWW (2014).
    [27]
    SARMA, A. D., GOLLAPUDI, S., AND PANIGRAHY, R. Estimating PageRank on Graph Streams. In PODS (2008), pp. 69--78.
    [28]
    SCHANK, T. Algorithmic aspects of triangle-based network analysis.
    [29]
    SCHERVISH, M. J. Theory of Statistics. Springer, 1995.
    [30]
    SESHADHRI, C., PINAR, A., AND KOLDA, T. G. Triadic measures on graphs: The power of wedge sampling. In SDM (2013).
    [31]
    SMITHA, KIM, I., AND REDDY, A. Identifying long term high rate flows at a router. In High Performance Computing (2001).
    [32]
    TRAUD, A. L., MUCHA, P. J., AND PORTER, M. A. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391, 16 (2012), 4165--4180.
    [33]
    TSOURAKAKIS, C. E., KANG, U., MILLER, G. L., AND FALOUTSOS, C. Doulion: counting triangles in massive graphs with a coin. In KDD (2009), pp. 837--846.
    [34]
    VITTER, J. Random sampling with a reservoir. TOMS 11(1985).
    [35]
    WILLIAMS, D. Probability with Martingales. Cambridge University Press, 1991.

    Cited By

    View all
    • (2024)Scalable Spatio-Temporal Top-k Interaction Queries on Dynamic CommunitiesACM Transactions on Spatial Algorithms and Systems10.1145/3648374Online publication date: 16-Feb-2024
    • (2024)A spanning tree approach to social network sampling with degree constraintsSocial Network Analysis and Mining10.1007/s13278-024-01247-414:1Online publication date: 18-May-2024
    • (2024)Sampling hypergraphs via joint unbiased random walkWorld Wide Web10.1007/s11280-024-01253-827:2Online publication date: 19-Feb-2024
    • Show More Cited By

    Index Terms

    1. Graph sample and hold: a framework for big-graph analytics

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2014
      2028 pages
      ISBN:9781450329569
      DOI:10.1145/2623330
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 August 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. graph streams
      2. network sampling
      3. statistical estimation

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      KDD '14
      Sponsor:

      Acceptance Rates

      KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)49
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 12 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Scalable Spatio-Temporal Top-k Interaction Queries on Dynamic CommunitiesACM Transactions on Spatial Algorithms and Systems10.1145/3648374Online publication date: 16-Feb-2024
      • (2024)A spanning tree approach to social network sampling with degree constraintsSocial Network Analysis and Mining10.1007/s13278-024-01247-414:1Online publication date: 18-May-2024
      • (2024)Sampling hypergraphs via joint unbiased random walkWorld Wide Web10.1007/s11280-024-01253-827:2Online publication date: 19-Feb-2024
      • (2023)Theoretical bounds on the network community profile from low-rank semi-definite programmingProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618976(13976-13992)Online publication date: 23-Jul-2023
      • (2023)Triangular Stability Maximization by Influence Spread over Social NetworksProceedings of the VLDB Endowment10.14778/3611479.361149016:11(2818-2831)Online publication date: 24-Aug-2023
      • (2023)Scalable Approximate Butterfly and Bi-triangle Counting for Large Bipartite NetworksProceedings of the ACM on Management of Data10.1145/36267531:4(1-26)Online publication date: 12-Dec-2023
      • (2023)Graph-Inceptor: Towards Extreme Data Ingestion, Massive Graph Creation and StorageCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578245.3585339(253-254)Online publication date: 15-Apr-2023
      • (2023)Efficiently Counting Triangles for Hypergraph Streams by Reservoir-Based SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.323633535:11(11328-11341)Online publication date: 1-Nov-2023
      • (2023)Mining Top-k Frequent Patterns in Large Geosocial Networks: A Mnie-Based Extension ApproachIEEE Access10.1109/ACCESS.2023.325788711(27662-27675)Online publication date: 2023
      • (2023)A distributed streaming framework for edge–cloud triangle counting in graph streamsKnowledge-Based Systems10.1016/j.knosys.2023.110878278:COnline publication date: 25-Oct-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media