Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512196acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article

SATMargin: Practical Maximal Frequent Subgraph Mining via Margin Space Sampling

Published: 25 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Maximal Frequent Subgraph (MFS) mining asks to identify the maximal subgraph that commonly appears in a set of graphs, which has been found valuable in many applications in social science, biology, and other domains. Previous studies focused on reducing the search space of MFSs and discovered the theoretically smallest search space. Despite the success in theory, no practical algorithm can exhaustively search the space as it is huge even for small graphs with only tens of nodes and hundreds of edges. Moreover, deciding whether a subgraph is an MFS needs to solve subgraph monomorphism (SM), an NP-complete problem that introduces extra challenges. Here, we propose a practical MFS mining algorithm that targets large MFSs, named SATMargin. SATMargin adopts random walk in the search space to perform efficient search and utilizes a customized conflict learning Boolean Satisfiability (SAT) algorithm to accelerate SM queries. We design a mechanism that reuses SAT solutions to combine the random walk and the SAT solver effectively. We evaluate SATMargin over synthetic graphs and 6 real-world graph datasets. SATMargin shows superior performance to baselines in finding more and larger MFSs. We further demonstrate the effectiveness of SATMargin in a case study of RNA graphs. The identified frequent subgraph by SATMargin well matches the functional core structure of RNAs previously detected in biological experiments. Our software can be found at https://github.com/MuyiLiu2022/SATMargin-and-Baselines.

    References

    [1]
    Mohammad Al Hasan and Mohammed Zaki. 2009. Musk: Uniform sampling of k maximal patterns. In SIAM International Conference on Data Mining. SIAM, 650–661.
    [2]
    Mirela Andronescu, Vera Bereg, and Holger Hoos H.2008. RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics 9(2008), 340.
    [3]
    Christian Borgelt and Michael R Berthold. 2002. Mining molecular fragments: Finding relevant substructures of molecules. In IEEE International Conference on Data Mining. IEEE, 51–58.
    [4]
    James W Brown. 1999. The Ribonuclease P Database. Nucleic Acids Research 27, 1 (1999), 314.
    [5]
    Yiqun Cao, Tao Jiang, and Thomas Girke. 2008. A maximum common substructure-based algorithm for searching and predicting drug-like compounds. Bioinformatics 24, 13 (2008), i366–i374.
    [6]
    Vincenzo Carletti, Pasquale Foggia, and Mario Vento. 2013. Performance comparison of five exact graph matching algorithms on biological databases. In International Conference on Image Analysis and Processing. Springer, 409–417.
    [7]
    Vincenzo Carletti, Pasquale Foggia, and Mario Vento. 2015. VF2 Plus: An improved version of VF2 for biological graphs. In International Workshop on Graph-Based Representation in Pattern Recognition. Springer, 168–177.
    [8]
    Shengnan Chen, Jianmin Qian, Haopeng Chen, and Si Liu. 2019. Anomaly subgraph mining in large-scale social networks. In IEEE Internal Conference on Parallel and Distributed Processing with Applications. IEEE, 883–890.
    [9]
    George Chin, Daniel G Chavarria, Grant C Nakamura, and Heidi J Sofia. 2008. BioGraphE: High-performance bionetwork analysis using the Biological Graph Environment. BMC Bioinformatics 9, 6 (2008), 1–10.
    [10]
    Edmund M Clarke, Thomas A Henzinger, Helmut Veith, and Roderick Bloem. 2018. Handbook of model checking. Vol. 10. Springer, Cham, Switzerland.
    [11]
    Luigi P Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. 2004. A (sub) graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 10(2004), 1367–1372.
    [12]
    Luc Dehaspe, Hannu Toivonen, and Ross D King. 1998. Finding frequent substructures in chemical compounds. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Vol. 98. ACM.
    [13]
    Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis. 2005. Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledge and Data Engineering 17, 8(2005), 1036–1050.
    [14]
    Arda Durmaz, Tim AD Henderson, and Gurkan Bebek. 2020. Frequent Subgraph Mining of Functional Interaction Patterns Across Multiple Cancers. In Pacific Symposium on Biocomputing. World Scientific, Hawaii, USA, 261–272.
    [15]
    Paul Erdös and Alfréd Rényi. 1959. On random graphs Publ. Mathematicae Debrecen 6(1959), 290–297.
    [16]
    Michael R Garey and David S Johnson. 1979. Computers and intractability: a guide to the theory of NP-completeness. 1979. Freeman, San Francisco, CA, USA.
    [17]
    Luis Gil, Paulo Flores, and Luis Miguel Silveira. 2010. PMSat: a parallel version of MiniSAT. Satisfiability Boolean Modeling and Computation 6, 1-3(2010), 71–98.
    [18]
    Aditi Gupta, Reazur Rahman, Kejie Li, and Michael Gribskov. 2012. Identifying complete RNA structural ensembles including pseudoknots. RNA Biology 9, 2 (2012), 187–199.
    [19]
    Jiawei. Han and Xifeng Yan. 2003. CloseGraph: Mining closed frequent graph patterns. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Washington, DC, USA, 286–295.
    [20]
    Lawrence B. Holder, Diane J. Cook, and Surnjani Djoko. 1994. Substucture discovery in the SUBDUE system. In KDD Workshop. AAAI, Seattle, Washington, USA, 169–180.
    [21]
    Jun Huan, Wei Wang, and Jan Prins. 2003. Efficient mining of frequent subgraphs in the presence of isomorphism. In IEEE International Conference on Data Mining. IEEE, 549–552.
    [22]
    Jun Huan, Wei Wang, Jan Prins, and Jiong Yang. 2004. SPIN: Mining maximal frequent subgraphs from graph databases. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Seattle, WA, USA, 581–586.
    [23]
    Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Conference on Principles of Knowledge Discovery in Databases. Springer, Lyon, France, 13–23.
    [24]
    Sergei Ivanov, Sergei Sviridov, and Evgeny Burnaev. 2019. Understanding isomorphism bias in graph datasets. arxiv:1910.12091 [cs.LG]
    [25]
    Said Jabbour, Nizar Mhadbhi, Badran Raddaoui, and Lakhdar Sais. 2018. Triangle-driven community detection in large graphs using propositional satisfiability. In 2018 IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA). IEEE, 437–444.
    [26]
    Said Jabbour, Nizar Mhadhbi, Abdesattar Mhadhbi, Badran Radaoui, and Lakhdar Sais. 2016. Summarizing big graphs by means of pseudo-boolean constraints. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 889–894.
    [27]
    Sebastian Keller, Pauli Miettinen, and Olga V. Kalinina. 2020. Frequent subgraph mining for biologically meaningful structural motifs. bioRxiv (2020). https://www.biorxiv.org/content/early/2020/05/14/2020.05.14.095695.full.pdf
    [28]
    Benny Kimelfeld and Phokion G Kolaitis. 2014. The complexity of mining maximal frequent subgraphs. ACM Transactions on Database Systems 39, 4 (2014), 32.
    [29]
    Stefan Kramer, Luc De Raedt, and Christoph Helma. 2001. Molecular feature mining in HIV data. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 136–143.
    [30]
    Michihiro Kuramochi and George Karypis. 2004. An efficient algorithm for discovering frequent subgraphs. Knowledge and Data Engineering, IEEE Transactions on 16, 9(2004), 1038–1051.
    [31]
    Michihiro Kuramochi and George Karypis. 2005. Finding frequent patterns in a large sparse graph. Data Mining and Knowledge Discovery 11, 3 (2005), 243–271.
    [32]
    Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.
    [33]
    Muyi Liu and Michael Gribskov. 2015. MMC-Margin: Identification of maximum frequent subgraphs by metropolis Monte Carlo sampling. In IEEE International Conference on Big Data. IEEE, Santa Clara, CA, USA, 849–856.
    [34]
    Ciaran McCreesh, Patrick Prosser, and James Trimble. 2017. A Partitioning Algorithm for Maximum Common Subgraph Problems. In International Joint Conference on Artificial Intelligence. IJCAI, 712–719.
    [35]
    Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21, 6 (1953), 1087–1092.
    [36]
    Ilya Mironov and Lintao Zhang. 2006. Applications of SAT solvers to cryptanalysis of hash functions. In International Conference on Theory and Applications of Satisfiability Testing. Springer, Seattle, WA, USA, 102–115.
    [37]
    Aida Mrzic, Pieter Meysman, Wout Bittremieux, Pieter Moris, Boris Cule, Bart Goethals, and Kris Laukens. 2018. Grasping frequent subgraph mining for bioinformatics applications. BioData Mining 11, 1 (2018), 1–24.
    [38]
    Siegfried Nijssen and Joost Kok. 2001. Faster association rules for multiple relations. In International Joint Conference on Artificial Intelligence, Vol. 17. IJCAI, 891–896.
    [39]
    Eén Niklas and SNiklas, örensson. 2003. An extensible SAT-solver. In International conference on theory and applications of satisfiability testing. Springer, Santa Margherita Ligure, Italy, 502–518.
    [40]
    Norman R Pace and James W Brown. 1995. Evolutionary perspective on the structure and function of ribonuclease P, a ribozyme. Journal of Bacteriology 177, 8 (1995), 1919–1928.
    [41]
    Sumit Purohit, Sutanay Choudhury, and Lawrence B Holder. 2017. Application-specific graph sampling for frequent subgraph mining and community detection. In IEEE International Conference on Big Data. IEEE, 1000–1005.
    [42]
    Ti Ramraj and Ri Prabhakar. 2015. Frequent subgraph mining algorithms: A survey. Procedia Computer Science 47 (2015), 197–204.
    [43]
    Saif Ur Rehman and Sohail Asghar. 2020. Online social network trend discovery using frequent subgraph mining. Social Network Analysis and Mining 10, 1 (2020), 1–13.
    [44]
    Tapan K Saha and Mohammad Al Hasan. 2014. FS 3: A sampling based method for top-k frequent subgraph mining. In Proceedings of the 4th IEEE International Conference on Big Data. IEEE, Washington, DC, USA, 72–79.
    [45]
    Mate Soos and Armin Biere. 2019. CryptoMiniSat 5.6 with YalSAT. In SAT Race. Helsinki, 14–15.
    [46]
    Mate Soos, Karsten Nohl, and Claude Castelluccia. 2009. Extending SAT solvers to cryptographic problems. In International Conference on Theory and Applications of Satisfiability Testing. Springer, Swansea, UK, 244–257.
    [47]
    Niklas Sorensson and Niklas Een. 2005. Minisat v1. 13-a sat solver with conflict-clause minimization. SAT Race 2005, 53 (2005), 1–2.
    [48]
    Lini T Thomas, Satyanarayana R Valluri, and Kamalakar Karlapalem. 2006. MARGIN: Maximal frequent subgraph mining. In IEEE International Conference on Data Mining, ICDM. IEEE, Hong Kong, China, 1097–1101.
    [49]
    Lini T Thomas, Satyanarayana R Valluri, and Kamalakar Karlapalem. 2010. MARGIN: Maximal frequent subgraph mining. ACM Transactions on Knowledge Discovery from Data 4, 3 (2010), 10.
    [50]
    Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of “small-world” networks. Nature 393, 6684 (1998), 440–442.
    [51]
    Xifeng Yan, Hong Cheng, Jiawei Han, and Philip S Yu. 2008. Mining significant graph patterns by leap search. In ACM SIGMOD International Conference on Management of Data. 433–444.
    [52]
    Xifeng Yan and Jiawei Han. 2002. gSpan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining. IEEE, Maebashi City, Japan, 721–724.
    [53]
    Xifeng Yan, Xianghong Jasmine Zhou, and Jiawei Han. 2005. Mining closed relational graphs with connectivity constraints. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 324–333.
    [54]
    Shunyun Yang, Runxin Guo, Rui Liu, Xiangke Liao, Quan Zou, Benyun Shi, and Shaoliang Peng. 2018. cmFSM: a scalable CPU-MIC coordinated drug-finding tool by frequent subgraph mining. BMC Bioinformatics 19, 4 (2018), 35–47.
    [55]
    Lintao Zhang, Conor F Madigan, Matthew H Moskewicz, and Sharad Malik. 2001. Efficient conflict driven learning in a boolean satisfiability solver. In International Conference on Computer Aided Design. IEEE, 279–285.

    Cited By

    View all
    • (2024)FSM-BC-BSP: Frequent Subgraph Mining Algorithm Based on BC-BSPApplied Sciences10.3390/app1408315414:8(3154)Online publication date: 9-Apr-2024
    • (2024)Efficient Maximal Temporal Plex Enumeration2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00240(3098-3110)Online publication date: 13-May-2024
    • (2023)HardSATGEN: Understanding the Difficulty of Hard SAT Formula Generation and A Strong Structure-Hardness-Aware BaselineProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599837(4414-4425)Online publication date: 6-Aug-2023
    • Show More Cited By

    Index Terms

    1. SATMargin: Practical Maximal Frequent Subgraph Mining via Margin Space Sampling
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WWW '22: Proceedings of the ACM Web Conference 2022
        April 2022
        3764 pages
        ISBN:9781450390965
        DOI:10.1145/3485447
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 25 April 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Boolean Satisfiability
        2. Maximal Frequent Subgraph Mining

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        WWW '22
        Sponsor:
        WWW '22: The ACM Web Conference 2022
        April 25 - 29, 2022
        Virtual Event, Lyon, France

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)67
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 27 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)FSM-BC-BSP: Frequent Subgraph Mining Algorithm Based on BC-BSPApplied Sciences10.3390/app1408315414:8(3154)Online publication date: 9-Apr-2024
        • (2024)Efficient Maximal Temporal Plex Enumeration2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00240(3098-3110)Online publication date: 13-May-2024
        • (2023)HardSATGEN: Understanding the Difficulty of Hard SAT Formula Generation and A Strong Structure-Hardness-Aware BaselineProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599837(4414-4425)Online publication date: 6-Aug-2023
        • (2022)A Frequent Subgraph Publishing Algorithm Based on Differential Privacy2022 International Conference on Cloud Computing, Big Data Applications and Software Engineering (CBASE)10.1109/CBASE57816.2022.00032(136-141)Online publication date: Sep-2022

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media