Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Interactive Discovery of Coordinated Relationship Chains with Maximum Entropy Models

Published: 31 January 2018 Publication History

Abstract

Modern visual analytic tools promote human-in-the-loop analysis but are limited in their ability to direct the user toward interesting and promising directions of study. This problem is especially acute when the analysis task is exploratory in nature, e.g., the discovery of potentially coordinated relationships in massive text datasets. Such tasks are very common in domains like intelligence analysis and security forensics where the goal is to uncover surprising coalitions bridging multiple types of relations. We introduce new maximum entropy models to discover surprising chains of relationships leveraging count data about entity occurrences in documents. These models are embedded in a visual analytic system called MERCER (Maximum Entropy Relational Chain ExploRer) that treats relationship bundles as first class objects and directs the user toward promising lines of inquiry. We demonstrate how user input can judiciously direct analysis toward valid conclusions, whereas a purely algorithmic approach could be led astray. Experimental results on both synthetic and real datasets from the intelligence community are presented.

References

[1]
Simon Barkow, Stefan Bleuler, Amela Prelić, Philip Zimmermann, and Eckart Zitzler. 2006. BicAT: A biclustering analysis toolbox. Bioinformatics 22, 10 (2006), 1282--1283.
[2]
Andrea Califano, Gustavo Stolovitzky, and Yuhai Tu. 2000. Analysis of gene expression microarrays for phenotype classification. In Proceedings of the. International Conference on Intelligent Systems for Molecular Biology. 75--85.
[3]
Loïc Cerf, Jérémy Besson, Kim-Ngan T. Nguyen, and Jean-François Boulicaut. 2013. Closed and noise-tolerant patterns in n-ary relations. Data Min. Knowl. Discov. 26, 3 (2013), 574--619.
[4]
Loïc Cerf, Jérémy Besson, Céline Robardet, and Jean-François Boulicaut. 2009. Closed patterns meet N-ary relations. ACM Trans. Knowl. Discov. Data 3, 1, Article 3 (Mar. 2009), 36 p.
[5]
Yizong Cheng and George M. Church. 2000. Biclustering of expression data. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. AAAI Press, 93--103.
[6]
I. Csiszar. 1975. -Divergence geometry of probability distributions and minimization problems. Ann. Probab. 3, 1 (1975), 146--158.
[7]
J. N. Darroch and D. Ratcliff. 1972. Generalized iterative scaling for log-linear models. Ann. Mathe. Stat. 43, 5 (1972), 1470--1480.
[8]
Warren L. IV Davis, Peter Schwarz, and Evimaria Terzi. 2009. Finding representative association rules from large rule collections. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). 521--532.
[9]
Tijl De Bie. 2011. An information theoretic framework for data mining. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). ACM, 564--572.
[10]
Tijl De Bie. 2011. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Min. Knowl. Discov. 23, 3 (2011), 407--446.
[11]
Luc Dehaspe and Hannu Toironen. 2000. Discovery of relational association rules. In Relational Data Mining, Saĕso Dĕzeroski (Ed.). Springer-Verlag, New York, NY, 189--208.
[12]
S. Dzeroski and N. Lavrac. 2001. Relational Data Mining. Springer, Berlin, Germany.
[13]
Patrick Fiaux, Maoyuan Sun, Lauren Bradel, Chris North, Naren Ramakrishnan, and Alex Endert. 2013. Bixplorer: Visual analytics with biclusters. Computer 46, 8 (2013), 90--94.
[14]
Floris Geerts, Bart Goethals, and Taneli Mielikainen. 2004. Tiling databases. In Proceedings of the International Conference on Discovery Science (DS’04). Springer, 278--289.
[15]
Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. 2007. Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data 1, 3 (2007), 167--176.
[16]
Joana P. Gonçalves, Sara C. Madeira, and Arlindo L. Oliveira. 2009. BiGGEsTS: Integrated environment for biclustering analysis of time series gene expression data. BMC Res. Notes 2, 1 (2009), 124.
[17]
Gregory A. Grothaus, Adeel Mufti, and T. M. Murali. 2006. Automatic layout and visualization of biclusters. Algorithms Mol. Biol. 1, 1 (2006), 15.
[18]
Sami Hanhijärvi, Markus Ojala, Niko Vuokko, Kai Puolamäki, Nikolaj Tatti, and Heikki Mannila. 2009. Tell me something I don’t know: Randomization strategies for iterative data mining. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, 379--388.
[19]
Julian Heinrich, Robert Seifert, Michael Burch, and Daniel Weiskopf. 2011. BiCluster viewer: A visualization tool for analyzing gene expression data. In Advances in Visual Computing. Springer, 641--652.
[20]
M. S. Hossain, J. Gresock, Y. Edmonds, R. Helm, M. Potts, and N. Ramakrishnan. 2012. Connecting the dots between abstracts. PLoS One 7, 1 (2012), e29509.
[21]
M. Shahriar Hossain, Patrick Butler, Arnold P. Boedihardjo, and Naren Ramakrishnan. 2012. Storytelling in entity networks to support intelligence analysts. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12). ACM, 1375--1383.
[22]
F. Hughes and D. Schum. 2003. Discovery-Proof-Choice, the Art and Science of the Process of Intelligence Analysis-Preparing for the Future of Intelligence Analysis. Joint Military Intelligence College, Washington, DC.
[23]
E. T. Jaynes. 1957. Information theory and statistical mechanics. Phys. Rev. 106, 4 (1957), 620--630.
[24]
Ying Jin, T. M. Murali, and Naren Ramakrishnan. 2008. Compositional mining of multirelational biological datasets. ACM Trans. Knowl. Discov. Data 2, 1, (Apr. 2008), Article 2, 35 pages.
[25]
Youn-ah Kang, C. Gorg, and John Stasko. 2009. Evaluating visual analytics systems for investigative analysis: Deriving design principles from a case study. In Proceedings of the IEEE Symposium on Visual Analytics Science and Technology (VAST’09). IEEE, 139--146.
[26]
Misha Kapushesky, Patrick Kemmeren, Aedín C. Culhane, Steffen Durinck, Jan Ihmels, Christine Körner, Meelis Kull, Aurora Torrente, Ugis Sarkans, Jaak Vilo, and others. 2004. Expression profiler: Next generation-an online platform for analysis of microarray data. Nucleic Acids Res. 32, Suppl. 2 (2004), W465--W470.
[27]
Jerry Kiernan and Evimaria Terzi. 2008. Constructing comprehensive summaries of large event sequences. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). 417--425.
[28]
K. Kontonasios, J. Vreeken, and T. De Bie. 2011. Maximum entropy modeling for assessing results on real-valued data. In Proceedings of the International Conference on Data Mining (ICDM’11). 350--359.
[29]
Kleanthis-Nikolaos Kontonasios and Tijl DeBie. 2012. Formalizing complex prior information to quantify subjective interestingness of frequent pattern sets. In Proceedings of the International Conference on Intelligent Data Analysis (IDA’12). Springer-Verlag, 161--171.
[30]
Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, and Tijl De Bie. 2013. Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD’13). Springer, 256--271.
[31]
Deept Kumar, Naren Ramakrishnan, Richard F. Helm, and Malcolm Potts. 2006. Algorithms for storytelling. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). 604--610.
[32]
A. Lambert, R. Bourqui, and D. Auber. 2010. Winding roads: Routing edges into bundles. Comput. Graph. Forum 29, 3 (Aug. 2010), 853--862.
[33]
N. Lavrac and P. A. Flach. 2001. An extended transformation approach to inductive logic programming. ACM Trans. Comput. Logic 2, 4 (Oct. 2001), 458--494.
[34]
Sara C. Madeira and Arlindo L. Oliveira. 2004. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics 1, 1 (Jan. 2004), 24--45.
[35]
Michael Mampaey, Jilles Vreeken, and Nikolaj Tatti. 2012. Summarizing data succinctly with the most informative itemsets. ACM Trans. Knowl. Discov. Data 6, 4 (2012), 1--44.
[36]
Markus Ojala, Gemma C. Garriga, Aristides Gionis, and Heikki Mannila. 2010. Evaluating query result significance in databases via randomizations. In Proceedings of the SIAM International Conference on Data Mining (SDM’10). 906--917.
[37]
G. Rasch. 1960. Probabilistic Models for Some Intelligence and Attainnment Tests. Danmarks paedagogiske Institut.
[38]
Rodrigo Santamaría, Roberto Therón, and Luis Quintales. 2014. BicOverlapper 2.0: Visual analysis for gene expression. Bioinformatics 30, 12 (2014), 1785--6.
[39]
Eran Segal, Ben Taskar, Audrey Gasch, Nir Friedman, and Daphne Koller. 2001. Rich probabilistic models for gene expression. Bioinformatics 17, Suppl. 1 (2001), S243--S252.
[40]
Dafna Shahaf and Carlos Guestrin. 2012. Connecting two (or less) dots: Discovering structure in news articles. ACM Trans. Knowl. Discov. Data 5, 4 (Feb. 2012), Article 24, 31 pages.
[41]
Amnon Shashua. 2008. Introduction to Machine Learning: Class Notes 67577. Retrieved October 2015 from http://arxiv.org/pdf/0904.3664.pdf.
[42]
Qizheng Sheng, Yves Moreau, and Bart De Moor. 2003. Biclustering microarray data by Gibbs sampling. Bioinformatics 19, Suppl. 2 (2003), ii196--ii205.
[43]
Eirini Spyropoulou and Tijl De Bie. 2011. Interesting multi-relational patterns. In Proceedings of the International Conference on Data Mining (ICDM’11). 675--684.
[44]
Eirini Spyropoulou and Tijl De Bie. 2014. Mining approximate multi-relational patterns. In Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA’14). 477--483.
[45]
Eirini Spyropoulou, Tijl De Bie, and Mario Boley. 2013. Mining interesting patterns in multi-relational data with N-ary relationships. In Proceedings of the International Conference on Discovery Science, Lecture Notes in Computer Science, vol. 8140. Springer-Verlag, Berlin, 217--232.
[46]
Eirini Spyropoulou, Tijl De Bie, and Mario Boley. 2014. Interesting pattern mining in multi-relational data. Data Min. Knowl. Discov. 28, 3 (2014), 808--849.
[47]
John Stasko, Carsten Görg, and Zhicheng Liu. 2008. Jigsaw: Supporting investigative analysis through interactive visualization. Inform. Visual. 7, 2 (2008), 118--132.
[48]
Marc Streit, Samuel Gratzl, Michael Gillhofer, Andreas Mayr, Andreas Mitterecker, and Sepp Hochreiter. 2014. Furby: Fuzzy force-directed bicluster visualization. BMC Bioinform. 15, Suppl. 6 (2014), S4.
[49]
Maoyuan Sun. 2016. Visual Analytics with Biclusters: Exploring Coordinated Relationships in Context. Ph.D. Dissertation. Virginia Tech.
[50]
Maoyuan Sun, Lauren Bradel, Chris L. North, and Naren Ramakrishnan. 2014. The role of interactive biclusters in sensemaking. In Proceedings of the Conference on Human Factors in Computing Systems. ACM, 1559--1562.
[51]
Maoyuan Sun, Peng Mi, Chris North, and Naren Ramakrishnan. 2016. BiSet: Semantic edge bundling with biclusters for sensemaking. IEEE Trans. Visual. Comput. Graph. 22, 1 (2016), 310--319.
[52]
Maoyuan Sun, Peng Mi, Hao Wu, Chris North, and Naren Ramakrishnan. 2016. Usability challenges underlying bicluster interaction for sensemaking. In Proceedings of the ACM CHI workshop on Human Centered Machine Learning.
[53]
Maoyuan Sun, Chris North, and Naren Ramakrishnan. 2014. A five-level design framework for bicluster visualizations. IEEE Trans. Visual. Comput. Graph. 20, 12 (2014), 1713--1722.
[54]
Nikolaj Tatti. 2006. Computational complexity of queries based on itemsets. Inf. Process. Lett. 98, 5 (2006), 183--187.
[55]
Nikolaj Tatti and Jilles Vreeken. 2012. Comparing apples and oranges -- Measuring differences between exploratory data mining results. Data Min. Knowl. Discov. 25, 2 (2012), 173--207.
[56]
Lisa C. Thomas and Christopher D. Wickens. 2001. Visual displays and cognitive tunneling: Frames of reference effects on spatial judgments and change detection. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 45. SAGE Publications, 336--340.
[57]
Robert Tibshirani, Trevor Hastie, Mike Eisen, Doug Ross, David Botstein, and Pat Brown. 1999. Clustering Methods for the Analysis of DNA Microarray Data. Technical Report. Stanford University.
[58]
Takeaki Uno, Tatsuya Asai, Yuzo Uchida, and Hiroki Arimura. 2004. An efficient algorithm for enumerating closed patterns in transaction databases. In Proceedings of the International Conference on Discovery Science. Springer, 16--31.
[59]
Takeaki Uno, Masashi Kiyomi, and Hiroki Arimura. 2005. LCM Ver.3: Collaboration of array, bitmap and prefix tree for frequent itemset mining. In Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations (OSDM’05). ACM, New York, NY, 77--86.
[60]
Chao Wang and Srinivasan Parthasarathy. 2006. Summarizing itemset patterns using probabilistic models. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). 730--735.
[61]
Hao Wu, Jilles Vreeken, Nikolaj Tatti, and Naren Ramakrishnan. 2014. Uncovering the plot: Detecting surprising coalitions of entities in multi-relational schemas. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD’14). Springer.
[62]
Han-Ming Wu, Yin-Jing Tien, and Chun-houh Chen. 2010. GAP: A graphical environment for matrix visualization and cluster analysis. Comput. Stat. Data Anal. 54, 3 (2010), 767--778.
[63]
M. J. Zaki and C.-J. Hsiao. 2005. Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans. Know. Data Eng. 17, 4 (2005), 462--478.
[64]
Mohammed J. Zaki and Naren Ramakrishnan. 2005. Reasoning about sets using redescription mining. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’05). ACM, 364--373.

Cited By

View all
  • (2023)The State of the Art in Visualizing Dynamic Multivariate NetworksComputer Graphics Forum10.1111/cgf.1485642:3(471-490)Online publication date: 27-Jun-2023
  • (2023)ChemoGraph: Interactive Visual Exploration of the Chemical SpaceComputer Graphics Forum10.1111/cgf.1480742:3(13-24)Online publication date: 27-Jun-2023
  • (2022)Towards Systematic Design Considerations for Visualizing Cross-View Data RelationshipsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.310296628:12(4741-4756)Online publication date: 1-Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 12, Issue 1
Special Issue (IDEA) and Regular Papers
February 2018
363 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3178542
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2018
Accepted: 01 January 2017
Revised: 01 January 2017
Received: 01 December 2015
Published in TKDD Volume 12, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Maximum entropy models
  2. interactive visual data exploration
  3. multi-relational pattern mining

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)15
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)The State of the Art in Visualizing Dynamic Multivariate NetworksComputer Graphics Forum10.1111/cgf.1485642:3(471-490)Online publication date: 27-Jun-2023
  • (2023)ChemoGraph: Interactive Visual Exploration of the Chemical SpaceComputer Graphics Forum10.1111/cgf.1480742:3(13-24)Online publication date: 27-Jun-2023
  • (2022)Towards Systematic Design Considerations for Visualizing Cross-View Data RelationshipsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.310296628:12(4741-4756)Online publication date: 1-Dec-2022
  • (2022)Understanding Missing Links in Bipartite Networks With MissBiNIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.303298428:6(2457-2469)Online publication date: 1-Jun-2022
  • (2022)Multi-users interaction anomalous subgraph detection for event miningNeurocomputing10.1016/j.neucom.2022.08.072509(34-45)Online publication date: Oct-2022
  • (2022)IISD: Integrated interaction subgraph detection for event miningKnowledge-Based Systems10.1016/j.knosys.2021.108080(108080)Online publication date: Jan-2022
  • (2021)SightBi: Exploring Cross-View Data Relationships with BiclustersIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.311480128:1(54-64)Online publication date: 24-Dec-2021
  • (2019)Interactive Bicluster Aggregation in Bipartite Graphs2019 IEEE Visualization Conference (VIS)10.1109/VISUAL.2019.8933546(246-250)Online publication date: Oct-2019
  • (2019)Subjectively interesting connecting trees and forestsData Mining and Knowledge Discovery10.1007/s10618-019-00627-133:4(1088-1124)Online publication date: 1-Jul-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media