-
Linear Time Visualization and Search in Big Data using Pixellated Factor Space Mapping
Abstract: It is demonstrated how linear computational time and storage efficient approaches can be adopted when analyzing very large data sets. More importantly, interpretation is aided and furthermore, basic processing is easily supported. Such basic processing can be the use of supplementary, i.e. contextual, elements, or particular associations. Furthermore pixellated grid cell contents can be utilized a… ▽ More
Submitted 27 February, 2019; originally announced February 2019.
Comments: 12 pages, 4 figures. From IFCS 2017 Conference, Tokyo, Japan
MSC Class: 62-07; 62-09; 60E99 ACM Class: G.3; H.2.8
-
Core Conflictual Relationship: Text Mining to Discover What and When
Abstract: Following detailed presentation of the Core Conflictual Relationship Theme (CCRT), there is the objective of relevant methods for what has been described as verbalization and visualization of data. Such is also termed data mining and text mining, and knowledge discovery in data. The Correspondence Analysis methodology, also termed Geometric Data Analysis, is shown in a case study to be comprehensi… ▽ More
Submitted 28 May, 2018; originally announced May 2018.
Comments: 25 pages, 10 figures
-
The Geometry and Topology of Data and Information for Analytics of Processes and Behaviours: Building on Bourdieu and Addressing New Societal Challenges
Abstract: We begin by summarizing the relevance and importance of inductive analytics based on the geometry and topology of data and information. Contemporary issues are then discussed. These include how sampling data for representativity is increasingly to be questioned. While we can always avail of analytics from a "bag of tools and techniques", in the application of machine learning and predictive analyt… ▽ More
Submitted 15 May, 2017; originally announced May 2017.
Comments: 16 pages, 7 figures
MSC Class: 62H25; 62P25 ACM Class: G.3; I.5.1
-
Massive Data Clustering in Moderate Dimensions from the Dual Spaces of Observation and Attribute Data Clouds
Abstract: Cluster analysis of very high dimensional data can benefit from the properties of such high dimensionality. Informally expressed, in this work, our focus is on the analogous situation when the dimensionality is moderate to small, relative to a massively sized set of observations. Mathematically expressed, these are the dual spaces of observations and attributes. The point cloud of observations is… ▽ More
Submitted 6 April, 2017; originally announced April 2017.
Comments: 17 pages, 2 figures
MSC Class: 62H30; 91C20 ACM Class: H.3.3; I.5.3
-
Hierarchical Matching and Regression with Application to Photometric Redshift Estimation
Abstract: This work emphasizes that heterogeneity, diversity, discontinuity, and discreteness in data is to be exploited in classification and regression problems. A global a priori model may not be desirable. For data analytics in cosmology, this is motivated by the variety of cosmological objects such as elliptical, spiral, active, and merging galaxies at a wide range of redshifts. Our aim is matching and… ▽ More
Submitted 12 December, 2016; originally announced December 2016.
Comments: 15 pages, 6 figures, 3 tables
MSC Class: 11Y35; 85-08; 62H30 ACM Class: I.5.3; H.3.3; G.3; J.2
Journal ref: Astroinformatics, Proceedings of the International Astronomical Union, Vol. 12, Issue S325, pp. 145-155, 2016
-
Contextualizing Geometric Data Analysis and Related Data Analytics: A Virtual Microscope for Big Data Analytics
Abstract: The relevance and importance of contextualizing data analytics is described. Qualitative characteristics might form the context of quantitative analysis. Topics that are at issue include: contrast, baselining, secondary data sources, supplementary data sources, dynamic and heterogeneous data. In geometric data analysis, especially with the Correspondence Analysis platform, various case studies are… ▽ More
Submitted 15 September, 2017; v1 submitted 29 November, 2016; originally announced November 2016.
Comments: 19 pages, 8 figures, 2 tables, Journal of Interdisciplinary Methodologies and Issues in Science, vol. 3, 2017. This version contains DOI, ISSN
MSC Class: 62H30; 68P01; 6207 ACM Class: G.3; H.2.8; I.2.1
Journal ref: Journal of Interdisciplinary Methodologies and Issues in Sciences (September 19, 2017) jimis:2570
-
Qualitative Judgement of Research Impact: Domain Taxonomy as a Fundamental Framework for Judgement of the Quality of Research
Abstract: The appeal of metric evaluation of research impact has attracted considerable interest in recent times. Although the public at large and administrative bodies are much interested in the idea, scientists and other researchers are much more cautious, insisting that metrics are but an auxiliary instrument to the qualitative peer-based judgement. The goal of this article is to propose availing of such… ▽ More
Submitted 8 April, 2018; v1 submitted 11 July, 2016; originally announced July 2016.
Comments: 22 pages, 7 figures, Journal of Classification, Online First, March 25, 2018
MSC Class: 68P01 ACM Class: H.0, I.5.3, G.3
-
Sparse p-Adic Data Coding for Computationally Efficient and Effective Big Data Analytics
Abstract: We develop the theory and practical implementation of p-adic sparse coding of data. Rather than the standard, sparsifying criterion that uses the $L_0$ pseudo-norm, we use the p-adic norm. We require that the hierarchy or tree be node-ranked, as is standard practice in agglomerative and other hierarchical clustering, but not necessarily with decision trees. In order to structure the data, all comp… ▽ More
Submitted 23 April, 2016; originally announced April 2016.
Comments: 20 pages, 6 figures
MSC Class: 94B27; 62H30; 68P01 ACM Class: E.2; E.4; G.2.2; H.3.3
Journal ref: p-Adic Numbers, Ultrametric Analysis and Applications, 8(3), 2016, pp. 236-247
-
arXiv:1604.06952 [pdf, ps, other]
Visualization of Jacques Lacan's Registers of the Psychoanalytic Field, and Discovery of Metaphor and of Metonymy. Analytical Case Study of Edgar Allan Poe's "The Purloined Letter"
Abstract: We start with a description of Lacan's work that we then take into our analytics methodology. In a first investigation, a Lacan-motivated template of the Poe story is fitted to the data. A segmentation of the storyline is used in order to map out the diachrony. Based on this, it will be shown how synchronous aspects, potentially related to Lacanian registers, can be sought. This demonstrates the e… ▽ More
Submitted 30 January, 2017; v1 submitted 23 April, 2016; originally announced April 2016.
Comments: 34 pages, 9 figures
MSC Class: 62H25; 62H30 ACM Class: I.5.3; I.5.4; I.2; G.2.2; G.3
-
Big Data Scaling through Metric Mapping: Exploiting the Remarkable Simplicity of Very High Dimensional Spaces using Correspondence Analysis
Abstract: We present new findings in regard to data analysis in very high dimensional spaces. We use dimensionalities up to around one million. A particular benefit of Correspondence Analysis is its suitability for carrying out an orthonormal mapping, or scaling, of power law distributed data. Power law distributed data are found in many domains. Correspondence factor analysis provides a latent semantic or… ▽ More
Submitted 13 December, 2015; originally announced December 2015.
Comments: 13 pages, 3 figures
MSC Class: 62H25 ACM Class: E.0; G.3; H.3.3; I.5
-
Correspondence Factor Analysis of Big Data Sets: A Case Study of 30 Million Words; and Contrasting Analytics using Apache Solr and Correspondence Analysis in R
Abstract: We consider a large number of text data sets. These are cooking recipes. Term distribution and other distributional properties of the data are investigated. Our aim is to look at various analytical approaches which allow for mining of information on both high and low detail scales. Metric space embedding is fundamental to our interest in the semantic properties of this data. We consider the projec… ▽ More
Submitted 6 July, 2015; originally announced July 2015.
Comments: 38 pages, 17 figures
MSC Class: 62H25; 62.07 ACM Class: G.3; H.2.8
-
Visualizing and Quantifying Impact and Effect in Twitter Narrative using Geometric Data Analysis
Abstract: We use geometric multivariate data analysis which has been termed a methodology for both the visualization and verbalization of data. The general objectives are data mining and knowledge discovery. In the first case study, we use the narrative surrounding very highly profiled tweets, and thus a Twitter event of significance and importance. In the second case study, we use eight carefully planned T… ▽ More
Submitted 14 September, 2014; v1 submitted 3 September, 2014; originally announced September 2014.
Comments: 34 pages, 11 figures
MSC Class: 66H25; 62H30; 91F99 ACM Class: I.7; I.5.3; H.3.1; H.2.8; G.3
-
Pattern Recognition in Narrative: Tracking Emotional Expression in Context
Abstract: Using geometric data analysis, our objective is the analysis of narrative, with narrative of emotion being the focus in this work. The following two principles for analysis of emotion inform our work. Firstly, emotion is revealed not as a quality in its own right but rather through interaction. We study the 2-way relationship of Ilsa and Rick in the movie Casablanca, and the 3-way relationship of… ▽ More
Submitted 4 May, 2015; v1 submitted 14 May, 2014; originally announced May 2014.
Comments: 21 pages, 7 figures
MSC Class: 62H25; 62H30; 62.07 ACM Class: H.2.8; H.3; I.5; I.7.0; J.5
Journal ref: Journal of Data Mining & Digital Humanities, 2015 (May 26, 2015) jdmdh:647
-
Ultrametric Component Analysis with Application to Analysis of Text and of Emotion
Abstract: We review the theory and practice of determining what parts of a data set are ultrametric. It is assumed that the data set, to begin with, is endowed with a metric, and we include discussion of how this can be brought about if a dissimilarity, only, holds. The basis for part of the metric-endowed data set being ultrametric is to consider triplets of the observables (vectors). We develop a novel co… ▽ More
Submitted 13 September, 2013; originally announced September 2013.
Comments: 49 pages, 15 figures, 52 citations
MSC Class: 62H30; 68T10 ACM Class: I.2.0; H.3.3; I.5.3
-
Computational Properties of Fiction Writing and Collaborative Work
Abstract: From the earliest days of computing, there have been tools to help shape narrative. Spell-checking, word counts, and readability analysis, give today's novelists tools that Dickens, Austen, and Shakespeare could only have dreamt of. However, such tools have focused on the word, or phrase levels. In the last decade, research focus has shifted to support for collaborative editing of documents. This… ▽ More
Submitted 16 August, 2013; originally announced August 2013.
Comments: 13 pages, 6 figures
MSC Class: 91C20; 62H30; 76M27 ACM Class: J.5; H.1.2; H.3.3; I.5.3; I.2.7
-
A History of Cluster Analysis Using the Classification Society's Bibliography Over Four Decades
Abstract: The Classification Literature Automated Search Service, an annual bibliography based on citation of one or more of a set of around 80 book or journal publications, ran from 1972 to 2012. We analyze here the years 1994 to 2011. The Classification Society's Service, as it was termed, has been produced by the Classification Society. In earlier decades it was distributed as a diskette or CD with the J… ▽ More
Submitted 16 August, 2013; v1 submitted 1 September, 2012; originally announced September 2012.
Comments: 23 pages, 9 figures
MSC Class: 62H30 ACM Class: I.5.3; H.3.3
-
arXiv:1202.3451 [pdf, ps, other]
The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces
Abstract: Consider observation data, comprised of n observation vectors with values on a set of attributes. This gives us n points in attribute space. Having data structured as a tree, implied by having our observations embedded in an ultrametric topology, offers great advantage for proximity searching. If we have preprocessed data through such an embedding, then an observation's nearest neighbor is found i… ▽ More
Submitted 15 February, 2012; originally announced February 2012.
Comments: 10 pages
MSC Class: 11Z05 ACM Class: I.5.3; H.3.3; E.2
-
arXiv:1201.2719 [pdf, ps, other]
Ultrametric Model of Mind, II: Application to Text Content Analysis
Abstract: In a companion paper, Murtagh (2012), we discussed how Matte Blanco's work linked the unrepressed unconscious (in the human) to symmetric logic and thought processes. We showed how ultrametric topology provides a most useful representational and computational framework for this. Now we look at the extent to which we can find ultrametricity in text. We use coherent and meaningful collections of nea… ▽ More
Submitted 16 July, 2012; v1 submitted 12 January, 2012; originally announced January 2012.
Comments: 21 pages, 6 tables. arXiv admin note: substantial text overlap with arXiv:cs/0701181 (V3: minor corrections)
MSC Class: 68T01 ACM Class: I.2.0; I.2.3; J.4
Journal ref: p-Adic Numbers, Ultrametric Analysis and Applications, 4, 207-221, 2012
-
Ultrametric Model of Mind, I: Review
Abstract: We mathematically model Ignacio Matte Blanco's principles of symmetric and asymmetric being through use of an ultrametric topology. We use for this the highly regarded 1975 book of this Chilean psychiatrist and pyschoanalyst (born 1908, died 1995). Such an ultrametric model corresponds to hierarchical clustering in the empirical data, e.g. text. We show how an ultrametric topology can be used as a… ▽ More
Submitted 16 July, 2012; v1 submitted 12 January, 2012; originally announced January 2012.
Comments: 20 pages, 2 figures, 46 references. arXiv admin note: substantial text overlap with arXiv:0709.0116, arXiv:0805.2744, and arXiv:1105.0121 (V3: 2 typos corrected)
MSC Class: 68T01 ACM Class: I.2.0; I.2.3; J.4
Journal ref: p-Adic Numbers, Ultrametric Analysis and Applications, 4, 193-206, 2012
-
Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm
Abstract: The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of… ▽ More
Submitted 11 December, 2011; v1 submitted 27 November, 2011; originally announced November 2011.
Comments: 20 pages, 21 citations, 4 figures
MSC Class: 62H30; 91C20 ACM Class: G.3; H.3.3
Journal ref: Journal of Classification, 31 (3), 274-295, 2014
-
Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement
Abstract: We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on pract… ▽ More
Submitted 27 November, 2011; originally announced November 2011.
Comments: 17 pages, 45 citations, 2 figures
MSC Class: 11Z05 ACM Class: H.3.3; E.2
Journal ref: P-Adic Numbers, Ultrametric Analysis, and Applications, 4 (1), 45-56, 2012
-
Fast, Linear Time Hierarchical Clustering using the Baire Metric
Abstract: The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. In this work we evaluate empirically this new approach to hierarchical clustering. We compare hierarchical clustering based on the Baire metric with (i) agglomerative hierarchical clustering, in terms of algorit… ▽ More
Submitted 11 June, 2011; originally announced June 2011.
Comments: 27 pages, 6 tables, 10 figures
MSC Class: 11Z05 ACM Class: H.3.3
Journal ref: Journal of Classification, July 2012, Volume 29, Issue 2, pp 118-143
-
Current Trends in Evolving Specialization in UK Universities
Abstract: There are very significant changes taking place in the university sector and in related higher education institutes in many parts of the world. In this work we look at financial data from 2010 and 2011 from the UK higher education sector. Situating ourselves to begin with in the context of teaching versus research in universities, we look at the data in order to explore the new divergence between… ▽ More
Submitted 27 November, 2011; v1 submitted 15 May, 2011; originally announced May 2011.
Comments: 58th World Statistics Congress of the International Statistical Institute, invited plenary presentation, IPS057, Data Mining and Machine Learning in Statistics Organizations
MSC Class: 62H86; 62H30 ACM Class: H.2.8; G.3
-
arXiv:1105.0121 [pdf, ps, other]
Methods of Hierarchical Clustering
Abstract: We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering al… ▽ More
Submitted 30 April, 2011; originally announced May 2011.
Comments: 21 pages, 2 figures, 1 table, 69 references
MSC Class: 62H30 ACM Class: H.3.3; H.2.8; G.3
-
arXiv:1104.4063 [pdf, ps, other]
Fast redshift clustering with the Baire (ultra) metric
Abstract: The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more cos\ tl… ▽ More
Submitted 20 April, 2011; originally announced April 2011.
Comments: 14 pages, 6 figures
MSC Class: 62H30; 85-08; 11S82 ACM Class: E.5; H.3; E.2
-
New Methods of Analysis of Narrative and Semantics in Support of Interactivity
Abstract: Our work has focused on support for film or television scriptwriting. Since this involves potentially varied story-lines, we note the implicit or latent support for interactivity. Furthermore the film, television, games, publishing and other sectors are converging, so that cross-over and re-use of one form of product in another of these sectors is ever more common. Technically our work has been la… ▽ More
Submitted 14 November, 2010; originally announced November 2010.
Comments: 17 pages, 6 figures
ACM Class: G.3; I.2.1; H.1.2
Journal ref: Entertainment Computing, 2, 115-121, 2011
-
Ultrametric and Generalized Ultrametric in Computational Logic and in Data Analysis
Abstract: Following a review of metric, ultrametric and generalized ultrametric, we review their application in data analysis. We show how they allow us to explore both geometry and topology of information, starting with measured data. Some themes are then developed based on the use of metric, ultrametric and generalized ultrametric in logic. In particular we study approximation chains in an ultrametric or… ▽ More
Submitted 20 August, 2010; originally announced August 2010.
Comments: 19 pp., 5 figures, 3 tables
MSC Class: 91C20; 62-07; 03-XX ACM Class: I.5.3; F.4.0
-
Segmentation and Nodal Points in Narrative: Study of Multiple Variations of a Ballad
Abstract: The Lady Maisry ballads afford us a framework within which to segment a storyline into its major components. Segments and as a consequence nodal points are discussed for nine different variants of the Lady Maisry story of a (young) woman being burnt to death by her family, on account of her becoming pregnant by a foreign personage. We motivate the importance of nodal points in textual and literary… ▽ More
Submitted 7 June, 2010; originally announced June 2010.
Comments: 27 pp., 13 figures. Submitted
ACM Class: H.3.1; H.3.2; I.2.7
-
Hierarchical Clustering for Finding Symmetries and Other Patterns in Massive, High Dimensional Datasets
Abstract: Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. "Structure" can be understood as symmetry and a range of symmetries are expressed by hierarchy. Such symmetries directly point to invariants, that pinpoint intrinsic properties of the data and of the background empirical domain of interest. We review many aspects of hierarchy… ▽ More
Submitted 14 May, 2010; originally announced May 2010.
Comments: 41 pages, 13 figures, 6 tables. 81 references
MSC Class: 62H30; 68P01 ACM Class: G.3; H.2.8; H.3.3
-
arXiv:0912.1262 [pdf, ps, other]
Open Access, Intellectual Property, and How Biotechnology Becomes a New Software Science
Abstract: Innovation is slowing greatly in the pharmaceutical sector. It is considered here how part of the problem is due to overly limiting intellectual property relations in the sector. On the other hand, computing and software in particular are characterized by great richness of intellectual property frameworks. Could the intellectual property ecosystem of computing come to the aid of the biosciences… ▽ More
Submitted 7 December, 2009; originally announced December 2009.
Comments: 7 pages
ACM Class: K.4; K.5
Journal ref: CEPIS UPGRADE, vol. XI, no. 4, pp. 50-64, 2010
-
Scale-Based Gaussian Coverings: Combining Intra and Inter Mixture Models in Image Segmentation
Abstract: By a "covering" we mean a Gaussian mixture model fit to observed data. Approximations of the Bayes factor can be availed of to judge model fit to the data within a given Gaussian mixture model. Between families of Gaussian mixture models, we propose the Rényi quadratic entropy as an excellent and tractable model comparison framework. We exemplify this using the segmentation of an MRI image volum… ▽ More
Submitted 2 September, 2009; originally announced September 2009.
Comments: 20 pages, 5 figures
ACM Class: I.4.6
Journal ref: Entropy, 11 (3), 513-528, 2009
-
Tag Clouds for Displaying Semantics: The Case of Filmscripts
Abstract: We relate tag clouds to other forms of visualization, including planar or reduced dimensionality mapping, and Kohonen self-organizing maps. Using a modified tag cloud visualization, we incorporate other information into it, including text sequence and most pertinent words. Our notion of word pertinence goes beyond just word frequency and instead takes a word in a mathematical sense as located at… ▽ More
Submitted 23 May, 2009; originally announced May 2009.
Comments: 23 pages, 7 figures
ACM Class: I.5.4; I.2.7; H.3.1
Journal ref: Information Visualization 9, 253-262, 2010
-
Ultrametric Wavelet Regression of Multivariate Time Series: Application to Colombian Conflict Analysis
Abstract: We first pursue the study of how hierarchy provides a well-adapted tool for the analysis of change. Then, using a time sequence-constrained hierarchical clustering, we develop the practical aspects of a new approach to wavelet regression. This provides a new way to link hierarchical relationships in a multivariate time series data set with external signals. Violence data from the Colombian confl… ▽ More
Submitted 16 February, 2009; originally announced February 2009.
Comments: 36 pages, 13 figures
Journal ref: IEEE Transactions on Systems, Man, and Cybernetics - Part A, Systems and Humans, 2011
-
arXiv:0811.2519 [pdf, ps, other]
Origins of Modern Data Analysis Linked to the Beginnings and Early Development of Computer Science and Information Engineering
Abstract: The history of data analysis that is addressed here is underpinned by two themes, -- those of tabular data analysis, and the analysis of collected heterogeneous data. "Exploratory data analysis" is taken as the heuristic approach that begins with data and information and seeks underlying explanation for what is observed or measured. I also cover some of the evolving context of research and appli… ▽ More
Submitted 15 November, 2008; originally announced November 2008.
Comments: 26 pages
Journal ref: Electronic Journal for History of Probability and Statisics, Vol. 4, no. 2, Dec. 2008
-
Between the Information Economy and Student Recruitment: Present Conjuncture and Future Prospects
Abstract: In university programs and curricula, in general we react to the need to meet market needs. We respond to market stimulus, or at least try to do so. Consider now an inverted view. Consider our data and perspectives in university programs as reflecting and indeed presaging economic trends. In this article I pursue this line of thinking. I show how various past events fit very well into this new v… ▽ More
Submitted 4 September, 2008; originally announced September 2008.
Comments: 18 pages, 4 figures
ACM Class: K.0; K.1; K.3.0; K.4.3; K.7.0
Journal ref: CEPIS UPGRADE, vol. IX, no. 5, pp. 56-64, Oct. 2008
-
From Data to the p-Adic or Ultrametric Model
Abstract: We model anomaly and change in data by embedding the data in an ultrametric space. Taking our initial data as cross-tabulation counts (or other input data formats), Correspondence Analysis allows us to endow the information space with a Euclidean metric. We then model anomaly or change by an induced ultrametric. The induced ultrametric that we are particularly interested in takes a sequential -… ▽ More
Submitted 2 September, 2008; originally announced September 2008.
Comments: 15 pages, 6 figures. To appear in: Proceedings of Third International Conference on p-Adic Mathematical Physics: From Planck Scale Physics to Complex Systems to Biology, Steklov Mathematics Institute, Russian Academy of Sciences
Journal ref: p-Adic Numbers, Ultrametric Analysis and Applications, 1, 58-68, 2009
-
arXiv:0807.4011 [pdf, ps, other]
Discussion of: Treelets--An adaptive multi-Scale basis for sparse unordered data
Abstract: Discussion of "Treelets--An adaptive multi-Scale basis for sparse unordered data" [arXiv:0707.0481]
Submitted 25 July, 2008; originally announced July 2008.
Comments: Published in at http://dx.doi.org/10.1214/08-AOAS137A the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)
Report number: IMS-AOAS-AOAS137A
Journal ref: Annals of Applied Statistics 2008, Vol. 2, No. 2, 472-473
-
The Correspondence Analysis Platform for Uncovering Deep Structure in Data and Information
Abstract: We study two aspects of information semantics: (i) the collection of all relationships, (ii) tracking and spotting anomaly and change. The first is implemented by endowing all relevant information spaces with a Euclidean metric in a common projected space. The second is modelled by an induced ultrametric. A very general way to achieve a Euclidean embedding of different information spaces based o… ▽ More
Submitted 2 September, 2008; v1 submitted 6 July, 2008; originally announced July 2008.
Comments: Sixth Annual Boole Lecture in Informatics, Boole Centre for Research in Informatics, Cork, Ireland, 29 April 2008. 28 pp., 17 figures. To appear, Computer Journal. This version: 3 typos corrected
ACM Class: I.5.4; H.3.1; I.2.7
Journal ref: Computer Journal, 53 (3), 304-315, 2010
-
The Structure of Narrative: the Case of Film Scripts
Abstract: We analyze the style and structure of story narrative using the case of film scripts. The practical importance of this is noted, especially the need to have support tools for television movie writing. We use the Casablanca film script, and scripts from six episodes of CSI (Crime Scene Investigation). For analysis of style and structure, we quantify various central perspectives discussed in McKee… ▽ More
Submitted 24 May, 2008; originally announced May 2008.
Comments: 28 pages, 7 figures, 21 references
ACM Class: I.5.4; I.2.7; H.3.1
Journal ref: Pattern Recognition, 42 (2), 302-312, 2009
-
The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering
Abstract: An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity i… ▽ More
Submitted 16 November, 2008; v1 submitted 18 May, 2008; originally announced May 2008.
Comments: 36 pages, 18 figures, 36 references
Journal ref: Journal of Classification, 26 (3), 249-277, 2009
-
Symmetry in Data Mining and Analysis: A Unifying View based on Hierarchy
Abstract: Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. "Structure" has long been understood as symmetry which can take many forms with respect to any transformation, including point, translationa… ▽ More
Submitted 1 June, 2009; v1 submitted 18 May, 2008; originally announced May 2008.
Comments: 35 pages, 3 figures, 84 references
Journal ref: Proceedings of Steklov Institute of Mathematics, 265, 177-198, 2009
-
arXiv:0804.1244 [pdf, ps, other]
Geometric Data Analysis, From Correspondence Analysis to Structured Data Analysis (book review)
Abstract: Review of: Brigitte Le Roux and Henry Rouanet, Geometric Data Analysis, From Correspondence Analysis to Structured Data Analysis, Kluwer, Dordrecht, 2004, xi+475 pp.
Submitted 8 April, 2008; originally announced April 2008.
Comments: 5 pages, 8 citations. Accepted in Journal of Classification
ACM Class: I.5; G.3; H.3; I.7; J.4
Journal ref: Journal of Classification 25, 137-141, 2008
-
Wavelet and Curvelet Moments for Image Classification: Application to Aggregate Mixture Grading
Abstract: We show the potential for classifying images of mixtures of aggregate, based themselves on varying, albeit well-defined, sizes and shapes, in order to provide a far more effective approach compared to the classification of individual sizes and shapes. While a dominant (additive, stationary) Gaussian noise component in image data will ensure that wavelet coefficients are of Gaussian distribution,… ▽ More
Submitted 24 February, 2008; originally announced February 2008.
Comments: Submitted to Pattern Recognition Letters
Journal ref: Pattern Recognition Letters, 29, 1557-1564, 2008
-
arXiv:0709.0116 [pdf, ps, other]
On Ultrametric Algorithmic Information
Abstract: How best to quantify the information of an object, whether natural or artifact, is a problem of wide interest. A related problem is the computability of an object. We present practical examples of a new way to address this problem. By giving an appropriate representation to our objects, based on a hierarchical coding of information, we exemplify how it is remarkably easy to compute complex objec… ▽ More
Submitted 29 September, 2007; v1 submitted 2 September, 2007; originally announced September 2007.
Comments: Forthcoming, Computer Journal. Minor corrections 29 Oct. 2007
ACM Class: I.2.0
Journal ref: Computer Journal, 53, 405-416, 2010
-
Hilbert Space Becomes Ultrametric in the High Dimensional Limit: Application to Very High Frequency Data Analysis
Abstract: An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a natural hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sp… ▽ More
Submitted 7 February, 2007; originally announced February 2007.
Comments: 22 pp., 9 figs., 4 tables
-
arXiv:cs/0702067 [pdf, ps, other]
The Haar Wavelet Transform of a Dendrogram: Additional Notes
Abstract: We consider the wavelet transform of a finite, rooted, node-ranked, $p$-way tree, focusing on the case of binary ($p = 2$) trees. We study a Haar wavelet transform on this tree. Wavelet transforms allow for multiresolution analysis through translation and dilation of a wavelet function. We explore how this works in our tree context.
Submitted 10 February, 2007; originally announced February 2007.
Comments: 37 pp, 1 fig. Supplementary material to "The Haar Wavelet Transform of a Dendrogram", http://arxiv.org/abs/cs.IR/0608107
ACM Class: I.5.3; H.3.1; I.1.m; I.7.m
-
arXiv:cs/0701181 [pdf, ps, other]
A Note on Local Ultrametricity in Text
Abstract: High dimensional, sparsely populated data spaces have been characterized in terms of ultrametric topology. This implies that there are natural, not necessarily unique, tree or hierarchy structures defined by the ultrametric topology. In this note we study the extent of local ultrametric topology in texts, with the aim of finding unique ``fingerprints'' for a text or corpus, discriminating betwee… ▽ More
Submitted 27 January, 2007; originally announced January 2007.
Comments: 18 pp
ACM Class: I.5.3; I.7.2; H.3
-
arXiv:cs/0701180 [pdf, ps, other]
Ontology from Local Hierarchical Structure in Text
Abstract: We study the notion of hierarchy in the context of visualizing textual data and navigating text collections. A formal framework for ``hierarchy'' is given by an ultrametric topology. This provides us with a theoretical foundation for concept hierarchy creation. A major objective is {\em scalable} annotation or labeling of concept maps. Serendipitously we pursue other objectives such as deriving… ▽ More
Submitted 27 January, 2007; originally announced January 2007.
Comments: 35 pp., 12 figures
ACM Class: H.5; I.5.3; H.5.2; I.7.2; H.3
-
arXiv:cs/0608107 [pdf, ps, other]
The Haar Wavelet Transform of a Dendrogram
Abstract: We describe a new wavelet transform, for use on hierarchies or binary rooted trees. The theoretical framework of this approach to data analysis is described. Case studies are used to further exemplify this approach. A first set of application studies deals with data array smoothing, or filtering. A second set of application studies relates to hierarchical tree condensation. Finally, a third stud… ▽ More
Submitted 19 February, 2007; v1 submitted 28 August, 2006; originally announced August 2006.
Comments: 38 pp, 8 figures. Forthcoming in Journal of Classification
ACM Class: I.5.3; H.3.1; I.1.m
Journal ref: Journal of Classification, 24, 3-32, 2007
-
arXiv:math/0605555 [pdf, ps, other]
Ultrametric embedding: application to data fingerprinting and to fast data clustering
Abstract: We begin with pervasive ultrametricity due to high dimensionality and/or spatial sparsity. How extent or degree of ultrametricity can be quantified leads us to the discussion of varied practical cases when ultrametricity can be partially or locally present in data. We show how the ultrametricity can be assessed in text or document collections, and in time series signals. An aspect of importance… ▽ More
Submitted 28 January, 2007; v1 submitted 19 May, 2006; originally announced May 2006.
Comments: 14 pages, 1 figure. New content and modified title compared to the 19 May 2006 version
Report number: P.M. Pardalos and P. Hansen, Eds., Data Mining and Mathematical Programming, CRM Proceedings & Lecture Notes Vol. 45, American Mathematical Society, 199-209, 2008 MSC Class: 62H30; 68P30; 68P20