-
Data curation via joint example selection further accelerates multimodal learning
Authors:
Talfan Evans,
Nikhil Parthasarathy,
Hamza Merzic,
Olivier J. Henaff
Abstract:
Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorit…
▽ More
Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach--multimodal contrastive learning with joint example selection (JEST)--surpasses state-of-the-art models with up to 13$\times$ fewer iterations and 10$\times$ less computation. Essential to the performance of JEST is the ability to steer the data selection process towards the distribution of smaller, well-curated datasets via pretrained reference models, exposing the level of data curation as a new dimension for neural scaling laws.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Comparison of Nested Geometry Treatments within GPU-Based Monte Carlo Neutron Transport Simulations of Fission Reactors
Authors:
Elliott Biondo,
Thomas Evans,
Seth Johnson,
Steven Hamilton
Abstract:
Monte Carlo (MC) neutron transport provides detailed estimates of radiological quantities within fission reactors. This method involves tracking individual neutrons through a computational geometry. CPU-based MC codes use multiple polymorphic tracker types with different tracking algorithms to exploit the repeated configurations of reactors, but virtual function calls have high overhead on the GPU…
▽ More
Monte Carlo (MC) neutron transport provides detailed estimates of radiological quantities within fission reactors. This method involves tracking individual neutrons through a computational geometry. CPU-based MC codes use multiple polymorphic tracker types with different tracking algorithms to exploit the repeated configurations of reactors, but virtual function calls have high overhead on the GPU. The Shift MC code was modified to support GPU-based tracking with three strategies: (1) dynamic polymorphism (DP) with virtual functions, (2) static polymorphism (SP), and (3) a single tracker (ST) type with tree-based acceleration. Results on the Frontier supercomputer show that the DP, SP, and ST methods achieve 77.8%, 91.2%, and 83.4% of the practical maximum tracking rate in the worst case, indicating that any of these methods can be used without incurring a significant performance penalty. The flexibility of the ST method is highlighted with a hexagonal-grid microreactor problem, performed without hexagonal-grid-specific tracking routines.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
Authors:
Talfan Evans,
Shreya Pathak,
Hamza Merzic,
Jonathan Schwarz,
Ryutaro Tanno,
Olivier J. Henaff
Abstract:
Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow. Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples. Despite their appeal, these methods have yet to be widely adopted since no one algorithm has been shown to a) generalize across models and tasks b) scale to large datasets and c) yield over…
▽ More
Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow. Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples. Despite their appeal, these methods have yet to be widely adopted since no one algorithm has been shown to a) generalize across models and tasks b) scale to large datasets and c) yield overall FLOP savings when accounting for the overhead of data selection. In this work we propose a method which satisfies these three properties, leveraging small, cheap proxy models to estimate "learnability" scores for datapoints, which are used to prioritize data for the training of much larger models. As a result, our models require 46% and 51% fewer training updates and up to 25% less total computation to reach the same performance as uniformly trained visual classifiers on JFT and multimodal models on ALIGN. Finally, we find our data-prioritization scheme to be complementary with recent data-curation and learning objectives, yielding a new state-of-the-art in several multimodal transfer tasks.
△ Less
Submitted 14 February, 2024; v1 submitted 8 December, 2023;
originally announced December 2023.
-
PADLL: Taming Metadata-intensive HPC Jobs Through Dynamic, Application-agnostic QoS Control
Authors:
Ricardo Macedo,
Mariana Miranda,
Yusuke Tanimura,
Jason Haga,
Amit Ruhela,
Stephen Lien Harrell,
Richard Todd Evans,
José Pereira,
João Paulo
Abstract:
Modern I/O applications that run on HPC infrastructures are increasingly becoming read and metadata intensive. However, having multiple concurrent applications submitting large amounts of metadata operations can easily saturate the shared parallel file system's metadata resources, leading to overall performance degradation and I/O unfairness. We present PADLL, an application and file system agnost…
▽ More
Modern I/O applications that run on HPC infrastructures are increasingly becoming read and metadata intensive. However, having multiple concurrent applications submitting large amounts of metadata operations can easily saturate the shared parallel file system's metadata resources, leading to overall performance degradation and I/O unfairness. We present PADLL, an application and file system agnostic storage middleware that enables QoS control of data and metadata workflows in HPC storage systems. It adopts ideas from Software-Defined Storage, building data plane stages that mediate and rate limit POSIX requests submitted to the shared file system, and a control plane that holistically coordinates how all I/O workflows are handled. We demonstrate its performance and feasibility under multiple QoS policies using synthetic benchmarks, real-world applications, and traces collected from a production file system. Results show that PADLL can enforce complex storage QoS policies over concurrent metadata-aggressive jobs, ensuring fairness and prioritization.
△ Less
Submitted 23 March, 2023; v1 submitted 13 February, 2023;
originally announced February 2023.
-
Local dominance unveils clusters in networks
Authors:
Dingyi Shi,
Fan Shang,
Bingsheng Chen,
Paul Expert,
Linyuan Lü,
H. Eugene Stanley,
Renaud Lambiotte,
Tim S. Evans,
Ruiqi Li
Abstract:
Clusters or communities can provide a coarse-grained description of complex systems at multiple scales, but their detection remains challenging in practice. Community detection methods often define communities as dense subgraphs, or subgraphs with few connections in-between, via concepts such as the cut, conductance, or modularity. Here we consider another perspective built on the notion of local…
▽ More
Clusters or communities can provide a coarse-grained description of complex systems at multiple scales, but their detection remains challenging in practice. Community detection methods often define communities as dense subgraphs, or subgraphs with few connections in-between, via concepts such as the cut, conductance, or modularity. Here we consider another perspective built on the notion of local dominance, where low-degree nodes are assigned to the basin of influence of high-degree nodes, and design an efficient algorithm based on local information. Local dominance gives rises to community centers, and uncovers local hierarchies in the network. Community centers have a larger degree than their neighbors and are sufficiently distant from other centers. The strength of our framework is demonstrated on synthesized and empirical networks with ground-truth community labels. The notion of local dominance and the associated asymmetric relations between nodes are not restricted to community detection, and can be utilised in clustering problems, as we illustrate on networks derived from vector data.
△ Less
Submitted 29 March, 2024; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Where is VALDO? VAscular Lesions Detection and segmentatiOn challenge at MICCAI 2021
Authors:
Carole H. Sudre,
Kimberlin Van Wijnen,
Florian Dubost,
Hieab Adams,
David Atkinson,
Frederik Barkhof,
Mahlet A. Birhanu,
Esther E. Bron,
Robin Camarasa,
Nish Chaturvedi,
Yuan Chen,
Zihao Chen,
Shuai Chen,
Qi Dou,
Tavia Evans,
Ivan Ezhov,
Haojun Gao,
Marta Girones Sanguesa,
Juan Domingo Gispert,
Beatriz Gomez Anson,
Alun D. Hughes,
M. Arfan Ikram,
Silvia Ingala,
H. Rolf Jaeger,
Florian Kofler
, et al. (24 additional authors not shown)
Abstract:
Imaging markers of cerebral small vessel disease provide valuable information on brain health, but their manual assessment is time-consuming and hampered by substantial intra- and interrater variability. Automated rating may benefit biomedical research, as well as clinical assessment, but diagnostic reliability of existing algorithms is unknown. Here, we present the results of the \textit{VAscular…
▽ More
Imaging markers of cerebral small vessel disease provide valuable information on brain health, but their manual assessment is time-consuming and hampered by substantial intra- and interrater variability. Automated rating may benefit biomedical research, as well as clinical assessment, but diagnostic reliability of existing algorithms is unknown. Here, we present the results of the \textit{VAscular Lesions DetectiOn and Segmentation} (\textit{Where is VALDO?}) challenge that was run as a satellite event at the international conference on Medical Image Computing and Computer Aided Intervention (MICCAI) 2021. This challenge aimed to promote the development of methods for automated detection and segmentation of small and sparse imaging markers of cerebral small vessel disease, namely enlarged perivascular spaces (EPVS) (Task 1), cerebral microbleeds (Task 2) and lacunes of presumed vascular origin (Task 3) while leveraging weak and noisy labels. Overall, 12 teams participated in the challenge proposing solutions for one or more tasks (4 for Task 1 - EPVS, 9 for Task 2 - Microbleeds and 6 for Task 3 - Lacunes). Multi-cohort data was used in both training and evaluation. Results showed a large variability in performance both across teams and across tasks, with promising results notably for Task 1 - EPVS and Task 2 - Microbleeds and not practically useful results yet for Task 3 - Lacunes. It also highlighted the performance inconsistency across cases that may deter use at an individual level, while still proving useful at a population level.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
Recommendations on test datasets for evaluating AI solutions in pathology
Authors:
André Homeyer,
Christian Geißler,
Lars Ole Schwen,
Falk Zakrzewski,
Theodore Evans,
Klaus Strohmenger,
Max Westphal,
Roman David Bülow,
Michaela Kargl,
Aray Karjauv,
Isidre Munné-Bertran,
Carl Orge Retzlaff,
Adrià Romero-López,
Tomasz Sołtysiński,
Markus Plass,
Rita Carvalho,
Peter Steinbach,
Yu-Chia Lan,
Nassim Bouteldja,
David Haber,
Mateo Rojas-Carulla,
Alireza Vafaei Sadr,
Matthias Kraft,
Daniel Krüger,
Rutger Fick
, et al. (5 additional authors not shown)
Abstract:
Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recom…
▽ More
Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing.
A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations for the collection of test datasets.
We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries?
The recommendations are intended to help AI developers demonstrate the utility of their products and to help regulatory agencies and end users verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Incremental Abstraction in Distributed Probabilistic SLAM Graphs
Authors:
Joseph Ortiz,
Talfan Evans,
Edgar Sucar,
Andrew J. Davison
Abstract:
Scene graphs represent the key components of a scene in a compact and semantically rich way, but are difficult to build during incremental SLAM operation because of the challenges of robustly identifying abstract scene elements and optimising continually changing, complex graphs. We present a distributed, graph-based SLAM framework for incrementally building scene graphs based on two novel compone…
▽ More
Scene graphs represent the key components of a scene in a compact and semantically rich way, but are difficult to build during incremental SLAM operation because of the challenges of robustly identifying abstract scene elements and optimising continually changing, complex graphs. We present a distributed, graph-based SLAM framework for incrementally building scene graphs based on two novel components. First, we propose an incremental abstraction framework in which a neural network proposes abstract scene elements that are incorporated into the factor graph of a feature-based monocular SLAM system. Scene elements are confirmed or rejected through optimisation and incrementally replace the points yielding a more dense, semantic and compact representation. Second, enabled by our novel routing procedure, we use Gaussian Belief Propagation (GBP) for distributed inference on a graph processor. The time per iteration of GBP is structure-agnostic and we demonstrate the speed advantages over direct methods for inference of heterogeneous factor graphs. We run our system on real indoor datasets using planar abstractions and recover the major planes with significant compression.
△ Less
Submitted 4 April, 2022; v1 submitted 13 September, 2021;
originally announced September 2021.
-
Cycle Analysis of Directed Acyclic Graphs
Authors:
Vaiva Vasiliauskaite,
Tim S. Evans,
Paul Expert
Abstract:
In this paper, we employ the decomposition of a directed network as an undirected graph plus its associated node metadata to characterise the cyclic structure found in directed networks by finding a Minimal Cycle Basis of the undirected graph and augment its components with direction information. We show that only four classes of directed cycles exist, and that they can be fully distinguished by t…
▽ More
In this paper, we employ the decomposition of a directed network as an undirected graph plus its associated node metadata to characterise the cyclic structure found in directed networks by finding a Minimal Cycle Basis of the undirected graph and augment its components with direction information. We show that only four classes of directed cycles exist, and that they can be fully distinguished by the organisation and number of source-sink node pairs and their antichain structure. We are particularly interested in Directed Acyclic Graphs and introduce a set of metrics that characterise the Minimal Cycle Basis using the Directed Acyclic Graphs metadata information. In particular, we numerically show that Transitive Reduction stabilises the properties of Minimal Cycle Bases measured by the metrics we introduced while retaining key properties of the Directed Acyclic Graph. This makes the metrics consistent characterisation of Directed Acyclic Graphs and the systems they represent. We measure the characteristics of the Minimal Cycle Bases of four models of Transitively Reduced Directed Acyclic Graphs and show that the metrics introduced are able to distinguish the models and are sensitive to their generating mechanisms.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Linking the Network Centrality Measures Closeness and Degree
Authors:
Tim S. Evans,
Bingsheng Chen
Abstract:
Measuring the importance of nodes in a network with a centrality measure is a core task in any network application. There are many measures available and it is speculated that many encode similar information. We give an explicit non-linear relationship between two of the most popular measures of node centrality: degree and closeness. Based on a shortest-path tree approximation, we give an analytic…
▽ More
Measuring the importance of nodes in a network with a centrality measure is a core task in any network application. There are many measures available and it is speculated that many encode similar information. We give an explicit non-linear relationship between two of the most popular measures of node centrality: degree and closeness. Based on a shortest-path tree approximation, we give an analytic derivation that shows the inverse of closeness is linearly dependent on the logarithm of degree. We show that our hypothesis works well for a range of networks produced from stochastic network models and for networks derived from 130 real-world data sets. We connect our results with previous results for other network distance scales such as average distance. Our results imply that measuring closeness is broadly redundant unless our relationship is used to remove the dependence on degree from closeness. The success of our relationship suggests that most networks can be approximated by shortest-path spanning trees which are all statistically similar two or more steps away from their root nodes.
△ Less
Submitted 4 July, 2022; v1 submitted 2 August, 2021;
originally announced August 2021.
-
A visual introduction to Gaussian Belief Propagation
Authors:
Joseph Ortiz,
Talfan Evans,
Andrew J. Davison
Abstract:
In this article, we present a visual introduction to Gaussian Belief Propagation (GBP), an approximate probabilistic inference algorithm that operates by passing messages between the nodes of arbitrarily structured factor graphs. A special case of loopy belief propagation, GBP updates rely only on local information and will converge independently of the message schedule. Our key argument is that,…
▽ More
In this article, we present a visual introduction to Gaussian Belief Propagation (GBP), an approximate probabilistic inference algorithm that operates by passing messages between the nodes of arbitrarily structured factor graphs. A special case of loopy belief propagation, GBP updates rely only on local information and will converge independently of the message schedule. Our key argument is that, given recent trends in computing hardware, GBP has the right computational properties to act as a scalable distributed probabilistic inference framework for future machine learning systems.
△ Less
Submitted 5 July, 2021;
originally announced July 2021.
-
Weak Form Generalized Hamiltonian Learning
Authors:
Kevin L. Course,
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We present a method for learning generalized Hamiltonian decompositions of ordinary differential equations given a set of noisy time series measurements. Our method simultaneously learns a continuous time model and a scalar energy function for a general dynamical system. Learning predictive models in this form allows one to place strong, high-level, physics inspired priors onto the form of the lea…
▽ More
We present a method for learning generalized Hamiltonian decompositions of ordinary differential equations given a set of noisy time series measurements. Our method simultaneously learns a continuous time model and a scalar energy function for a general dynamical system. Learning predictive models in this form allows one to place strong, high-level, physics inspired priors onto the form of the learnt governing equations for general dynamical systems. Moreover, having shown how our method extends and unifies some previous work in deep learning with physics inspired priors, we present a novel method for learning continuous time models from the weak form of the governing equations which is less computationally taxing than standard adjoint methods.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
Quadruply Stochastic Gaussian Processes
Authors:
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We introduce a stochastic variational inference procedure for training scalable Gaussian process (GP) models whose per-iteration complexity is independent of both the number of training points, $n$, and the number basis functions used in the kernel approximation, $m$. Our central contributions include an unbiased stochastic estimator of the evidence lower bound (ELBO) for a Gaussian likelihood, as…
▽ More
We introduce a stochastic variational inference procedure for training scalable Gaussian process (GP) models whose per-iteration complexity is independent of both the number of training points, $n$, and the number basis functions used in the kernel approximation, $m$. Our central contributions include an unbiased stochastic estimator of the evidence lower bound (ELBO) for a Gaussian likelihood, as well as a stochastic estimator that lower bounds the ELBO for several other likelihoods such as Laplace and logistic. Independence of the stochastic optimization update complexity on $n$ and $m$ enables inference on huge datasets using large capacity GP models. We demonstrate accurate inference on large classification and regression datasets using GPs and relevance vector machines with up to $m = 10^7$ basis functions.
△ Less
Submitted 4 June, 2020;
originally announced June 2020.
-
Social Success of Perfumes
Authors:
Vaiva Vasiliauskaite,
Tim S. Evans
Abstract:
We study data on perfumes and their odour descriptors - notes - to understand how note compositions, called accords, influence successful fragrance formulas. We obtain accords which tend to be present in perfumes that receive significantly more customer ratings. Our findings show that the most popular notes and the most over-represented accords are different to those that have the strongest effect…
▽ More
We study data on perfumes and their odour descriptors - notes - to understand how note compositions, called accords, influence successful fragrance formulas. We obtain accords which tend to be present in perfumes that receive significantly more customer ratings. Our findings show that the most popular notes and the most over-represented accords are different to those that have the strongest effect to the perfume ratings. We also used network centrality to understand which notes have the highest potential to enhance note compositions. We find that large degree notes, such as musk and vanilla as well as generically-named notes, e.g. floral notes, are amongst the notes that enhance accords the most. This work presents a framework which would be a timely tool for perfumers to explore a multidimensional space of scent compositions.
△ Less
Submitted 11 September, 2019;
originally announced September 2019.
-
Making Communities Show Respect for Order
Authors:
Vaiva Vasiliauskaite,
Tim S. Evans
Abstract:
In this work we give a community detection algorithm in which the communities both respects the intrinsic order of a directed acyclic graph and also finds similar nodes. We take inspiration from classic similarity measures of bibliometrics, used to assess how similar two publications are, based on their relative citation patterns. We study the algorithm's performance and antichain properties in ar…
▽ More
In this work we give a community detection algorithm in which the communities both respects the intrinsic order of a directed acyclic graph and also finds similar nodes. We take inspiration from classic similarity measures of bibliometrics, used to assess how similar two publications are, based on their relative citation patterns. We study the algorithm's performance and antichain properties in artificial models and in real networks, such as citation graphs and food webs. We show how well this partitioning algorithm distinguishes and groups together nodes of the same origin (in a citation network, the origin is a topic or a research field). We make the comparison between our partitioning algorithm and standard hierarchical layering tools as well as community detection methods. We show that our algorithm produces different communities from standard layering algorithms.
△ Less
Submitted 11 March, 2020; v1 submitted 30 August, 2019;
originally announced August 2019.
-
Analysis of the Wikipedia Network of Mathematicians
Authors:
Bingsheng Chen,
Zhengyu Lin,
Tim S. Evans
Abstract:
We look at the network of mathematicians defined by the hyperlinks between their biographies on Wikipedia. We show how to extract this information using three snapshots of the Wikipedia data, taken in 2013, 2017 and 2018. We illustrate how such Wikipedia data can be used by performing a centrality analysis. These measures show that Hilbert and Newton are the most important mathematicians. We use o…
▽ More
We look at the network of mathematicians defined by the hyperlinks between their biographies on Wikipedia. We show how to extract this information using three snapshots of the Wikipedia data, taken in 2013, 2017 and 2018. We illustrate how such Wikipedia data can be used by performing a centrality analysis. These measures show that Hilbert and Newton are the most important mathematicians. We use our example to illustrate the strengths and weakness of centrality measures and to show how to provide estimates of the robustness of centrality measurements. In part, we do this by comparison to results from two other sources: an earlier study of biographies on the MacTutor website and a small informal survey of the opinion of mathematics and physics students at Imperial College London.
△ Less
Submitted 21 February, 2019; v1 submitted 20 February, 2019;
originally announced February 2019.
-
Conceptual Organization is Revealed by Consumer Activity Patterns
Authors:
Adam N. Hornsby,
Thomas Evans,
Peter Riefer,
Rosie Prior,
Bradley C. Love
Abstract:
Meaning may arise from an element's role or interactions within a larger system. For example, hitting nails is more central to people's concept of a hammer than its particular material composition or other intrinsic features. Likewise, the importance of a web page may result from its links with other pages rather than solely from its content. One example of meaning arising from extrinsic relations…
▽ More
Meaning may arise from an element's role or interactions within a larger system. For example, hitting nails is more central to people's concept of a hammer than its particular material composition or other intrinsic features. Likewise, the importance of a web page may result from its links with other pages rather than solely from its content. One example of meaning arising from extrinsic relationships are approaches that extract the meaning of word concepts from co-occurrence patterns in large, text corpora. The success of these methods suggest that human activity patterns may reveal conceptual organization. However, texts do not directly reflect human activity, but instead serve a communicative function and are usually highly curated or edited to suit an audience. Here, we apply methods devised for text to a data source that directly reflects thousands of individuals' activity patterns, namely supermarket purchases. Using product co-occurrence data from nearly 1.3m shopping baskets, we trained a topic model to learn 25 high-level concepts (or "topics"). These topics were found to be comprehensible and coherent by both retail experts and consumers. Topics ranged from specific (e.g., ingredients for a stir-fry) to general (e.g., cooking from scratch). Topics tended to be goal-directed and situational, consistent with the notion that human conceptual knowledge is tailored to support action. Individual differences in the topics sampled predicted basic demographic characteristics. These results suggest that human activity patterns reveal conceptual organization and may give rise to it.
△ Less
Submitted 19 October, 2018;
originally announced October 2018.
-
Discretely Relaxing Continuous Variables for tractable Variational Inference
Authors:
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We explore a new research direction in Bayesian variational inference with discrete latent variable priors where we exploit Kronecker matrix algebra for efficient and exact computations of the evidence lower bound (ELBO). The proposed "DIRECT" approach has several advantages over its predecessors; (i) it can exactly compute ELBO gradients (i.e. unbiased, zero-variance gradient estimates), eliminat…
▽ More
We explore a new research direction in Bayesian variational inference with discrete latent variable priors where we exploit Kronecker matrix algebra for efficient and exact computations of the evidence lower bound (ELBO). The proposed "DIRECT" approach has several advantages over its predecessors; (i) it can exactly compute ELBO gradients (i.e. unbiased, zero-variance gradient estimates), eliminating the need for high-variance stochastic gradient estimators and enabling the use of quasi-Newton optimization methods; (ii) its training complexity is independent of the number of training points, permitting inference on large datasets; and (iii) its posterior samples consist of sparse and low-precision quantized integers which permit fast inference on hardware limited devices. In addition, our DIRECT models can exactly compute statistical moments of the parameterized predictive posterior without relying on Monte Carlo sampling. The DIRECT approach is not practical for all likelihoods, however, we identify a popular model structure which is practical, and demonstrate accurate inference using latent variables discretized as extremely low-precision 4-bit quantized integers. While the ELBO computations considered in the numerical studies require over $10^{2352}$ log-likelihood evaluations, we train on datasets with over two-million points in just seconds.
△ Less
Submitted 9 January, 2019; v1 submitted 12 September, 2018;
originally announced September 2018.
-
Student Cluster Competition 2017, Team University ofTexas at Austin/Texas State University: Reproducing Vectorization of the Tersoff Multi-Body Potential on the Intel Skylake and NVIDIA V100 Architectures
Authors:
James Sullivan,
Collin Weir,
Austin Reichert,
R. Todd Evans,
W. Cyrus Proctor,
Nicolas Thorne
Abstract:
This paper satisfies the reproducibility challenge of the Student Cluster Competition at Supercomputing 2017. We attempted to reproduce the results of Höhnerbach et al. (2016) for an implementation of a vectorized code for the Tersoff multi-body potential kernel of the molecular dynamics code Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS). We investigated accuracy, optimization…
▽ More
This paper satisfies the reproducibility challenge of the Student Cluster Competition at Supercomputing 2017. We attempted to reproduce the results of Höhnerbach et al. (2016) for an implementation of a vectorized code for the Tersoff multi-body potential kernel of the molecular dynamics code Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS). We investigated accuracy, optimization performance, and scaling with our Intel CPU and NVIDIA GPU based cluster.
△ Less
Submitted 21 August, 2018;
originally announced August 2018.
-
Exploiting Structure for Fast Kernel Learning
Authors:
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We propose two methods for exact Gaussian process (GP) inference and learning on massive image, video, spatial-temporal, or multi-output datasets with missing values (or "gaps") in the observed responses. The first method ignores the gaps using sparse selection matrices and a highly effective low-rank preconditioner is introduced to accelerate computations. The second method introduces a novel app…
▽ More
We propose two methods for exact Gaussian process (GP) inference and learning on massive image, video, spatial-temporal, or multi-output datasets with missing values (or "gaps") in the observed responses. The first method ignores the gaps using sparse selection matrices and a highly effective low-rank preconditioner is introduced to accelerate computations. The second method introduces a novel approach to GP training whereby response values are inferred on the gaps before explicitly training the model. We find this second approach to be greatly advantageous for the class of problems considered. Both of these novel approaches make extensive use of Kronecker matrix algebra to design massively scalable algorithms which have low memory requirements. We demonstrate exact GP inference for a spatial-temporal climate modelling problem with 3.7 million training points as well as a video reconstruction problem with 1 billion points.
△ Less
Submitted 9 August, 2018;
originally announced August 2018.
-
Scalable Gaussian Processes with Grid-Structured Eigenfunctions (GP-GRIEF)
Authors:
Trefor W. Evans,
Prasanth B. Nair
Abstract:
We introduce a kernel approximation strategy that enables computation of the Gaussian process log marginal likelihood and all hyperparameter derivatives in $\mathcal{O}(p)$ time. Our GRIEF kernel consists of $p$ eigenfunctions found using a Nystrom approximation from a dense Cartesian product grid of inducing points. By exploiting algebraic properties of Kronecker and Khatri-Rao tensor products, c…
▽ More
We introduce a kernel approximation strategy that enables computation of the Gaussian process log marginal likelihood and all hyperparameter derivatives in $\mathcal{O}(p)$ time. Our GRIEF kernel consists of $p$ eigenfunctions found using a Nystrom approximation from a dense Cartesian product grid of inducing points. By exploiting algebraic properties of Kronecker and Khatri-Rao tensor products, computational complexity of the training procedure can be practically independent of the number of inducing points. This allows us to use arbitrarily many inducing points to achieve a globally accurate kernel approximation, even in high-dimensional problems. The fast likelihood evaluation enables type-I or II Bayesian inference on large-scale datasets. We benchmark our algorithms on real-world problems with up to two-million training points and $10^{33}$ inducing points.
△ Less
Submitted 1 August, 2018; v1 submitted 5 July, 2018;
originally announced July 2018.
-
Opinion formation on dynamic networks: identifying conditions for the emergence of partisan echo chambers
Authors:
Tucker Evans,
Feng Fu
Abstract:
Modern political interaction is characterized by strong partisanship and a lack of interest in information sharing and agreement across party lines. It remains largely unclear how such partisan echo chambers arise and how they coevolve with opinion formation. Here we explore the emergence of these structures through the lens of coevolutionary games. In our model, the payoff of an individual is det…
▽ More
Modern political interaction is characterized by strong partisanship and a lack of interest in information sharing and agreement across party lines. It remains largely unclear how such partisan echo chambers arise and how they coevolve with opinion formation. Here we explore the emergence of these structures through the lens of coevolutionary games. In our model, the payoff of an individual is determined jointly by the magnitude of their opinion, their degree of conformity with their social neighbors, and the benefit of having social connections. Each individual can simultaneously adjust their opinion as well as the weights of their social connections. We present and validate the conditions for the emergence of partisan echo chambers, characterizing the transition from cohesive communities with consensus to divisive networks with splitting opinions. Moreover, we apply our model to voting records of the United States House of Representatives over a timespan of decades in order to understand the influence of underlying psychological and social factors on increasing partisanship in recent years. Our work helps elucidate how the division of today has come to be and how cohesion and unity could otherwise be attained on a variety of political and social issues.
△ Less
Submitted 3 July, 2018;
originally announced July 2018.
-
Community Detection with Metadata in a Network of Biographies of Western Art Painters
Authors:
Michael Kitromilidis,
Tim S. Evans
Abstract:
In this work we look at the structure of the influences between Western art painters as revealed by their biographies on Wikipedia. We use a modified version of modularity maximisation with metadata to detect a partition of artists into communities based on their artistic genre and school in which they belong. We then use this community structure to discuss how influential artists reached beyond t…
▽ More
In this work we look at the structure of the influences between Western art painters as revealed by their biographies on Wikipedia. We use a modified version of modularity maximisation with metadata to detect a partition of artists into communities based on their artistic genre and school in which they belong. We then use this community structure to discuss how influential artists reached beyond their own communities and had a lasting impact on others, by proposing modifications on standard centrality measures.
△ Less
Submitted 22 February, 2018;
originally announced February 2018.
-
Diversity from the Topology of Citation Networks
Authors:
Vaiva Vasiliauskaite,
Tim S. Evans
Abstract:
We study transitivity in directed acyclic graphs and its usefulness in capturing nodes that act as bridges between more densely interconnected parts in such type of network. In transitively reduced citation networks degree centrality could be used as a measure of interdisciplinarity or diversity. We study the measure's ability to capture "diverse" nodes in random directed acyclic graphs and citati…
▽ More
We study transitivity in directed acyclic graphs and its usefulness in capturing nodes that act as bridges between more densely interconnected parts in such type of network. In transitively reduced citation networks degree centrality could be used as a measure of interdisciplinarity or diversity. We study the measure's ability to capture "diverse" nodes in random directed acyclic graphs and citation networks. We show that transitively reduced degree centrality is capable of capturing "diverse" nodes, thus this measure could be a timely alternative to text analysis techniques for retrieving papers, influential in a variety of research fields.
△ Less
Submitted 16 February, 2018;
originally announced February 2018.
-
Rayleigh Quotient Iteration with a Multigrid in Energy Preconditioner for Massively Parallel Neutron Transport
Authors:
R. N. Slaybaugh,
T. M. Evans,
G. G. Davidson,
P. P. H. Wilson
Abstract:
Three complementary methods have been implemented in the code Denovo that accelerate neutral particle transport calculations with methods that use leadership-class computers fully and effectively: a multigroup block (MG) Krylov solver, a Rayleigh quotient iteration (RQI) eigenvalue solver, and a multigrid in energy preconditioner. The multigroup Krylov solver converges more quickly than Gauss Seid…
▽ More
Three complementary methods have been implemented in the code Denovo that accelerate neutral particle transport calculations with methods that use leadership-class computers fully and effectively: a multigroup block (MG) Krylov solver, a Rayleigh quotient iteration (RQI) eigenvalue solver, and a multigrid in energy preconditioner. The multigroup Krylov solver converges more quickly than Gauss Seidel and enables energy decomposition such that Denovo can scale to hundreds of thousands of cores. The new multigrid in energy preconditioner reduces iteration count for many problem types and takes advantage of the new energy decomposition such that it can scale efficiently. These two tools are useful on their own, but together they enable the RQI eigenvalue solver to work. Each individual method has been described before, but this is the first time they have been demonstrated to work together effectively.
RQI should converge in fewer iterations than power iteration (PI) for large and challenging problems. RQI creates shifted systems that would not be tractable without the MG Krylov solver. It also creates ill-conditioned matrices that cannot converge without the multigrid in energy preconditioner. Using these methods together, RQI converged in fewer iterations and in less time than all PI calculations for a full pressurized water reactor core. It also scaled reasonably well out to 275,968 cores.
△ Less
Submitted 7 February, 2017;
originally announced February 2017.
-
Embedding Graphs in Lorentzian Spacetime
Authors:
James R. Clough,
Tim S. Evans
Abstract:
Geometric approaches to network analysis combine simply defined models with great descriptive power. In this work we provide a method for embedding directed acyclic graphs into Minkowski spacetime using Multidimensional scaling (MDS). First we generalise the classical MDS algorithm, defined only for metrics with a Euclidean signature, to manifolds of any metric signature. We then use this general…
▽ More
Geometric approaches to network analysis combine simply defined models with great descriptive power. In this work we provide a method for embedding directed acyclic graphs into Minkowski spacetime using Multidimensional scaling (MDS). First we generalise the classical MDS algorithm, defined only for metrics with a Euclidean signature, to manifolds of any metric signature. We then use this general method to develop an algorithm to be used on networks which have causal structure allowing them to be embedded in Lorentzian manifolds. The method is demonstrated by calculating embeddings for both causal sets and citation networks in Minkowski spacetime. We finally suggest a number of applications in citation analysis such as paper recommendation, identifying missing citations and fitting citation models to data using this geometric approach.
△ Less
Submitted 9 February, 2016;
originally announced February 2016.
-
Time and Citation Networks
Authors:
James R. Clough,
Tim S. Evans
Abstract:
Citation networks emerge from a number of different social systems, such as academia (from published papers), business (through patents) and law (through legal judgements). A citation represents a transfer of information, and so studying the structure of the citation network will help us understand how knowledge is passed on. What distinguishes citation networks from other networks is time; docume…
▽ More
Citation networks emerge from a number of different social systems, such as academia (from published papers), business (through patents) and law (through legal judgements). A citation represents a transfer of information, and so studying the structure of the citation network will help us understand how knowledge is passed on. What distinguishes citation networks from other networks is time; documents can only cite older documents. We propose that existing network measures do not take account of the strong constraint imposed by time. We will illustrate our approach with two types of causally aware analysis. We apply our methods to the citation networks formed by academic papers on the arXiv, to US patents and to US Supreme Court judgements. We show that our tools can reveal that citation networks which appear to have very similar structure by standard network measures turn out to have significantly different properties. We interpret our results as indicating that many papers in a bibliography were not directly relevant to the work and that we can provide a simple indicator of the important citations. We suggest our methods may highlight papers which are of more interest for interdisciplinary research. We also quantify differences in the diversity of research directions of different fields.
△ Less
Submitted 6 July, 2015;
originally announced July 2015.
-
Ranking Journals Using Altmetrics
Authors:
Tamar V. Loach,
Tim S. Evans
Abstract:
The rank of a journal based on simple citation information is a popular measure. The simplicity and availability of rankings such as Impact Factor, Eigenfactor and SciMago Journal Rank based on trusted commercial sources ensures their widespread use for many important tasks despite the well-known limitations of such rankings. In this paper we look at an alternative approach based on information on…
▽ More
The rank of a journal based on simple citation information is a popular measure. The simplicity and availability of rankings such as Impact Factor, Eigenfactor and SciMago Journal Rank based on trusted commercial sources ensures their widespread use for many important tasks despite the well-known limitations of such rankings. In this paper we look at an alternative approach based on information on papers from social and mainstream media sources. Our data comes from altmetric.com who identify mentions of individual academic papers in sources such as Twitter, Facebook, blogs and news outlets. We consider several different methods to produce a ranking of journals from such data. We show that most (but not all) schemes produce results, which are roughly similar, suggesting that there is a basic consistency between social media based approaches and traditional citation based methods. Most ranking schemes applied to one data set produce relatively little variation and we suggest this provides a measure of the uncertainty in any journal rating. The differences we find between data sources also shows they are capturing different aspects of journal impact. We conclude a small number of such ratings will provide the best information on journal impact.
△ Less
Submitted 2 July, 2015;
originally announced July 2015.
-
Sculplexity: Sculptures of Complexity using 3D printing
Authors:
D. S. Reiss,
J. J. Price,
T. S. Evans
Abstract:
We show how to convert models of complex systems such as 2D cellular automata into a 3D printed object. Our method takes into account the limitations inherent to 3D printing processes and materials. Our approach automates the greater part of this task, bypassing the use of CAD software and the need for manual design. As a proof of concept, a physical object representing a modified forest fire mode…
▽ More
We show how to convert models of complex systems such as 2D cellular automata into a 3D printed object. Our method takes into account the limitations inherent to 3D printing processes and materials. Our approach automates the greater part of this task, bypassing the use of CAD software and the need for manual design. As a proof of concept, a physical object representing a modified forest fire model was successfully printed. Automated conversion methods similar to the ones developed here can be used to create objects for research, for demonstration and teaching, for outreach, or simply for aesthetic pleasure. As our outputs can be touched, they may be particularly useful for those with visual disabilities.
△ Less
Submitted 8 December, 2014;
originally announced December 2014.
-
Modelling Citation Networks
Authors:
S. R. Goldberg,
H. Anthony,
T. S. Evans
Abstract:
The distribution of the number of academic publications as a function of citation count for a given year is remarkably similar from year to year. We measure this similarity as a width of the distribution and find it to be approximately constant from year to year. We show that simple citation models fail to capture this behaviour. We then provide a simple three parameter citation network model usin…
▽ More
The distribution of the number of academic publications as a function of citation count for a given year is remarkably similar from year to year. We measure this similarity as a width of the distribution and find it to be approximately constant from year to year. We show that simple citation models fail to capture this behaviour. We then provide a simple three parameter citation network model using a mixture of local and global search processes which can reproduce the correct distribution over time. We use the citation network of papers from the hep-th section of arXiv to test our model. For this data, around 20% of citations use global information to reference recently published papers, while the remaining 80% are found using local searches. We note that this is consistent with other studies though our motivation is very different from previous work. Finally, we also find that the fluctuations in the size of an academic publication's bibliography is important for the model. This is not addressed in most models and needs further work.
△ Less
Submitted 13 August, 2014;
originally announced August 2014.
-
What is the dimension of citation space?
Authors:
James R. Clough,
Tim S. Evans
Abstract:
Citation networks represent the flow of information between agents. They are constrained in time and so form directed acyclic graphs which have a causal structure. Here we provide novel quantitative methods to characterise that structure by adapting methods used in the causal set approach to quantum gravity by considering the networks to be embedded in a Minkowski spacetime and measuring its dimen…
▽ More
Citation networks represent the flow of information between agents. They are constrained in time and so form directed acyclic graphs which have a causal structure. Here we provide novel quantitative methods to characterise that structure by adapting methods used in the causal set approach to quantum gravity by considering the networks to be embedded in a Minkowski spacetime and measuring its dimension using Myrheim-Meyer and Midpoint-scaling estimates. We illustrate these methods on citation networks from the arXiv, supreme court judgements from the USA, and patents and find that otherwise similar citation networks have measurably different dimensions. We suggest that these differences can be interpreted in terms of the level of diversity or narrowness in citation behaviour.
△ Less
Submitted 30 April, 2015; v1 submitted 6 August, 2014;
originally announced August 2014.
-
Transitive Reduction of Citation Networks
Authors:
James R. Clough,
Jamie Gollings,
Tamar V. Loach,
Tim S. Evans
Abstract:
In many complex networks the vertices are ordered in time, and edges represent causal connections. We propose methods of analysing such directed acyclic graphs taking into account the constraints of causality and highlighting the causal structure. We illustrate our approach using citation networks formed from academic papers, patents, and US Supreme Court verdicts. We show how transitive reduction…
▽ More
In many complex networks the vertices are ordered in time, and edges represent causal connections. We propose methods of analysing such directed acyclic graphs taking into account the constraints of causality and highlighting the causal structure. We illustrate our approach using citation networks formed from academic papers, patents, and US Supreme Court verdicts. We show how transitive reduction reveals fundamental differences in the citation practices of different areas, how it highlights particularly interesting work, and how it can correct for the effect that the age of a document has on its citation count. Finally, we transitively reduce null models of citation networks with similar degree distributions and show the difference in degree distributions after transitive reduction to illustrate the lack of causal structure in such models.
△ Less
Submitted 27 March, 2014; v1 submitted 30 October, 2013;
originally announced October 2013.
-
Universality of Performance Indicators based on Citation and Reference Counts
Authors:
T. S. Evans,
N. Hopkins,
B. S. Kaube
Abstract:
We find evidence for the universality of two relative bibliometric indicators of the quality of individual scientific publications taken from different data sets. One of these is a new index that considers both citation and reference counts. We demonstrate this universality for relatively well cited publications from a single institute, grouped by year of publication and by faculty or by departmen…
▽ More
We find evidence for the universality of two relative bibliometric indicators of the quality of individual scientific publications taken from different data sets. One of these is a new index that considers both citation and reference counts. We demonstrate this universality for relatively well cited publications from a single institute, grouped by year of publication and by faculty or by department. We show similar behaviour in publications submitted to the arXiv e-print archive, grouped by year of submission and by sub-archive. We also find that for reasonably well cited papers this distribution is well fitted by a lognormal with a variance of around 1.3 which is consistent with the results of Radicchi, Fortunato, and Castellano (2008). Our work demonstrates that comparisons can be made between publications from different disciplines and publication dates, regardless of their citation count and without expensive access to the whole world-wide citation graph. Further, it shows that averages of the logarithm of such relative bibliometric indices deal with the issue of long tails and avoid the need for statistics based on lengthy ranking procedures.
△ Less
Submitted 20 February, 2012; v1 submitted 14 October, 2011;
originally announced October 2011.
-
The Emergence of Leadership in Social Networks
Authors:
T. Clemson,
T. S. Evans
Abstract:
We study a networked version of the minority game in which agents can choose to follow the choices made by a neighbouring agent in a social network. We show that for a wide variety of networks a leadership structure always emerges, with most agents following the choice made by a few agents. We find a suitable parameterisation which highlights the universal aspects of the behaviour and which also i…
▽ More
We study a networked version of the minority game in which agents can choose to follow the choices made by a neighbouring agent in a social network. We show that for a wide variety of networks a leadership structure always emerges, with most agents following the choice made by a few agents. We find a suitable parameterisation which highlights the universal aspects of the behaviour and which also indicates where results depend on the type of social network.
△ Less
Submitted 7 November, 2011; v1 submitted 2 June, 2011;
originally announced June 2011.
-
Turnover Rate of Popularity Charts in Neutral Models
Authors:
T. S. Evans,
A. Giometto
Abstract:
It has been shown recently that in many different cultural phenomena the turnover rate on the most popular artefacts in a population exhibit some regularities. A very simple expression for this turnover rate has been proposed by Bentley et al. and its validity in two simple models for copying and innovation is investigated in this paper. It is found that Bentley's formula is an approximation of th…
▽ More
It has been shown recently that in many different cultural phenomena the turnover rate on the most popular artefacts in a population exhibit some regularities. A very simple expression for this turnover rate has been proposed by Bentley et al. and its validity in two simple models for copying and innovation is investigated in this paper. It is found that Bentley's formula is an approximation of the real behaviour of the turnover rate in the Wright-Fisher model, while it is not valid in the Moran model.
△ Less
Submitted 20 May, 2011;
originally announced May 2011.
-
Uncovering space-independent communities in spatial networks
Authors:
Paul Expert,
Tim Evans,
Vincent D. Blondel,
Renaud Lambiotte
Abstract:
Many complex systems are organized in the form of a network embedded in space. Important examples include the physical Internet infrastucture, road networks, flight connections, brain functional networks and social networks. The effect of space on network topology has recently come under the spotlight because of the emergence of pervasive technologies based on geo-localization, which constantly fi…
▽ More
Many complex systems are organized in the form of a network embedded in space. Important examples include the physical Internet infrastucture, road networks, flight connections, brain functional networks and social networks. The effect of space on network topology has recently come under the spotlight because of the emergence of pervasive technologies based on geo-localization, which constantly fill databases with people's movements and thus reveal their trajectories and spatial behaviour. Extracting patterns and regularities from the resulting massive amount of human mobility data requires the development of appropriate tools for uncovering information in spatially-embedded networks. In contrast with most works that tend to apply standard network metrics to any type of network, we argue in this paper for a careful treatment of the constraints imposed by space on network topology. In particular, we focus on the problem of community detection and propose a modularity function adapted to spatial networks. We show that it is possible to factor out the effect of space in order to reveal more clearly hidden structural similarities between the nodes. Methods are tested on a large mobile phone network and computer-generated benchmarks where the effect of space has been incorporated.
△ Less
Submitted 3 January, 2012; v1 submitted 15 December, 2010;
originally announced December 2010.
-
Flow graphs: interweaving dynamics and structure
Authors:
R. Lambiotte,
R. Sinatra,
J. -C. Delvenne,
T. S. Evans,
M. Barahona,
V. Latora
Abstract:
The behavior of complex systems is determined not only by the topological organization of their interconnections but also by the dynamical processes taking place among their constituents. A faithful modeling of the dynamics is essential because different dynamical processes may be affected very differently by network topology. A full characterization of such systems thus requires a formalization t…
▽ More
The behavior of complex systems is determined not only by the topological organization of their interconnections but also by the dynamical processes taking place among their constituents. A faithful modeling of the dynamics is essential because different dynamical processes may be affected very differently by network topology. A full characterization of such systems thus requires a formalization that encompasses both aspects simultaneously, rather than relying only on the topological adjacency matrix. To achieve this, we introduce the concept of flow graphs, namely weighted networks where dynamical flows are embedded into the link weights. Flow graphs provide an integrated representation of the structure and dynamics of the system, which can then be analyzed with standard tools from network theory. Conversely, a structural network feature of our choice can also be used as the basis for the construction of a flow graph that will then encompass a dynamics biased by such a feature. We illustrate the ideas by focusing on the mathematical properties of generic linear processes on complex networks that can be represented as biased random walks and also explore their dual consensus dynamics.
△ Less
Submitted 6 December, 2010;
originally announced December 2010.
-
Clique Graphs and Overlapping Communities
Authors:
T. S. Evans
Abstract:
It is shown how to construct a clique graph in which properties of cliques of a fixed order in a given graph are represented by vertices in a weighted graph. Various definitions and motivations for these weights are given. The detection of communities or clusters is used to illustrate how a clique graph may be exploited. In particular a benchmark network is shown where clique graphs find the overl…
▽ More
It is shown how to construct a clique graph in which properties of cliques of a fixed order in a given graph are represented by vertices in a weighted graph. Various definitions and motivations for these weights are given. The detection of communities or clusters is used to illustrate how a clique graph may be exploited. In particular a benchmark network is shown where clique graphs find the overlapping communities accurately while vertex partition methods fail.
△ Less
Submitted 3 September, 2010;
originally announced September 2010.
-
Communities and Patterns of Scientific collaboration
Authors:
T. S. Evans,
R. Lambiotte,
P. Panzarasa
Abstract:
This paper investigates the role of homophily and focus constraint in shaping collaborative scientific research. First, homophily structures collaboration when scientists adhere to a norm of exclusivity in selecting similar partners at a higher rate than dissimilar ones. Two dimensions on which similarity between scientists can be assessed are their research specialties and status positions. Secon…
▽ More
This paper investigates the role of homophily and focus constraint in shaping collaborative scientific research. First, homophily structures collaboration when scientists adhere to a norm of exclusivity in selecting similar partners at a higher rate than dissimilar ones. Two dimensions on which similarity between scientists can be assessed are their research specialties and status positions. Second, focus constraint shapes collaboration when connections among scientists depend on opportunities for social contact. Constraint comes in two forms, depending on whether it originates in institutional or geographic space. Institutional constraint refers to the tendency of scientists to select collaborators within rather than across institutional boundaries. Geographic constraint is the principle that, when collaborations span different institutions, they are more likely to involve scientists that are geographically co-located than dispersed. To study homophily and focus constraint, the paper will argue in favour of an idea of collaboration that moves beyond formal co-authorship to include also other forms of informal intellectual exchange that do not translate into the publication of joint work. A community-detection algorithm is applied to the co-authorship network of the scientists that submitted in Business and Management in the 2001 UK RAE. While results only partially support research-based homophily, they indicate that scientists use status positions for discriminating between potential partners by selecting collaborators from institutions with a rating similar to their own. Strong support is provided in favour of institutional and geographic constraints. Scientists tend to forge intra-institutional collaborations; yet, when they seek collaborators outside their own institutions, they tend to select those who are in geographic proximity.
△ Less
Submitted 16 May, 2011; v1 submitted 9 June, 2010;
originally announced June 2010.
-
Line Graphs of Weighted Networks for Overlapping Communities
Authors:
T. S. Evans,
R. Lambiotte
Abstract:
In this paper, we develop the idea to partition the edges of a weighted graph in order to uncover overlapping communities of its nodes. Our approach is based on the construction of different types of weighted line graphs, i.e. graphs whose nodes are the links of the original graph, that encapsulate differently the relations between the edges. Weighted line graphs are argued to provide an alternati…
▽ More
In this paper, we develop the idea to partition the edges of a weighted graph in order to uncover overlapping communities of its nodes. Our approach is based on the construction of different types of weighted line graphs, i.e. graphs whose nodes are the links of the original graph, that encapsulate differently the relations between the edges. Weighted line graphs are argued to provide an alternative, valuable representation of the system's topology, and are shown to have important applications in community detection, as the usual node partition of a line graph naturally leads to an edge partition of the original graph. This identification allows us to use traditional partitioning methods in order to address the long-standing problem of the detection of overlapping communities. We apply it to the analysis of different social and geographical networks.
△ Less
Submitted 9 June, 2010; v1 submitted 22 December, 2009;
originally announced December 2009.