Directed closure coefficient and its patterns

Bogdan Gabrys

PLOS ONE RESEARCH ARTICLE Directed closure coefficient and its patterns Mingshan Jia ID*, Bogdan Gabrys, Katarzyna Musial School of Computer Science, University of Technology Sydney, Sydney, NSW, Australia * mingshan.jia@student.uts.edu.au Abstract a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 OPEN ACCESS Citation: Jia M, Gabrys B, Musial K (2021) Directed closure coefficient and its patterns. PLoS ONE 16(6): e0253822. https://doi.org/10.1371/journal. pone.0253822 Editor: Hocine Cherifi, University of Burgundy, FRANCE The triangle structure, being a fundamental and significant element, underlies many theories and techniques in studying complex networks. The formation of triangles is typically measured by the clustering coefficient, in which the focal node is the centre-node in an open triad. In contrast, the recently proposed closure coefficient measures triangle formation from an end-node perspective and has been proven to be a useful feature in network analysis. Here, we extend it by proposing the directed closure coefficient that measures the formation of directed triangles. By distinguishing the direction of the closing edge in building triangles, we further introduce the source closure coefficient and the target closure coefficient. Then, by categorising particular types of directed triangles (e.g., head-of-path), we propose four closure patterns. Through multiple experiments on 24 directed networks from six domains, we demonstrate that at network-level, the four closure patterns are distinctive features in classifying network types, while at node-level, adding the source and target closure coefficients leads to significant improvement in link prediction task in most types of directed networks. Received: March 1, 2021 Accepted: June 13, 2021 Published: June 25, 2021 Peer Review History: PLOS recognizes the benefits of transparency in the peer review process; therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. The editorial history of this article is available here: https://doi.org/10.1371/journal.pone.0253822 Copyright: © 2021 Jia et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All data is available at: https://github.com/MingshanJia/explore-localstructure. Funding: This work was supported by the Australian Research Council, grant No. 1 Introduction Networks, abstracting the interactions between components, are fundamental in studying complex systems in a variety of domains ranging from social and economic networks to informational and technological networks [1, 2]. Small subgraph patterns (also known as motifs [3] or graphlets [4]) that recur at a higher frequency than those in random networks are crucial in understanding and analysing networks. Motifs underlie many descriptive and predictive applications such as community detection [5–7], anomaly detection [8], role analysis [9], and link prediction [10, 11]. Among them, 3-node connected subgraphs, which are the building blocks for higher-order motifs, are explored most often. Further, the triangle structure, or the triadic closure [12] from a temporal perspective, has been revealed to be a natural phenomenon of networks across different areas [3, 13]. Nodes sharing a common neighbour are more likely to connect with each other. For example, in an undirected friendship network, there is an increased likelihood for two people having a common friend to become friends [14]; in a directed citation network, a paper cites two papers where one tends to cite the other [15]; and in a signed directed trust network, when Alice distrusts Bob, Alice discounts anything recommended by Bob [16]. PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 1 / 23 PLOS ONE DP190101087: “Dynamics and Control of Complex Social Networks”. Competing interests: The authors have declared that no competing interests exist. Directed closure coefficient and its patterns The classic measure of triangle formation is the local clustering coefficient [17], which is defined by the percentage of the number of triangles formed with a node (referred to as node i) to the number of triangles that i could possibly form with its neighbours. Note that in this definition, the focal node i serves as the centre-node in an open triad [18]. To emphasise, an open triad is an unordered pair of edges sharing one node. With a focus on node i, it describes the extent to which edges congregate around it. As a standard metric to describe networks, the clustering coefficient has been widely used in network analysis, such as the study of community structure [19, 20], the discovery of structural role [21] and the detection of anomalous objects [22]. The extensions of local clustering coefficient have been thoroughly discussed for weighted networks [23–25], directed networks [26] and signed networks [27, 28]. Another metric for triangle formation, with a focus on an edge (referred to as eij connecting node i and j), is the edge clustering coefficient [29] which evaluates to what extent nodes cluster around this edge. A recent study has proposed another interesting triangle formation measure, i.e., the local closure coefficient [30]. With the focal node i as the end-node of an open triad, it is quantified as the percentage of two times the number of triangles containing i to the number of open triads with i as the end-node. Conceptually, the local clustering coefficient measures the phenomenon that two friends of mine are also friends themselves, while the local closure coefficient is focusing on a friend of my friend is also a friend of mine. This new metric has been proven to be a useful tool in several network analysis tasks such as community detection and link prediction [30]. Together with the two measures mentioned above, we propose a classification diagram of all three triangle formation measures (Fig 1). The closure coefficient is originally defined for undirected binary networks. However, in real-world complex networks, the relationships between components can be nonreciprocal (a follower is often not followed back by the followee), heterogeneous (trade volumes between countries vary significantly), and negative (an individual can be disliked or distrusted). In this paper, from the end-node perspective, we propose the directed closure coefficient [31] to measure triangle formation in binary directed networks, and we extend it for weighted directed networks and weighted signed directed networks. Since each of the three edges can Fig 1. Classification diagram of triangle formation measures. In each of the two node-based measures, the focal node is painted in blue, and the dotted edge represents the potential closing edge in an open triad. In the edge-based measure, the focal edge is in blue, and the dotted outline circle represents the potential node that forms a triangle. https://doi.org/10.1371/journal.pone.0253822.g001 PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 2 / 23 PLOS ONE Directed closure coefficient and its patterns take either direction in a directed triangle, there are eight different triangles in total. According to the direction of the closing edge, i.e., the edge that closes an open triad and forms a triangle, we classify them into two groups (emanating from or pointing to the focal node, as shown in Fig 2a). Based on that, we propose the source closure coefficient and the target closure coefficient, respectively. Further, from a transitive perspective, we categorise all directed triangles into four patterns: (i) a head-of-path pattern, where the focal node is at the beginning of a length-2 path; (ii) a mid-of-path pattern, where the focal node serves as an intermediate node in a length-2 path; (iii) an end-of-path pattern, where the focal node is the endpoint of a length-2 path; (iv) a cyclic pattern, where the focal node is in a cyclical path (Fig 2b). The definitions of the four closure patterns are given accordingly. Comparably, the classic directed clustering coefficient can also be categorised into four patterns [26, 32], which are found to be useful features in classifying directed networks. Our evaluations have revealed some interesting properties of the proposed directed closure coefficient and its patterns. Through a correlation analysis on various networks, it is shown that the directed closure coefficient provides different information than the classic directed clustering coefficient on measuring triangle formation. Besides, the correlations among the eight patterns (four closure patterns and four clustering patterns) show that many types of directed networks demonstrate distinctive characteristics. We further apply the proposed coefficients into two machine learning tasks. First, at network-level, it is shown that adding the four closure patterns in network classification improves the accuracy significantly. Also, through an analysis of feature importance, we show that compared to the four clustering patterns, the four closure patterns are more important features in classifying networks. Second, in a link prediction task, we show that at node-level, the source and target coefficients can be used together with common neighbours as effective predictors to improve the performance, especially in food webs, software networks, web graphs and word adjacency networks. Fig 2. Taxonomy of directed triangles. Two solid edges connecting nodes i, j and k form an open triad, which is closed by a dotted edge connecting nodes i and k. Focal node i, painted in blue, is the end-node of an open triad. (a) Eight triangles are classified into two groups according to the direction of the closing edge. First row shows a group where the focal node serves as the source node of the closing edge; second row shows another group where the focal node serves as the target. (b) Eight Triangles are classified into four groups from a transitive perspective. In six transitive triads, three different patterns are distinguished by the position of node i in a length-2 path (emphasised by grey curved arrows): head-of-path, mid-of-path, and end-of-path patterns. The remaining two non-transitive triads are classified as a cyclic pattern. https://doi.org/10.1371/journal.pone.0253822.g002 PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 3 / 23 PLOS ONE Directed closure coefficient and its patterns In summary, we propose 1) the directed closure coefficient as another measure of triangle formation in directed networks (and the extension of it into weighted and signed networks); 2) the source closure coefficient and the target closure coefficient; and 3) the four closure patterns from a transitive perspective. Through multiple experiments, we exhibit the intrinsic properties of the proposed metrics and how they can be used to improve some common network analysis tasks. 2 Preliminaries This section introduces the preliminary knowledge of our work, including the classic clustering coefficient, its extension in directed networks, and the closure coefficient. 2.1 Clustering coefficient The clustering coefficient was originally proposed to measure the cliquishness of a neighbourhood in an undirected graph [17]. Let G = (V,E) be an undirected graph on a node set V (the number of nodes is |V|) and an edge set E, without multiple edges and self-loops. The adjacency matrix of G is denoted as A = {aij}. aij = 1 if there is an edge between node i and node j, otherwise aij = 0. We denote the degree of node i as di = ∑j aij. For any node i 2 V, the local clustering coefficient is calculated as the number of triangles formed with node i and its neighbours (labelled as T(i)), divided by the number of open triads with i as the centre-node (labelled as OTC(i)): PP 1 TðiÞ j k aij aik ajk 2 ð1Þ CðiÞ ¼ ¼ 1 : OTCðiÞ d ðdi 1Þ 2 i We assume that C(i) is well defined. Clearly, C(i) 2 [0, 1]. In order to measure clustering or triangle formation at the network-level, the average clustering coefficient is introduced by averaging the local clustering coefficient over all nodes (an P 1 undefined local clustering coefficient is treated as zero): C ¼ jVj i2V CðiÞ: Another option to measure triangle formation at the network-level is the global clustering coefficient [33], which is defined as the fraction of open triads that form triangles in the entire network: PPP i j k aij aik ajk ð2Þ C¼ P : 1Þ i2V di ðdi Note that the global clustering coefficient is not equivalent to the average clustering coefficient. On some occasions, they can be very distinct from each other. 2.2 Directed clustering coefficient Fagiolo [26] proposed an extension of the local clustering coefficient to directed networks, which considers all possible directed triangles formed around a focal node. In total, there are eight different triangles (each of the three edges can have two directions). When a directed open triad (or a directed triangle) contains bidirectional edges, they are treated as a combination of open triads (or triangles) with only unidirectional edges (Fig 3). Let us denote A = {aij} as the adjacency matrix of a directed graph GD ¼ ðV; EÞ. aij = 1 if there is an edge from node i to node j, otherwise aij = 0. The degree of node i is denoted as di, PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 4 / 23 PLOS ONE Directed closure coefficient and its patterns Fig 3. Dealing with bidirectional edges. First row shows that an open triad with one bidirectional edge is counted as two unidirectional open triads; second row shows that a triangle with two bidirectional edges is counted as four unidirectional triangles. https://doi.org/10.1371/journal.pone.0253822.g003 P P including both outgoing edges and incoming edges: di ¼ diout þ diin ¼ j aij þ j aji . di$ P denotes the degree of bidirectional edges of i: di$ ¼ j aij aji . The local directed clustering coefficient is thus defined as the number of directed triangles formed with node i and its neighbours (counted as unidirectional ones, labelled as T D ðiÞ), divided by twice the number of directed open triads with i as the centre-node (labelled as OTCD ðiÞ): ð1=2Þ T D ðiÞ C ðiÞ ¼ ¼ 2OTCD ðiÞ D PP j k ðaij þ aji Þðaik þ aki Þðajk þ akj Þ di ðdi 1Þ 2di$ : ð3Þ Note that OTCD ðiÞ equals to ð1=2Þ½di ðdi 1Þ 2di$ �. OTCD ðiÞ is multiplied by two because the edge closes a directed open triad can take two directions. Similarly, the average directed clustering coefficient of the entire network is defined as: 1P D D C ¼ jVj i2V C ðiÞ: An expected alternative, i.e., the global directed clustering coefficient has not appeared in literature. We therefore give a definition here: Definition 2.1. The global directed clustering coefficient of a directed network, denoted D C , is defined as the fraction of directed open triads that form triangles in the entire network: CD ¼ 1 2 PPP � i j k P � � � aij þ aji ðaik þ aki Þ ajk þ akj i2V ðdi ðdi 1Þ 2di$ Þ : ð4Þ The numerator here equals three times the number of directed triangles in the entire network (each node of a triangle contributes an open triad with it as the centre-node). 2.3 Closure coefficient Recently Yin et al. [30] proposed the local closure coefficient and thus closed a gap in measuring triangle formation on undirected networks. Different from the ordinary centre-node focus in the local clustering coefficient, this definition is based on the end-node of an open triad. Recall that an open triad is an unordered pair of edges sharing one node. For example, in an open triad ijk with two edges ij and jk, there is no difference between (ij, jk) and (jk, ij). Reusing the above notations for undirected graph, the local closure coefficient of node i is defined as two times the number of triangles formed with i (labelled as T(i)), divided by the PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 5 / 23 PLOS ONE Directed closure coefficient and its patterns number of open triads with i as the end-node. (labelled as OTE(i)): PP 2TðiÞ j k aij aik ajk EðiÞ ¼ ¼P ; OTEðiÞ 1Þ j2NðiÞ ðdj ð5Þ where N(i) denotes the set of neighbours of node i. T(i) is multiplied by two for the reason that each triangle contains two open triads with i as the end-node. When a triangle is actually formed (e.g., with nodes i, j and k), the focal node i can be viewed as the centre-node in one open triad (jik) or as the end-node in two open triads (ijk and ikj). Obviously, E(i) 2 [0, 1]. At the network-level, the average closure coefficient is then defined as the mean of the local P 1 closure coefficient over all nodes: E ¼ jVj i2V EðiÞ: Analogous to the global clustering coeffi- cient (Eq 2), we can give a global version of the closure coefficient: P 2 i2V TðiÞ P P E¼ : 1Þ i2V j2NðiÞ ðdj ð6Þ The numerator is equal to six times the number of triangles in the entire network (each node of a triangle contributes two open triads with it as the end-node), divided by twice the number of open triads constructed from the end-node in the entire network. However, this definition is actually equivalent to the global clustering coefficient (Eq 2) as globally the difference of the position of the focal node will not surface. Proposition 1. In any undirected network, E = C. Proof. Since globally the neighbourhood relationship is reciprocal, ∑i2V ∑j2N(i)(dj − 1) can be written as ∑j2V∑i2N(j)(dj − 1) which equals ∑j2V dj(dj − 1). Then we have ∑i2V∑j2N(i)(dj − 1) = ∑i2V di(di − 1). Thus, E = C. 3 Closure coefficient in directed networks The local closure coefficient has been proven to be a useful metric in undirected networks [30]. In this section, we first provide a general extension of it to directed networks, i.e., the local directed closure coefficient. We further propose the closure coefficients of particular patterns. Finally, we extend it into weighted (signed) directed networks. 3.1 Closure coefficient in binary directed networks Motivated by the closure coefficient and the directed clustering coefficient, we aim to measure the directed triangle formation from the end-node of an open triad. There are eight different directed triangles, and similarly a triangle (or an open triad) with bidirectional edges is treated as a combination of triangles (or open triads) with only unidirectional edges (Fig 3). Reusing the notations in Section 2, we now give the definition of the closure coefficient in directed networks. Definition 3.1. The local directed closure coefficient of node i in a directed network, denoted ED ðiÞ, is defined as twice the number of directed triangles formed with node i (labelled as T D ðiÞ), divided by twice the number of directed open triads with i as the end-node (labelled as OTED ðiÞ): PP 2T D ðiÞ j k ðaij þ aji Þðaik þ aki Þðajk þ akj Þ D E ðiÞ ¼ : ¼ P ð7Þ D 2OTE ðiÞ 2 j2NðiÞ ðaij þ aji Þðdj ðaij þ aji ÞÞ PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 6 / 23 PLOS ONE Directed closure coefficient and its patterns When the neighbours of i are solely connected to i, the local directed closure coefficient is undefined. In real-world networks, however, nodes with undefined closure coefficients are very rare. T D ðiÞ is multiplied by two since each triangle contains two open triads with i as the endnode. OTED ðiÞ is multiplied by two because the closing edge of a directed open triad can take two directions. Obviously, ED ðiÞ 2 ½0; 1�. When the adjacency matrix A is symmetric (the network becomes undirected), Eq 7 reduces to Eq 5, i.e., ED ðiÞ ¼ EðiÞ. Similarly, in order to measure at the network-level, we propose the definition of an average directed closure coefficient and a global directed closure coefficient. Definition 3.2. The average directed closure coefficient of a directed network, denoted ED , is defined as the average of the local directed closure coefficient over all nodes: 1 X D ED ¼ E ðiÞ; ð8Þ jVj i2V in which an undefined local directed closure coefficient is treated as zero. In the case of a random network, where each directed edge occurs with a probability p, one has that E½ED ðiÞ� ¼ p. Definition 3.3. The global directed closure coefficient of a directed network, denoted ED , is defined as: P 2 i2V T D ðiÞ D E ¼ P P ; ð9Þ 2 i2V j2NðiÞ ðaij þ aji Þðdj ðaij þ aji ÞÞ where the numerator equals six times the number of directed triangles in the entire network (each node of a triangle contributes two open triads with it as the end-node), divided by twice the number of directed open triads across the network. Similar to Proposition 1 and its proof, the global directed closure coefficient is equivalent to the global directed clustering coefficient (Eq 4). Proposition 2. In any directed network, ED ¼ CD . 3.2 Closure coefficients of particular patterns In addition to a general measure, we propose to have a closer look at the directed closure coefficients of particular patterns in order to gain a deeper understanding and fully realise the potential of this metric. Recall that in Fig 2(a), when a directed triangle is constructed from an end-node-based open triad, the closing edge is incident to the focal node. Therefore, we propose to classify directed triangles into two groups according to the direction of the closing edge: one group where the focal node serves as the source node of the closing edge, another group where the focal node serves as the target. Two definitions are given accordingly. Definition 3.4. For a given node i in a directed network, the source closure coefficient, denoted Esrc(i), and the target closure coefficient, denoted Etgt(i) are defined as: PP T src ðiÞ j k ðaij þ aji Þðajk þ akj Þaik src E ðiÞ ¼ ¼P ; ð10Þ OTED ðiÞ ða ðaij þ aji ÞÞ ij þ aji Þðdi j2NðiÞ PP T tgt ðiÞ j k ðaij þ aji Þðajk þ akj Þaki E ðiÞ ¼ ¼P : OTED ðiÞ ða ðaij þ aji ÞÞ ij þ aji Þðdi j2NðiÞ tgt ð11Þ Tsrc(i) indicates the number of triangles where the focal node acts as the source node of the closing edge. Since we view triangles as being built from an end-node perspective, certain PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 7 / 23 PLOS ONE Directed closure coefficient and its patterns Fig 4. An example of calculating the source closure coefficient and target closure coefficient. https://doi.org/10.1371/journal.pone.0253822.g004 triangles (the ones where the focal node has two outgoing edges) are counted twice. Thus, 0 � Tsrc(i) � 2TD(i). Similarly, Ttgt(i) denotes the number of triangles where the focal node acts as the target node. Comparing these two equations with the directed closure coefficient(Eq 7), the denominators are OTED ðiÞ instead of 2OTED ðiÞ. This is because the closing edge here can only take one direction, either outgoing or incoming, thus ensuring the two definitions are in the range of [0, 1]. It is obvious that Tsrc(i)+Ttgt(i) = 2TD(i), which then gives Esrc ðiÞ þ Etgt ðiÞ ¼ 2ED ðiÞ. On a small network, Fig 4 shows how the source closure coefficient and the target closure coefficient are calculated in a detailed table. These two metrics evaluate the extent to which the focal node acts as the source node or the target node of the closing edges in a triangle formation. Note that there are no analogous definitions for the clustering coefficient because the closing edge is not incident to the focal node that serves as the centre-node of the open triad. In Section 4.3, we show how the source and target closure coefficients can be used to improve the performance in a link prediction task. Furthermore, several studies have shown that the three-node transitive closure (also called the feedforward loop) prevails in many real-world networks [3, 13, 26, 32]. Thus, we propose to categorize the eight directed triangles into four patterns from a transitive perspective: three transitive patterns distinguished by the position of the focal node in a length-2 path, plus one non-transitive pattern (Fig 2b). Before introducing the definitions of directed closure coefficients of these four patterns, we first characterize four types of directed open triads with the focal node as the end-node; a comparison with centre-node focused triads is also provided (Fig 5). Then we give the following definitions. Definition 3.5. The directed closure coefficients of four patterns, i.e., the head closure coefficient, denoted Ehead(i); the end closure coefficient, denoted Eend(i); the mid closure Fig 5. Two groups of directed open triads. (a). Four different open triads with the focal node i as the end-node. Two arrows on the superscript describe the directions of two edges: First arrow depicts an edge from i to j (!) or from j to i ( ); second arrow depicts an edge from j to k (!) or from k to j ( ). (b). Three different open triads with i as the centre-node. First arrow depicts the edge direction between i and j while second arrow depicts the edge direction between i and k. There are three instead of four since when the focal node is in the centre, node j and k are symmetric to it. https://doi.org/10.1371/journal.pone.0253822.g005 PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 8 / 23 PLOS ONE Directed closure coefficient and its patterns coefficient, denoted Emid(i) and the cyclic closure coefficient, denoted Ecyc(i) are defined as: PP 2T head ðiÞ j k aij aik ðajk þ akj Þ head E ðiÞ ¼ ; ¼P !! ! OTE ðiÞ þ OTE ðiÞ ðaij þ aji ÞÞ j2NðiÞ aij ðdj end E ðiÞ ¼ OTE mid E ðiÞ ¼ ! OTE 2T end ðiÞ ðiÞ þ OTE 2T mid ðiÞ ðiÞ þ OTE 2T cyc ðiÞ E ðiÞ ¼ !! OTE ðiÞ þ OTE ! ðiÞ cyc ðiÞ ! PP ¼P ðiÞ ¼P ¼P a ðdj j2NðiÞ ji PP j j2NðiÞ k j k ðaij þ aji ÞÞ ; ðaji aik ajk þ aki aij akj Þ ðaij ðdjin PP j2NðiÞ a a ðajk þ akj Þ k ji ki j aij Þ þ aji ðdjout aji ÞÞ ðaji aik akj þ aki aij ajk Þ ðaji ðdjin aij Þ þ aij ðdjout aji ÞÞ ; : As shown above, the numerator of each coefficient equals twice the number of particular triangles; the denominator can be calculated with the neighbourhood information of node i and the degree information of i’s neighbours. Fagiolo [26] and Ahnert [32] also proposed four patterns for the directed clustering coefficient. In order to better compare the four closure patterns with the four clustering patterns, we briefly list their equations here: Chead ðiÞ ¼ T head ðiÞ ; 2OTC!! ðiÞ Cend ðiÞ ¼ T end ðiÞ ; 2OTC ðiÞ T mid ðiÞ ; OTC! ðiÞ Ccyc ðiÞ ¼ T cyc ðiÞ : OTC! ðiÞ Cmid ðiÞ ¼ The significance of defining the four closure patterns is twofold. First, at node-level analysis, they can be applied directly to measure whether a node of interest is more of an initiator (higher value of the head closure coefficient), an intermediary (higher value of the mid closure coefficient) or a target (higher value of the end coefficient). Secondly, after averaging over all nodes, the four closure patterns can also serve as features at network-level. Section 4.2 shows how the four closure patterns are used in a supervised learning task to classify different types of directed networks. 3.3 Closure coefficient in weighted networks So far, the study is focusing on binary networks, where the value of every edge is either one or zero. In many networks, however, we need a more accurate representation of the relationships between nodes, such as the frequency of contact in a social network, the traffic flow in a road network, etc. This is why we are also interested in defining a closure coefficient for weighted networks. We begin with weighted undirected networks. Several versions of weighted clustering coefficients have been summarised in [34]. Among them, a definition given by Onnela et al. [24] and another given by Zhang and Horvath [25] are often employed. After normalisation (maximum weight normalised to one), the former takes a geometric average of weights of actually formed triangles, divided by the number of potential triangles, which implies all edges taking the maximum weight in the denominator. The latter chooses a simple product of weights of formed triangles, divided by the product of two weights of an open triad, implying the potential triadic closing edge taking the maximum weight. PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 9 / 23 PLOS ONE Directed closure coefficient and its patterns In our definition of weighted closure coefficient, similar to the method proposed by Zhang and Horvath [25], we choose to assign a maximum weight to the closing edge. In a weighted graph GW described by its weight matrix W = {wij}, we suppose wij 2 [0, 1] (normalised by the maximum weight), and the strength of node i is si = ∑j wij. Definition 3.6. The weighted closure coefficient of node i in a weighted network, denoted EW ðiÞ, is defined as: PP j k wij wik wjk W E ðiÞ ¼ P : ð12Þ w wij Þ j2NðiÞ ij ðsj Obviously, EW ðiÞ 2 ½0; 1�. When the weight matrix becomes binary, Eq 12 degrades to Eq 5, i.e., EW ðiÞ ¼ EðiÞ. In a similar approach, the definition of closure coefficient in weighted directed networks can be extended from Eq 7. Let us denote W = {wij} as the weight matrix of a weighted directed graph GW;D , wij 2 [0, 1]. The strength of node i is denoted by si (si = ∑j wij+∑j wji). Then we have the following definition: Definition 3.7.The weighted directed closure coefficient of node i, denoted EW;D ðiÞ, is defined as: PP j k ðwij þ wji Þðwik þ wki Þðwjk þ wkj Þ W;D E ðiÞ ¼ P : ð13Þ 2 j2NðiÞ ðwij þ wji Þðsj ðwij þ wji ÞÞ This definition can also be used in weighted signed networks (wij 2 [−1, 1]), with a modified definition of si (si = ∑j|wij| + ∑j|wji|). In many settings, the weights of relationships can be both positive and negative, as a person may trust or distrust others with different levels of intensity. Clearly, EW;D ðiÞ varies in [−1, 1]. It is positive when positive triangles formed around the focal node outweigh negative ones. It equals zero when no triangles are formed with the focal node or positive triangles and negative triangles are balanced. Through a brief case study on the Bitcoin Alpha trust network (TR-BTCALPHA) [35], we illustrate how the weighted directed closure coefficient can provide new understandings in network analysis. TR-BTCALPHA is a trust network on a blockchain asset trading platform, where users rate other traders in a range of [−10, 10] in steps of 1, from total distrust to total trust. There are 3,783 nodes representing the users and 24,186 edges representing the ratings in the network. First, without considering weights on edges, we find in Fig 6a that the directed closure coefficient is positively related to the node degree (Pearson correlation coefficient ρ equals to 0.714), implying big traders tend to form more trustful cliques. However, when weights are put back, we see in Fig 6b that the correlation between the weighted directed closure coefficient and the node strength becomes very weak (ρ = 0.265): big traders are not significantly better at forming trustful cliques. Also, we detect some nodes with negative closure coefficients, meaning the negative triangles outweigh the positive ones around them. In line with the balance theory [36] suggesting that negative triangles are rare in a trust relationship, we find only 138 out of 3783 nodes having formed overall distrustful cliques. 3.4 Computational efficiency To end this section, we briefly discuss the computational efficiency of the proposed metrics. Taking the local directed closure coefficient (Definition 3.1) as an example, for the purpose of facilitating understanding and expression, we use the adjacency matrix of the network to present the equation, which leads up to O(|V|2) in computation. PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 10 / 23 PLOS ONE Directed closure coefficient and its patterns Fig 6. Case study of the network TR-BTCALPHA. (a). The correlation between directed closure coefficient and node degree (weights ignored). All nodes are plotted in black dots; (b). The correlation between weighted directed closure coefficient and node strength (weights taken into account). 3654 nodes with positive closure coefficients are plotted in red; 138 nodes with negative closure coefficients are plotted in blue. https://doi.org/10.1371/journal.pone.0253822.g006 In actual development, however, after conveniently obtaining the neighbourhood information (both successors and predecessors in directed networks) of a given node, the average-case computational cost is Oðk 2 Þ, where k is the average degree of the network. The average cost for computing the local directed closure coefficient across the network is therefore OðjVj � k 2 Þ. As in most real networks k � jVj, the computation is fast in large networks. 4 Experiments and analysis In this section, we evaluate the proposed directed closure coefficient and its patterns in realworld networks. First, we compare it with the classic directed clustering coefficient. Then, we demonstrate that at network-level, the four closure patterns are discriminative in classifying directed networks. Finally, at node-level, we show how the source and target closure coefficients can be applied in link prediction task. Our code is available at https://github.com/ MingshanJia/explore-local-structure. 4.1 Directed closure coefficient in real-world networks 4.1.1 Datasets. We run experiments on 24 directed networks from 6 different domains: 1. Four trust networks. TR-BTCALPHA and TR-BTC-OTC [37]: two who-trusts-whom networks of users on Bitcoin trading platforms Bitcoin Alpha and Bitcoin OTC; TR-ADVOGATO [38]: a network of trust relationships among users on an online community Advogato; TR-EPINIONS [39]: a who-trust-whom network of members on a general consumer review site Epinions.com. 2. Four food webs. FW-MANGROVE [40]: a what-eats-what network among species found in Florida’s mangroves during the wet season; FW-BAYWET and FW-BAYDRY [41]: two food webs collected from the cypress wetlands of South Florida during the wet season and the dry season; FW-LITTLEROCK [42]: a food web among the species found in Little Rock Lake in Wisconsin. PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 11 / 23 PLOS ONE Directed closure coefficient and its patterns 3. Four citation networks. CITcORA [43]: citations among papers indexed by CORA; CITHEPPH [44] and CITHEPPH [45]: citations among papers posted on arxiv.org under the hep-ph and hep-th categories, between 1993 and 2003; CITCITESSER [46]: citations among papers indexed by the CiteSeer digital library. 4. Four software networks. SW-WEKA [47]: a class dependency network of the Weka 3.6.6 framework; SW-LUCENE [48]: a class dependency network of the Lucene 4.1.0 framework; SW-JUNG [47]: a class dependency network within the JUNG 2.0.1 and javax 1.6.0.7 library namespaces; SW-JDK [47]: a class dependency network within the JDK 1.6.0.7 framework. 5. Four web graphs. WEB-STANFORD [49]: a hyperlink network of Stanford University; WEBNOTREDAME [50]: a hyper link network of the University of Notre Dame; WEB-BERKSTAN [49]: a hyperlink network between UC Berkeley and Stanford University; WEB-GOOGLE [49]: a hyperlink network of a portion of the general WWW released in 2002 by Google. 6. Four word adjacency networks. WA-JAPANESE, WA-ENGLISH, WA-FRENCH and WA-SPANISH [51]: directed networks of word adjacency in texts of languages including Japanese, English, French and Spanish. Table 1 lists some key statistics of these datasets. We see that in all 24 networks, the average directed closure coefficient is smaller than the average directed clustering coefficient. That is to say, in all these types of networks, fewer triangles are built from end-node-based open triads than from centre-node-based open triads. In food webs, the difference between them is not very big; while in word adjacency networks and two software networks (SW-JUNG and JDK), the directed closure coefficient is several dozen times smaller than the directed clustering coefficient. From the scatter plots of the local directed closure coefficient and the local directed clustering coefficient (Fig 7), we can see their relationship more clearly. First, in most networks covered in our study, the two coefficients have a positive Pearson correlation whereas only in five networks they show negative correlation. However, neither positive nor negative correlations are strong, ranging from −0.296 to 0.518, indicating the directed closure coefficient captures different information on triangle formation than the classic directed clustering coefficient. Secondly, the same type of a network exhibits a similar relationship between the two variables. A visual inspection of Fig 7 indicates that the plots within one type of network (in the same row) are more similar to each other than the plots in between these types of networks (between rows). For example, in citation networks, most points are congregated at the left bottom area; and in word adjacency networks, most nodes have relatively small directed closure coefficient, making most points very close to the horizontal axis. We further explore the correlations amongst the eight patterns, i.e., the four closure patterns and the four clustering patterns. In Fig 8, we observe that certain types of networks demonstrate particular characteristics. Specifically, in trust networks, we find strong correlations among almost all four closure patterns (except between Ehead and Eend). Also, Cmid and Ccyc are strongly correlated. In citation networks, two mid-of-path patterns (Cmid and Emid) and two cyclic patterns (Ccyc and Ecyc) are highly correlated. In software networks, two end-of-path patterns (Cend and Eend) and two cyclic patterns (Ccyc and Ecyc) have higher correlations. In web graphs, the correlation between Ehead and Emid and the correlation between Eend and Ecyc are stronger. At last, in word adjacency networks, the four closure patterns are strongly correlated with each other. These observations lead us to the following experiment, in which we utilise these patterns as features to classify different types of directed networks. PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 12 / 23 PLOS ONE Directed closure coefficient and its patterns Table 1. Statistics of datasets, showing the number of nodes (|V|), the number of edges (|E|), the average degree(k), the proportion of reciprocal edges (r), the average directed clustering coefficient (CD ), and the average directed closure coefficient (ED ). Datasets having timestamps on edge creation are superscripted by (τ). Network TR-BTCALPHAτ |V| |E| k r CD ED 3,783 24,186 6.39 0.832 0.158 0.017 TR-BTC-OTCτ 5,881 35,592 6.05 0.792 0.151 0.013 TR-ADVOGATO 6,539 51,127 7.82 0.307 0.148 0.026 TR-EPINIONS 0.016 75,879 509K 6.71 0.405 0.110 FW-MANGROVE 97 1,492 15.38 0.062 0.261 0.185 FW-BAYWET 128 2,106 16.45 0.029 0.177 0.134 FW-BAYDRY 128 2,137 16.70 0.029 0.176 0.135 FW-LITTLEROCK 183 2,494 13.63 0.034 0.173 0.112 CITcORA 23,166 91,500 3.95 0.051 0.146 0.055 CIT-HEPTh 27,770 353K 12.70 0.003 0.157 0.061 CITHEPPH 34,546 422K 12.20 0.003 0.143 0.053 CITCITESSER 384K 1,751K 4.56 0.010 0.092 0.028 SW-WEKA 1,323 4,844 3.66 0.014 0.201 0.021 SW-LUCENE 2,956 10,872 3.68 0.005 0.217 0.029 SW-JUNG 6,120 50,535 8.26 0.010 0.454 0.006 SW-JDK 6,434 53,892 8.38 0.009 0.443 0.006 WEB-STANFORD 282K 2,312K 8.20 0.277 0.378 0.055 WEB-NOTREDAME 326K 1,497K 4.60 0.507 0.159 0.029 WEB-BERKSTAN 685K 7,601K 11.09 0.250 0.400 0.055 0.097 WEB-GOOGLE 876K 5,105K 5.83 0.307 0.370 WA-JAPANESE 2,704 8,300 3.07 0.073 0.139 0.004 WA-ENGLISH 7,381 46,281 6.27 0.090 0.252 0.005 WA-FRENCH 8,325 24,295 2.92 0.037 0.114 0.002 WA-SPANISH 11,586 45,129 3.90 0.091 0.249 0.002 https://doi.org/10.1371/journal.pone.0253822.t001 4.2 Network classification This section presents the utility of the proposed four closure patterns in classifying different types of directed networks. Previous works have shown that a normalised number of directed triads and triangles, such as the triad significance profile [51] and the clustering signatures [32], are effective attributes in a network classification task. It motivated us to use the four closure patterns in the network classification, as they represent a normalised number of directed triangles from the end-node perspective. In order to gain an intuitive understanding of the effect of the four closure patterns on detecting different types of networks, we draw a parallel coordinates plot (Fig 9). Without complicated conditional rules, it is clearly seen that some types of networks show discriminative profiles in terms of certain closure patterns. For example, food webs are better separated from other types of networks with respect to their head closure coefficients, word adjacent networks with respect to their end or mid closure coefficients, and web graphs or trust networks in respect of their cyclic closure coefficients. 4.2.1 Setup. To prepare the classification dataset, we calculate the average four clustering patterns and the average four closure patterns of each network. We then choose Decision Tree (DT) as the classifier, not only because it is a powerful algorithm, but also because it enables convenient calculation of feature importance. We also include two tree-based ensemble models that are more stable and more powerful than a single DT, i.e., the Random Forest (RF) classifier and the Gradient Boosted Decision Tree (GBDT) classifier in the experiment. PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 13 / 23 PLOS ONE Directed closure coefficient and its patterns Fig 7. Correlation between the directed clustering coefficient and the directed closure coefficient, together with the Pearson correlation coefficient ρ. Each dot in the plot represents a node in the network. https://doi.org/10.1371/journal.pone.0253822.g007 In order to test how useful the four closure patterns are in classifying networks, we fit three sets of features into these models: first set, the baseline, includes the traditional four clustering patterns [26, 32]; second set includes the proposed four closure patterns; third set includes both the clustering patterns and the closure patterns. As the dataset is small, we adopt the PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 14 / 23 PLOS ONE Directed closure coefficient and its patterns Fig 8. Heatmap of the correlations among the eight patterns in 24 networks. https://doi.org/10.1371/journal.pone.0253822.g008 PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 15 / 23 PLOS ONE Directed closure coefficient and its patterns Fig 9. Parallel coordinates plot of 24 networks on eight features, including the four closure patterns and the four clustering patterns. Each vertical axis represents one feature. In order to put all features on a similar scale, the value of each feature is standardised by removing the mean and scaling to unit variance. Different types of networks are painted in different colours, as shown in the legend. Distinct braids of line segments are highlighted by thin rectangles. https://doi.org/10.1371/journal.pone.0253822.g009 leave-one-out cross-validation [52] to evaluate the classification performance with different set of features. Also, because tree-based models are naturally stochastic, we repeat 1000 times and report the mean accuracy score. 4.2.2 Results and discussion. Table 2 shows the mean accuracy of three classifiers with different sets of features. Comparing the first row and the second row, we have two classifiers (DT and GBDT) performing better with the closure patterns and one classifier (RF) performing better with the clustering patterns. In order to further study how different features influence the classification results, we take Random Forest classifier as an example and report the two average confusion matrices of using two different feature sets (Fig 10). We can see that the clustering patterns are better at classifying software networks and web graphs, while the closure patterns are better at categorising food webs and word adjacency networks. It indicates that the information contained in the closure patterns are complementary to the information contained in the clustering patterns. Therefore, combining both would be expected to yield the best classification accuracy as also illustrated in the third row of Table 2. Indeed, comparing the first row and the third row, after adding the four closure patterns to the four clustering patterns, we observe significant improvement in all three classifiers, Table 2. Leave-one-out cross-validation accuracy in classifying network types. Three sets of network features (rows) are tested in three tree-based classifiers (columns). DT RF GBDT with four clustering patterns 0.557 0.734 0.667 with four closure patterns 0.631 0.716 0.708 with four clustering patterns & four closure patterns 0.672 0.765 0.797 https://doi.org/10.1371/journal.pone.0253822.t002 PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 16 / 23 PLOS ONE Directed closure coefficient and its patterns Fig 10. Average confusion matrices of Random Forest model with different feature sets. https://doi.org/10.1371/journal.pone.0253822.g010 especially in Decision Tree and Gradient Boosted Decision Tree classifiers (more than 19%). The result demonstrates that the proposed four closure patterns are useful features in telling apart different types of directed networks. Naturally, the next question is how important each feature is in classifying these networks. Adopting the common approach to measure feature importance in tree-based models [52], we calculate the importance score by computing the normalised total decrease in impurity brought by each feature. After repeating 1000 times, we report the average importance scores of the eight features in Fig 11. We observe that in DT and RF classifiers, all four closure patterns have larger importance scores than the four clustering patterns, and the most important feature is Emid. In GBDT, although Cend has the second largest importance score, overall speaking, the total score of the closure patterns is still larger than that of the clustering patterns. This analysis illustrates further that the proposed four closure patterns are important features in network classification. 4.3 Link prediction in directed networks Many studies [53–58] have shown that future interactions among nodes can be extracted from the network topology information. The key idea is to compare the proximity or similarity between pairs of nodes, either from the neighbourhoods [54, 55], the local structures [56] or the whole network [57, 58]. Fig 11. Importance scores of eight features in three tree-based models classifying network types. The scores of the four closure patterns are plotted in blue bars while those of the four clustering patterns are plotted in red bars. https://doi.org/10.1371/journal.pone.0253822.g011 PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 17 / 23 PLOS ONE Directed closure coefficient and its patterns Most existing methods, however, focus solely on undirected networks. In this experiment, we show whether the information provided by the local directed closure coefficient can be used to enhance the performance of link prediction approaches for directed networks. As shown in [53], the neighbourhood based methods are simple yet powerful. We choose three classic similarity indices extended for directed networks as the baseline methods [59]. Let Nout(i) be the out-neighbour set of node i (consisting of i’s successors); Nin(i) be the inneighbour set (consisting of i’s predecessors). The set of all neighbours N(i) is the union of the two: N(i) = Nout(i)[Nin(i). For an ordered pair of nodes (s, t), the three baseline indices are defined below: 1. Directed Common Neighbours index: DiCN(s, t) = |Nout(s) \ Nin(t)|, P 1 , 2. Directed Adamic-Adar index: DiAAðs; tÞ ¼ u2Nout ðsÞ\Nin ðtÞ logjNðuÞj P 1 3. Directed Resource Allocation index: DiRAðs; tÞ ¼ u2Nout ðsÞ\Nin ðtÞ jNðuÞj . 4.3.1 Proposed indices. Combining the idea of the Common Neighbours index and the source and target closure coefficients (Definition 3.4), we propose two indices to measure the directed closeness in directed networks. Definition 4.1. For an ordered pair of nodes (s, t), the closure closeness index, denoted CCI (s, t); and the extra closure closeness index, denoted ECCI(s, t) are defined as: CCIðs; tÞ ¼ jNout ðsÞ \ Nin ðtÞj � ðEsrc ðsÞ þ Etgt ðtÞÞ; ECCIðs; tÞ ¼ jNðsÞ \ NðtÞj � ðEsrc ðsÞ þ Etgt ðtÞÞ: Unlike the closure closeness index, the extra closure closeness index uses the set of all neighbours, because the source closure coefficient of node s and the target closure coefficient of node t can also bring in the direction inclination. 4.3.2 Setup. We model a directed network as a graph GD ¼ ðV; EÞ. For networks having timestamps on edges, we order the edges according to their appearing times and select the first 50% edges and related nodes to form an “old graph”, denoted Gold = (V� , Eold). For networks not having timestamps, we randomly choose 50% edges and related nodes as Gold and repeat 10 times in the experiment. Apparently, the total number of potential links on node set V� equals to |V� |2−|Eold|. Let Enew be the set of future edges among the nodes in V� . We apply each prediction method to output a list containing the similarity scores for all potential links. Each potential link either represents a positive link or a negative link, depending on whether it appears in Enew. The PR-AUC value of the prediction is then calculated. Since in very large networks it is too expensive to compute all potential links, we randomly sample 3,000 connected nodes on GD when |V|>10,000 and repeat the above procedures 10 times. 4.3.3 Results and discussion. We compare three baseline methods with two proposed methods (Definition 4.1) in Table 3. We see that the closure closeness index (CCI) has recorded the highest PR-AUC value in 12 networks, including all the food webs, all the software networks, three web graphs and one citation network. The extra closure closeness index (ECCI), on the other hand, has recorded the highest PR-AUC value in 6 networks, including all the word adjacency networks and two trust networks. It suggests that in most directed networks, including the local structure information of the source and target closure coefficients leads to improvement in link prediction. The improvement is remarkable in many networks: CCI is over 25% better than the baselines in three software networks (SW-WEKA, SW-JUNG and SW-JDK), and over 10% better in all four food webs and two web graphs (WEB-STANFORD PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 18 / 23 PLOS ONE Directed closure coefficient and its patterns Table 3. Performance comparison of five methods on link prediction in directed networks (PR-AUC). The best performance in each network is in bold type, second best in italic. Network DiCN DiAA DiRA CCI ECCI τ TR-BTCAlpha 0.0286 0.0291 0.0199 0.0283 0.0347 TR-BTC-OTCτ 0.0275 0.0308 0.0245 0.0265 0.0316 TR-Advogato 0.1076 0.1124 0.0899 0.1052 0.1107 TR-Epinions 0.1536 0.1559 0.1303 0.1520 0.1491 FW-Mangrove 0.2334 0.2438 0.2456 0.2760 0.2666 FW-BayWet 0.1669 0.1705 0.1719 0.1995 0.1903 FW-BayDry 0.1738 0.1771 0.1783 0.2058 0.1887 FW-LittleRock 0.2593 0.2520 0.2443 0.3117 0.2449 CIT-Cora 0.1084 0.1056 0.1007 0.1053 0.0819 CIT-HepTh 0.1742 0.1897 0.1769 0.1833 0.1708 CIT-HepPh 0.1428 0.1459 0.1324 0.1424 0.1339 CIT-Citeseer 0.1054 0.1063 0.1029 0.1221 0.0791 SW-Weka 0.1231 0.1394 0.1399 0.1901 0.0935 SW-Lucene 0.1853 0.1730 0.1678 0.1930 0.1026 SW-JUNG 0.3386 0.3277 0.2732 0.4385 0.1645 SW-JDK 0.3610 0.3377 0.2785 0.4551 0.1787 0.2927 WEB-Stanford 0.3784 0.3875 0.3330 0.4159 WEB-NotreDame 0.2226 0.2310 0.2104 0.2934 0.2703 WEB-BerkStan 0.4002 0.4026 0.3746 0.4784 0.3968 WEB-Google 0.4938 0.5211 0.4803 0.5046 0.3903 WA-Japanese 0.0240 0.0197 0.0154 0.0353 0.0568 WA-Darwin 0.0421 0.0451 0.0337 0.0654 0.0901 WA-French 0.0136 0.0152 0.0138 0.0262 0.0488 WA-Spanish 0.0571 0.0631 0.0537 0.1048 0.1368 https://doi.org/10.1371/journal.pone.0253822.t003 and WEB-NOTREDAME). Besides, ECCI is over 100% better than the three baselines in word adjacency networks. We also notice that in all four software networks and one citation network (CITCITESSER), where CCI records the highest precision, ECCI is, however, worse than the baseline methods. This suggests that sometimes the information provided by the extra neighbours without considering direction inclination conflicts with that provided by the source and target closure coefficients. Finding a method that better combines the information of common neighbours and closure coefficients is an interesting avenue for future study. 5 Related work In this section, we summarise some additional related work that also measure directed triangle formation from an end-node perspective. Similar to our work, Yin et al. [60] extended the local closure coefficient in directed networks by proposing a family of eight coefficients. Their z definition of the local directed closure coefficients of node i are eight scalars Hxy ðiÞ with x, y, z 2 {i, o} (i and o represent edge direction, incoming or outgoing). One major limitation of this work is that it lacks a general characterisation that unifies all eight directed triangles constructed from end-node based open triads. Our work not only addressed this issue by giving one general definition, but further proposed the taxonomies of end-node based directed triangles, i.e., the source and target closure coefficients and the four closure patterns. PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 19 / 23 PLOS ONE Directed closure coefficient and its patterns Romero and Klenberg [61] developed a methodology for studying a particular type of directed closure process (or “link copying” phenomenon) in information network. Lou et al. [62] later proposed a graphical model TriFG to predict reciprocity and triadic closure in social networks. Nevertheless, these two works chose not to take into account all directed triangles by particularly focusing on the feed-forward triangle. 6 Conclusion In this paper, we introduced the directed closure coefficient and its patterns to measure directed triangle formation from an end-node perspective. Through experiments on 24 realworld networks from six domains, we revealed that 1) in all networks, the average directed closure coefficient is smaller than the average directed clustering coefficient; 2) the correlation between the directed closure coefficient and the directed clustering coefficient is weak; 3) different types of networks demonstrate different characteristics in the correlations of the eight patterns. We also showed that, at network-level, adding the four closure patterns leads to significant improvement in classifying directed networks; while at node-level analysis, such as in link prediction, the source and target coefficients can be used together with common neighbours as effective predictors, especially in food webs, software networks, web graphs and word adjacency networks. Due to the simplicity and interpretability in the definitions, we anticipate that the directed closure coefficient and its patterns will become standard descriptive features and be incorporated in other network mining tasks. Acknowledgments The authors thank the editors and anonymous reviewers for their excellent comments and suggestions. The authors would also thank Pim van der Hoorn, Xiaolin Zhang, Mohamad Barbar, Joakim Skarding and Yu-Xuan Qiu for their helpful comments and discussions. Author Contributions Conceptualization: Mingshan Jia. Formal analysis: Mingshan Jia. Funding acquisition: Bogdan Gabrys, Katarzyna Musial. Investigation: Mingshan Jia. Methodology: Bogdan Gabrys, Katarzyna Musial. Project administration: Katarzyna Musial. Software: Mingshan Jia. Supervision: Bogdan Gabrys, Katarzyna Musial. Writing – original draft: Mingshan Jia. Writing – review & editing: Mingshan Jia, Bogdan Gabrys, Katarzyna Musial. References 1. Newman ME. The structure and function of complex networks. SIAM Rev. 2003;. https://doi.org/10. 1137/S003614450342480 2. Barabási AL, et al. Network science. Cambridge university press; 2016. PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 20 / 23 PLOS ONE Directed closure coefficient and its patterns 3. Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U. Network motifs: simple building blocks of complex networks. Science. 2002;. https://doi.org/10.1126/science.298.5594.824 PMID: 12399590 4. Pržulj N, Corneil DG, Jurisica I. Modeling interactome: scale-free or geometric? Bioinformatics. 2004;. PMID: 15284103 5. Girvan M, Newman ME. Community structure in social and biological networks. PNAS. 2002;. https:// doi.org/10.1073/pnas.122653799 PMID: 12060727 6. Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. nature. 2005;. https://doi.org/10.1038/nature03607 PMID: 15944704 7. Solava RW, Michaels RP, Milenković T. Graphlet-based edge clustering reveals pathogen-interacting proteins. Bioinformatics. 2012;. https://doi.org/10.1093/bioinformatics/bts376 PMID: 22962470 8. Noble CC, Cook DJ. Graph-based anomaly detection. In: KDD; 2003. p. 631–636. 9. Musial K, Juszczyszyn K. Motif-based analysis of social position influence on interconnection patterns in complex social network. In: ACIIDS. IEEE; 2009. p. 34–39. 10. Zhang QM, Lü L, Wang WQ, Zhou T, et al. Potential theory for directed networks. PloS one. 2013; 8(2): e55437. https://doi.org/10.1371/journal.pone.0055437 PMID: 23408979 11. Schall D. Link prediction in directed social networks. Social Network Analysis and Mining. 2014; 4 (1):157. https://doi.org/10.1007/s13278-014-0157-9 12. Easley D, Kleinberg J, et al. Networks, crowds, and markets. vol. 8. Cambridge university press Cambridge; 2010. 13. Juszczyszyn K, Kazienko P, Musiał K. Local topology of social network based on motif analysis. In: KES. Springer; 2008. p. 97–105. 14. Rapoport A. Spread of information through a population with socio-structural bias: I. Assumption of transitivity. The bulletin of mathematical biophysics. 1953. https://doi.org/10.1007/BF02476441 15. Wu ZX, Holme P. Modeling scientific-citation patterns and other triangle-rich acyclic networks. Physical review E. 2009;. https://doi.org/10.1103/PhysRevE.80.037101 PMID: 19905247 16. Jøsang A, Hayward R, Pope S. Trust network analysis with subjective logic. In: Estivill-Castro V, Dobbie G, editors. Computer Science 2006, Twenty-Nineth Australasian Computer Science Conference (ACSC2006), Hobart, Tasmania, Australia, January 16-19 2006. vol. 48 of CRPIT. Australian Computer Society; 2006. p. 85–94. 17. Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’networks. nature. 1998;. https://doi.org/10. 1038/30918 18. ter Haar-Pomp L, de Beer C, van der Lem R, Spreen M, Bogaerts S. Monitoring risk behaviors by managing social support in the network of a forensic psychiatric patient: A single-case analysis. Journal of Forensic Psychology Practice. 2015. https://doi.org/10.1080/15228932.2015.1007779 19. Orman K, Labatut V, Cherifi H. An empirical study of the relation between community structure and transitivity. In: Complex Networks. Springer; 2013. p. 99–110. 20. Ji Q, Li D, Jin Z. Divisive Algorithm Based on Node Clustering Coefficient for Community Detection. IEEE Access. 2020; 8:142337–142347. https://doi.org/10.1109/ACCESS.2020.3013241 21. Henderson K, Gallagher B, Eliassi-Rad T, Tong H, Basu S, Akoglu L, et al. Rolx: structural role extraction & mining in large graphs. In: KDD; 2012. p. 1231–1239. https://doi.org/10.1145/2339530.2339723 22. LaFond T, Neville J, Gallagher B. Anomaly detection in networks with changing trends. In: ODD2 Workshop; 2014. 23. Barrat A, Barthelemy M, Pastor-Satorras R, Vespignani A. The architecture of complex weighted networks. PNAS. 2004;. https://doi.org/10.1073/pnas.0400087101 PMID: 15007165 24. Onnela JP, Saramäki J, Kertész J, Kaski K. Intensity and coherence of motifs in weighted complex networks. Physical Review E. 2005;. https://doi.org/10.1103/PhysRevE.71.065103 PMID: 16089800 25. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology. 2005;. https://doi.org/10.2202/1544-6115.1128 PMID: 16646834 26. Fagiolo G. Clustering in complex directed networks. Physical Review E. 2007;. https://doi.org/10.1103/ PhysRevE.76.026107 PMID: 17930104 27. Kunegis J, Lommatzsch A, Bauckhage C. The slashdot zoo: mining a social network with negative edges. In: WWW; 2009. p. 741–750. https://doi.org/10.1145/1526709.1526809 28. Costantini G, Perugini M. Generalization of clustering coefficients to signed correlation networks. PloS one. 2014;. https://doi.org/10.1371/journal.pone.0088669 PMID: 24586367 29. Wang J, Li M, Wang H, Pan Y. Identification of essential proteins based on edge clustering coefficient. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2011;. PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 21 / 23 PLOS ONE Directed closure coefficient and its patterns 30. Yin H, Benson AR, Leskovec J. The local closure coefficient: A new perspective on network clustering. In: WSDM; 2019. p. 303–311. https://doi.org/10.1145/3289600.3290991 31. Jia M, Gabrys B, Musial K. Closure Coefficient in Complex Directed Networks. In: International Conference on Complex Networks and Their Applications. Springer; 2020. p. 62–74. 32. Ahnert SE, Fink TM. Clustering signatures classify directed networks. Physical Review E. 2008;. https:// doi.org/10.1103/PhysRevE.78.036112 PMID: 18851110 33. Newman ME, Strogatz SH, Watts DJ. Random graphs with arbitrary degree distributions and their applications. Physical review E. 2001. https://doi.org/10.1103/PhysRevE.64.026118 PMID: 11497662 34. Saramäki J, Kivelä M, Onnela JP, Kaski K, Kertesz J. Generalizations of the clustering coefficient to weighted complex networks. Physical Review E. 2007;. PMID: 17358454 35. Kumar S, Spezzano F, Subrahmanian V, Faloutsos C. Edge weight prediction in weighted signed networks. In: ICDM. IEEE; 2016. p. 221–230. 36. Heider F. Attitudes and cognitive organization. The Journal of psychology. 1946. https://doi.org/10. 1080/00223980.1946.9917275 PMID: 21010780 37. Kumar S, Hooi B, Makhija D, Kumar M, Faloutsos C, Subrahmanian V. Rev2: Fraudulent user prediction in rating platforms. In: WSDM; 2018. p. 333–341. 38. Massa P, Salvetti M, Tomasoni D. Bowling alone and trust decline in social network sites. In: 2009 Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing. IEEE; 2009. p. 658–663. 39. Massa P, Avesani P. Controversial users demand local trust metrics: An experimental study on epinions. com community. In: AAAI; 2005. p. 121–126. 40. Rossi RA, Ahmed NK. The Network Data Repository with Interactive Graph Analytics and Visualization. In: AAAI; 2015. Available from: http://networkrepository.com. 41. Ulanowicz RE, DeAngelis DL. Network analysis of trophic dynamics in south florida ecosystems. US Geological Survey Program on the South Florida Ecosystem. 1999. 42. Martinez ND. Artifacts or attributes? Effects of resolution on the Little Rock Lake food web. Ecological monographs. 1991. 43. McCallum AK, Nigam K, Rennie J, Seymore K. Automating the construction of internet portals with machine learning. Information Retrieval. 2000; 3(2):127–163. https://doi.org/10.1023/ A:1009953814988 44. Gehrke J, Ginsparg P, Kleinberg J. Overview of the 2003 KDD Cup. Acm Sigkdd Explorations Newsletter. 2003; 5(2):149–151. https://doi.org/10.1145/980972.980992 45. Leskovec J, Kleinberg J, Faloutsos C. Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining; 2005. p. 177–187. 46. Bollacker KD, Lawrence S, Giles CL. CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. In: Proceedings of the second international conference on Autonomous agents; 1998. p. 116–123. 47. Šubelj L, Bajec M. Software systems through complex networks science: Review, analysis and applications. In: Proceedings of the First International Workshop on Software Mining; 2012. p. 9–16. 48. Šubelj L, Blagus N, Bajec M. Group extraction for real-world networks: The case of communities, modules, and hubs and spokes. In: Proceedings of the International Conference on Network Science (Copenhagen, Denmark, 2013); 2013. p. 152–153. 49. Leskovec J, Lang KJ, Dasgupta A, Mahoney MW. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics. 2009; 6(1):29–123. https://doi.org/10.1080/15427951.2009.10129177 50. Albert R, Jeong H, Barabási AL. Diameter of the world-wide web. nature. 1999. https://doi.org/10.1038/ 43601 51. Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, Ayzenshtat I, et al. Superfamilies of evolved and designed networks. Science. 2004; 303(5663):1538–1542. https://doi.org/10.1126/science.1089167 PMID: 15001784 52. Friedman J, Hastie T, Tibshirani R, et al. The elements of statistical learning. vol. 1. Springer series in statistics New York; 2001. 53. Liben-Nowell D, Kleinberg J. The link-prediction problem for social networks. Journal of the American society for information science and technology. 2007. https://doi.org/10.1002/asi.20591 54. Adamic LA, Adar E. Friends and neighbors on the web. Social networks. 2003. https://doi.org/10.1016/ S0378-8733(03)00009-1 PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 22 / 23 PLOS ONE Directed closure coefficient and its patterns 55. Zhou T, Lü L, Zhang YC. Predicting missing links via local information. The European Physical Journal B. 2009. https://doi.org/10.1140/epjb/e2009-00335-8 56. Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations. In: KDD; 2014. p. 701–710. 57. Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953. https://doi.org/10. 1007/BF02289026 58. Meo PD. Trust Prediction via Matrix Factorisation. ACM Transactions on Internet Technology (TOIT). 2019. 59. Zhang X, Zhao C, Wang X, Yi D. Identifying missing and spurious interactions in directed networks. International Journal of Distributed Sensor Networks. 2015. 60. Yin H, Benson AR, Ugander J. Measuring directed triadic closure with closure coefficients. Network Science. 2020; 8(4):551–573. https://doi.org/10.1017/nws.2020.20 61. Romero D, Kleinberg J. The directed closure process in hybrid social-information networks, with an analysis of link formation on twitter. In: Proceedings of the International AAAI Conference on Web and Social Media. vol. 4; 2010. 62. Lou T, Tang J, Hopcroft J, Fang Z, Ding X. Learning to predict reciprocity and triadic closure in social networks. ACM Transactions on Knowledge Discovery from Data (TKDD). 2013; 7(2):1–25. https://doi. org/10.1145/2499907.2499908 PLOS ONE | https://doi.org/10.1371/journal.pone.0253822 June 25, 2021 23 / 23

RELATED PAPERS

RELATED TOPICS

Log In

Directed closure coefficient and its patterns

Directed closure coefficient and its patterns

Related Papers

RELATED PAPERS

RELATED TOPICS