A management of early warning and risk control based on data fusion for COVID-19
Abstract
According to the previous management of early warning and risk control methods, the efficiency of management prediction is low, the effect is not good, and the disadvantages are very obvious. This paper mainly studies the C4.5 algorithm, Apriori algorithm and K-means algorithm. On the basis of association rules, the data from the above three algorithms are fused. On the fusion results of the processed data, it builds and optimizes the early warning model. The fusion data used in this model can be regarded as the basic data and the association rules are used for data mining. The experimental results show that data fusion can solve the problems of management early warning and risk control. This method is applied to enterprises Management has reference value.
1Introduction
Almost all the enterprises that fall into the crisis of operation and management take financial crisis as a sign [1]. The emergence of financial crisis has a gradual and deteriorating process, which will eventually be reflected through financial indicators. Therefore, as an important part of business management [2, 3], financial management naturally requires the establishment of corresponding financial early warning system [4]. It is of great research value and practical significance to build an effective early warning model of financial crisis, to get early warning signals of serious deterioration of the financial situation of listed companies as soon as possible, and to meet the increasingly urgent needs of stakeholders [5, 6]. In addition, it is of great practical significance to correctly predict the financial risks of enterprises for the protection of the interests of investors and creditors, the prevention of financial crisis by operators, and the supervision of the quality of listed companies and the risk of securities market by government departments [7]. This paper proposes a data fusion mining management early warning and risk control method [8–10]. This method can comprehensively analyze the hidden internal relations between the results of operation and management and various financial data, take effective measures to promote the reform of operation and management, and improve the effectiveness and quality of management decisions.
2C4.5 algorithm
In 1986, J. Ross Quinlan published a paper entitled “induction of decision trees” in Machine Learning Journal, and introduced a new ID3 algorithm [11]. The traditional ID3 algorithm uses information gain to select the attributes from the candidate attributes at each step of the growth tree, and measures the homogeneity of the samples with entropy.
2.1Attribute metrics
Given a sample set S that contains concepts about a goal, if the target attribute has c different values, the classification of S relative to c States is defined as [12–14]:
(1)
Where pi is the proportion of category i in S.
The information Gain (S, A) of an attribute A relative to the sample set S is defined as
(2)
Among them, Values (A) is a set of all possible values of attribute A, and Sv is a subset of the value V of attribute A in S, that is Sv ={ s ∈ S|A (s) = v }.
The principle that ID3 algorithm chooses attribute a as the test attribute is to maximize the information Gain(S, A) in formula (2). This algorithm tends to choose the attributes with more values, because the weighted sum method makes the classification of instance set tend to discard the data tuples with small data amount, but the attributes with more values are not always the optimal attributes. That is to say, according to the principle of minimizing the entropy value and maximizing the information gain, ID3 algorithm is listed as the attribute to be selected, which will not provide too much information for testing.
There is a drawback in ID3 algorithm that information gain is used as the test attribute selection method: because information gain measurement tends to have more attributes, but the attributes with more values are not necessarily the best attributes.
In order to solve this problem, Quinlan proposes C4.5 algorithm, which modifies the classification evaluation function and uses the information gain rate to replace the information gain as the classification evaluation function. It is not only the successor of ID3 algorithm, but also the basis of many decision tree algorithms. In the decision tree algorithm applied to single machine, C4.5 algorithm not only has high classification accuracy, but also has the fastest speed.
Information gain ratio punishes multi valued attributes by adding an item called split information, which is used to measure the breadth and uniformity of attribute split data.
(3)
Where, S1 to Sc, it refers to the c sample set formed by dividing S by attribute A of c values. Noting that split information is actually S values for attribute A.
The information gain rate is equal to the ratio of information gain to segmentation information.
(4)
2.2C4.5 algorithm description
Based on ID3, C4.5 algorithm integrates the processing of continuous attribute and attribute value vacancy, and has a more mature method for tree pruning. The main idea of C4.5 algorithm is: suppose T is a training set, when constructing a decision tree for T, the attribute with the largest GainRatio (S, A) value is selected as the split node, and T is divided into n subsets according to this standard. If the tuples in the i subset Ti have the same category, the node becomes the leaf node of the decision tree and stops splitting. For other subsets of T that do not meet this condition, the tree is generated recursively according to the above method until all the tuples contained in the subsets belong to one category.
The specific pseudo code algorithm description C4.5 form-tree (T, T _ attributelist) is shown as Table 1. Suppose T represents the current sample set and the current candidate attribute set is represented by T, T _ attributelist.
Table 1
(1) Create root node N |
(2) If T all belong to the same class C |
Then return N as a leaf node, marked with class C |
(3) If T_attributelist is empty, the number of remaining samples in or T is less than a given value, |
Then return N as a leaf node, marked as the most frequently occurring class in T; |
(4) For each T_attributelist attribute, |
Calculate the information gain rate GainRatio; |
(5) N’s test attribute test_ attribute = T_attrilbutelist has the attribute with the highest information gain rate |
(6) If the test attribute is continuous |
Then find the segmentation threshold of this attribute; |
(7) For each new leaf node grown from node N |
{if the sample subset T’ corresponding to the leaf node is empty |
Then split the leaf node to generate a new leaf node, marking it as the most frequently occurring class in T |
else |
Execute C4.5 formtree(T’, T’_attributelist) on the leaf node |
Continue to split on it; |
} |
(8) Calculate the classification error of each node and perform tree pruning |
C4.5 algorithm is a greedy algorithm, which uses the top-down, divide and conquer recursive way to construct a decision tree. In addition to the improvement of classification evaluation function, the following two aspects are also improved:
On the one hand, ID3 algorithm can deal with discrete values, while C4.5 can deal with attributes of continuous values. For attribute A of continuous values, C4.5 uses to find the best threshold value T, divides the training set into two parts T is the separation point), and establishes two branches, A ≤ T (left branch) and A > T (right branch). The gain or gain ratio is calculated for each partition, and the selected partition maximizes the gain.
On the other hand, C4.5 algorithm splits the training sample with missing attribute value according to all possible values of the missing attribute, divides the instance into multiple instances and belongs to different categories. In the process of execution, probability method is adopted and different weights are assigned. The weight value is the occurrence probability of a possible value in the classification. In this way, the number of samples passed down the path may not be an integer but have a score, but it does not affect the calculation process of the algorithm.
3Mining and analysis of association rules
3.1Basic concepts of association rules
Association rule mining is widely used in transaction database.
Suppose that the set I ={ i1, i2 ·· · , im }, composed of m different items is equivalent to the set of all kinds of goods. Given a transaction database D, each transaction T ={ t1, t2, ·· · , tm } is a set of items in I, that is T ⊆ I, which is equivalent to the commodity column in the transaction. T has a unique identifier TID. Any subset A of I is called the item-set in D, and |A| = k is called the set A is k-item-set. Let tk and A be transaction and project sets in D respectively. If A ⊆ tk, transaction tk is called to contain project set A. If item-set A ⊆ I and A ⊆ T transaction T is said to contain item-set A.
Definition 1: Association Rule.
Suppose A and B are non-empty sets composed of some items, It is A ⊆ I, A≠ ∅, B ⊆ I, B≠ ∅ and A∩ B = ∅, then the expression of the form A ⇒ B is called association rule, which means that the appearance of item subset A will lead to the appearance of item subset B. It is called association, A is the antecedent or precondition of association rule, and B is the consequent or result of association rule.
Definition 2: Support
Let A be a non-empty set composed of some items, that is, A ⊆ I and A≠ Ø. The support degree of rule A ⇒ B in transaction database D is the ratio of the number of transactions containing A and B in the transaction set to the number of all transactions, which is recorded as Supp (A ⇒ B), that is,
(5)
Physical significance of support degree: from the statistical significance, the support degree of project A, Supp (A), indicates the probability of project A appearing in transaction database T.
Definition 3: Confidence
The confidence level of rule A ⇒ B in the transaction set refers to the ratio of the number of transactions containing both A and B to the number of transactions containing A, which is recorded as Conf (A ⇒ B), that is,
(6)
Physical significance of confidence: for association rule A ⇒ B, its confidence indicates how likely it is to include both A and B in a transaction;From the statistical point of view, confidence is also a conditional probability, that is, in the case of A, the probability of B.
Definition 4: Strong Association Rule
The minimum support and confidence of a given association rule are MinSupp and MinConf. For association rule A ⇒ B, if Supp (A ⇒ B)≥MinSupp and Conf (A ⇒ B)≥MinConf, then association rule A ⇒ B is called strong association rule.
In statistical sense, the minimum support degree represents the lowest importance of association rules, and the minimum confidence degree represents the lowest reliability of association rules. Therefore, strong association rules are important and reliable association rules with expected value. Association rules that do not meet the above two conditions are also called weak association rules.
When the support of data item set A is greater than MinSupp, A is called frequent data item set, which is called frequency set for short.
Let A and B be the item set in data set D.
(1) A ⊆ B, then Supp (A) ≥ Supp (B);
(2) A ⊆ B, if A is a non-frequency set, then B is also a non-frequency set;
(3) A ⊆ B, if B is a frequency set, then A is also a frequency set;
3.2The process of mining association rules
Given a transaction set D, the problem of mining association rules is to generate association rules with support and confidence greater than the minimum support (MinSupp) and minimum confidence (MinConf) given by users. When the support and confidence of the rules are greater than MinSupp and MinConf respectively, we think the rules are effective.
The mining process of association rules mainly includes two stages: the first stage is to find out all frequent project groups from the data set, and the second stage is to generate association rules from these frequent project groups.
In the first stage of association rule mining, all frequent project groups must be found out from the original data set. Frequent means that the frequency of a project group must reach a certain level compared with all records. The frequency of a project group is also called the support degree. Taking a 2-item-set including A and B as an example, we can get the support degree of a project group containing A, B from equation (5). If the support degree is greater than or equal to the minimum support threshold, then A, B is called the frequent project group. A k-item-set meeting the minimum support is called frequent k-item-set, which is generally expressed as Large k or Frequent k. The algorithm generates large k + 1 from Largek’s project group until no longer frequent project group can be found.
The second stage of association rule mining is to produce association rules. To generate association rules from frequent project groups is to use frequent k-project groups in the previous step to generate rules. Under the threshold of minimum confidence, if the confidence degree of a rule meets the minimum confidence, this rule is called association rule. For example, the confidence of rule A ⇒ B generated by frequent k-item group A, B can be obtained by formula (2.6). If the confidence is greater than or equal to the minimum confidence, then A ⇒ B is the association rule.
3.3Apriori algorithm
Agrawal first proposed the problem of mining association rules among item sets in customer transaction database in 1993. The core method is Apriori algorithm, which is based on frequency set theory. Apriori algorithm is one of the most influential algorithms for mining frequent item-sets of association rules. It uses known frequent item-sets to derive other frequent item-sets. It is a width first algorithm.
Apriori algorithm divides the mining of association rules into two sub problems. First, mining all frequent items whose support degree is not less than the minimum support degree MinSupp from the transaction database D; second, generating association rules whose confidence degree is not less than the minimum confidence degree MinConf by using the mined frequent items. The algorithm is shown in Table 2.
Table 2
(1) C1 = {Candidate, 1-Itemsets }; |
(2) L1 = find_frequent_1-Itemsets (D); |
(3) For (k = 2; Lk-1≠∅; k++); |
(4) { Ck = Apriori_Gen (Lk - 1, MinSupp); |
(5) For each transaction t∈D |
(6) { Ct = subset (Ck, t); |
(7) for each candidate c∈Ct |
(8) c.count++; } |
(9) Lk = {c∈Ck | c.count≥MinSupp }; |
(10) } |
(11) return L =∪kLk. |
Input data: transaction database D; minimum support threshold MinSupp.
Output result: frequent item set L in D.
Step 2 finds out the set L1 of frequent 1-term sets. In step 3–11, LK - 1 is used to generate candidate set CK to find LK. Apriori_ Gen does two actions: connect and prune, that is, to generate candidate item set CK from frequent item set LK - 1 connection. The specific process is described in Table 3.
Table 3
procedure Apriori_Gen(Lk - 1, MinSupp) |
(1) for each itemset l1∈ Lk - 1 |
(2) for each itemset l2∈ Lk - 1 |
(3) if |
(4) then { c = l1 |
(5) if has_infrequent_subset (c, Lk - 1) |
(6) then delete c; |
(7) else add c to Ck;} |
(8) return Ck. |
According to Apriori property, all subsets of frequent item-sets must be frequent. This algorithm uses layer by layer search technology. Given k-item-sets, we only need to check whether their k-1 subsets are frequent. The test description of non-frequent subsets is shown as Table 4.
Table 4
procedure has_infrequent_subset(c, Lk - 1) |
(1) for each (k-1)-subset s of c |
(2) if s∉Lk - 1 then |
(3) return TRUE; |
(4) return FALSE; |
The key to the high efficiency of the algorithm is to generate smaller candidate project sets, that is to say, the candidate project sets that are not likely to become frequent project sets are not generated and calculated as much as possible. It takes advantage of the basic property that any subset of a frequent item-set must also be a frequent item-set. This property is inherited by most of the current association rule algorithms.
3.4K-means algorithm
The K-means algorithm proposed by J.B.Mac Queen in 1967 is a classical clustering algorithm, which is widely used in scientific research and industrial applications. K-means algorithm is an indirect clustering method based on similarity measurement between samples, which belongs to unsupervised learning method. The task is to divide the data set into k disjoint point sets, so that the points in each set are as homogeneous as possible. That is to say, given the set N data points D = {X1,X2...,Xn}, The goal of clustering is to find k clusters C = {C1, C2,...,CK}, so that every point XK is assigned to a unique cluster Ci, where i = {1,2...,k}. Finally, the number k of clusters to be obtained is determined in advance.
The basic idea of the algorithm is: given a database containing n data objects and the number of clusters to be generated k, randomly select k objects as the initial k cluster centers, then calculate the distance between the remaining samples and each cluster center, classify the samples to the nearest cluster center, and calculate the average value of the adjusted new clusters If there is no change in the centers of the two adjacent clusters, it means that the adjustment of samples is over and the clustering average error criterion function E has converged. The function E is as follows:
(7)
The E is the sum of the squared errors of all objects in the database, p is the point in the space, representing the given data object, and mi is the average value of cluster cI (both p and mi are multidimensional). The algorithm is shown in Table 5.
Table 5
Input: number of clusters k, and database with n data objects |
Output: k clusters, minimizing the square error criterion |
Method: |
(1) Arbitrarily select k objects as the initial clustering center; |
(2) repeat |
(3) According to the average value of objects in the cluster, |
assign each object (re) to the most similar cluster; |
(4) Update the average value of the cluster, that is, calculate |
the average value of each (variable) cluster; |
(5) until no longer changes |
This algorithm has good scalability. The disadvantage of K-means clustering algorithm is that it scans the database many times. In addition, it can only find the spherical class, but not the arbitrary shape class. In addition, the selection of initial centroid has a great influence on clustering results, and the algorithm is very sensitive to noise.
4Establishment and optimization of management early warning model
4.1Selection of financial indicators
It is a gradual process for an enterprise to fall into financial difficulties. The gradual deterioration of its production and operation will usually be reflected in the financial statements of the enterprise quickly, showing some abnormal financial index data. There are many factors that affect the financial status of enterprises, but the data of some indicators are difficult to obtain, which requires a lot of human and material resources, so those financial ratios with high acquisition costs are not considered. According to the principle of operability, combined with the indicators provided in the financial report, this paper selects 29 financial indicators that comprehensively reflect the profitability, solvency, operating capacity and cash flow to build the financial early warning model, including 4 aspects of the company size and growth capacity that are not covered in the general paper but which we think have a greater impact on the financial risk prediction Financial indicators (log (total assets), log (net assets * total shareholders’ equity), growth rate of total assets, growth rate of operating revenue). The selected indicators are shown in Table 6.
Table 6
No. | Index name | Category | No. | Index name | Category |
01 | Profit margin of main business | Profitability | 15 | Current ratio | Debt service ability |
02 | Return on equity (net profit) | 16 | Quick ratio | ||
03 | Return on assets | 17 | Debt to capital ratio | ||
04 | Earnings per share (diluted operating profit) | 18 | Log (total assets) | ||
05 | Earnings per share (diluted net profit) | 19 | Log (net assets * total | Company | |
shareholders’ equity) | |||||
06 | Return on total assets | 20 | total assets | Scale | |
07 | Net profit margin | 21 | All rights and interests (including | ||
minority rights and interests) | |||||
08 | Operating profit margin | 22 | Net assets | ||
09 | Debt to asset ratio | 23 | Growth rate of total assets | Grow up ability | |
10 | Main business profit | 24 | Growth rate of operating revenue | ||
II | Business income | 25 | Cash detection ratio | Cash flow | |
12 | Total profit | 26 | Inventory turnover | ||
13 | Operating profit | 27 | Turnover rate of receivables | Operating capacity | |
14 | Net profit | 2S | Turnover rate of current assets | ||
29 | Turnover rate of total assets |
4.2Establishment and optimization of early warning model
We use seven classification methods provided by data mining software Weka, such as Bayesian network, decision tree, rule-based classification, nearest neighbor classification, multi-layer perceptron, BP neural network, logical regression, to establish various early warning models and analyze them. A large number of data analysis is carried out from two aspects. Firstly, all financial indicators are used for risk modeling and analysis by using seven classification methods, then indicators are selected by using data mining methods, and then risk modeling and analysis are carried out by using selected indicators.
The modeling process is based on the original data set without attribute selection. For each classification algorithm, two models are established respectively: the 2010–2015 data set and the 2010–2016 data set are used as training sets; the 2016 related data and the 2017 related data are used as test sets. Table 7 shows the test results of different classification methods on two sets of data.
Table 7
Classification algorithm | 2010–2015 forecast 2016 | 2010–2015 forecast 2017 | 2010–2016 forecast 2017 |
Bayesian network | 76.99% | 77.51% | 81.68% |
Decision tree (J48) | 78.06% | 77.72% | 88.65% |
Rule based classification (JRip) | 78.29% | 78.06% | 86.53% |
Nearest neighbor classification (INN) | 89.14% | 90.4% | 88.18 |
Multilayerperceptron | 87.16% | 87.70% | 91.87% |
BP neural network (RBFNetwork) | 93.20% | 87.42% | 89.34% |
Logistic regression | 90.37% | 84.96% | 90.98% |
The experimental results show that the performance of nearest neighbor classification, multi-layer perceptron, BP neural network and logistic regression is basically the same, while the performance of Bayesian network, decision tree and rule-based classification is not significantly different, the overall performance is significantly lower than the first four methods, but the recognition accuracy of ST (about 60%) is significantly higher than the first four methods.
From the experimental results, it can be seen that using the data from 2010 to 2015 as the training set for modeling to predict the data from 2016 to 2017, the prediction accuracy of most methods is lower than using the data from 2010 to 2016 to predict the prediction accuracy of 2017. It can be understood that the law of stock market in 2016 and 2017 is obviously different from that in 2010–2015. Therefore, the prediction accuracy of the model is not ideal.
Based on the data from 2010 to 2015, by using the three attribute selection methods of BestFirest, Green Stepwise and Linear Forward Selection in weka, we get 9 reserved attributes, that is earnings per share (diluted operating profit), debt asset ratio (total assets), log (net assets * total shareholders’ equity), total asset growth rate, cash liability ratio, operating profit, total owners’ equity (including minority shareholders’ equity), net assets, etc.
For the selected 9 indicators, two models are established for each classification algorithm: 2010–2015 data set and 2010–2016 data set as training set; 2016 related data and 2017 related data as test set. The test results are given in Table 8.
Table 8
Classification algorithm | 2010–2015 forecast 2016 | 2010–2015 forecast 2017 | 2010–2016 forecast 2017 |
Bayesian network | 77.22% | 77.92% | 84.14% |
Decision tree (J48) | 78.13% | 77.79% | 89.47% |
Rule based classification (JRip) | 78.36% | 77.72% | 90.15% |
Nearest neighbor classification (INN) | 84.10% | 87.42% | 88.52% |
Multilayerperceptron | 86.16% | 91.05% | 91.52% |
BP neural network (RBFNetwork) | 87.61% | 94.80% | 91.93% |
Logistic regression | 90.67% | 89.47% | 91.32% |
It can be seen from Table 7 and Table 8 that after attribute selection, the prediction accuracy of various models does not change much, most of them are slightly improved, but the amount of data is reduced by nearly 2 / 3 compared with that before attribute selection, so the establishment time of the model is greatly shortened. When multi-layer perceptron algorithm is adopted, the establishment time of the model is shortened to 16.5% before attribute selection, The modeling time of other classification methods is also reduced to varying degrees, ranging from 24.74% to 57.57% before attribute selection. At the same time, the representation of the model after attribute selection is more concise, and the time to detect new data is correspondingly shortened, which shows that the model after attribute selection has better applicability. In addition, it is noted that after attribute selection, the four innovation indicators proposed in this paper retain three (log (total assets), log (net assets * total shareholders’ equity), and the growth rate of total assets), which shows that the concept of innovation indicators added in this paper is correct.
5Conclusions
Based on the analysis of the financial data of non-financial listed companies from 2010 to 2017, four new indicators, including Log (total assets), Log (net assets * total shareholders’ equity), growth rate of total assets, and growth rate of operating revenue, are introduced. A total of 29 indicators are used for risk analysis. Seven different classification methods are used to model financial risk. The results show that the performance of the four methods is basically the same, and the risk early warning model can be built with nine representative indicators, which can better achieve the risk prediction. Generally speaking, the data set we deal with is balanced, and the machine learning method adopted is ideal for the classification performance of a few classes of unbalanced data. Therefore, the method based on data fusion mining proposed in this paper has certain reference value for enterprise management early warning and risk control.
Acknowledgments
This paper is supported by Social Science Foundation in Shaanxi Province(NO.2019S025) and the Soft Science Project in Shaanxi Province (NO. 2019KRM047).
References
[1] | Devino G.T. , A Method for Analyzing the Effect of Taxes and Financing on Investment Decisions: Comment, American Journal of Agricultural Economics 53: (1) ((1971) ), 134. |
[2] | He Y. , Liao N. , Bi J. , et al., Investment decision-making optimization of energy efficiency retrofit measures in multiple buildings under financing budgetary restraint, Journal of Cleaner Production 215: (APR.1) ((2019) ), 1078–1094. |
[3] | Grichnik D. and Hisrich R.D. , Strategic and Investment Behaviour in the German and Israeli Venture Capital Industries: A Comparison with the USA, International Journal of Technology Management 34: (1-2) ((2006) ), 88–104(17). |
[4] | Migdalas A. , Applications of game theory in finance and managerial accounting, Operational Research 2: (2) ((2002) ), 209–241. |
[5] | Athanasios M. , Applications of game theory in finance and managerial accounting, Operational Research (2002). |
[6] | Gentry W.M. , Debt, investment and endowment accumulation: the case of not-for-profit hospitals, Journal of Health Economics (2002), 21. |
[7] | Xing L. and Cheng P. , Real Option Analysis on Interaction Effects and Harmonizing Decisions between Investment and Financing, Systems Engineering 25: (4) ((2007) ), 59–63. |
[8] | Szabo S. , Jaeger-Waldau A. and Szabo L. , Risk adjusted financial costs of photovoltaics, Energy Policy 38: (7) ((2010) ), 3807–3819. |
[9] | Huhtala A. , Special issue on cleaner production financing, Journal of Cleaner Production 11: (6) ((2003) ), 611–613. |
[10] | Zhang X.Q. and Kumaraswamy M.M. , BOT-Based Approaches to Infrastructure Development in China, Journal of Infrastructure Systems 7: (1) ((2001) ), 18–25. |
[11] | Anderson J.M.M. , 1-d and 2-d system identification algorithms using higher-order statistics, Oil & Gas Journal (1992), 93. |
[12] | Danilova A. , Risk-Sensitive Investment Management, Quantitative Finance 15: (12) ((2015) ), 1–2. |
[13] | Domptail S. and Nuppenau E.A. , The role of uncertainty and expectations in modeling (range) land use strategies: An application of dynamic optimization modeling with recursion, Ecological Economics 69: (12) ((2010) ), 2475–2485. |
[14] | Safavi H.R. , Chakraei I. , Kabiri-Samani A. , et al., Optimal Reservoir Operation Based on Conjunctive Use of Surface Water and Groundwater Using Neuro-Fuzzy Systems, Water Resources Management 27: (12) ((2013) ), 4259–4275. |