In system software, the big data analysis method is mainly used to design the government economic situation prediction algorithm.
3.1 Factors Affecting Government Economic Development
Combined with the cost-effectiveness of government infrastructure, the influencing factors of government economic development are extracted. Cost-benefit mainly includes the construction cost of infrastructure and operation cost. The former includes municipal, road, real estate, communication, and security facilities, which usually account for 5% to 10% of the total investment. The latter includes the consumption of labor, energy, and other materials required for the daily operation and maintenance of municipal, road, real estate and safety systems, and the maintenance costs of infrastructure, power system, heating system, water supply system, and other costs.
When analyzing the benefits of government infrastructure, it is assumed that the service life of government infrastructure is
\({A }\) years, the completion time of government infrastructure and superstructure is the initial time
\({t = 0}\), and the government infrastructure operator purchases infrastructure at the initial time. In time
\({A }\), assuming that the income
\({C}\) and annual usage
\({D}\) are constant, and the project investment cost is
\({E}\), including infrastructure construction cost and facility present value; under the usage
\({D}\) state, the fixed cost remains unchanged, only variable cost is considered, and all costs are calculated by opportunity cost. The conditional expression of positive net present value is obtained as follows:
Where,
\({F(D) }\) is the annual social income of the project,
\({G(D)}\) is the annual operation and maintenance cost based on
\({D}\), and
\({H }\) is the investment cost,
\({e}\) represents cost error. According to formula (
1), during the operation of government infrastructure construction, operation and maintenance costs, labor costs, energy consumption, and so on. will be generated.
3.2 Collect Government Economic Development Index Data
Screen the influencing factors of government economic development, build the government economic development index system, and collect the government economic index data. Combined with relevant indicators and data of economic development, 9 factors such as per capita GDP, per capita industrial production, per capita fixed asset investment, per capita fiscal income, per capita resident savings deposit balance, per capita total retail sales of social consumer goods, per capita net income of rural residents, the proportion of fiscal income in GDP and the speed of economic development are selected. This article discusses the economic development in the radiation zone of government infrastructure. The comprehensive development level of the node is characterized by the comprehensive quality index and measured by composite indicators. 14 indicators are selected to build the comprehensive development level index system from the three levels of economic development, social development, and urban construction. The maximum method is used to standardize the data, and the comprehensive development level is preliminarily measured with the help of SPSS software. The index system is shown in Table
1.
The basic data are collected from the government statistical yearbook, i.e., China Urban Statistical Yearbook 2020.
In addition to the statistical yearbook as a collection source, government economic index data can also be collected through web crawlers. Take the government's economic development as the keyword, search the relevant web pages of situation indicators on the Internet, join the crawler queue from the seed
Uniform Resource Locator (
URL), analyze and download the web pages, grab the URL and obtain a new URL. The data pages containing the government's economic situation are considered relevant, including all economic related pages. However, in order to simplify the calculation, only the top 10 pages of the Internet search are collected, and the same economic situation data are not collected. In the real network traffic data, count the usage heat of various indicators, read the web page on the front page of the web page, find other link addresses in the web page, find the next web page, and set the access layers of different web pages until all web pages of the web site are captured [
12]. Preprocess the captured pages, take the government's economic development as the theme content, filter out the pages with inconsistent theme, use the table tag to repair and sort out the wrong or irregular tags, store the repaired complete pages in the HTML document, select the HTML file as the root node, construct the tag tree, and use the visual information of the web page to process the web page in blocks. In line with the forecast demand of the government's economic development, remove the redundant information of the web page, link the useful information together, find the text file related to the subject content, mark the hypertext, and integrate the web page [
13]. Finally, through the HTTP protocol, assist the browser to download the web page, capture the effective information in the web page, including sound, text, image and other documents, obtain the government economic index data in the field of government economic development, add the government economic development related content, collect the video, audio, database, picture, text data and other types of data in the web page, eliminate new URLs, add a new crawling queue and cycle the above operations. So far, the collection of government economic development index data is completed.
3.3 Preprocessing Government Economic Development Index Data
The massive government economic development index data collected are preprocessed to enable the government economic index data to accurately express the government's economic situation.
3.3.1 Cluster Processing of Government Economic Index Data.
The distributed
k-means algorithm is used to cluster the relevant government economic development index data. First, clean up the original data of economic situation indicators, and deal with wrong data, data noise, and invalid data. Then, the time series of government economic development data in historical data are counted, and the time series are classified and processed in quarterly order to maintain the continuity of government economic development index data, and then the missing data are filled in. Through attribute mapping, the character data of the original dataset is converted into digital standardized data. The mapping formula is as follows:
Where,
\({m }\) is the processed standardized government economic index data,
\({{m}_{\max}}\) and
\({{m}_{\min}}\) are the maximum and minimum values of the processed government economic index data, respectively,
\({n}\) is the original historical data of the government economic development index, and
\({n_{\max}}\) and
\({n_{\min}}\) are the maximum and minimum values of the original data, respectively. Randomly select
\({k }\) data objects in the dataset as the initial clustering center of government economic development index data, and compare the initial clustering center with the remaining data objects by using Mahalanobis distance. Mahalanobis distance is the covariance distance of data, which is an effective method to calculate the similarity of two unknown sample sets. Unlike Euclidean distance, it takes into account the relationship between various characteristics, so as to improve the effectiveness of calculation. The calculation formula of Mahalanobis distance is
Suppose
\({{H}_{ij}}\) is the Mahalanobis distance between the government economic indicator data
\({i}\) and the government economic indicator data
\({j}\). If the Mahalanobis distance
\({H_{ij}}\) is closer to 1 or −1, the higher the correlation degree, the closer the distance between the two governments economic index data. If
\({H_{ij} }\) is closer to 0, the lower the correlation degree, the farther the distance. The remaining government economic index data objects are classified into the nearest initial cluster center, and then the cluster center is re-selected. It is iterated for many times until the criterion function converges, while the
\({k}\) cluster centers remain unchanged. The definition formula of criterion function J is as follows:
Where,
\({Z_{r} }\) is the central point value of the cluster center of class
\({r}\), and
\({E_r}\) is the average value of
\({Z_r}\). Clean the clustered government economic indicator data, and delete records irrelevant to government economic development, including picture content requests, file requests and crawler requests. When HTTP requests are initiated, separate illogical sessions and record a large amount of government economic development information through HTTP headers. Based on the government economic index data after piecewise clustering, the collected data are finely classified to obtain different local data tuples. The refined data items after segmentation are shown in Table
2.
Build a distributed SQL database to represent the attribute structure of data items, and provide data support for government economic situation prediction through various refined datasets. So far, the clustering processing of government economic index data is completed.
3.3.2 Ranking the Primary and Secondary Relationship of Government Economic Index Data.
The SPRINT classification algorithm is used to sort the primary and secondary relationship of government economic development index data. SPRINT classification algorithm is easy to understand and has a low degree of time complexity, which is the main advantage. It can be used for processing of small datasets, and the missing value of the algorithm is insensitive, which can effectively extract the characteristics of related data. The maximum minimum normalization formula is used to discretize the continuous numerical attribute of government economic index data, and the government economic index data is linearly transformed. The calculation formula is
Where,
\({L}\) is the data value of government economic indicators,
\({M}\) and
\({N}\) are the maximum and minimum values of government economic indicator data with the same attribute, respectively;
\({\beta}\) is the mapping interval, and
\({V}\) is the mapped value of government economic indicator data. The neural network center is used to replace the continuous value of government economic index data, convert the data attributes into discrete values, display regular rules on the basis of ensuring the relative attributes, and reduce the number of values of the same attribute data [
14]. The sprint classification algorithm is adopted to sort the primary and secondary relationship of government economic development index data, classify the economic development level of governments in surrounding areas, divide the governments in surrounding areas into multiple sub groups, regard the governments in surrounding areas with different development levels as different categories, and distinguish the economic development level of governments in surrounding areas. It is worth noting that the economic development level of the same subdivision government is close to each other. The classification of government economic indicator data is realized through the decision tree. The attribute with the highest priority is selected as the root in the government economic indicator data to provide the preprocessed attribute set. Search for commonness from the government economic indicator data, make a series of sorting decisions, split the decision tree nodes, and then split the government economic indicator data attributes, so that the attributes are accurately associated with the child nodes, and the attribute value segmentation dataset can be obtained [
15]. If the number of dataset categories is
\({c }\) and the number of dataset categories is equal to the number of leaf node categories, the calculation formula of splitting parameter
\({F }\) is
Where,
\({p_{I} }\) is the relative frequency of dataset category
\({I }\). Select a data node in the dataset, take the logical judgment of the economic development level of the surrounding government as the internal node of the decision tree, take the branch result of the logical judgment as the edge of the decision tree, and associate the data attributes to the root node of the decision tree, so as to construct a multi tree decision tree. When all the government economic indicator data belong to the same category, the class label is used to define the leaf node. When the government economic indicator data do not belong to the same category, the data attribute is measured according to the information entropy, and the data in the original attribute set is deleted. When the candidate set is empty, the leaf node is returned and marked as a common category. For different types of government economic index data, the calculation formula of information entropy
\({W }\) is
Where,
\({\xi}\) is the dataset given by the decision tree, and
\({{C}_I}\) is the set of datasets belonging to class
\({I }\) objects. Classify the dataset
\({\xi}\) according to the attribute characteristics to obtain multiple different objects. The weighted sum of the information entropy
\({W}\) is obtained through partition entropy, based on which the information gain attribute of government economic index data can be calculated according to the formula as follows:
Where, \({K }\) is the information gain of government economic index data, and \({\eta}\) is the number of attribute characteristics of the dataset. In the attribute set, select the attribute with the highest information gain \({K }\), mark the leaf node, get the score of the attribute with the highest information gain, and make the subset elements of the dataset meet the score. When the categories at the nodes are the same, and the remaining attributes cannot be subdivided, or the given score has no data, create a class label, terminate the division of the decision tree, and complete the classification of the economic development level of the governments in the surrounding areas. So far, the sorting of the primary and secondary relationships of government economic index data is completed, and the preprocessing of government economic development index data is completed.
3.4 The Characteristics of Government Economic Development
Based on the semantic attention of government economic indicator data, this study highlights the semantically similar government economic development indicator data, and uses it to predict the government economic situation. According to the above basis, analyze the factors that affect the semantic distance between data items including economic development, social development, and urban construction. According to the analysis results of the above influencing factors, the influencing factors are taken as the dynamic characteristics of the government economic indicator data, so that the data of the government economic indicators change over time, showing different characteristics of government economic development. Obtain multidimensional data information according to dynamic features, conduct data exploration to reduce the dimension of government economic indicator data, and convert multidimensional dynamic features into two-dimensional dynamic features according to the dimension combination obtained by data exploration [
16]. Assuming that the characteristic dimension of government economic indicator data is
\({z }\), the calculation formula for the combination exploration condition
\({R }\) of government economic indicator data is
The abstract features of government economic index data in information space are extracted, and the abstract feature types are divided into three categories: time series, network, and level. The calculation parameters of semantic distance are determined according to the structural relationship among the three types of government economic indicator data. Suppose that the dynamic characteristic object of the government economic development index is
\({s}\) and the data object of any government economic index in information space is
\({x }\), then the semantic distance
\({d( {s,x} ) }\) between
\({s }\) and
\({x }\) is
Where, \({ f( {s,x} ) }\) is the two-dimensional display of \({x }\) on the combination of \({s }\) dynamic feature dimensions. The implicit intention is used to determine the impact of government economic index data on the prediction of government economic situation, the explicit intention is used to clarify the prediction intention of government economic situation.
\({g(s,x) }\) represents the association relationship between \({x }\) and \({s }\).
\({l(s,x) }\) is the center distance after semantic representation of \({x }\) and \({s }\), and \({w }\) represents the weight.
The semantic distance is taken as an important parameter of the semantic attention of government economic indicator data. Through
\({d(s,x) }\), the distance between
\({x }\) and
\({s }\) at the semantic level is adopted, based on which it is possible to determine the a priori importance of different data items of government economic indicator data in the prediction of government economic situation, set the semantic attention threshold, and limit the collection of government economic indicator data items. The semantic attention
\({P }\) of government economic indicator data item
\({x }\) with respect to
\({s }\) is
Where, \({k( {s,x} ) }\) is the a priori importance of \({x }\) with respect to \({s }\), and \({C }\) is the semantic attention threshold. The greater the semantic attention, the closer the semantics of \({x }\) and \({s }\). Aggregate the data items of government economic indicators with similar semantics to assist in the prediction of the government economic situation.
For the government economic development index data with similar semantics, the association rules of the government economic situation in the surrounding areas are mined, and the economic development model of each government is determined according to the relationship between different attributes and characteristics of the government economic index data. In the dataset, the attribute information of government economic development index data is extracted and divided into three sets: continuous attribute set, original invariant attribute set and nominal attribute set. This study uses the knowledge base of HowNet, defines the words existing in the semantic dictionary, and takes the def item in HowNet as the concept of words. According to the above concept of words, words with similar meanings need to be replaced, so that words have semantic relevance. HowNet (English name is HowNet) is a common-sense knowledge base that takes the concepts represented by Chinese and English words as the description object and regards reminding the relationship between concepts and the attributes of concepts as the basic content. At the same time, the semantic similarity interval between words is considered in this study, and the minimum semantic similarity and the maximum semantic similarity are adopted. Calculate the semantic similarity of different government economic indicator data according to the minimum and maximum semantics. The specific calculation formula is
Where, \({{K}_a,{K}_b }\) is the concept of semantic word \({a,b }\) of government economic indicator data, \({{K}_a \cap {K}_b}\) is the number of words with the same definition of the two concepts. The value of concept similarity is within [0, 1]. The smaller the similarity is, the lower the possibility of concept semantic correlation between the mined feature attributes and the prediction of government economic situation is, and the greater the similarity is, the closer the concept semantics is. Set the semantic similarity threshold, select the feature attribute with \({M }\) greater than the threshold, extract the government economic index data and determine the similarity between data. The semantic similarity matrix is used to represent the semantic similarity of all government economic index data. Combined with the semantic elements of government economic situation prediction, we mine the index features of deep semantic connection, analyze the common parts of the semantic elements of index feature attributes, and obtain the semantic connection key points and the semantic information describing the characteristics of government economic development. According to the semantic bias of economic development characteristics to economic situation prediction, semantically process the data mining results of government economic indicators, and define the characteristics of government economic development. So far, the excavation of the characteristics of government economic development is completed.
3.5 Training of Government Prediction Model
The experimental data were collected from the National Bureau of statistics. The types of data collected include residents’ income, production level, socio-economic level, and so on. Input the government economic development data into BP neural network to predict the government economic situation. The BP neural network applies the radial basis function of Multivariable Interpolation, selects the three-layer forward network as the typical structure of the neural network, and transforms the characteristic attributes of government economic development in the surrounding areas extracted from the input layer in the middle layer, so as to make the category of the characteristic attributes of government economic development in the surrounding areas closer to the center of the network. If the output value of the
\({i }\)th neuron is
\({{x}_i }\) and the sample point of the
\({j }\)th network center is
\({{G}_j }\), the corrected new network center
\({B }\) is
The characteristic attributes of government economic development in surrounding areas are divided into new network centers, and the collection of network centers is used as the value domain to replace the characteristic values of government economic development in surrounding areas, so as to eliminate the impact of different dimensions of data on the prediction of government economic situation and find out the change law of government economic situation. In the prediction of the government's economic situation, with the increase of the prediction length, the error of the prediction value will become larger and larger. Therefore, the BP neural network adopts the learning training of the fitting error difference to ensure the prediction accuracy of nonlinear factors. The learning algorithm of the BP neural network is composed of four processes on the premise of the error back propagation algorithm in the neural network. The input mode in the first stage is the forward propagation of the input layer to the output layer through the middle layer, the expected output of the network and the actual output of the network in the second stage is the error inverse propagation of the error signal to the input layer through the middle layer, and the connection weight of the neural network is corrected layer by layer; the third stage is the repeated alternation of the error inverse propagation and the mode forward propagation; and the fourth stage is the convergence of the neural network, and the learning convergence process of network global error tending to a minimum [
17,
18]. The overall process of predicting the government's economic situation by BP neural network is shown in Figure
4.
As shown in the three-layer BP network structure in Figure
4, the number of nodes in the input layer is set to 2, the number of nodes in the hidden layer is set to 6, the number of nodes in the output layer is set as the number of output vectors, and the number of output vectors of the target value of the neural network is set to 1, that is, the prediction result of the government's economic situation. Taking the characteristic attribute of the government's economic development as the training data and test data, BP neural network training is carried out based on training data, and the predicted value of the government's economic situation is output. The predicted value of economic situation is divided into 1 ∼ 5 levels, as shown in Table
3.
According to Table
3, the level of government economic situation is determined. So far, the prediction of the government economic situation has been completed, and the design of the government economic situation prediction method based on big data analysis has been realized.