1 Introduction

Following the release of the “The State Council’s Guiding Opinions on Strengthening the Construction of Digital Government”, governments in all regions of China have actively built digital systems for urban governance.

The construction of smart cities has now entered a stage centered around data [1]. By leveraging advanced technologies and data analytics, the efficiency of infrastructure and services is increased, the quality of life for citizens is improved, and economic growth is promoted [2]. Governments should not only be progressive in chasing technological developments but also have efficient governmental management and policies [3]. Digital government of smart cities is a crucial way to promote the modernization of the national governance system and governance capacity [4], which can reduce the cost of government operation, improve the efficiency of government execution [5], and provide better services to citizens. Information systems and associated services should be promoted via integration between businesses and government organizations; creating collaborations would increase the efficiency of smart governance concepts [6]. Collaboration is the metric to evaluate the smartness of government initiatives [7].

Large-scale collaborative tasks in government are complex and burdensome for government staff. For example, neighborhood committee staff regularly visit the elderly and distribute benefits; during epidemic outbreaks, hospitals must register the consumption of medical supplies such as medications and oxygen; all communities must maintain information about the permanent population within their jurisdiction for daily management. Therefore, all city government staff, citizens, and businesses all participate in the recording and reporting of management or service tasks. Once reporting information is available, the government needs to understand the progress of task handling for supervision and decision-making purposes. For instance, they need to understand supplies’ usage and storage trends during the epidemic outbreak for timely dispatching.

These tasks, assigned by a government organization, are distributed hierarchically according to the government administration. Government management is divided into two forms: top-down management based on geographical hierarchies and organizational hierarchies, as shown in Fig. 1, where staff at the lowest level of the hierarchy deliver services to citizens. Collaboration tasks include task creation, task dispatch, task execution, and task statistics. Common procedures in government tasks include submitting applications, executing jobs, and providing necessary materials, which require the participation of government staff, citizens, and businesses.

Fig. 1
figure 1

Government administration

To effectively manage large-scale collaborative tasks in government, we need a flexible, efficient, scalable, and controllable system for governmental collaboration. This system should address the complex and variable scenarios of grassroots governance. The main challenges in achieving these goals are as follows:

  • Task execution system: We require a system that can integrate geographical and organizational hierarchy information to promote cross-departmental and cross-hierarchical execution of large-scale government collaboration tasks. This system should also support flexible changes in business types.

  • Efficient data analysis: There is a high demand for real-time monitoring of task execution. However, aggregating data across time and hierarchy is challenging, especially with massive amounts of data, making it difficult to ensure timeliness.

Currently, the daily work of government staff still heavily relies on manual collection and maintenance of information, which has not been fully integrated into information systems [8]. Existing government management systems are isolated from each other, forming data silos that hinder collaboration and data aggregation between government organizations [9] [10]. Current research on intelligent government management mainly focuses on data sharing [11].

Current collaboration systems lack the feature of multi-hierarchy collaboration. Typically, these systems are started by one manager and then assigned to several other staff. In tasks requiring collaboration across multiple hierarchies, each hierarchy must create similar tasks separately, and data does not automatically transfer between these hierarchies, often necessitating significant manual intervention for data flow. In practical scenarios, data flow between hierarchies typically occurs through office automation software or email. As a result, it does not enable collaboration between different organizations and departments, effective monitoring of task progress, timely access to task statistics, and efficient multi-person collaboration.

As a consequence, the current systems cannot facilitate the execution of extensive collaborative tasks within governmental structures. We present a spatio-temporal task tree designed for large-scale collaborative tasks in government and propose a log-based incremental update method for updating statistical values in the tree structure. The main contributions are as follows:

  • We propose the concept of a spatio-temporal task tree, which incorporates organizational hierarchy, location hierarchy, and time. This concept accommodates cross-departmental, cross-hierarchical execution scenarios in large-scale government collaborative tasks while supporting flexible business-type changes.

  • The log-based incremental update method for statistical values in the tree structure is introduced, enabling updating of statistical results in seconds, significantly reducing computational costs, and improving query efficiency.

2 Preliminaries

In this section, we introduce the definitions and problem statements. All frequently used notations are shown in Table 1.

Table 1 Notion table

2.1 Definitions

Definition 1

Collaboration Task Tree: a collaboration task can be defined as a task tree, consisting of a node set and a parent function, namely \({\mathcal {C}}=\left\{ C,\ p_C\right\}\). Specifically, \(C=\left\{ c_i|1\le i\le n_C\right\}\) denotes the tree nodes where \(n_C\) is the number of nodes and \(p_C\left( c_i\right)\) denotes the parent node of \(c_i\). Each \(c_i\) is respectively associated with a node in location tree where the task is carried out, as well as a node in organization tree which takes responsibility for this task. Likewise, we can define location tree \({\mathcal {L}}=\left\{ L,p_L\right\}\) and organization tree \({\mathcal {O}}=\left\{ O,\ p_O\right\}\). We utilize \(a_l\left( c_i\right)\) and \(a_o\left( c_i\right)\) to denote the associated location node and organization node of \(c_i\) respectively. Finally, each task node \(c_i\) contains a list of jobs, denoted as \(\left\{ j_{i,1},\ldots ,j_{i,n_i}\right\}\) where \(n_i\) is the number of jobs on node \(c_i\).

Definition 2

Spatio-Temporal Task Status Tree: a spatio-temporal task tree is employed to maintain the status of collaboration task, denoted as \(\{{\mathcal {S}}^t|0\le t\le n_t\}\) where \({\mathcal {S}}^t\) denotes the corresponding tree status of \({\mathcal {T}}\) at timestamp t. Particularly, for each task node \(c_i\), \({\mathcal {S}}^t\) maintains the status of this node such as starting, in processing and finishing (denoted as \(\{s_1^t,\ldots s_{n_i}^t\}\)) and the job statistics indicators of each node \(c_i\) such as the finishing rate and the delay rate (denoted as \(\{h_{i,1}^t,\ldots , h_{i,n_h}^t\}\) where \(n_h\) is the number of indicators). Figure 2 shows the \(\{{\mathcal {S}}^t|0\le t\le n_t\}\) of \(\{{\mathcal {C}}_1,{\mathcal {C}}_1,...,{\mathcal {C}}_m\}\).

Fig. 2
figure 2

Spatio-Temporal Task Status Tree

2.2 Problem Statement

The lifecycle of a collaboration task consists of the following stages:

  1. 1.

    Task creation: Initially, a government proposes to deploy a collaboration task. This organization needs to create the collaboration task tree, including the tree structure, the corresponding location, organization, and jobs of each tree node.

  2. 2.

    Task dispatch: The organization of each tree node \(c_i\) can dispatch their jobs \([j_{i,1},\ldots ,j_{i,n_i}]\) to the child nodes, i.e., \(\{c_k|p_C\left( c_k\right) =c_i\}\). There are two types of dispatching strategies: (1) dispatching according to the corresponding locations of jobs; and (2) responsible organizations of jobs. Note that each job should be dispatched to exactly one child node, and the parent task node can visit and change the status of all these jobs.

  3. 3.

    Task execution: After receiving jobs, the organizations can start to execute the task and change the status of jobs (i.e., \(\{s_1^t,\ldots s_{n_i}^t\}\)) in our system. Once the status of a job is modified, all the related task nodes that contain this job can perceive such a change. When the execution is finished, the organization can submit their tasks.

  4. 4.

    Task statistics: The near real-time statistics on the task tree are employed to compute \(\left\{ h_{i,1}^t,\ldots h_{i,n_h}^t\right\}\) according to changing status of jobs, such that government staffs can monitor the task execution process.

3 System Overview

Figure 3 gives an overview of our system. It can provide a flexible, efficient, scalable, and controllable system for governmental collaboration. There are two main parts in the system: (1) information storage (detailed in Sect. 4), and (2) update and query method of statistical value (detailed in Sect. 5).

  1. 1.

    Information storage The system stores data for four task processes: task creation, task dispatch, task execution, and task statistics. First, the system stores structured location information and organizational information as \({\mathcal {L}}\) and \({\mathcal {O}}\). Then, the government uses the system to identify the responsible organization based on the area of the task execution, subsequently creating and storing \({\mathcal {C}}\). When the task is dispatched, the system will store the association of jobs and task nodes. During task execution, the system records the real-time status of the task nodes. In addition, the system will store the operation of task execution as a log, to calculate the statistical values. Finally, all the historical statistical values are stored for multi-dimensional queries of task statistics.

  2. 2.

    Update and query of statistical value We propose a method for incremental updating of statistical values, which enables task statistics to be updated in seconds. The system regularly reads the newly generated log information and then analyzes and updates the statistical values in memory. Thus the system can support updating statistical values with high volume and concurrency from task execution.

Fig. 3
figure 3

System overview

4 Information Storage

4.1 Basic Information Storage

We constructed \({\mathcal {L}}\) based on the geographic administrative divisions of the city as shown in Table 2, and each row in the table represents \(l_i\). Simultaneously, \({\mathcal {O}}\) is constructed based on the organizational structure of the government as shown in Table 3, and each row represents \(o_i\). When the government performs task creation, it selects the corresponding organization to be responsible for task execution based on the area of the task. This process structures the chosen organization into a \({\mathcal {C}}\) as shown in Table 4, and each row represents \(c_i\). There is only one \({\mathcal {O}}\) and \({\mathcal {L}}\), and countless \({\mathcal {C}}\). The above tables are stored in the MySQl [12].

Table 2 Structure of \({\mathcal {L}}\)
Table 3 Structure of \({\mathcal {O}}\)
Table 4 Structure of \({\mathcal {C}}\)
Fig. 4
figure 4

Code generation

The codes in the three tables above are location code, organization code, and task code, and they are all generated in the same way as shown in Fig. 4. We use \({Code}=m_{d_0}, \ldots , m_{d_i}, 0 \le {di} \le level\) to denote the code of the node, where \(m_{d_i}\) stands for the node is the \(m_{d_i}\)-th child of its parent node. Therefore, we can find all the parent nodes of the node by code.

After \(c_0\) creates a task, it can dispatch jobs to child nodes, and each job only can be dispatched to one child node. Jobs can be dispatched down the chain up to the leaf nodes. During the task dispatch progress, jobs are associated with the nodes of a task chain. This association is also stored, and staff in a task node can change and view the status of jobs associated with this node and all child nodes.

4.2 Operation Log Storage

When staff change the status of jobs, each operation will generate a log, leading to a significant amount of data in large-scale government collaborative tasks. Traditional row-based storage techniques may surpass the capacity constraints of a single machine.

HBase [13] is a distributed, column-oriented NoSQL database that can expand storage capacity and processing power by adding new nodes. It supports efficient key-value indexing capabilities. HBase uses key-value column storage and supports multiple retrieval methods, including Scan and Get [14]. In the Scan method, HBase supports single-row queries based on the RowKey, range queries, filter queries, full table scans, scans of specific “column families” and scans of specific “columns”. In the Get method, HBase supports single row queries based on the RowKey, batch queries, specific “column family” queries, and specific “column” queries. We store the user’s operation logs in HBase.

We assume that the log records have a timestamp (ts), a collaboration task tree ID (\(tree\_ id\)), and a job ID (\(job \_ id\)). To minimize the frequency of database accesses during task execution analysis, we use \(job \_ id+ts\) as the prefix for the RowKey to query all the operation logs within a specific period for the \(tree \_ id\) in a single database query. Given the precision of the timestamp down to the millisecond, a surge in application traffic may result in multiple job operation logs sharing the same \(tree \_ id+ts\), leading to data loss. To address this issue, the RowKey is designed as:

$$\begin{aligned} tree\_id+ts+job\_id \end{aligned}$$
(1)

4.3 Spatio-Temporal Task Status Tree Storage

Statistical values require storing the values for each task node across all historical timestamps, and different types of task trees will have different job statistics indicators \(\{h_{i,1}^t,\ldots h_{i,n_h}^t\}\). Storing statistical values involves managing a large amount of data with uncertain storage requirements, which presents challenges when using a conventional row-based relational database. So we utilize HBase to store the statistical values for tasks in each historical timestamp. Assuming the ID of \(c_i\) is \(task\_id\), and the update timestamp is ts, the RowKey is designed as:

$$\begin{aligned} task\_id+ts \end{aligned}$$
(2)

4.4 Newest Status Information Storage

4.4.1 Jobs Newest Status Storage

As the system’s usage grows, there may be instances where multiple people are changing the same job at the same time. As shown in Fig. 5, when Person 1 modifies the content from a to b at time ts, and shortly after ts, at time \(ts+\theta\), Person 2 changes the content from a to c, the final state of the job will be c. However, errors in calculation occur when performing incremental statistics based on log information. For example, when calculating the sum, if the previous statistical value was the sum, the calculation after these two records would result in \(sum-2a+b+c\). In contrast, the correct statistical values should be \(sum-a+c\).

Fig. 5
figure 5

Illustration of high traffic scenarios

To resolve the above problem, we require a table that records the most recent job state, storing data from the previous successful statistical analysis. In this way, the operation log simply needs to store job information post-modifications, while query the job status pre-modifications from the database. Due to the large amount of data, HBase is employed for storage. Assuming the job’s ID is \(job\_id\), and the RowKey is designed as:

$$\begin{aligned} job\_id \end{aligned}$$
(3)

4.4.2 Spatio-Temporal Newest Status Task Tree Storage

The statistical values update method adopts a log-based incremental update approach. However, during periods of high system access volume or when creating tasks that involve importing extensive jobs, there may be tens of thousands of job operation logs in a brief time. Implementing a method that triggers statistical value updates for each log entry would significantly drain Input–output (IO) resources. As such, statistical values are updated in batches every \(\bigtriangleup t\) seconds.

Implementing a batch update method requires retrieving the latest statistical values for task nodes involved in the logs from the database. Based on the method above, statistical values are stored in the HBase database with \(task\_id+ts\) as the RowKey. Since each node updates its statistical values at different timestamps, retrieving these values can only be done through single-row queries based on the RowKey using the Get method. Accessing the database separately for the most recent statistical values of task nodes involved in each log would consume a considerable amount of IO resources. To reduce IO overhead, it is necessary to have a table that stores the most recent statistical values for each task node, and the RowKey is designed as:

$$\begin{aligned} task\_id \end{aligned}$$
(4)

Based on this design, it is possible to use HBase’s batch retrieval method with Get to obtain the latest statistical values of all task nodes corresponding to logs generated over time in one go.

5 Update and Query of Statistical Values

In this section, we describe the method of performing updates of task nodes based on operation logs and the method of query compensation.

The method has three parts. Firstly, we only calculate incremental statistical values to avoid the problem of wasting resources; only the changed statistical values are calculated and stored. Secondly, triggering the update of the statistical values of a single operation will lead to a decline in the system’s performance, so we store the logs of the operation and trigger the update of the statistical values in batches to improve the throughput of the system. Thirdly, we establish the index to improve query speed. The prefix code of \({\mathcal {C}}\), \({\mathcal {L}}\), and \({\mathcal {O}}\) can be used as an index to quickly query the statistical value in a certain area. In addition, the storage of statistical values can be enhanced by the time index to facilitate the fast query of historical statistical values.

5.1 Statistical Values Update Method

The Algorithm 1 illustrates the update method.

Algorithm 1
figure a

Batch Updating of Statistics

The system executes the scheduled task of Algorithm 1 every \(\Delta t\) seconds. This scheduled execution involves a process to query all task trees awaiting statistical updates. The system calculates the statistical values of each task tree either simultaneously or sequentially to utilize shared configurations among tasks within the same tree. This method significantly reduces the necessity for frequent database accesses for each operation log, thereby conserving IO resources and enhancing the efficiency of the update process.

At the beginning of every statistical task, the system initially collects data on all task trees awaiting updates. For each identified task tree, operation logs relevant within \([tree\_id + ts, tree\_id + c\_ts)\) are retrieved from the log database L and sorted chronologically. This step ensures that updates are applied in the order they were recorded, maintaining the accuracy of the statistical value. Subsequently, for each log, the system carries out the operations depicted in Fig. 6.

  1. 1.

    Task identification (5):The system identifies affected task nodes using these logs, ensuring that updates are targeted and efficient.

  2. 2.

    Retrieval of last jobs and statistics (6–7): The system fetches the last jobs status and statistical values related to the identified tasks based on the \(task\_ids\) and the \(job\_ids\) in the logs.

  3. 3.

    Log processing and statistical updates (8–13): This step involves parsing each log, identifying the relevant tasks, and updating their statistical values \(\{h_{i,1}^t,\ldots h_{i,n_h}^t\}\) in \(S^{ts}\).

  4. 4.

    Save newest statistical values (16): Create new timestamps cts to store new statistics values \(\{h_{i,1}^{cts},\ldots h_{i,n_h}^{cts}\}\) in \(S^{cts}\).

Fig. 6
figure 6

Illustration of Algorithm 1

The time complexity analysis of Algorithm 1 is as follows:

  • The complexity of algorithm initialization and data preparation is considered to be O(1).

  • The complexity of log retrieval and sorting is \(O(M\log M)\), where M is the number of logs retrieved.

  • The complexity of parsing each log for tasks and updating indicators is O(MK), where K is the average number of tasks associated with each log.

  • Therefore, the overall time complexity of the algorithm is \(O(N \cdot (M\log M + MK))\), where N is the number of trees.

This indicates that the complexity of the algorithm is mainly influenced by the number of trees, the number of logs, and the average number of tasks associated with each log.

By structuring updates around task trees and leveraging shared configurations, Algorithm 1 not only optimizes the process of updating statistical values but also minimizes the system’s IO resource consumption. This efficiency is crucial for maintaining up-to-date statistical value, facilitating the timely and accurate analysis of operation logs for enhanced decision-making and operational insights.

5.2 Statistical Values Query

Queries based on organizational lists, geographic location lists, and task types require collecting corresponding task nodes. Following this, through the retrieval method of HBase Get, the latest statistical values are acquired in a single operation from the table that maintains the most recent statistical values. For queries based on historical timestamps, it is necessary to use the \(task\_id\) along with the timestamp of the query moment ts. Then, in the time slice storage table for statistical values, a prefix match is conducted based on the \(task\_id\). This process aims to retrieve the last piece of statistical values information that is less than \(task\_id+ts\).

Given that the statistical task is executed every \(\bigtriangleup t\) seconds, querying the latest statistical values of a task might yield inaccurate results. To reduce the error in querying statistical values, a compensation method should be implemented during the retrieval of statistical values. Before reading the statistical values, if the system is not currently calculating statistical values, an incremental update of the statistical values should be executed for the current task tree. Subsequently, the statistical values computed after this update are returned.

6 Experimentation

6.1 Task Descriptions

First, we set up a government administration task scenario. Suppose that the municipal government needs to collect information about the elderly in the city, and the task is performed by the 10 district governments under the municipal government and 100 subdistricts under each district government. Every leaf node of the collaboration task tree has over 100 jobs that need to be processed. Then the task tree can be constituted as shown in Fig. 7, the numbers in the figure represent the task code of the task nodes.

Fig. 7
figure 7

Structure of Collaboration Task Tree

Each job requires collecting the name, age, gender, whether the elderly have children, and whether they are disabled. The following indicators needed to be updated for each task node during task execution: the rate of completion of the task, the number of elderly over 80 years of age, the number of elderly with disabilities, and the number of elderly without children, which are \(\{h_{i,1}^t,\ldots h_{i,n_h}^t\}\) of \({\mathcal {S}}^t\).

6.2 Baseline

Since algorithms that combine both job parsing and tree-structured aggregation statistics have not yet been found, we adopt HBase for data storage and use in-memory real-time job parsing and computation as the baseline for experiments.

6.3 Experimental Environment

All experiments are conducted on a cluster with three nodes, and each node is equipped with an 8-core CPU, 32GB memory, and 1T disk.

6.4 Experimental Results and Analysis

6.4.1 Different Incremental Update Time

This section compares the execution time of the statistical values update method when \(\Delta t\) takes different values. We set the 1000 subdistrict nodes in the task tree to execute one job a second on average, so every second 1000 operation logs are generated. Then we take \(\Delta t\) from 0 to 13 to execute the statistical values update method. The results are shown in Fig. 8. When \(\Delta t\) is set to 10, the number of operation logs that need to be updated has accumulated to 10,000, but the statistics execution time is still below 1 s.

Fig. 8
figure 8

Execution time of different \(\Delta t\)

6.4.2 Benchmark Experiment Comparison

This section compares with the baseline to validate the performance of the statistical values update method. The experiment involves simulating the task tree generation operation logs at a gradually increasing rate from 1 to 5000 per minute. Subsequently, statistical values for the task node with the code “00” are queried after 10 min. The results are shown in Fig. 9. Employing our method ensures that the query time for these values remains consistent even as the number of accesses rises. The query time stabilizes at approximately 0.2 second. In contrast, utilizing real-time calculation for statistical values leads to a gradual increase in query time as the number of accesses accumulates, thus proving the efficiency of our method.

Fig. 9
figure 9

Query time of different access volumes

6.4.3 Concurrent Update of Statistical Values

This section simulates the performance of our method when multiple tasks are executed simultaneously in the system, and \(\Delta t\) is set to 10 seconds. Assume that there are multiple tasks as above task description executing at the same time, and each task tree generates one thousand operation logs in 1 second. As shown in Fig. 10, although the update time gradually increases as the number of task trees increases, the method can be completed within 6 seconds even if ten tasks generate ten thousand operation logs simultaneously. This proves the high availability of the method.

Fig. 10
figure 10

Time for concurrent calculation of statistics

6.4.4 Comparison with Other Schemes

We evaluate our method (Batch Incremental Update)’s performance by comparing it with two schemes:

  • Periodical update: The system periodically updates and stores the statistical values of all task nodes at present.

  • Single incremental update: When the status of the job is updated, every single operation will directly trigger the statistical values incremental update on the task chain where the job is associated.

We compared the performance of these schemes in the following aspects:

  • Real-time response: Whether real-time response is achievable when querying task statistical value.

  • Real-time values: Whether the retrieved statistical values are real-time.

  • Quick operation: Whether it is possible to operate quickly when simultaneously modifying the status of a large number of jobs.

Table 5 Performance comparison with various schemes

As shown in Table 5, the scheme of periodical update does not support to response of real-time statistical values, as the update intervals for statistical values are relatively long, typically ranging from a few hours to a day. The single incremental update scheme affects the operation of tasks. Since the state change of a job affects multiple task nodes, a single job state triggers changes in the statistical values of multiple task nodes. All nodes in the task chain are locked when an update operation is performed to prevent concurrent updates to the statistical values of the same task node. When the status of multiple jobs in a task tree changes at the same time, the above reasons will lead to a waiting process when the job status is updated. Only our update method can simultaneously achieve real-time response to statistical value queries, query results in real-time status, and statistical update do not affect job status changing.

7 Conclusion

This paper presents a large-scale collaborative government-use task system based on a spatio-temporal task tree. This system supports flexible changes in operations and real-time viewing of statistical values. Additionally, it proposes a log-based incremental update method for statistical values in a tree structure, achieving updating of statistical results in seconds. Experimental data demonstrate the feasibility of this approach. Future work could further optimize the algorithm to enhance system performance and scalability.