1 Introduction

Human occupational ergonomics is important because it aims to design jobs, tasks, and work environments in a way that reduces the risk of musculoskeletal disorders (MSDs) and other types of injuries and health problems. Poor ergonomics can lead to musculoskeletal problems such as back pain, neck pain, carpal tunnel syndrome, tendinitis, and more [1]. It can also contribute to mental health problems, such as stress and burnout [2]. By designing interventions that reduce physical strain and promote healthy postures and work habits employers can improve well-being, reduce absenteeism, and create a more positive work culture. Sensors can be used in a variety of ways to measure and assess human occupational ergonomics. For instance, motion sensors can be used to track the movements of workers [3] and to assess the ergonomics of their workstation setup, force sensors can be used to measure the forces exerted on workers’ bodies [4] as they perform tasks, which can help to identify potential ergonomic risks, and temperature sensors can be used to measure the temperature of the work environment [5], which can affect the comfort and safety of workers [6]. By collecting multivariate data from these and other types of sensors, employers and occupational health professionals can get a better understanding of the ergonomic risks faced by workers and can take steps to mitigate those risks and improve the overall ergonomics of the work environment. Machine learning (ML) and artificial intelligence (AI) methods on wearable sensor data have been widely investigated to recognize how humans interaction with their work environment or their occupational ergonomics [6, 7]. There is no doubt that there is a widespread trend of acquiring wearable sensor data in controlled settings and using AI/ML approaches to predict occupational ergonomics in the scientific community. Nevertheless, as far as we are aware, there is a scarcity of research regarding the engineering and implementation of AI/ML methodologies for occupational ergonomics in order to effectively integrate them into practical applications and facilitate human-friendly communication in natural language, with the ultimate aim of providing actionable insight to help mitigate musculoskeletal disorders (MSDs) among workers. Our motivation is driven by gaps in recent research at the intersection of AI and occupational health. Saadatnejad et al. [8] demonstrate the importance of modeling uncertainty in human pose forecasting, which enhances trustworthiness and reliability of the predictions. However, we see the need to go beyond pose forecasting by integrating a machine learning pipeline with Large Language Models (LLMs) to transform posture predictions and uncertainty estimates into actionable health risk assessments and recommendations. Furthermore, Saadatnejad et al. [8] utilize datasets like Human3.6M, AMASS, and 3DPW for evaluating pose forecasting models. None of these datasets are real-world datasets reflecting occupational health risks. To address this challenge we use a real-world dataset from the Digital Worker Goldicare dataset [9], which consists of several hours of data from home care workers. Furthermore, a recent comparative study [10], highlight that human experts outperform LLM-based ChatGPT in generating accurate and complete medical responses in occupational health. However, the study does not consider the use of real-world sensor data and uncertainty-aware posture detection to augment the responses from an LLM. This requires the development of comprehensive machine learning pipeline that can automate several steps of converting sensor data to posture over time followed by a natural language interpretation of it.

This study introduces ERG-AI, a machine learning pipeline designed to predict a sequence of postures from data collected over long-term observations via various wearable sensors. The pipeline leverages machine learning models’ performance metrics and predictive capabilities (e.g. uncertainty) to present occupational health risk assessments and improvements suggestions in a language that is easily understood by users. The ERG-AI pipeline incorporates a diverse suite of software modules for data input, feature extraction, data division, training, and inference employing AI/ML models such as deep neural networks and decision trees. It also acknowledges the presence of uncertainties in the workplace arising from measurement errors, limitations in sensing technology, user comfort and wearability, and calibration and maintenance needs. Given that it’s impractical to obtain independently and identically distributed data for every possible scenario, ERG-AI incorporates a ’dropout’ regularization scheme for epistemic uncertainty estimation, enabling random node elimination during training and inference stages. This facilitates ERG-AI to generate an ’uncertainty-aware confusion matrix’ (UCM) estimating the posture classification system’s accuracy and associated uncertainty. Ultimately, ERG-AI constructs a detailed prompt encapsulating the machine learning model’s posture sequence predictions and associated uncertainty estimates, which is processed by an API for a large language model (LLM) such as GPT-4 [11] and LLAMA-2 [12], to deliver a comprehensible occupational health risk assessment and user-specific recommendations.

We assessed the ERG-AI system using real-world data from the Digital Worker Goldicare dataset [9], which consists of 2913 hours of accelerometer data collected over 3.8 workdays from 114 home care workers. These workers, who include nurses, nursing assistants, occupational therapists, and others, wore five tri-axial accelerometers attached to various parts of their bodies. Previously, postures in this dataset were identified by applying vector mathematics to accelerometer data collected at a rate of 25 Hz within a measurement range of ±8 G. In designing a practical system for estimating occupational ergonomics risks, it’s crucial to consider factors such as battery usage and ease of use by workers. Consequently, we employed ERG-AI to predict posture based on down-sampled accelerometer data, reduced from 25Hz to 1Hz, in order to conserve battery life potentially on a low power device. We also studied the system’s performance as we incrementally reduced the number of physical sensors from five to just one. ERG-AI’s predictions were more accurate for certain postures (like lying, kneeling, sitting, and standing) using down-sampled data. However, it showed higher uncertainty when predicting actions like walking, running, climbing stairs, and other less common postures. As expected, we observed that reducing the number of accelerometers feeding into ERG-AI weakened its predictive capabilities. For example, when only one sensor was used on the arm, it could only reliably detect postures like lying, sitting, and to some extent standing and running. The system also generated higher uncertainty estimates for predictions when using fewer accelerometers. Taking into account the predictions for various postures and their uncertainty estimates over a period of time, ERG-AI generates a prompt to determine occupational health risks and recommendations. ERG-AI invokes an LLM API such as that of commercial GPT-4 [11] and open-source LLAMA-7B [12] to generate occupational health risks and recommendations. An ergonomics professional provided an analysis of the output to evaluate its validity and usefulness. We found that ERG-AI could provide meaningful and balanced explanations of occupational health risks, which are based on summary statistics of detected postures and the machine learning model’s performance and uncertainty. Nevertheless, there is a need for more specificity in the recommendations, where the knowledge of a worker’s age, gender and overall health can be beneficial. We also analyzed the differences between LLM outputs a commercial LLM GPT-4 and an open source LLM LLAMA-7B, which is much smaller in size and appropriate for portable devices. Lastly, we assessed ERG-AI in terms of energy consumption and carbon footprint for the training and evaluation phases to evangelize reporting our environmental footprint. We do not evaluate the environmental footprint of LLM inference in this article as it has been analyzed by other authors [13].

To summarize, our contributions include:

  • Comprehensive Machine Learning Pipeline for Occupational Ergonomics: ERG-AI incorporates a robust machine learning pipeline that handles data ingestion, preprocessing, model training, and deployment.The pipeline is designed to efficiently manage and process large datasets, ensuring accurate posture prediction and effective implementation in real-world scenarios, with an emphasis on energy efficiency and reduced carbon footprint. Furthermore, it leverages the DigitalWorker Goldicare dataset [9] for posture prediction which is a real-world dataset for occupational health of home care workers.

  • Uncertainty-aware Posture Prediction: ERG-AI predicts long-term worker postures using data from wearable sensors, incorporating uncertainty estimation to enhance prediction accuracy and reliability. The system generates an “uncertainty-aware confusion matrix” to evaluate the posture classification system’s accuracy and associated uncertainty.

  • Large Language Model-driven Insights for Occupational Ergonomics: ERG-AI leverages large language models (LLMs) like GPT-4 and LLAMA-2 to transform posture predictions and uncertainty estimates into comprehensible occupational health risk assessments and personalized recommendations for workers. The integration of LLMs ensures that the output is user-friendly and actionable, facilitating better health outcomes and ergonomic practices. To the best of knowledge, our article is the first work that combines posture prediction and large language models for generating occupational health risks and recommendations.

The rest of the paper is organized as follows. We present background work on sensor-driven occupational ergonomics, machine learning pipelines, uncertainty estimation, and LLMs in Section 2. We then present related work on the use of AI/ML for occupational ergonomics in Section 2.6 and our machine learning pipeline ERG-AI in Section 3. In Section 4 we evaluate the pipeline using sensor data acquired from care givers; and we finally conclude in Section 5.

2 Background

In this section, we present background work on sensor-driven occupational ergonomics, data pipelines to pre-process sensor data and train machine learning models to predict posture, concept of uncertainty estimation to be incorporate in the data pipeline, large language models and dimensions of AI engineering that need to be considered for maintenance and deployment of machine learning models for occupational ergonomics.

2.1 Sensor-driven occupational ergonomics

Musculoskeletal disorders (MSDs) are injuries caused by stress on internal body parts such as muscles, nerves, tendons, joints, cartilage, and spinal discs during movement [14, 15]. They impact many individuals across various occupations and industries, ranging from office work to manufacturing, construction, and healthcare. These disorders can lead to long-term disability and economic losses [18].

Musculoskeletal disorders caused by workplace activities are known as work-related musculoskeletal disorders (WMSDs). The high physical demands of certain jobs and the prevalence of WMSDs contribute significantly to the elevated sickness absence rates among workers. Numerous studies have utilized advancements in portable sensor technologies to accurately measure physical work demands [15,16,17,18].

The utilization of portable sensors for healthcare, wellbeing, and behavioral analysis to prevent WMSDs associated with awkward postures has been explored using both rule-based and machine learning models. These models aim to identify risks associated with specific tasks and the ergonomic design of tasks, tools, and workplaces to align physical jobs with workers’ natural capacities [19,20,21]. Developing accurate posture assessment tools requires collecting sufficient spatiotemporal work-related data. Traditional data collection methods, including self-reporting, manual observation, and sophisticated sensor networks, are time-consuming, intrusive, and require technical expertise that may not be readily available among workers and employers [9, 18].

This research aims to design and test a methodology using an unobtrusive and automated data processing framework to classify body postures associated with the risk of developing MSDs, utilizing only wearable sensor technology (accelerometers) on workers during their activities. The activity classification output can identify ergonomic risk levels for each worker and major sources of ergonomic risks, aiding workers and decision-makers. The data used to validate the presented pipeline was pre-processed and labeled using a modified version of the custom-made software Acti4 (The National Research Centre for the Working Environment, Copenhagen, Denmark) [22]. Acti4 employs rule-based models to determine activity categories and postures, such as lying, sitting, standing, moving, slow walking, fast walking, running, cycling, stair-climbing, arm-elevation, forward trunk inclination, and kneeling, with high sensitivity and specificity [23].

2.2 Machine learning pipelines

A machine learning pipeline is a set of interconnected steps that are designed to transform raw data into a final model that can be deployed to make predictions on new data. In this paper we are using supervised learning, the branch of machine learning where a model is trained to create a function for mapping input data to expected outcome values. We employ a popular data pipeline framework called Data Version Control (DVC) [24, 25] to implement ERG-AI. A DVC pipeline has several stages that are used to manage and version large datasets, machine learning models, and other data-intensive projects. We briefly describe the role of each stage in DVC in this section.

Data ingestion: In this stage, raw data is ingested into the data pipeline from a data repository such a file system, database, or an API. For instance, raw data from wearable sensors is acquired using a serial peripheral interface, universal asynchronous receiver-transmitter (UART), Wi-Fi, or Bluetooth by a mobile app and stored on a file system.

Data preprocessing: In this stage, a pipeline profiles, cleans and prepares raw data for training by machine learning algorithms. Profiling the data [26] provides information on the data quality and insights into the distributions of the various features. The data pipeline typically makes use of external libraries for profiling such as Pandas profiling [27] and Great Expectations [28] for specifying domain-specific assertions on data quality if need be. The profiling statistics can be used to clean the data, which involves removing missing or invalid values, and minimize using data of poor quality. After cleaning, pre-processing entails feature extraction from raw sensor data which is transforming raw sensor data to a set of relevant and robust features. A pipeline can use an external Python library for feature extraction such as TSFEL [29] that provides about 60 features extracted from time series data. Feature extraction reduces the amount of data that needs to be processed and analyzed, while retaining the essential information required for the task at hand. Raw sensor data is often complex, noisy, and high-dimensional, which can make it challenging to work with and interpret. After feature extraction, both raw data and features need to be normalized if they are to be used for machine learning. Normalizing sensor data is a process of scaling the data to fit within a predefined range. This is often necessary because sensor data may have a wide range of values, and some machine learning algorithms may be sensitive to the scale of the data. Normalizing involves transforming the data to have a zero mean and unit variance. A data pipeline typically employs external libraries such as Sci-kit Learn [30] that provides a number of off-the-shelf algorithms for scaling data (e.g., min-max scaler, standard scaler). Finally, preparing scaled data for training entails restructuring data in the form of input-output pairs required by machine learning algorithms. A window size on the input features is typically specified to represent the time horizon of sensor data used to make a prediction. Detecting posture is a classification problem where input data is used to predict several classes of postures. In many cases, prepared data may be imbalanced, meaning that one or more classes are underrepresented compared to others. This can cause the machine learning algorithm to be biased towards the majority class, resulting in poor performance for the minority class. Hence it is also necessary to balance the scaled data. This refers to the process of adjusting the distribution of data in a dataset to ensure that it contains an equal number of samples from each class. Here the pipeline may employ techniques such as oversampling, under-sampling and generation of synthetic data to balance scaled data. A recent review of techniques for balancing is presented by Susan et. al. [31]. DVC manages how data is versioned and stored during all the steps of the preprocessing stage ensuring that during multiple runs of the pipeline only relevant artifacts are update hence improving performance.

Model Training and Storage: In this stage, machine learning models are trained using the processed data. This stage may also include tasks such as hyperparameter tuning, model selection, and evaluation. Trained models are stored in a separate storage system, which can be local or remote. DVC manages how models are versioned during training phases and monitors the dependencies with the preprocessing stage. It ensures that training only occurs when there are updates to the data available in the processing phase. DVC can skip the execution of certain stages and instead fetch the correct output from the cache if they have already run in the same configuration.

Model Deployment: In this stage, the trained models are deployed to a production environment, where they can be used for inference or prediction. This may include tasks such as creating an API or integrating the model into an existing software system.

2.3 Uncertainty estimation

Posture prediction for occupational ergonomics using ma-chine learning may be affected by two broad categories of uncertainty: data uncertainty (also referred to as aleatoric uncertainty) and model uncertainty (also referred to as epistemic uncertainty) [32]. Data uncertainty refers to the inherent variability in data that can impact the accuracy of posture prediction in occupational ergonomics. It can be caused by errors in the measurement of input variables such as tri-axial accelerometer vectors in different joints of the body resulting in inaccurate posture predictions. The causes for measurement uncertainty can be calibration errors, sensitivity to temperature, mounting errors due to sensor not being properly aligned and secured, sensor drift over time due to temperature changes and mechanical stress, the ski-slope problem in high-frequency accelerometers, signal noise due to electromagnetic interference, and variations in sampling frequencies due to memory and power limitations. Data uncertainty can also be caused by errors and biases in the training data used to develop posture prediction models resulting in poor generalization performance and inaccurate posture predictions for new subjects or tasks. Data uncertainty can be due to both inter-subject and intra-subject variability in human posture and movement resulting in inaccurate posture predictions especially for tasks or postures that are not well-represented in the training data. Finally, data uncertainty can also occur due to environmental factors that can affect posture, such as changes in lighting, temperature, or work equipment. Model uncertainty stems from the selection of machine learning model structure and its parameters. Different neural network architectures (e.g. CNNs, FCNNs, RNNs, Transformer models) have different structures and different types and number of learning parameters. Therefore, various models predict posture differently and are a source of uncertainty.

Uncertainty estimation in our context is the process of quantifying the degree of uncertainty or error in posture prediction. It can help predict posture in human occupational ergonomics by providing a measure of confidence in the posture prediction models. In this paper, we use deep neural networks (DNNs) for posture prediction and estimating uncertainty in its prediction. There are different approaches to estimating uncertainty in the prediction of posture as presented in [33]. We present three most common approaches below:

  • Monte Carlo Dropout during inference: This method [34, 35] involves randomly dropping out some neurons during the forward pass of a trained deep neural network to obtain a set of predictions. The variance of these predictions can then be used as an estimate of model uncertainty.

  • Bayesian Deep Learning: This approach involves training a probabilistic model (e.g. Bayesian CNN/RNN) [36] that provides a distribution over model parameters. For instance, Bayesian neural networks can be trained using a variation of Bayesian inference called stochastic gradient Markov chain Monte Carlo (SG-MCMC) [37]. This approach allows for the estimation of a posterior distribution over the model parameters given the training data, which can be used to make predictions and quantify uncertainties.

  • Deep Ensembles: This method involves training multiple deep neural networks [38] with different initializations, and averaging their predictions at inference time. It is important to ensure diversity in the ensembles. This can be achieved by using different architectures, regularization techniques, or hyperparameters for each network. The variance of the predictions by the different architectures can be used as an estimate of model uncertainty.

In this article, we use Monte Carlo dropout during inference to estimate posture prediction uncertainty. Dropout was originally a regularization technique used in neural networks to prevent over-fitting [39] and improve generalization. It involves randomly dropping out (ignoring) a percentage of the neurons in a neural network during training, which helps to prevent the network from becoming too complex and memorizing the training data rather than learning generalizable patterns. However, we use dropout during inference to estimate uncertainty in the prediction of posture [34, 35]. Monte-Carlo (MC) dropout is one method where forward-pass for inference is performed multiple times on a DNN with dropout enabled. Each forward-pass will randomly drop a neuron and produce a different output. This results in a distribution of predictions can be used to estimate the uncertainty of the model. By generating a distribution of predictions, we can quantify both the mean prediction and the variability (uncertainty) of these predictions. Specifically, MC dropout helps estimate epistemic uncertainty, which arises due to the model’s lack of knowledge and can be reduced with more data. We quantify the uncertainty for classification problems such as posture prediction by computing the entropy \(\textrm{H}\) of the softmax output from the neural network:

$$\begin{aligned} \textrm{H} (X):=-\sum _{x\in N} p(x)\log p(x), \end{aligned}$$
(1)

In this equation, \(\textrm{H}(X)\) is the Shannon entropy of the random variable X representing the softmax output of the DNN predicting posture. The colon (:) means “is defined as.” The summation symbol \(\sum _{x \in N}\) means to sum over all possible values x that X can take on, and p(x) is the probability of X taking on the value x. Finally, \(\log p(x)\) is the logarithm of p(x) with base e (the natural logarithm). The Shannon entropy provides a measure of the uncertainty in the predictions: higher entropy indicates more uncertainty, while lower entropy indicates more confidence in the predictions. This entropy-based uncertainty measure is crucial for applications where knowing the confidence level of predictions can inform subsequent decision-making processes.

Monte Carlo dropout is a popular method for estimating uncertainty in deep neural networks that can be easier to implement and more computationally efficient than other methods such as Bayesian neural networks and deep ensembles. Bayesian neural networks [36] require a significant amount of computation to learn the posterior distribution of model parameters and make predictions using Monte Carlo sampling. Deep ensemble [38], on the other hand, requires training multiple neural networks independently and then averaging their predictions, which can be computationally expensive. In contrast, Monte Carlo dropout provides a simpler and faster way to estimate model uncertainty by randomly dropping out neurons during inference and averaging the predictions over multiple samples. This method requires only a small amount of additional computation during inference and can be easily integrated into existing models. Due to it’s computational efficiency and ease of implementation, we opted for using Monte Carlo dropout to estimate uncertainty in our experiments.

2.4 Uncertainty-aware confusion matrix

A confusion matrix [40] for posture detection shows the performance of a classification model in predicting the postures of a human worker based on sensor data. The confusion matrix summarizes the number of correct and incorrect predictions made by the model for each posture class, as well as the types of errors made. The confusion matrix is a powerful visualization to obtain an overview of the performance of a model. Nevertheless, a confusion matrix does not directly show uncertainty because it only provides information on the number of correct and incorrect predictions for each class. It does not give information on how confident the model is in its predictions or the degree of uncertainty associated with each prediction.

We introduce the concept of an uncertainty-aware confusion matrix that shows the uncertainty of a classification model in terms of a quantity that indicates uncertainty assigned to each predicted class. The uncertainty can be quantified using the standard deviation from the prediction probability for a class or entropy in the prediction class. The uncertainty-aware confusion matrix is similar to the traditional confusion matrix, but it also shows uncertainty assigned to each predicted class in colored boxes as shown in Fig. 5. The uncertainties in our case are entropies computed based on the dropout method presented in Section 2.3. Visually, higher uncertainty/entropy is represented by darker red color while a lighter red color indicates lower uncertainty/entropy.

In the context of posture prediction, if the wearable sensor system is designed to classify a person’s posture into three categories (standing, sitting, and lying down), the uncertainty-aware confusion matrix would contain the entropy in the prediction of each posture class prediction. The matrix can be used to obtain overview of metrics such as the expected true positive rate or expected false positive rate, which provide a measure of the overall accuracy of the posture classification system and its associated uncertainty. Using dropout inference in conjunction with the uncertainty-aware confusion matrix we aim to convey more trustworthy DNN outputs in real-world scenarios where data may not be independent and identically distributed (IID).

2.5 Large language models and prompt engineering

Large language models (LLMs) are a type of artificial intelligence model developed to understand and generate human-like text. They are built using a machine learning architecture known as Transformer [41], which enables the model to comprehend context across long pieces of text and generate coherent, contextual responses. In this article, we use LLMs to convey occupational health risks and recommendations based on posture predictions and their uncertainties. This entails transforming categorical and numerical information generated by ERG-AI to natural language that is easier for humans to comprehend. ERG-AI can invoke different LLMs via an API. We experiment with both a commercial LLM such as GPT-4 [11] and a small and portable open source LLM namely LLAMA-7B  [12].

GPT-4 [11], an iteration of the Generative Pretrained Transformer (GPT) series by OpenAI, represents a significant advancement in LLMs. It has been trained on a broad corpus of Internet text, allowing it to generate human-like responses in a wide range of languages and styles [42]. The official number of parameters in GPT-4 has not been disclosed but rumours claim that is uses about 1.76 trillion parameters. We use a paid subscription to the gpt-4-32K model using OpenAI’s API to analyze and reply to inputs prompts. The number 32K refers to the maximum number of input tokens.

Privacy preservation and edge deployment of LLMs can be pivotal in advancing occupational health ergonomics. By processing data on local devices or near the data source, edge deployment minimizes the latency usually associated with cloud-based solutions such as OpenAI’s GPT-4 API, enabling real-time analysis and feedback crucial for monitoring and improving ergonomic factors in a workplace. Furthermore, it significantly enhances privacy preservation as sensitive information regarding an employee’s health and behaviors is processed locally, reducing the risks associated with data transmission and storage on remote servers. Open-source models like LLaMA 2 offer a platform for developing and sharing models trained on publicly available datasets, fostering a collaborative environment for innovation while ensuring transparency and accessibility [12]. LLAMA-7B, being the smallest in the LLaMA model range with 7 billion parameters, is a particularly good candidate for edge deployment on personal mobile devices due to its balance between model size and performance. Its relatively smaller size could allow for efficient deployment on resource-constrained devices, like mobile phones, enabling real-time ergonomic analysis and feedback directly on a worker’s device, enhancing both occupational health ergonomics and privacy preservation.

Prompt engineering [43, 44] is a technique used to query LLMs. It involves carefully crafting the input prompts to elicit desired responses from the model. The aim is to guide the LLM’s response in a specific direction, enhance the output’s quality, or achieve a certain style or tone. For occupational health risk assessments, LLMs can analyze events indicating prolonged standing, and suggest mitigations. For instance, it could recommend periodic rest, ergonomic footwear, or use of sit-stand workstations based on a body of health and safety literature [45, 46]. However, it should be used with expert oversight due to its limitations. LLMs can be instrumental in summarizing the performance of machine learning models that predict posture using sensor data. They can digest complex data, performance metrics, and statistical information such as uncertainty estimates, and produce comprehensible, clear summaries that can be easily understood by non-experts [44]. This helps bridge the gap between the highly technical world of machine learning and practical, real-world applications, such as occupational health risk assessment.

2.6 Related work

Efforts for preventing MSDs include the development of workplace ergonomics assessment methods and strategies. These are based on the use of ergonomics rules to monitor the frequency and duration of physically demanding movements and repetitive awkward postures. Common ergonomic rules for posture assessment such as “Rapid Upper Limb Assessment” (RULA) [47] and the Ovako Working Posture Analyzing System (OWAS) [48] are commonly implemented through self-reports and visual- and video-based observations and are thus subjected to high levels of inaccuracy and costs, as well as being time-consuming [49]. Ergonomic assessments like RULA and OWAS, though essential, suffer from subjectivity, sampling bias, and are time-consuming and costly. Observer presence may alter worker behavior, these methods lack real-time feedback, and may inadequately assess complex postures.

Compared to RULA and OWAS, using Machine Learning (ML) and Deep Neural Networks (DNN) with wearable sensors such as accelerometers and Inertial Measurement Units (IMUs) can provide more objective, accurate, and real-time posture assessments, eliminating observer bias and reducing manual labor. However, they require large, high-quality datasets and careful implementation. Recent research has demonstrated the potential of utilizing Machine Learning (ML) models to identify workers’ postures and activities through the analysis of motion data collected by wearable Inertial Measurement Units (IMUs) [50,51,52]. The studies primarily depended on traditional ML models that necessitate manual heuristic feature engineering. However, this method can introduce engineering bias and potentially overlook the valuable information present in sequential motion data. As a solution, several researchers have started to leverage Deep Neural Networks (DNNs) for automated feature engineering to address the issue of worker posture recognition [20, 21, 53]. This approach has proven to be highly effective, yielding a recognition accuracy rate of 94% for construction workers. Building upon these advancements, the current study explores the possibility of monitoring the risk of Musculoskeletal Disorders (MSDs) through the detection of postures using wearable accelerometers and DNN methods.

Estimating uncertainty in Deep Neural Networks (DNNs) used for posture detection, based on wearable sensor data, is vital for understanding model reliability and identifying less trustworthy or ambiguous predictions. Multiple studies have employed Bayesian neural networks to quantify this predictive uncertainty, evidencing their effectiveness [54,55,56,57]. However, Bayesian methods’ computational complexity and extended training durations, especially with large-scale or high-dimensional data, can limit their real-time usage. Monte Carlo methods, on the other hand, offer an alternate perspective, utilizing the principle of randomness [35, 58]. Specifically, the Monte Carlo dropout method presents an efficient and direct approach for estimating model uncertainty in posture detection [59, 60]. This technique involves randomly “dropping out” a portion of the neural network during training, creating multiple variations of the model. This helps to understand the model’s behavior under various data conditions, making it especially useful for posture detection where data can be highly variable. In our study, we apply the Monte Carlo dropout method to estimate model uncertainty. Recent work on uncertainty estimation for pose prediction include work by Saadatnejad et al. [8]. They introduce models that incorporate inherent data noise and uncertainty priors, featuring novel epistemic uncertainty quantification through deep clustering, and achieving significant accuracy improvements in pose forecasting with the release of the UnPOSed library for standard evaluation. In contrast, we present a comprehensive machine learning pipeline for posture prediction, utilizing an uncertainty-aware confusion matrix and integrating LLMs for user-friendly health assessments, evaluated with real-world sensor data. Saadatnejad’s work focuses on both aleatoric and epistemic uncertainties, validated with standard datasets, and aims at general pose forecasting, providing an open-source library for research with significant prediction accuracy improvements. Meanwhile, we emphasize practical application using Monte Carlo dropout for uncertainty estimation, integrate a practical pipeline with LLMs for occupational ergonomics and health risk assessments, evaluate energy consumption, carbon footprint, and LLM outputs, offering practical tools for occupational health ergonomics.

The application of LLMs such as GPT-4 within occupational health represents a novel research area to our knowledge. Recent work has explored the use of ChatGPT in occupational medicine. Padovan et. al. [10] evaluate ChatGPT’s accuracy in answering occupational medical questions compared to human experts, highlighting its limitations and need for human supervision. Oviedo et. al. [61] highlight the risks of using ChatGPT for safety advice, noting its tendency to provide oversimplified and sometimes erroneous advice, lack of transparency, keyword sensitivity, and emphasis on individual responsibility. Sridi et. al. [62] present ChatGPT’s potential in enhancing data analysis, virtual assistance, task automation, education, and multilingual support in occupational medicine. However, it also points out challenges like ethical concerns, confidentiality risks, inaccuracies, the need for expert validation, and the issue of AI hallucinations. In recent work, Farquhar et. al. [63] present an approach to detect hallucinations in LLMs like ChatGPT and Gemini using semantic entropy, which measures uncertainty in meaning rather than specific word sequences. This method outperformed naive entropy estimation and other baselines across various datasets and LLMs, improving detection of incorrect answers. By clustering answers based on semantic equivalence and calculating entropy, the approach helps avoid unreliable outputs. This generalizable and unsupervised method enhances LLM reliability, especially in critical fields like law and medicine, by refusing to answer high-entropy questions, thus ensuring safer and more trustworthy AI-generated content. Our work addresses the challenge of simplification and misinterpretation by the design of contextual prompts and establishes expert-guided summarization protocols to ensure nuanced and context-aware recommendations. LLMs have the potential to offer a rich, contextual comprehension for interpreting and articulating sensor data related to occupational health hazards. Specifically, ’instruction prompting’ employs definite commands or inquiries to navigate an LLM’s responses. In the case of summarizing detected postures from sensor data, the model is instructed to transform the raw or processed sensor readings into a summary that is easy to understand. Singhal et al. [64] showcase cutting-edge uses of LLMs for medical knowledge, where LLMs when tuned with instruction prompts, demonstrate reasonable performance but still lag behind actual clinicians. Our objective is to utilize instruction prompts with the statistics and sequences of detected postures to summarize occupational health risks and recommendations in language that is easy for humans to comprehend.

Fig. 1
figure 1

ERG-AI Pipeline

Fig. 2
figure 2

ERG-AI Training Sequence Diagram

Finally, it is also important to consider the impact of future disruptions, such as human enterprises incorporating AI-driven risks and recommendations for occupational health. The NIOSH study [65] explored how future disruptions could impact occupational safety and health (OSH). Using strategic foresight, researchers identified nine critical uncertainties and developed four scenarios: Trusted Partnerships, Multi-Polar World of OSH, Race to the Bottom, and Rugged Individualism and Trustworthy Government. Key challenges include data access, direct-to-worker communication, and misinformation management. Recommendations emphasize modernizing IT infrastructure, focusing on individualized OSH, and developing communication strategies. The study underscores the need for strategic planning, data management, and partnerships to enhance OSH readiness and resilience for future threats. Baldassarre et. al. [66] discuss the impact of generative AI and LLMs on occupational health practices, emphasizing the need for tailored ethical considerations. It highlights AI applications in workplace safety, HR tools, and healthcare, and outlines challenges such as data privacy, security, and misinformation risks. The European Parliament’s AI Act and WHO guidelines are referenced for regulatory frameworks. They advocate updating the ICOH Code of Ethics to incorporate transparency, human oversight, and data privacy. Recommendations include developing specific AI guidelines for occupational medicine to enhance worker safety and well-being, ensuring responsible AI integration in healthcare.

3 Approach

We present, ERG-AI, a machine learning pipeline as shown in Fig. 1 to generate occupational health assessments and recommendations based on raw wearable sensor data as input. The main actors in ERG-AI is an ML Engineer who configures the pipeline for either training/inference and the Worker who generates data using wearable sensors and receives an occupational health risk assessment with recommendations. We describe the training and inference lanes for the ERG-AI pipeline in the following subsections.

3.1 Training lane

The training lane as shown in Fig. 2 describes how the ERG-AI pipeline learns from labeled data and how the machine learning models are trained. We present the different stages of the training lane as follows:

Data Acquisition: The process starts with a Worker providing high-frequency multivariate sensor data. For instance, in our experiments we obtain data from five tri-axial accelerometers A domain expert labels this data using vector mathematics between joints to obtain postures such standing, walking, running, sitting and so on. This data is stored in the labeled data database (LabelledDataDB).

Data Pre-processing: The machine learning engineer (ML Engineer) then specifies the configuration for the ERG-AI pipeline. The pipeline extracts data from the LabelledDataDB as a CSV file and performs various preprocessing steps including data profiling for quality and data cleaning. The results of these preprocessing steps are stored in the file system used by ERG-AI. Feature engineering is performed on the cleaned data to extract statistical properties that can better inform the model about the underlying patterns in the data. Feature engineering extracts statistical properties, called features, from the raw input data that exhibit invariance to noise. Furthermore, the feature-based representations of time-series data [67] perform well in classifying tasks at a fraction of the computational cost of processing raw time-series data. The results of feature engineering are also stored in the file system. The ERG-AI pipeline splits the data into training, validation, and test sets and performs scaling/normalization on the data to ensure all variables are comparable. The data is then sequenced, meaning the sensor variables are divided into sequences of a certain length called the window size. The last stage of the preprocessing involves combining these sequences into a training data set and a test data set, which are then passed to the training and evaluation stages, respectively.

Training: The ERG-AI pipeline then trains and validates a machine learning model using the prepared training dataset and validation dataset, respectively. It uses the input time-series data (of chosen input window sizes) data and output posture from the training set. The ML Engineer configures window size, learning parameters and ML model types (architectures). The ML Engineer may choose between various types of neural networks to train models: Fully-Connected Neural Networks (FCNN), Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). These three network types can all work on multi-variate time series data. In addition, the ML Engineer may define the specific architecture of the neural network by specifying the exact number of layers and neurons of each layer. For CNNs, one may also specify whether or not to use maxpooling, and the maxpooling size. For RNNs, one can configure the unit type (Long Short Term Memory, Gated Recurrent Unit, etc). Before training, the pipeline sets apart a small portion of the training data (e.g., 20%) to use as a validation dataset. It automatically stops training the ML model if the prediction error of the validation dataset stops improving, preventing the over-fitting of the model to the training data. It saves the ML model for evaluation in the Model Database (ModelDB) for evaluation.

Evaluation: The model is evaluated using the unseen test dataset, and the performance metrics are stored in the ModelDB. By using an unseen test set, we treat the model as a black-box system, focusing solely on the inputs and outputs to assess its behavior without considering its internal mechanisms. This approach helps in evaluating how well the model generalizes to new data and minimizes bias due to hyper-parameter tuning during training. We compare the model output and the ground truth to determine the accuracy of the model’s predictions. To provide a visual representation of the model’s task performance, the pipeline generates plots of predictions on test data.

We use accuracy, F\(_1\)-score, recall, and precision to evaluate the model performance. These metrics provide different aspects of the model’s effectiveness in classifying the data correctly. Accuracy is the proportion of true results (both true positives and true negatives) among the total number of cases examined. It is defined as:

$$ \text {Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. Precision, also known as positive predictive value, is the ratio of correctly predicted positive observations to the total predicted positives. It is defined as:

$$ \text {Precision} = \frac{TP}{TP + FP} $$

This metric is important to detect when the cost of false positives is high. Recall, also known as sensitivity or true positive rate, is the ratio of correctly predicted positive observations to the all observations in actual class. It is defined as:

$$ \text {Recall} = \frac{TP}{TP + FN} $$

Recall is important when the cost of false negatives is high. The F\(_1\)-score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. It is particularly useful when the class distribution is imbalanced. The F\(_1\)-score is defined as:

$$ \text {F}_1\text {-score} = 2 \cdot \frac{\text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}} $$

By using these metrics, we ensure a balanced evaluation of the model’s performance, taking into account both the accuracy of the predictions and the balance between precision and recall.

Fig. 3
figure 3

ERG-AI Inference Sequence Diagram

We also estimate model uncertainty in the predictions using Monte Carlo dropout (see Section 2.3) and obtain the uncertainty-aware confusion matrix (see Section 2.4) for all postures.

Deployment: The ERG-AI pipeline is then deployed as a Docker container model as a service with an inference API. The API embodies the trained, validated, and evaluated ML model as a service (e.g., Flask web service). The API is invoked using sub-sequences of data from input sensors and returns postures. Since the pipeline trains the ML model on input data features extracted from raw data and bounded by a scaling operation (e.g., values between 0 and 1), the model cannot always use the raw input sequences as they are. The feature engineering and inference operations require using ML libraries such as Sci-kit learn [30] and TensorFlow [68]. Therefore, parts of the pipeline used in inference, such as code to compute engineered features, scaler, and the ModelDB with all its dependencies (e.g., ML libraries), are packaged as a standalone container (e.g., docker). We can deploy the container on an edge device or on the cloud.

3.2 Inference lane

The inference lane describes how the ERG-AI pipeline uses the trained models to make predictions and how these predictions are used to generate occupational health risk assessments and recommendations. The sequence diagram of interactions for inference are shown in Fig. 3.

Data Acquisition: The process starts with a Worker providing sensor data acquired over several days to ERG-AI Inference API. The API stores this new sensor data S in the sensor data database (SensorDataDb).

Configuration: The machine learning engineer (MLEngineer) then specifies the default ModelDB to be used via the API. The ML Engineer has elevated permission to call the inference API. He/she invokes the API with a token or credentials that grant rights to specify the model M that is retrieved by the ERG-AI Inference API. This invocation is typically done right after a new version of the API is released or the models in ModelDB have been updated in the training lane.

Inference: The ERG-AI Inference API uses model M to predict postures and uncertainties from the new sensor data S. The set of predictions P and their uncertainties U are stored in the file system. The API generates a simple statistical summary of postures and uncertainties in text T using (P,U). For instance, the summary is the amount of time a worker is standing, sitting, walking, and so on over the period of time in S. One may also store the sequence of postures in the summary hoping that the change from one posture to another can reveal a potential occupational health risk that may not occur only due to prolonged sitting or standing.

Generating risks and recommendations: The Worker who seeks a summary of their occupational health risks and wants recommendations to improve invokes the ERG-AI Inference API with a short work description W that provides context. For instance, this can be a description of what a care worker does on a daily basis. The API combines W along with T summarizing model predictions and uncertainties to generate a prompt Pr for a large language model API such as GPT-4 or LLAMA-2. The prompt describes the work, a description of the model, the model’s uncertainties, a summary of its predicted postures, and an instruction to generate occupational health risks and recommendations. When the LLM API (e.g., GPT4 APIFootnote 1) is invoked with Pr, it returns occupational health risks and recommendations based on the prompt and these are then provided back to the Worker.

3.3 Implementation of the pipeline

We have implemented the ERG-AI pipeline to generate occupational health risks and recommendations using Python and Data Version Control (DVC)Footnote 2, a tool for structuring ML experiments and data versioning. Each pipeline stage is a Python program that takes input data and produces an output based on control parameters. DVC tracks the dependencies between input, output, and control parameters for each stage and stores the input and output in the cache for each pipeline execution. Therefore, DVC can automatically check if any pipeline stage has already run with the given input and control parameters. It can fetch the output from the cache instead of executing the pipeline again, reducing the computational resources needed for creating virtual sensors. The control parameters are in a configuration file separate from the source code. Thus, the user can explore various pipeline configurations (e.g., the type of machine learning model, window sizes of input and target sensors, splitting of data, and how to train ML models, selection of LLM such as GPT4 and LLAMA-2 to generate risks and recommendations) without knowing the implementation details of the pipeline.

We have integrated CodeCarbon [69, 70], a framework for measuring the energy consumption and carbon footprint, into the ERG-AI tool. CodeCarbon offers the capability to monitor the energy usage associated with each stage of the pipeline. This not only provides valuable insights into the environmental impact of our machine learning system but also enables us to identify potential areas for energy optimization and sustainability improvements. The combination of CodeCarbon with Data Version Control (DVC) allows us to link the energy consumption data to specific pipeline executions, helping us understand the resource demands associated with different configurations and input parameters. This integration with CodeCarbon ensures that our research accounts for the ecological footprint of our AI-enhanced ergonomics system, aligning with our goal of creating a sustainable and efficient solution for the workforce.

The implementation of ERG-AI is open source and available on GitHub: https://github.com/SINTEF-9012/erg-ai.

4 Evaluation

In this section, we address the following Research Questions (RQ)s based on the Digital Worker Goldicare dataset:

  • RQ1. What is the performance in predicting human ergonomic posture from accelerometer data?

  • RQ2. Can uncertainty estimation gauge our trust in predictions of rare human ergonomic posture?

  • RQ3. Can reducing the number of sensors continue to give accurate predictions of human ergonomic postures?

  • RQ4. What is the energy usage and estimated carbon footprint of the various stages of the ERG-AI pipeline?

  • RQ5. Can LLM feedback on occupational ergonomics driven by uncertainty-aware ML output be actionable?

4.1 Subject of the evaluation

The Digital Worker Goldicare dataset [9] is made of data from home care workers with \(\ge \)50 employment (minimum 18.8 working hours a week), recruited from six of a total of 13 home care service units in Trondheim, the third largest city in Norway. Only workers that had direct contact with patients were included. All workers in these home care units were provided written and oral information about the research project and gave written consent before the study. Exclusion criteria were: (1) physical disability not allowing normal behavior, (2) office work, (3) bandage band aid and adhesives allergy, and (4) pregnancy. The study was conducted according to the Declaration of Helsinki and approved by the Regional Committees for Medical Research Ethics-Central Norway (No.: 64541). Five triaxial AX3 accelerometers (Axivity Ltd, Newcastle upon Tyne, UK) were mounted on the skin of the home care workers, using adhesive double-sided tape (3M; Witre, Halden, Norway) and secured with waterproof medical tape (Opsite Flexifix; Mediq, Oslo, Norway). They were worn 24 h per day for up to six consecutive workdays at a sampling frequency of 25 Hz and a measurement range of ±8 G. We down-sampled the data to 1 Hz to increase the practical usefulness of the system on low power devices. The accelerometers were attached to the following anatomical locations: (1) below the head of the fibula, on the proximal and lateral aspect of the calf, (2) on the distal, anterior and medial aspect of the femur (approximately 10 cm above the crest of the patella), (3) below the iliac crest of the hip, (4) the upper back approximately 5 cm to the side of the processus spinosus at the level of Th1-Th2 vertebrae, and (5) on the upper arm, approximately at the insertion of the deltoid muscle. The home care workers consisted of nurses, nursing assistants, learning disability nurses, and occupational therapists, having home care as their main employer, and worked an average of 38.5 h a week. For an average of 3.8 workdays, 2913 h of accelerometer data were recorded from 114 home care workers [9].

4.2 Results

RQ1: What is the performance in predicting human ergonomic posture from accelerometer data? To identify the most appropriate machine learning algorithm and architecture for our posture classification task, we combined manual experimentation with empirical validation, evaluating Fully-Connected Neural Networks (FCNN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN). We tested various configurations for all three network types by altering the number of hidden layers and nodes per layer. We used the ReLU activation function during all trials. For the CNNs, which are well-suited for capturing the temporal and spatial dependencies in accelerometer data, we experimented with different numbers of convolutional layers, kernel sizes, filters per layer, and the use of max-pooling layers. Recurrent Neural Networks (RNN), including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), were also explored to capture temporal dependencies in the data. We tested various configurations by adjusting the number of recurrent layers and units per layer. Although LSTM and GRU units were effective in learning long-term dependencies, CNNs ultimately outperformed RNNs in terms of classification accuracy and computational efficiency. Throughout the experimentation phase, hyperparameter tuning was conducted manually, refining the learning rate, batch size, and the number of epochs, with a window size of 15 seconds used for each prediction to balance temporal context and computational efficiency. Our final CNN architecture, which yielded the best results, included four convolutional layers with 64 kernels each, followed by two fully-connected layers with 32 nodes each, using ReLU activation functions. The full list of configuration parameters is shown in Table 1.

Table 1 Configuration parameters and hyperparameters used in experiments
Table 2 Accuracy for all classes on different sets of features

The overall performance scores of the models are summarized in Table 2. Our findings demonstrate that the proposed ERG-AI framework exhibits high performance when predicting human ergonomic postures, even when operating with down-sampled data. Particularly noteworthy is the performance achieved when utilizing data from the complete set of sensors, including the arm, trunk, thigh, calf, and hip positions, where the overall accuracy reached 0.92 (see Table 3). Table 2 provides a breakdown of accuracy per posture class, showing high classification accuracy for postures such as lying, sitting and standing, while other classes exhibit lower accuracy rates across the various feature sets.

Table 3 Accuracy and uncertainty for different feature sets
Fig. 4
figure 4

Accuracy for all classes on different sets of features

In Fig. 4 we present a visual overview of the accuracy of different posture classes across various sensor configurations. This plot offers a clearer and more immediate understanding of how the number and type of sensors influence the model’s accuracy for each posture class. It is evident that using the complete set of sensors (arm, trunk, thigh, calf, and hip) consistently results in the highest accuracy across almost all posture classes. This configuration is particularly effective for static postures such as lying, sitting, and standing, where the accuracy remains above 0.85. However, as the number of sensors is reduced, a noticeable decline in accuracy is observed, especially for dynamic activities like walking and running. For instance, when only the arm sensor is used, the accuracy for walking and running drops to 0.15 and 0.50, respectively, highlighting the importance of multiple sensor inputs for these activities. The “Stairs” class continues to exhibit the lowest accuracy across all sensor configurations, indicating the inherent complexity in predicting this particular activity. This is likely due to the diverse nature of stair-climbing movements, which may require a more sophisticated approach or additional sensors to capture accurately.

Overall, the results underscores the trade-offs between sensor configurations and classification accuracy. While the inclusion of more sensors generally enhances performance, it also increases the complexity and computational demands of the model. Therefore, a balanced approach is necessary to optimize both accuracy and efficiency for real-world applications.

The confusion matrix shown in Fig. 5 provides a visual representation of the model’s performance, highlighting the distribution of true positives, false positives, and misclassifications for each posture class. The results in Fig. 5 are from the model trained on feature set 1, corresponding to the first row of Table 2. When examining the confusion matrix, we can observe that misclassifications of the “Walk” class predominantly shift into the “Other” class, whereas misclassifications of “Run” mostly transition to the “Walk” class. Stair-climbing is notably challenging to predict, with confusion occurring between “Walk,” “Other,” and the “Stairs” class. Notably, active postures, including “Walk,” “Run,” and “Stairs,” exhibit more confusion compared to static postures, indicating the increased difficulty in accurately predicting dynamic activities. The complexity of stair-climbing, which includes both ascending and descending stairs, adds an additional layer of intricacy to this class.

The “Other” class encompasses a broad range of activities that do not fall into the explicitly defined categories such as lying, sitting, standing, kneeling, or stair-climbing. This class is inherently challenging to classify accurately due to its diversity, potentially including activities that involve dynamic and unpredictable movements. The low accuracy score of 0.51 suggests that ERG-AI may face difficulties in distinguishing between these diverse activities accurately

Our results underscore the robustness of the ERG-AI system in classifying postures, especially those related to static positions. However, it is important to acknowledge the potential impact of higher sampling rates on the system’s performance. While our results with down-sampled data are promising, future investigations may explore whether higher sampling rates can further enhance the accuracy of posture classification. This exploration could provide valuable insights into the trade-offs between data granularity and computational efficiency.

figure a

RQ2: Can uncertainty estimation gauge our trust in predictions of rare human ergonomic posture? Uncertainty estimation involves generating a secondary output for each prediction. In our context, it entails quantifying the level of uncertainty or potential error in posture prediction. This quantified value serves as an estimate of uncertainty, essentially reflecting how confident the model is in its prediction. By providing this supplementary information, decision-makers can make more informed choices and gain greater confidence in their decisions. Additionally, uncertainty estimation enables the detection of potential anomalies or alterations in the surrounding context of the prediction of a posture.

We present the uncertainty-aware confusion matrix in Fig. 5 where the intensity of the red color indicates the level of uncertainty. This figure represents the results from the model trained on all available sensors (feature set 1). Darker the red the higher is the uncertainty in the prediction and vice versa. The model trained by ERG-AI in for our experiment demonstrates the low uncertainty when predicting Sitting, Running and Lying.The uncertainty is moderate for Kneeling, Standing, and Walking while there is high uncertainty in predicting Other and Stairs.It is also interesting to note that there is low uncertainty and low confusion between predictions of Sitting or Stairs and Lying and high uncertainty and moderate confusion when predicting Stairs and Other. Prediction of some postures have high uncertainty and high confusion such as Kneeling and Other as well as Stairs and Kneeling. The knowledge of uncertainty helps in targeting data acquisition to reduce confusion and and increase certainty in our prediction of postures. For instance, Stairs is a relatively rare event compared to other movements and hence it exhibits high uncertainty in its prediction. Balancing the dataset by acquiring more data involving climbing stairs can help reduce the uncertainty in predicting Stairs.

Fig. 5
figure 5

Uncertainty-aware confusion matrix for predictions on the test set using all sensors as input features

Fig. 6
figure 6

Accuracy and uncertainty for different feature sets. The bar plot shows the accuracy, while the lines represent the uncertainty levels for correct classifications and misclassifications

Additionally, Fig. 6 visualizes the relationship between prediction accuracy and average uncertainty across different feature sets. The bar plot illustrates the accuracy of posture prediction for each feature set, while the line plots indicate the uncertainty levels for correct classifications and misclassifications. As expected, models utilizing a more comprehensive set of sensors (e.g., Arm, trunk, thigh, calf, hip) exhibit higher accuracy and lower uncertainty. Conversely, models relying on fewer sensors show increased uncertainty and reduced accuracy. Understanding the interplay between accuracy and uncertainty enables more informed decisions regarding sensor deployment and data acquisition strategies.

A more detailed breakdown of the uncertainty is presented in Fig. 7, showing the uncertainty estimation across all posture classes for every feature set. These results reinforces the conclusion that incorporating more sensors in general reduces uncertainty. For instance, the average uncertainty for correctly predicted “Sit” postures decreases from 8 with only the arm sensor to 0.26 when using the full set of sensors (arm, trunk, thigh, calf, and hip). This trend is consistent across most postures, highlighting the value of comprehensive sensor data in enhancing prediction confidence. However, certain postures such as “Stairs” and “Kneel” still exhibit high uncertainty even with the full sensor set, suggesting these activities are inherently more challenging to predict accurately. In contrast, the uncertainty for misclassifications remains significantly higher, indicating that the uncertainty estimation can be useful for identifying unreliable predictions and guiding further model improvements. This suggests that high uncertainty values can serve as a red flag for misclassifications, allowing for targeted interventions such as additional data collection or model refinement for the most problematic postures.

Fig. 7
figure 7

Uncertainty heatmap for all classes across the different feature sets

Uncertainty estimation provides a good indication of both good and poor performance for unforeseen data. For instance, as shown in Fig. 5, good performance in correctly predicting a posture is correlated to low uncertainty and vice versa. For instance, all values with high accuracy in the diagonal vector have light red colors indicating low uncertainty. While, prediction of Stairs and Other have both low accuracy and high uncertainty. ERG-AI predicts posture on unforeseen wearable sensor data along with its uncertainty computed using Monte Carlo dropout. This uncertainty helps us know whether the prediction can be trusted and whether we need to improve the underlying model by acquiring more data for certain postures.

figure b

RQ3: Can reducing the number of sensors continue to give accurate predictions of human ergonomic postures? To investigate the impact of reducing the number of sensors on the accuracy and uncertainty in predicting ergonomic postures, we created four additional sets of features, each with progressively fewer sensors on different parts of the body. We trained five unique models for each set. Our goal was to understand how sensor quantity influences prediction accuracy, which is vital for system affordability, user comfort, and practical deployment. Table 3 presents data that illustrates the impact of reducing the number of sensors on the accuracy and uncertainty of predicting ergonomic postures. It compares five different sensor configurations, each with a varying number of sensors placed on different parts of the body: Arm, Trunk, Thigh, Calf, and Hip. Set 1 features a full array of sensors (Arm, Trunk, Thigh, Calf, Hip) for extensive coverage. Set 2 omits the hip sensor, while Set 3 additionally removes the calf. Set 4 limits to arm and trunk sensors, excluding lower body parts. Set 5 simplifies to an arm-only sensor setup.

The accuracy metric evaluates the effectiveness of each sensor configuration in predicting ergonomic postures, with higher values signifying better performance. For Sets 1 and 2, accuracy is comparably high at 0.92, suggesting minimal impact from the removal of the hip sensor. However, as sensors are further reduced in Sets 3 to 5, there’s a marked decline in accuracy, dropping to 0.91, 0.79, and 0.73 respectively. This trend highlights a direct correlation between the number of sensors and the ability to accurately predict postures, with fewer sensors leading to less effective posture prediction.

Uncertainty quantifies confidence in posture prediction, with lower values being more desirable. It’s divided into “Uncertainty (correct)” for accurately predicted cases, and “Uncertainty (misclassifications)” for errors. As sensor count decreases, uncertainty rises for both correct and incorrect predictions. This trend implies that with fewer sensors, the system’s confidence in its predictions diminishes, leading to more ambiguous outcomes, especially in misclassified instances. This increase in uncertainty with sensor reduction highlights the challenge of maintaining prediction reliability and clarity with a limited sensor setup.

The sensor placement’s relevance is evident in posture prediction accuracy. A comprehensive array (Set 1), with sensors on multiple body parts, achieves high accuracy and low uncertainty, emphasizing the need for diverse data points. Set 2’s exclusion of the hip sensor slightly increases uncertainty, indicating its role in prediction confidence, yet doesn’t heavily impact accuracy. However, further reductions in Sets 3 to 5, notably removing thigh and calf sensors, lead to significant drops in both accuracy and certainty. This underscores the critical role of lower body sensors, particularly on the thigh and calf, for precise and reliable ergonomic posture predictions.

The analysis of class-specific performance for ergonomic postures, based on data from various Tables (2, 4, 5, 6, 7), reveals significant variation in accuracy, precision, recall, F1-score, and uncertainty across different sensor configurations. Notably, the ’Lie’ posture maintains a high accuracy (0.99) and precision (0.99) across most sensor sets, with the lowest uncertainty (0.21) observed in the most comprehensive set (Set 1). In contrast, the ’Kneel’ posture experiences a dramatic accuracy drop to 0.00 and high uncertainty (8.88) as sensors are reduced. The ’Sit’ posture shows remarkable stability in accuracy (0.99) and consistent recall (around 0.98-0.99), indicating reliable detection. However, the ’Stand’ posture’s accuracy decreases significantly from 0.89 to 0.40, with a corresponding increase in uncertainty (up to 17.97) as the number of sensors is reduced. The ’Stairs’ posture presents the lowest accuracy (0.23) and high uncertainty (11.84), underscoring the challenges in its prediction. These findings highlight the high performance in detecting stable postures like ’Lie’ and ’Sit’ across sensor configurations, whereas complex movements like ’Kneel’ and ’Stairs’ pose substantial challenges, particularly with fewer sensors.

figure c

RQ4: What is the energy usage and estimated carbon footprint of the various stages of the ERG-AI pipeline? The ERG-AI pipeline includes functionality for measuring the energy usage of each stage of the pipeline, which enables monitoring and reporting of the resource consumption and carbon footprint of using the pipeline. By using the framework CodeCarbon [70], these metrics are automatically recorded each time the pipeline is run. By investigating the energy usage and carbon footprint of the various stages of the ERG-AI pipeline we can reveal important insights regarding the environmental impact of our machine learning model creation process.

Table 4 Precision score for all classes on different sets of features
Table 5 Recall score for all classes on different sets of features
Table 6 F\(_1\)-score for all classes on different sets of features
Table 7 Uncertainty for all classes on different sets of features
Table 8 Energy usage report for all stages of the ERG-AI pipeline

Table 8 presents the duration, energy consumption, and carbon emissions for each pipeline stage. Every experiment was performed on computational infrastructure located in Norway, on a computer with 8 physical cores, 128GB CPU RAM and 2 NVIDIA A30 GPUs. The values for carbon emissions are based on that Norway have an average carbon intensity of 27.55 gCO\(_2\)eqFootnote 3 (grams of CO\(_2\) equivalents). The first two stages, “Profile” and “Clean”, were only performed once for the data set, since those stages were not affected by which features we chose to use as input variables to the model. The rest of the stages had to be rerun for each new feature set we explored, and we present here the mean duration, energy consumption and carbon emissions across the five different feature sets we used, in addition to the standard deviation for those five runs.

The “Train” stage, responsible for the actual model training, stands out as the most energy-intensive phase with an energy consumption of 0.842 kWh, and estimated emissions of 23.193 gCO\(_2\)eq. This accounts for a substantial portion, \(83\%,\) of both energy consumption and carbon emissions of the whole model creation process. This is as expected, since model training, especially for complex machine learning models like neural networks, demands significant computational resources. The “Clean” stage, responsible for the first part of the data preprocessing, and the “Evaluate” stage, which includes both model evaluation and uncertainty estimation, also contribute significantly to our carbon footprint, with an energy consumption of 1.352 gCO\(_2\)eq and 3.035 gCO\(_2\)eq, respectively. This is an order of magnitude lower than the “Train” stage, but an order of magnitude higher than the rest of the preprocessing stages. The “Clean” stage involves reading the full data set in order to clean up the data and remove unwanted parts, and employs heavy use of the Pandas framework [83], which is convenient to use, but relatively computationally expensive. The “Evaluate” stage involves employing Bayesian dropout on the neural network, which likely contributes to its higher emissions. Additionally, the creation of plots and visualizations within the “Evaluate” stage may explain its larger emissions compared to some of the preprocessing stages.

Fig. 8
figure 8

ERG-AI Generated Prompt

It is important to note that while the “Train” stage consumes the most energy, the preprocessing stages also make a notable contribution to our carbon emissions. As we consider the environmental impact of our machine learning pipeline, optimizing the energy usage and emissions in all stages, not just the training phase, becomes imperative. This optimization could include more energy-efficient algorithms, hardware, and data processing techniques to reduce our overall carbon footprint.

figure d

RQ5: Can LLM feedback on occupational ergonomics driven by uncertainty-aware ML output be actionable? ERG-AI generates a prompt summarizing the activity of a worker as shown in Fig. 8. This prompt instructs an LLM API to analyze recorded activities of a healthcare worker in home care, including time spent in various postures and movements, along with a model’s prediction metrics like accuracy, precision, and uncertainty for each activity as computed by ERG-AI. The task is to assess ergonomic risks based on this data and provide five tailored recommendations to improve the worker’s occupational health and safety.

Fig. 9
figure 9

ERG-AI calls GPT4 to Generate Risks and Recommendations

The prompt in Fig. 8 is used to perform Retrieval-Augmented Generation (RAG) using a commercial LLM namely Open-AI’s GPT-4 and a smaller and portable open source LLM called LLAMA-7B. RAG is a technique in natural language processing that enhances text generation by integrating a retriever model to fetch relevant information from external sources, which is then used by a generator model to produce more informed and contextually accurate outputs. In our case, we fetch relevant information from the ERG-AI pipeline. The generator model is the LLM that produces more information given the context from ERG-AI.

Analysis of GPT-4’s output: The risk-assessment generated by Open-AI’s GPT-4 is shown in Fig. 9. The ergonomic risk assessment customizes recommendations to their specific activities, like prolonged sitting and standing, offering actionable solutions such as sit-stand workstations and proper footwear. It also thoughtfully addresses uncertain ’Other’ activities by suggesting manual logging for greater clarity and emphasizes safety, especially in emergency scenarios. However, it falls short in detailing strategies for the ambiguous ’Other’ category and primarily concentrates on posture-related risks, potentially overlooking other ergonomic concerns like manual handling. Moreover, recommendations may face feasibility issues in diverse home care settings due to resource limitations and assumptions about environmental control. Additionally, the advice, while activity-specific, lacks personalization considering the worker’s unique physical conditions and preferences.

Analysis of LLAMA-7B’s output: The risk assessment using LLAMA-7B’s API for the same prompt in Fig. 8 is provided in Fig. 10. The evaluation of ergonomic risks for a healthcare worker presents several pros and cons. Pros include a comprehensive set of recommendations addressing back pain, stress, and fatigue, practical solutions such as regular movement breaks and correct lifting techniques, a holistic focus on both physical and mental health, and the innovative use of technology like activity trackers. However, there are cons: some recommendations lack specific guidance on break duration and frequency, there are assumptions about the work environment that may not hold in all settings, and a reliance on the worker’s self-motivation and consistency in using tools like fitness trackers..

figure e

OpenAI’s GPT-4 vs. LLAMA-7B: Comparing GPT-4 and LLAMA-7B outputs reveals distinct approaches to ergonomic risk assessment. GPT-4 delves into specific activities like sitting and standing, offering detailed, actionable advice such as sit-stand workstations and proper footwear. It also acknowledges data accuracy issues, suggesting manual logging for unclear activities. Conversely, LLAMA-7B covers broader ergonomic risks with generalized advice on movement, lifting, and stress management, but lacks specificity. While GPT-4 proposes technology integration through manual logging, LLAMA-7B recommends using an activity tracker. GPT-4 validates its recommendations against specific activity data, emphasizing continuous monitoring. In contrast, LLAMA-7B’s conclusion is more general, without specific validation, offering a broader but less detailed perspective.

Feedback from Occupational Health Professional: LLM feedback on occupational ergonomics based on ML output is in line with general perspectives on healthy working conditions. Subject-specific feedback on identified musculoskeletal health risks are appropriate but still seem to lack specificity, specially related to each worker’s age, gender, overall health condition, specific role in the organization, all of which must be taken into consideration when providing recommendations for healthier working habits. An important aspect of occupational ergonomics is the design of tasks that fit the worker, rather than forcing a worker’s body into postures. Tnis is where more context information on the worker and their tasks will crucial in order to understand how the working activities and environments can be tailored to suit the needs of the worker in order to reduce the risk of musculoskeletal disorders.

Fig. 10
figure 10

ERG-AI calls open source LLAMA-7B to Generate Risks and Recommendations

4.3 Threats to validity

The ERG-AI study on enhancing occupational ergonomics with ML and LLMs faces several validity threats.

Internal validity: The ERG-AI system, designed to improve occupational ergonomics through machine learning and LLMs like GPT-4, encounters multiple internal validity challenges. The posture labels of the Digital Worker Goldicare dataset was identified using vector mathematics, which means that the ground truth of the dataset relies on the accuracy of that method. Confounding variables such as environmental factors or individual health issues could mislead conclusions about musculoskeletal disorders. The selection of home care workers from a specific region introduces selection bias, limiting the generalizability of results to other populations. History effects, including personal or professional events during the study, might independently affect the workers’ behaviors. Maturation effects, such as physical or mental changes over time, can also skew results, as can testing effects where familiarity with sensors alters natural movements. Instrumentation changes, involving modifications in sensors or data processes, threaten data consistency. Regression to the mean may falsely interpret natural score fluctuations as intervention impacts. Experimental mortality, or participant dropout, could bias outcomes if dropouts have different characteristics. Placebo effects might cause behavior changes due to belief in the intervention’s efficacy. Lastly, experimenter bias could influence data handling, affecting study conclusions.

External validity: The study’s external validity faces several threats. Its generalizability is limited as its findings, based on the Digital Worker Goldicare dataset, may not extend to varied worker types or work environments. The use of LLMs like GPT-4 and LLAMA-7B for ergonomic risk assessments as very specific choices and might not be universally relevant across all occupational settings. There is a need to customize LLMs for the purpose by fine-tuning them to mitigate this threat. Additionally, the study overlooks the environmental impact of LLM inference, an important aspect for the ERG-AI system’s broader applicability and sustainability. The study’s recommendations lack individual worker personalization, potentially narrowing its external applicability. Finally, the fast-paced evolution of AI and ML technologies could render the study’s findings less relevant over time, as new methods and data might offer different insights.

5 Conclusions and future work

5.1 Conclusions

ERG-AI represents a pivotal development in occupational ergonomics, amalgamating machine learning’s analytical robustness with the communicative proficiency of Large Language Models (LLMs) such as GPT-4 and LLAMA-7B. The system’s efficacy was validated using the Digital Worker Goldicare dataset, where it adeptly predicted various human ergonomic postures, especially static ones like sitting and lying, even under energy-efficient down-sampled data conditions. This aspect underscores ERG-AI’s practical utility in real-world scenarios. Our work notably investigated the impact of sensor reduction on the accuracy and uncertainty of posture prediction. Findings revealed that while basic postures were consistently identified with fewer sensors, complex movements such as kneeling and stair climbing presented significant prediction challenges. This emphasizes the crucial role of specific sensors, particularly thigh-based ones, and highlights a balance to be struck between system affordability and predictive accuracy. Incorporating uncertainty estimation into ERG-AI offered insights into the confidence level of posture predictions, thereby aiding in decision-making and pinpointing areas needing model enhancement. Nonetheless, this feature also contributed to an increase in the system’s energy use and carbon footprint, particularly during training and evaluation phases.

The application of LLMs to transform ERG-AI’s technical data into actionable ergonomic advice illustrated the potential for AI-driven occupational health risk assessments. Although GPT-4 and LLAMA-7B provided valuable insights, occupational health experts suggested the need for more personalized recommendations, considering individual worker attributes and specific job roles. This indicates an avenue for further refinement of LLMs to produce more tailored ergonomic advice.

To summarize, ERG-AI embodies the intersection of advanced AI techniques and occupational health, offering an innovative means to improve workplace ergonomics and worker welfare. Its proficiency in processing intricate sensor data and relaying findings in an accessible format renders it an instrumental tool in mitigating musculoskeletal disorders among workers. However, ongoing advancements and customizations are essential to optimize its effectiveness across varied occupational contexts.

5.2 Future work

Building on the foundational work of ERG-AI in occupational ergonomics, several future research directions can further enhance its utility and applicability:

GuardRails for Risk Assessment: Future research could focus on developing robust ’GuardRails’  [72, 73] for Large Language Model (LLM) outputs in ergonomic risk assessments. These mechanisms would involve implementing checks and balances to ensure that the LLMs, such as GPT-4 and LLAMA-7B, provide not only accurate but also ethically and contextually appropriate recommendations. This may include developing algorithms to filter out biases, inaccuracies, or inappropriate suggestions from LLM outputs, ensuring that the advice given is safe, relevant, and practical for diverse occupational contexts. For instance, GuardRails can be used to constrain recommendations to be specific with regard to gender and age.

Portable Deployment: Advancing ERG-AI’s portability to different platforms using technologies like Intermediate Representation (IR) and frameworks such as Apache TVM [74] and Modular.AI’s Mojo language [75] could be another area of exploration. This would involve optimizing the pipeline for deployment on various hardware platforms, ensuring it is lightweight, efficient, and capable of running on devices with limited computational power. Such portability would facilitate the widespread adoption of ERG-AI, especially in remote or resource-constrained environments.

Privacy-preserving Federated Learning: Integrating privacy-preserving techniques like Federated Learning [76] into ERG-AI would enable the system to learn from decentralized data sources without compromising individual privacy. This approach allows the model to be trained across multiple devices or servers, ensuring that sensitive data does not leave its original location, thereby adhering to privacy regulations and enhancing user trust. Privacy-preserving federated learning (FL) has been explored with positive consequences for occupational health. The paper by Moe et al. [76] presents a novel approach to predict worker safety in construction environments using FL to train deep learning models on edge devices, enhancing safety management while maintaining data privacy. Prasad et al. [77] provide a comprehensive survey on FL in the Internet-of-Medical-Things (IoMT), addressing integration challenges and presenting a case study on blockchain-based FL for decentralized data analytics in healthcare. Additionally, Alahmadi et al. [78] introduce a privacy-preserved mental stress detection framework using IoMT and FL, demonstrating significant reductions in communication overhead and ensuring data security and efficiency.

Uncertainty-driven Continual Learning: Incorporating continual learning driven by uncertainty estimates [79] in ERG-AI could be a significant step forward. This approach would allow the system to continuously adapt and improve based on new data while being aware of its limitations. By focusing on areas with high uncertainty, the system could prioritize learning new or rare postures, enhancing its overall predictive accuracy and robustness.

Data Sovereignty and Data Spaces: Data sovereignty is the ability of an individual or organization to control how, when, and at what price its data is used across the value chain. A data space is like a digital ecosystem that brings together relevant data infrastructures and governance frameworks to facilitate data pooling and sharing [80]. Data space have the potential to significantly contribute to accelerating digital transformation within and across domains [80]. Data sovereignty provides the legal and ethical framework within which data spaces operate. Data spaces, such as International Data Spaces (IDS) represent the technological basis for trustworthy data exchange, supporting data sovereignty between businesses while complying with relevant standards, values, and regulations. Addressing data sovereignty [81] concerns in the context of occupational ergonomics is crucial. By focusing on data sovereignty and data spaces in the future development of ERG-AI, the pipeline can enhance its compliance with standards and legal requirements, encourage active participation of workers in studies and data sharing, and supports the use of insights for better workplace ergonomics, research, innovation and policy making. Insights gained from ERG-AI can be valuable for broader initiatives focusing on workplace health or the European Health Data Space.

Small Language Models: The use of Large Language Models requires significant computational resources. This entails not only a large energy consumption, but also often the need for using cloud services and thereby sharing potentially sensitive data. Smaller language models (SML) can make it easier to perform such tasks locally and on resource-constrained devices, reducing the carbon footprint and improving data control and privacy. LLAMA-7B, which we experimented with in this paper, is currently one of the smallest open source language models, but with the rapid development in the field of generative AI, we may see smaller, more capable models emerge, such as Orca 2 and Phi 2. Phi 2 outperformed LLAMA-2 in commonsense reasoning and language understanding [82]. It would be interesting to integrate such SLMs to process ERG-AI’s posture predictions and generate risk assessments and recommendations.

Feasibility studies: Future research will need to conduct an economic feasibility assessment to evaluate the cost-effectiveness of sensor deployment and data processing, considering both initial investments and long-term savings from improved ergonomic interventions. Social feasibility is underway, engaging with diverse stakeholders, including employees, occupational health experts, and industry representatives, to ensure the system’s acceptability and practicality in real-world settings. This involves addressing social ethics concerns, such as user privacy and data protection. Legal feasibility will be evaluated by reviewing relevant regulations and standards in occupational health and safety, data protection, and AI deployment. Ensuring compliance with laws such as the General Data Protection Regulation (GDPR) and other local data privacy laws will be paramount. Moreover, the legal implications of AI-driven ergonomic assessments will be considered, ensuring that the system offers recommendations while preserving human oversight and decision-making.

Limitations of this study: The study identified several limitations that warrant future exploration. Sensor dependency is a significant factor, with the accuracy of ERG-AI’s predictions being influenced by the number and placement of sensors; while basic postures were reliably identified with fewer sensors, complex movements required a more extensive network. To address this, future research could optimize sensor placement and explore alternative sensing technologies to reduce costs without sacrificing accuracy. Additionally, incorporating uncertainty estimation, though beneficial for decision-making, increased energy consumption, suggesting a need for more energy-efficient algorithms and hardware solutions. The generalizability of the model is another limitation, as validation was conducted using the Digital Worker Goldicare dataset, which may not represent all workplace scenarios. Building on the dataset to include a broader variety of tasks and environments would enhance the model’s applicability. Ethical concerns and data privacy requires continuous monitoring and updating to adapt to evolving standards and regulations. Future research will address these and other limitations by focusing on developing dynamic compliance frameworks that adjust to new legal and ethical guidelines in real-time. Lastly, ensuring unbiased and fair recommendations from the LLMs is crucial, we identified the need for further investigation into algorithms that can identify and mitigate biases, ensuring inclusive ergonomic solutions for all workers.