2.1. IoT-Based Sensor for Monitoring System
Recent technologies such as IoT, sensors, big data, and machine learning can be utilized for monitoring and can play important roles in predicting disease, improving production, reducing cost, providing an early warning system, and facilitating better decision making for management. Several studies have been conducted on IoT-based monitoring systems and showed significant advantages. Mora et al. proposed an IoT-based framework for monitoring human vital signs [
7]. A case study on monitoring footballers’ heart rates during a football match was conducted. The proposed system was able to monitor the players’ vital signs and predict, not only the worst situation (i.e., sudden death), but also possible injuries. Zhang et al. proposed a monitoring system based on IoT for the agricultural field [
8]. The developed system was used to monitor citrus soil moisture and nutrients for fertilization and irrigation decision making. Case study-based results showed that the proposed system helped farmers make better decisions, improve citrus production, and reduce labor costs as well as the pollution caused by chemical fertilizers. Manes et al. proposed a distributed monitoring system for leakage detection and gas levels in hazardous environments [
9]. A wireless sensor network was utilized to gather the sensor data. The collected environmental sensor data were transmitted to a remote server and presented to the manager via a user interface. The proposed system was effective for monitoring the environment and triggered a warning when critical events were detected. Finally, Cheung et al. proposed real-time monitoring based on information modeling and a wireless sensor network to monitor the safety of construction sites [
10]. Hazardous gas levels and environmental conditions (i.e., temperature and humidity) were collected by wireless sensor nodes and sent to a remote server. The proposed system triggered a warning/alarm once an abnormal situation was detected. Case-study based results showed that the proposed system improved the safety of the construction site and helped the management with better decision making in real-time.
Current studies utilize IoT-based sensors to determine the environmental conditions at a real site, thus the sensor data can be presented in real-time. IoT-based sensors provide an important solution for many research areas, including smart building and healthcare. Several studies have been conducted and showed significant results for IoT based sensors on improving system performance. Plageras et al. proposed a monitoring system using IoT-based sensors for smart buildings [
55]. The proposed system was implemented in a simulation environment. The results showed that a better monitoring system in a smart building can be achieved by using several IoT-based sensors. The proposed system is expected to improve energy efficiency as well as facilitate green smart buildings. Blanco-Novoa et al. proposed an IoT-based sensor for monitoring the radon gas level inside a building [
56]. The proposed system could notify/warn users when a specific level of radon gas is reached to prevent dangerous situations. The proposed system was able to monitor the radon gas level, trigger the programmed actions, and notify the users once a specific level of gas radon was reached. Benammar et al. proposed a modular indoor air quality monitoring system that collects several types of sensor data such as CO
2, CO, SO
2, NO
2, O
3, Cl
2, temperature, and humidity [
57]. A single-board computer (Raspberry Pi) was utilized as a gateway to process the sensor data. The experimental results showed that the proposed system effectively monitored the indoor air quality for six kinds of gases in addition to temperature and humidity. Sood and Mahajan proposed a healthcare system based on wearable IoT-based sensors for detecting and preventing the outbreak of chikungunya virus [
58]. The collected health, environmental, medical, location, and meteorological data were used to classify individuals as possibly infected or uninfected. The results showed that the proposed system could be used to detect infected individuals as well as to send a warning alert to the nearest governments and healthcare clinics to prevent further outbreaks. Finally, Bayo-Monton et al. developed an IoT-based sensor utilizing Arduino and Raspberry Pi to enhance eHealth care [
59]. The performance of the proposed sensor was compared with that of a personal computer. The results confirmed that the proposed IoT-based sensor was suitable for scalable eHealth systems.
Several studies have been conducted in the manufacturing industry and showed significant advantages from IoT based sensors in improving working conditions, preventing erroneous designs, providing fault diagnosis and quality prediction, and helping managers with better decision making. Moon et al. developed an IoT-based sensor to measure the air quality inside a factory [
11]. Temperature, humidity, CO
2 level, dust, and odor sensor data were collected and transmitted via wireless communication. Based on the experimental results, the proposed system is robust enough, able to accurately measure the environmental condition in the factory in real-time, and is expected to help managers maintain an optimum working environment for the workers inside the factory. Salamone et al. proposed an environmental monitoring system based on low-cost IoT sensors for preventing errors during the design phase in additive manufacturing [
12]. The sensors were used to gather temperature and humidity data. The study revealed that knowledge of environmental conditions could help prevent errors during the design phase in additive manufacturing. Li et al. utilized IoT sensors to collect data for the fault diagnosis of mine hoisting equipment [
13]. The study revealed that IoT sensors can help provide complete diagnosis data as well as improve diagnosis results. Lee et al. proposed a framework by utilizing IoT and machine learning to predict the quality of a product and optimize operation control [
14]. Metal casting was used as a real-case implementation of the proposed system. The proposed system was able to effectively predict the quality of the metal casting and efficiently improve the operation control. Finally, Calderón Godoy et al. proposed the integration of sensors and the SCADA system for implementation of the fourth industrial revolution framework [
15]. Experimental results confirmed the feasibility of the proposed system, which is expected to help managers during the migration of legacy systems to the Industry 4.0 framework.
The number of IoT-based sensors and other related components is increasing significantly. The adoption of IoT in manufacturing enables the transition from traditional to modern digitalized manufacturing. As the number of devices collecting sensor data in manufacturing increases, the potential for new types of applications that can handle the input of large amounts of sensor data such as big data technology also increases. Ge et al. developed a conceptual framework by integrating big data technology in IoT, which is expected to support critical decision making [
60]. By utilizing big data processing, the enormous amount of data collected by many heterogeneous sources (sensor devices) can be handled and presented in an efficient manner, thus they can assist managers with better decision making.
2.2. Big Data Processing
With the increasing number of IoT and sensing devices, data generated from manufacturing systems are expected to grow exponentially, producing so called “big data” [
16]. Big data is often described in terms of 4 V’s. The first V is volume in reference to the size of the data, the second V is variety in reference to the different types/formats of the data, the third V is velocity in reference to the speed of data generation, and the last V is veracity in reference to the reliability of the data [
61]. The data generated during manufacturing is increasing daily with different types and formats (i.e., process logs, events, images, and sensor data), hence, the processing and storage of these data is becoming a challenging issue that needs to be addressed. There are several applications of big data analytics in the manufacturing industry. Zhang et al. proposed a big data framework for reducing energy consumption and emission in an energy-intensive manufacturing industry [
17]. The proposed system consists of two components, data acquisition for gathering the energy data and data analytics for analyzing the energy usage. Based on a real-case implementation, the results showed that the proposed system was capable of eliminating three percent of the energy consumption and four percent of energy costs. Zhong et al. proposed a big data system for logistics discovery from RFID-enabled production data for mining knowledge [
18]. An experiment was used to demonstrate the feasibility of the proposed system and the results showed that the knowledge gained from big data could be used for production scheduling and logistics planning. Mani et al. studied the application of big data analytics for mitigating supply chain social risk [
19]. A case study was used to elaborate the application of big data analytics in the supply chain. The results of the study revealed that big data analytics can help management predict various social problems and mitigate social risks. Finally, Li et al. proposed a big data framework for active sensing and processing of complex events in manufacturing processes [
20]. To effectively process complex event big data, a relation model and unified XML-based manufacturing processes were developed. The Apriori frequent item mining algorithm was used to find a frequent pattern from the complex events data. The feasibility and effectiveness of the proposed system was confirmed with implementation in a local chili sauce manufacturing company. The proposed model is expected to provide practical guidance for management decision-making.
Several big data technologies can be utilized in the manufacturing industry to process and store large volumes of data quickly, such as Apache Kafka, Apache Storm, and NoSQL MongoDB. Apache Kafka is a scalable messaging queue system used for building real-time applications [
62]. It is fault-tolerant, high-throughput, and scalable. Several studies have shown significant benefits from using Kafka for healthcare, transportation, manufacturing, and IoT-generated sensor data. Alfian et al. proposed real-time data processing for monitoring diabetic patients [
21]. Apache Kafka and MongoDB were utilized to handle and store sensor data from the patients. The proposed system was sufficiently efficient at monitoring diabetic patients. Ji et al. proposed a cloud-based car parking system consisting of several technologies, including Apache Kafka [
63]. The proposed system was capable of efficiently handling massive amounts of sensor data when the amount of data and the number of clients increased. D’silva et al. proposed a framework for handling real-time IoT event data [
22]. The proposed framework utilized Apache Kafka as a message queue system and was efficient enough to process real-time IoT events data. Canizo et al. proposed a framework based on big data technologies and machine learning for online fault prediction for wind turbines [
23]. Apache Kafka was used to handle incoming data in real-time and send the data to a streaming system for further analysis. The proposed system could be used to monitor the status of wind turbines and is expected to help reduce operation and management costs. Du et al. proposed a framework for handling huge amounts of incoming unstructured connected vehicle (CV) data [
24]. The proposed framework utilized Apache Kafka as a distributed message broker. Experimental results showed the proposed system is efficient enough in handling huge amounts of incoming CV data and achieved the minimal recommended latency value defined by the U.S. Department of Transportation for CV applications. Park and Chi proposed an architecture for an ingestion system based on Apache Kafka for machine logs in the manufacturing industry [
25]. The proposed system collects machine logs from a set of milling machines, handles them in a Kafka messaging queue, and delivers them to an external systems for further analysis. Finally, Ferry et al. proposed a data management system based on big data technologies for machine generated data in a manufacturing shop-floor [
26]. The proposed system utilizes Apache Kafka as a message queue and Apache Storm as a real-time processing system. Implementation of the proposed system is expected to reduce infrastructure and deployment costs.
Apache Storm is a real-time distributed parallel system for processing high-velocity stream data [
64]. It is fault-tolerant and scalable, with guaranteed data processing. Previous studies have utilized Apache Storm for real-time data processing. Ma et al. proposed a stream-based framework for providing real-time information services on public transit [
27]. The proposed framework utilized Apache Storm as a real-time distributed processing engine. The results showed that the proposed framework was capable of handling large amounts of real-time data with lower latency. Furthermore, the performance of the proposed framework increased when the number of nodes/servers utilized increased. Manzoor and Morgan proposed a real-time intrusion detection system based on Apache Storm [
28]. The proposed system was evaluated using the KDD 99 network intrusion dataset and the results showed that the proposed system was feasible for processing network traffic data and detecting network intrusion with high accuracy. Chen et al. proposed a real-time geographic information system for managing environmental big data using Apache Storm [
29]. The proposed system was tested with two use-cases (i.e., real-time air quality monitoring and soil moisture monitoring). The results showed that the proposed system was effective enough for managing real-time environmental big data. In addition, several studies have been conducted regarding the performance of Apache Storm as a real-time data processing system. Qian et al. performed a performance comparison between Apache Storm and Spark [
30]. The latency and throughput of the system was considered and the results showed that Apache Storm has shorter latency while Spark has higher throughput. Finally, Chatterjee and Morin performed comparative performance analysis between several data streaming platforms (i.e., Flink, Storm, and Heron) [
31]. Various performance metrics were considered such as fault tolerance and resource usage. The results showed that Storm has better fault tolerance and less memory usage than the other systems.
The increasing amount of IoT-generated sensor data has led to increased demand for sensor-friendly data storage platforms. NoSQL databases have become popular in the last couple of years because of their growing flexibility, scalability and availability. The term ‘NoSQL’ collectively refers to data storage platforms that do not follow a strict data model for relational databases. MongoDB is a document-oriented NoSQL database that offers flexible data-schema, high performance, scalability, and availability [
65]. A previous study compared the performance of MongoDB and Oracle with insert, update, and delete tests [
32]. MongoDB outperforms oracle in all tests. In addition, MongoDB has been proven to be effective for storing data from the supply chain, geographic information systems and manufacturing. Alfian et al. utilized MongoDB to store IoT-generated sensor data for monitoring a perishable food supply chain [
33]. In the study, MongoDB was capable of processing a huge amount of input/output sensor data efficiently when the number of sensors and clients increased. In addition, MongoDB outperformed MySQL in read and write tests. Hu et al. conducted a comparative study among six popular databases (i.e., Rasdaman, SciDB, Spark, ClimateSpark, Hive, and MongoDB) for handling a variety of geospatial data [
34]. The results showed that MongoDB was adequate in terms of parallel query and resource consumption (i.e., CPU, memory, network). Chen et al. proposed MongoSOS, a sensor observation service based on MongoDB, for handling spatiotemporal data [
35]. The proposed system was capable of handling read and write access for navigation and positioning data in a millisecond and the performance improved by around two percent compared with the traditional model. Putri et al. proposed a big data processing system based on Apache Spark and MongoDB to identify profitable areas from large amounts of taxi trip data [
36]. The experimental results showed that the proposed system was scalable and efficient enough in processing profitable-area queries from huge amounts of big taxi trip data. Finally, Angrish et al. proposed a flexible data schema based on NoSQL MongoDB for the virtualization of manufacturing machines [
37]. The proposed system was evaluated against several query statements. The results showed that MongoDB can accommodate any type of machine data and could easily be implemented across a variety of machines on the factory floor.
Previous studies have shown a significant impact from the integration of several big data technologies. Lohokare et al. proposed a scalable framework for home automation in smart cities [
38]. The proposed framework utilized Apache Kafka as a message broker to handle incoming IoT data and MongoDB to store the sensor data. The proposed system was able to reduce the processing time when the amount of data and nodes increased. Jung et al. proposed a smart city system using Apache Kafka and Apache Storm to handle and process IoT-generated data in real-time [
39]. Experimental results showed that the proposed system was capable of effectively and efficiently processing the IoT-generated data in real-time. Villari et al. proposed a management system for smart environments using big data technologies [
40]. The proposed system utilized Apache Storm to process the data in real-time and MongoDB to store huge amounts of sensor data. A case study on smart homes was performed, and the results showed that the proposed system was able to manage large amounts of smart environmental data in real-time. Zhou et al. proposed an integration of Apache Kafka, Apache Storm, and MongoDB for processing streaming spatiotemporal data [
41]. The proposed system was tested using the Taiyuan BeiDou bus location data. The proposed system was capable of processing large amounts of sensor data per second and was around three times faster than the traditional model. Finally, Syafrudin et al. proposed an open source-based real-time data processing system consisting of Apache Kafka, Apache Storm, and MongoDB [
42]. The proposed system was implemented to monitor the injection molding process in real-time. The proposed system was capable of processing a massive amount of sensor data efficiently when the amount of data and the number of devices increased.
Integration of Apache Kafka, Apache Storm, and MongoDB can be used for big data processing to handle manufacturing sensor data. Previous studies have shown that these three technologies can be used for big data processing so that large amounts of streaming sensor data can be promptly processed, stored, and presented in real-time [
41,
42]. Thus, in our study, Apache Kafka, Apache Storm, and MongoDB were utilized for big data processing to monitor the manufacturing process in real-time. In addition, the integration of big data processing with a machine-learning model is expected to help managers with decision-making and to prevent unexpected losses caused by faults during the manufacturing process.
2.3. Machine Learning Methods in Manufacturing
The manufacturing industry is experiencing an increase in data generation, e.g., sensor data from the production line, environmental data, etc. New developments in technology such as machine learning offer great potential to analyze data repositories, and thus can provide support for management in decision-making or can be used to improve system performance. Machine learning techniques are utilized to detect certain patterns or regularities and have been successfully implemented in various areas such as fault detection, quality prediction, defect classification, and visual inspection. Several studies have utilized machine learning and showed significant results in the manufacturing industry. Kim et al. employed seven different machine learning-based novelty detection methods to detect faulty wafers [
43]. The models were trained with Fault Detection and Classification (FDC) data to detect faulty wafers. The experimental results showed that machine learning-based models had a high possibility of detecting faulty wafers. Lee et al. performed an evaluation analysis on four machine learning algorithms (i.e., decision tree, random forest, artificial neural network, and support vector machine) for predicting the quality of metal castings product [
14]. The result showed that all of four machine learning algorithms can effectively be used to predict the quality of product. Chen et al. utilized support vector machine algorithm to predict the quality of welding in a high-power disk layer [
44]. The results showed that the proposed quality prediction model can be used for real-time monitoring system. An intelligent system was developed by Chen et al. to minimize the incorrect warning in detecting the quality of product in manufacturing [
45]. They utilized three methods (i.e., visual inspection, support vector machine, and similarity matching). Through real-case implementation in manufacturing company in Taiwan, the proposed system can effectively be used to minimize the incorrectly classified and improve the performance of quality prediction. Finally, two machine learning algorithms (i.e., decision tree and Naïve Bayes) was also used by Ravikumar et al. for automating the process of inspecting the quality of machine components [
46]. Three types of machine component quality (i.e., good, minor scratch, and deep scratch) were measured. The results showed that the proposed method can effectively be used in automating the quality inspection of the product in real practical case.
Fault detection and diagnosis is an important problem in process engineering and is utilized to detect abnormal events in a process. Early detection of process faults can help avoid productivity loss. Machine learning algorithms such as Random Forest showed significant efficacy in detecting process faults in manufacturing. Random Forest is an ensemble prediction method that aggregates the results of individual decision trees [
66]. Generally, Random Forest works by utilizing the bagging method to generate subsets of training data. For each training dataset, a decision tree algorithm is utilized. In the end, the final prediction result is selected based on majority vote (the most voted class) over all the trees in the forest. Recently, Random Forest was used by Quiroz et al. for detecting the failure of rotor bar. They performed the performance analysis between Random Forest and other models (i.e., decision tree, Naïve Bayes, logistic regression, linear ridge, and support vector machine). The experimental results showed that Random Forest outperformed the other models and has around 98.8% of accuracy. The proposed model can be used for real-time fault monitoring system as well as the preventive maintenance system in factory. Random Forest also was utilized by Patel and Giri for detecting the failure of bearing [
48]. The results were compared with those obtained from an existing artificial intelligence technique, neural network. The results showed that Random Forest had better performance and higher accuracy than the neural network algorithm. The results of this study are expected be used for bearing fault detection and diagnosis. Finally, Cerrada et al. proposed fault diagnosis in spur gears based on genetic algorithm and Random Forest [
49]. The proposed system consisted of two parts, namely genetic algorithm for attribute selection and Random Forest for classification. The proposed system was tested on real vibration signals and Random Forest had better performance for fault diagnosis.
Machine learning algorithms encounter problems with outlier data, which can reduce the accuracy of the classification model. Outlier detection can be utilized in the preprocessing step to identify inconsistencies in data/outliers; thus, a good classifier can be generated for better decision making. Previous studies showed that removing the outlier can improve the classification accuracy. Tallón-Ballesteros and Riquelme utilized outlier detection for a classification model [
50]. The authors proposed a statistical outlier detection method based on the interquartile range (lQR) with classes. The results showed that by removing the outliers from the training set, the classification performance of C4.5 was improved. Podgorelec et al. utilized an outlier prediction method to improve classification model performance in medical datasets [
51]. The results showed that by removing the identified outliers from the training set, the classification accuracy was improved, especially for the Naïve Bayes classifier.
One of the techniques used for outlier detection is DBSCAN [
52]. The algorithm works by identifying dense regions, which are determined based on the number of objects close to a given point. Finally, the algorithm identifies points that do not belong to any cluster, which are treated as outliers. DBSCAN has been implemented in different areas and showed significant accuracy by detecting true outliers. Tian et al. proposed an outlier detection method involving soft sensor modeling of time series [
53]. They utilized DBSCAN for outlier detection and the proposed outlier detection method demonstrated good performance. Abid et al. proposed outlier detection based on DBSCAN for sensor data in wireless sensor networks [
54]. The proposed model successfully separated outliers from normal sensor data. Based on experiments with synthetic datasets, the proposed model showed significant accuracy in detecting outliers, with an accuracy rate of 99%.
Existing studies showed that Random Forest can be utilized for fault prediction with high classification accuracy. Furthermore, several studies showed significant results for DBSCAN-based outlier detection with regard to improving the classification accuracy. We propose a hybrid prediction model that consists of DBSCAN-based outlier detection to remove the outlier data, and Random Forest to detect whether the manufacturing process is functioning normally or abnormally. The hybrid prediction model is integrated with a real-time big data processing system, enabling processing of the sensor data from IoT-based sensor device (e.g., temperature, humidity, accelerometer, and gyroscope) and fault prediction in real-time.