Model-free robot anomaly detection

Rachel Hornung; Patrick van der Smagt; Christian  Osendorfer

Proc. IEEE/RSJ International Conference on Robotics and Systems (IROS), 2014 Model-free robot anomaly detection Rachel Hornung, Holger Urbanek, Julian Klodmann, Christian Osendorfer, Patrick van der Smagt, Member, IEEE Abstract— Safety is one of the key issues in the use of robots, especially when human–robot interaction is targeted. Although unforeseen environment situations, such as collisions or unexpected user interaction, can be handled with specially tailored control algorithms, hard- or software failures typically lead to situations where too large torques are controlled, which cause an emergency state: hitting an end stop, exceeding a torque, and so on—which often halts the robot when it is too late. No sufficiently fast and reliable methods exist which can early detect faults in the abundance of sensor and controller data. This is especially difficult since, in most cases, no anomaly data are available. In this paper we introduce a new robot anomaly detection system (RADS) which can cope with abundant data in which no or very little anomaly information is present. I. I NTRODUCTION Reliability and, depending on the task, high accuracy are critical for the successful application of a robot. Early detection of faults is crucial to ensure safety and high performance of the system. Faults can be soft- and hardware-related and include communication problems, wear-dependent deviation, and broken sensors. Especially in the early stages of a disturbance, these faults will not result in a failure or malfunction as defined by [13] since the control algorithms are sophisticated enough to handle changes in measurement and control values. Besides increasing reliability of the system, utilizing redundant hardware can identify faults by comparing the outputs of two or more substituting systems. Unfortunately, a redundant setup increases costs, complexity, required space, and system weight [19]. Thus, robots often realize fault detection by setting thresholds on sensor data, which should not be exceeded [23]. Yet, simply applying thresholds disregards dependencies between different data and requires unacceptably large possible data ranges to avoid frequent false positives. Especially for robots, the acceptable data range strongly depends on the particular configuration. Moreover, differences to concurrently running simulators give insight in aberrations. These model-based techniques vary from comparing the state estimates to the actual states and only accepting residuals below a predefined threshold [21], over imprecise models accepting measured values within a specified bandwith [20], to robot state estimating RH, HU, and JK are with the Institute of Robotics and Mechatronics, DLR / German Aerospace Center, {rachel.hornung, holger.urbanek, julian.klodmann}@dlr.de CO and PvdS are with the Faculty for Informatics, Technische Universität München, {osendorf@in., smagt@}tum.de PvdS is further with fortiss, An-Institut Technische Universität München algorithms like particle filters [25]. Some of these approaches can only recognize faults with known failure patterns [9], which facilitates an immediate diagnosis of the fault, and others have a specific state for unmodeled faults [24]. However, all of these approaches have drawbacks. Simulations are merely approximations of the real system; The more detailed the model, the more cumbersome the generation and the more expensive the evaluation. Yet, the performance of model-based fault detection depends on model accuracy. Besides, using models implicitly defines the detectable faults, even if no specific fault pattern is required. Only those errors influencing modeled relations can be detected. E.g. if nothing but the relationship between desired Cartesian end effector position and joint position is modeled, a gear slippage forcing the controller to increase the impelling current will remain undetected until further damage is caused to the system. Therefore, we focus on a data-driven approach. A major challenge is the impracticality to generate and record fault data. Such data are often caused by hardware failure, which can hardly be emulated (for instance, deliberate destruction does not reflect the effects of wear), but is also a combination of many different states, which cannot be exhaustively emulated. Similarly, the number of valid configurations is very large, and not all valid data combinations are seen during normal operation. Depending on the robot the data space can easily cover several hundreds of dimensions. We introduce a semi-supervised Robot Anomaly Detection System (RADS) which incrementally learns valid data during normal operation, building up a compact representation of these states. After switching from training to application, aberrations of these are detected and correspondingly labeled. The method is validated with a KUKA/DLR LightWeight Robot (LWR) III arm [3], but is applicable to other input-output systems. The presented approach is aiming for the recognition of anomalies not yet prohibiting the proper operation of the robot, but potentially leading to increasing problems. In contrast to many fault identification and diagnostic systems, it does not model distinct fault states allowing for a clear diagnosis. However, it can detect faults not considered during modeling or occluded by model assumptions. Moreover, it is conceivable to employ RADS to recognize improper use of the robot in tele-operated systems, e.g. deviations from the common workflow. Thus, it should be considered an additional tool to ensure system reliability, but not as an ultimate replacement for current fault detection, isolation, and identification methods. II. S TATE OF THE A RT Few publications deal with the problem of anomaly detection in high-dimensional data while only positive samples are available. Using the following six requirements for novelty detection stated by [16] existing methods are assessed (see Tab. I): 1) robust trade-off: exclude novel samples, yet include known, 2) generalization: avoid false positives and negatives, 3) adaptability: capable to incorporate new information, 4) minimized complexity: applicable for online evaluation, 5) independence: handle varying dimensions and features, 6) parameter minimization: little input from the user. Uniform data scaling further requested by [16] is part of preprocessing [26] rather than the novelty detection algorithm, and makes the application of many algorithms easier. Statistical methods as aggregated by [5] evaluate whether a new data point belongs to the same distribution as the training data. Presuming the type of distribution is especially difficult for high-dimensional data, as well as the construction of hypothesis tests. Density estimation methods such as Parzen windowing [4] do not scale with data dimensions. The same holds for variational Bayesian techniques [4]. Furthermore, the created clusters span over large, sparse areas, leading to overgeneralization. Clustering algorithms such as k-means [15] require prior data space knowledge, e.g. the number of centers that make up the data model. More elaborate clustering algorithms, e.g. growing neural gas [17], can adapt the number of cluster centers but require many equally distributed data. One-class Support Vector Machines [22] lead to an unmanageable number of support vectors, and compel a specified percentage of the training data to be considered as outliers. Extreme Learning Machines [12] use single-hidden layer feed-forward networks to approximate the data distribution, and are only applicable if counter examples are available. Other algorithms inherently deal with time series effects, e.g. Peer Group Analysis [8]. However, these methods compare similarly behaving dimensions against each other, and TABLE I A NOMALY DETECTION ALGORITHMS IN THE LIGHT OF REQUIREMENTS method violated requirements statistical methods density estimation variational Bayesian techniques clustering one-class Support Vector Machines Extreme Learning Machines Peer Group Analysis genetic algorithms Replicator Neural Networks Sliding window Mahalanobis threshold 5, 4 1, 1, 1, 5 5 3, 1 1 6 2, 4 2, 6 3 4 classify sudden divergence as suspicious. In robot anomaly detection, many of the dimensions cannot be compared by common distance metrics as Euclidean distance. [26] offers an extensive overview of outlier detection methods especially for high-dimensional data. However, either these methods require outliers to be present during training, or they evaluate test data with respect to all training points resulting in too slow evaluations, or the approaches use only selected subspaces of the data. E.g. [2] consider both high dimensionality and single-class discrimination by genetic algorithms, which project data onto several lowerdimensional subspaces. The idea is promising but very time consuming, and only those faults differing in the selected subspaces will be identified [26]. [10] use a multi-layer feed-forward network, called Replicator Neural Network, to compress and restore the data (similar to the method introduced in Sec. III-D). However, this technique cannot detect faults exhibiting a similar intra-data relationship as the training data. In 2011 [14] have introduced another model-free approach to anomaly detection. They learn the distribution of measurements and control commands from the latest data history and use a similarity threshold on the Mahalanobis distance to evaluate whether new data points fit into that distribution. The algorithm has returned very good results in their evaluations. However, as this method is based on windows over the latest history, slinking appearance of errors, like wear, will not be detected. In the beginning these faults will not differ enough to be considered an anomaly, but the further the error proceeds, the more of it will be contained in the latest history and thus distort the training data. Although the training step in this algorithm is computationally efficient, it has to be repeated for each measurement, thus prohibiting too high update rates—in the paper the assessment is repeated with a frequency of at most 10 Hz. None of the available approaches sufficiently considers all of the six requirements while being able to handle highdimensional one-class data and preventing adjustment to faults (cf. Tab. I). The aim of this paper is to introduce a system that is capable of detecting anomalies in highdimensional data while the robot is operating. First, a robust detection mechanism is developed. Subsequently, applicability to high-dimensional data is ensured. Each concept of RADS is evaluated according to the fulfillment of the above listed requirements and the detection capability to prove its satisfactory performance for anomaly detection. III. M ACHINE L EARNING -BASED A NOMALY D ETECTION We tackle the problem with a multi-stage approach. Our method starts off by calculating maps of valid and invalid states from the robot data; inter alia joint torques, motor torques, currents, temperatures in the joints, and the applied load. Either of the two maps can identify faults, but lack computational efficiency. Thus, the maps are used to generate two labeled classes. These classes are then separated using a Support Vector Machine (SVM) ensuring a quick decision. Whenever new information is available, e.g. because the robot is equipped with a new tool or the robot is operating in a different area of the workspace, the system can be amended without requiring the former training data. Finally, dimensionality reduction ensures applicability to high-dimensional data. Positive data comprises all areas of the data space accessed during regular use. Naturally, this space should be maximally sampled. In contrast, negative data looms in unknown areas of the data space. Any data point classified thus requires further investigation: Either the maps are incomplete and have to be amended or an invalid state is detected. A. A Map of Positive Data To test whether a data point is within the positive data it could be compared against all valid points. However, an infinite number of test points would be required to cover all possible measurement variations making training and evaluation infeasible. Instead, we approximate the positive data space by summing K radial basis function kernels. The map assigns each point x in the data space a function value corresponding to the probability of belonging to the trained data. If this value is above a threshold α, x is considered known. Otherwise the data point is an outlier. The probability is evaluated using the χ2 goodness-of-fit test [18]. Each kernel κk has a specific center µk and variance Σk . The function value φ(x, κk ) of a data point x in a kernel κk only depends on the Mahalanobis distance, q (1) mk = (x − µk )T Σ−1 k (x − µk ), to the respective kernel center and the dimensionality d of the data space: φ(x, κk ) = Qχ2 (mk , d) . (2) Qχ2 (mk , d) is the χ2 goodness-of-fit test probability, describing the likelihood that a point with distance mk to a kernel belongs to same distribution as the respective kernel in d-dimensional space. Combining multiple kernels a multi-modal and multivariate distribution can be modeled: rbf(x) = K X φ(x, κk ). (3) i=1 Although a single kernel might not surpass the threshold at a specific point, it is possible that the combination of several kernels still identifies this data space as known. Training is performed sequentially. Once trained, the data can be discarded. For each training point rbf(x) is evaluated, i.e. x is tested against all kernels. If rbf(x) > α, (4) with α as threshold, x is situated in a known area, otherwise the data point is considered unknown. If Eq. (4) rates x unknown, a new kernel κnew , covering the area around x, is added to the network. The initial center µnew of the new kernel κnew is placed at the position of x. The covariance matrix has to be initialized in a way that the typical noise variation a, but not more, is covered. Accepting noise-like divergence establishes a d-dimensional ellipsoid of recognition around the initial data point. Any point within its border will be considered known. Assuming a normal distribution of measurement errors around the actual value, choosing a Gaussian kernel to include all neighbors is well suited. This does not imply that the underlying data distribution has to be Gaussian; indeed, the data can be arbitrarily distributed. If x is in a known area, the center and the covariance of the corresponding kernel are adjusted: µ0 i = ηi µ i + x , ηi + 1 (5) where µi is the center of the kernel at the time of observation, µ0 i the updated center, and ηi the number of points that have already contributed. The covariance has to be adjusted accordingly, ensuring that all points, covered before, will be covered after the transformation. To test whether a data point is represented by the map during the application phase only Eq. (4) applies. If the probability threshold is surpassed, the point is accepted as inlier. In case the map has to be extended, the same process as for initial training can be applied. The old map is used as starting point and, where necessary, amended c.q. extended. B. A Map of Negative Data A map of the unknown data space can also classify the data adequately. Instead of modeling a data distribution, owed to the fact of missing fault data, a geometric description of the negative space is desired. One way to generate such a map is Negative Selection (NS) [7]. Any space not covered by (positive) training data is filled with detectors reacting once a data point is within their area of influence. If a detector is activated during the application phase, an anomaly is detected. A detector τi is modeled as a sphere with center ci and radius ri . The detector τi is activated whenever an observation is within the sphere, i.e. its distance to ci is smaller than ri . Since there is no information on how the negative space is set up, a detector is generated by first drawing its center ci uniformly from the sample space, determined by the minimum and maximum values of the positive data in each dimension. To avoid redundancy a new ci is discarded if it activates any of the existing detectors. Otherwise, it has to be ensured that none of the positive data points is too close: ci must have at least a distance of rmin + a. The noise amplitude a ensures that data points differing less from the training data than the typical noise range will not activate the detector; rmin guarantees that τi has a minimum area of influence. Finally, the radius ri of τi has to be determined. Since valid assumptions about the structure of the error space are intangible, a proportional divergence in all directions is The new radius rnew will be the lower value of rmax and the minimum distance to the new training data ri (Eq. (6)). r rmin C. Fast Differentiation Between Two Classes rold rmax Fig. 1. NS Update Example: any of the small blue dots, representing new data, caused the large, pale detector in the back to turn invalid. Two exemplary replacements (bright orange) have been inserted. The radius of the smaller one is defined by the minimum distance to the training data, the larger is bound by size of the old detector. The two green potential detector centers have been discarded. One is outside of the bounding hull, one is too close to the hull. More replacement detectors have to be generated until the detector is sufficiently covered. assumed and thus ri is set to be the minimum distance to the training data D shielded by a: ri = min kci − xk − a. (6) x∈D Accepting radii larger than the minimum radius reduces the number of required detectors to cover the invalid states. An established detector is not changed anymore. Detector generation is continued until the desired coverage c of the data space is reached, i.e. 1/(1 − c) potential detectors have been discarded in a row. Once detector generation completed the training data can be discarded. Since NS generally requires all data to be available during training, map updating cannot be conducted in the same way as initial training. Instead, the reaction of the detectors to each data point of a new batch of training data is evaluated. If none of the detectors react, the point can be discarded without further processing. Otherwise, each activated detector is treated separately. If the active detector has only the minimum radius rmin , no smaller detectors can be created to replace the old one, thus the detector is removed. Otherwise replacement detectors are generated. Since the available information on the space around the detector is insufficient given only the new batch of training data (the original training data has been discarded), a new detector may cover at most the same area as the old detector. Thus, kcnew − cold k < rold − rmin (7) with τnew being a new, valid detector having the center cnew , and the old detector τold with center cold and radius rold . cnew is drawn uniformly from within the sphere of τold (Fig. 1). Similar to the previously described approach on the original data, a newly generated center is compared to the new training data and maintained if it is not too close to it. Because the new detector can not exceed the original one, the maximum radius rmax the new detector can have is the difference between the old radius and the distance of the two centers: rmax = rold − kcold − cnew k . (8) Both, the RBF-based approach and NS, provide stable error detection results. However, they are computationally intensive. While training may take a long time, the decision whether the actual data represents a fault has to be executed at a 1 kHz rate (for the LWR). In order to meet the required decision time frame, but still benefit from considering positive and negative data, we use the two maps as labled data to train an SVM, reverting to the standard libsvm [6] implementation. We have chosen Gaussian kernels for the SVM and the parameters—C for misclassification punishment and σ 2 for kernel width—are determined by using a grid search together with cross validation. D. Upgrade for High-Dimensional Data The previously defined RADS approach works well for low-dimensional observations. In order to evaluate complex behavior over time, the dimensionality increases to several hundreds of dimensions, leading to two major drawbacks. On the one hand, calculations in the high-dimensional data space are expensive, prohibiting online evaluation of the data points. On the other hand, typical distance metrics like the Euclidean, respectively Mahalanobis distance have little meaning in high dimensions [1], [26]. 1) Dimensionality Reduction: The simplest way to overcome the problem of relevance in high dimensionality is to use a distance metric that is less influenced by the number of dimensions than the Euclidean distance. [1] introduce fractional distances as a promising alternative. Switching the norm may improve the discrimination of data, but it does not solve the problem of computational complexity. In contrast, dimensionality reduction can decrease the time required for testing, as d decreases. However, this introduces problems of its own. The only data available during training is valid data, therefore it is unknown how an error will appear. While projecting the known data onto a lowerdimensional manifold is straightforward, it is impossible to predict the projection behavior of errors. Projection might place a negative sample into the area of positive data, as depicted in Fig. 2a, making a decision about validity impossible. In order to benefit from projection, back projection is essential. Back projection tries to move the data point back to its original position in the high-dimensional data space. If the point is an inlier, back projection will work fine and the point is returned to a position close to its initial location. However, an outlier projected onto the manifold will be reprojected to a part of the space related to inliers (cf. Fig. 2b). During training the system learns the regular back projection error. If a test error surpasses the acceptable value, the point is considered an outlier. alarming difference typical error range PI PII (a) Problems of dimensionality reduction: The correlated, blue 2D data points are projected onto a single dimension (green). However, the yellow outlier is also projected into the same area (red). (b) Reprojection overcomes lack of projection: The green data is reprojected into the 2D space. A small error remains after reconstruction. The outlier is projected into the same area. RBF NS Dataflow during training Fig. 2. Projection vs. Reprojection 2) Linear Projection: One type of dimensionality reduction is linear projection. The originally many dimensions are mapped onto a lower-dimensional subset of dimensions, e.g. through Principal Component Analysis (PCA) [4]. The transformation matrix P, expressing the principal components in the original data space, is determined from the training data. Reprojection is achieved using its pseudoinverse P† . Since P is constant, P† is calculated offline. Untrained valid data should behave similarly to training data. Since projection is capable of projecting untrained, but comparably behaving data, it decreases the required amount of training setups and improves generalization. Linear projection is well-suited to handle direct relations like the dependency between desired and actual current and can diminish constant values. However, PCA cannot exploit non-linear constraints, e.g. between joint angle and Cartesian position. Thus, a projection covering most of the variance will need more principal components than latent variables exist. However, our initial results using autoencoders and stacked autoencoders [11] for non-linear projection do not exhibit any advantage within the data that was available. Thus, we confine ourselves to linear projection. 3) Combining RADS with Dimensionality Reduction: Although observing the reconstruction error from dimensionality reduction is auspicious, it is not sufficient. If an outlier is projected to a point outside of the projected training data, reprojection might locate it close to its original position. The reconstruction error would be in the acceptable range. Combining dimensionality reduction with the previously introduced RADS in RADS for High-Dimensional Data (HDRADS) merges the advantages of both tools. Dimensionality reduction and reprojection decrease the time required for computation and identify outliers projected into the training data. The SVM discriminates the space of projected training data from the unknown projection space. Besides, projecting the data will reduce the number of required training samples due to improved generalization. It suffices to train on a few different configurations of the robot, Dataflow during application SVM Fig. 3. Setup of HDRADS: First, projection PI and reprojection PII are computed. Using PI the data is reduced and the basic RADS is applied. For evaluations the data is projected, evaluated with the SVM, and reprojected. If either the SVM or reprojection identify an anomaly, an alarm is issued. rather than having to learn every possible setup, in addition reducing the need for retraining. Fig. 3 depicts the combined setup. Data generated by the robot is recorded for training. First, a projection (PI ) and the corresponding backprojection (PII ) are calculated. Afterwards, the training data is reduced using PI . The reduced data set is employed to create the positive (RBF) and negative (NS) maps, which are used to train the SVM. During the application, the data is projected with PI , evaluated with the SVM, and reprojection with PII . If either the SVM or reprojection classify the data as outlier an alarm is issued. IV. E XPERIMENTS In the following, all five methods (positive map alone, negative map alone, RADS, reprojection alone, and HDRADS) are compared with regard to their anomaly detection performance. All tests have been conducted for an LWR [3]. A. Experimental Setup Initial tests have been conducted running a simulation of the robot with 71 computed data dimensions. Several pointto-point trajectories have served as training data. Perturbing the simulation during test runs (adding constants and sine waves to single dimensions, or replacing data by constants) on the same trajectories has clarified, that unusual changes in a single dimension suffice to cause a RADS alarm while undisturbed data is recognized as valid—we can achieve true positive rates of 1, yet keeping the false positive rate at 0. Although using the simulation facilitates the assessment according to a ground truth and the exact identification of false positive and negative rates, it is difficult to introduce realistic faults. Just to name a few of the necessary considerations: Which data dimensions are affected simultaneously? How is the correlation between these influences? What is the B. Results of Basic RADS and its Components Tab. II displays the percentage of data points that have been considered anomalous in the full state test data using the positive and negative map and the SVM, resp. RADS. The amount of resulting reactions for the enhanced state data are higher, but cover the same areas. For the validation set the same randomly selected 30% of the training data have been withheld from the algorithms during training. The results on the validation set show that the algorithm has a low rate of false positives if the trajectory is known, as none of the validation data points should be unknown. The table TABLE II A NOMALY ALARMS IN PERCENT OF THE NUMBER OF DATA POINTS . PARAMETERS USED FOR THE EVALUATION : POSITIVE MAP : a = 0.2, α = 0.05; NEGATIVE MAP : a = 0.2, rmin = 0.2, c = 0.99; SVM: C = 1, σ 2 = 1/d Test Set validation faster slower control load collision RBF 0 19.8 16.5 0.2 100 3.2 NS 0.1 15.1 27.1 1.9 100 1.9 SVM 0 0.001 0.01 0 100 1.3 scaled torque Anomaly Detection on Full State Collision Data with Positive Map 2 0 2 2 3 4 5 6 time step 4 x 10 scaled torque Anomaly Detection on Full State Collision Data with Negative Map 2 0 2 2 3 4 5 6 time step 4 x 10 Anomaly Detection on Full State Collision Data with SVM scaled torque a realistic fault behavior? Hence, the evaluations would forfeit validity and relevance, and we refrained from extending the simulation-based evaluation. In order to assess the anomaly detection performance on authentic data, the data produced by a real robot during the motion along given trajectories has been recorded. To cover a wide range of the data space each of the robot joints has been moved from upper to lower joint limit and vice versa. The robot has remained at the limits for five seconds, before backtracing the trajectory. Testing is performed against different recordings of the same trajectories. Generating real, yet predictable faults— to retrieve a ground truth—would require arbitrary damage to the hardware. However, that is without the bounds of possibility. Thus, we change hardware specific parameters: we increase and decrease in the maximum velocity, change the load applied to the endeffector, or switch the controller of the robot. Besides, in order to simulate acute and nonconstant faults, collisions have deliberately been induced during an additional recording of the same trajectories by gently hitting the robot. Without supplementary generated dimensions (full state), the data has 63 dimensions comprising control input and measurements. In order to observe the relationship between succeeding data points, the enhanced state set further contains first and second order derivatives of each dimension— except for quasi-constant dimensions like the currently applied load—and consists of 176 dimensions. As stated in [14] using the time derivative is reasonable, as when dealing with robots often the changes in states are of greater interest than the actual states. Each method is trained separately with full and enhanced state data. 2 0 2 2 detected anomaly torque in joint 1 Fig. 4. 3 4 time step 5 6 4 x 10 Anomaly sensitivity of the different RADS components further indicates, that an error has to be more pronounced to be detected by the SVM. The two preliminary stages assume that the majority of the movements are anomalous when a different velocity is used since currents, etc. vary from the trained data. They only accept static positions, which we expect to have the similar data values as in the training speed. The SVM, in contrast, fails to identify the comparatively marginal deviation. Whereas applying a different load to the endeffector influences the data dimensions more strongly and thus the entire sequence is considered anomalous. Of course, the load affects each data point in the test set, and thus anomalies on the entire sequence are expected. Fig. 4 depicts a segment of the collision trajectory evaluated with the RBF-based approach, NS, and an SVM. The orange bars in the background indicate a detected anomaly. The black graph in front displays the torque in joint one. Spikes in the graph hint at an induced collision. Even if the collision is induced at a distal link, it will be apparent in proximal joints. Thus, all ascertainable collisions of this setup are visible in these graphs. The positive map and the SVM detect all collisions. Yet, the positive map shows a slightly higher amount of false negatives. NS lags behind the other two algorithms: while the robot is moving, the rate of false positives is much higher. Solely applying NS, a sufficient amount of valid training data would have to be higher. However, the RBF-based version is too computationally expensive, and the SVM cannot be trained without NS. Using the information from both preliminary stages, the SVM is able to differentiate between valid and anomalous states most accurately. Using the enhanced state data the algorithm shows a marginal change of behavior. Due to the filter applied to smooth the data, the algorithm recognizes errors whenever they impact the filtered data. Using RADS with both data sets concurrently, the enhanced state data observation will react with a slight delay until a sufficient impact of the disturbance PCA error on enhanced state data PCA error on full state data scaled torque reconstruction error Anomaly Detection on Full State Collision Data with HDRADS on Two Speeds Mean Reconstruction Error after PCA 100 50 0 0 10 20 30 40 50 2 0 2 retained dimensions 2 3 4 Fig. 5. PCA reconstruction error on training data vs. retained dimensions. The reconstruction errors between 50 and 63 retained dimensions for the full state data and between 50 and 176 for the enhanced state data are very close to zero, and are cropped for better visibility of the changes in lower dimensions. Anomaly Detection on Full State Collision Data with PCA20 Reconstruction Error Fig. 7. 5 time step detected anomaly torque in joint 1 6 4 x 10 Anomaly alarms from HDRADS during collision trajectory overgeneralization is avoided. Fig. 7 displays the detected anomalies on a collision trajectory at a trained speed. scaled torque 2 E. Evaluation of Experiments 1 0 1 2 2 3 4 5 6 time step 4 x 10 scaled torque Anomaly Detection on Full State Collision Data with PCA40 Reconstruction Error 2 0 2 2 detected anomaly torque in joint 1 Fig. 6. 3 4 5 time step 6 4 x 10 Anomaly sensitivity based on the number of retained dimensions is perceived. It will also notice a stop of the disturbance a little later than the full state observation because the filter needs to compensate the error impact. C. Results using the Reprojection Error As expected, the reconstruction error increases as fewer dimensions are retained during projection. Fig. 5 shows the reconstruction error resulting from PCA depending on the number of maintained dimensions. The anomaly detection capabilities of (re-)projection alone are depicted in Fig. 6. The graph on top results from PCA with 20 retained dimensions, the one below from retaining 40 dimensions. The fewer dimensions are retained, the higher the typical error. Thus, to be identified as an error, the data points have to diverge more leading to smaller areas of detection. As in RADS, the enhanced state data have resulted in the same areas of detection—more pronounced due to filtering. D. Results of HDRADS For the evaluation of its generalization capability, HDRADS is trained with the same trajectory recorded at two different speeds, 0.2 m/s and 0.4 m/s. For the evaluation a third velocity, 0.6 m/s, is introduced. Neither reprojection nor RADS on the reduced data return any anomalies, although only two of all possible velocities have been used for training—the false positive rate is at 0. The generalization of the algorithm improves, as it learns to generalize between different setups, in turn raising the alarm less often. Although fewer false positives are detected, Tab. III assesses HDRADS and its precursors with regard to the six requirements presented in Sec. II. The RBFbased approach identifies anomalies well while recognizing positive data, but is too computationally intensive in the test phase. NS detects too many anomalies or would require a higher amount of training data, but is faster at evaluating test points. Using an SVM, we can further accelerate the evaluation. However, taken by itself adapting the classifier is tedious and generalization is limited as we only have positive training data. Besides, for many training points, the decision time of SVMs increases with the number of support vectors, respectively training points. The drawback of depending on two distinct classes is overcome by training RADS with the positive and the negative model. This way we increase robustness and generalization while rendering model updates easy. Using only the two maps to train the SVM its complexity is forced to be at most that of the single maps. The major problem—little generalization between different data parameters—can be overcome by including projection. Yet, projection and reprojection alone are not fail safe as anomalies that are not projected to the positive data space might not be identified as such. Combining the entire setup the most accurate anomaly detection even in high-dimensional data is achieved, and the generalization compares to that of projection. Additionally, all parts of the algorithm can be improved with subsequent data using the available positive and negative maps, without requiring the initial training data. The dimensionality reducTABLE III F ULFILMENT OF THE REQUIREMENTS (R) INTRODUCED IN S EC . II. 1= ROBUSTNESS AND TRADEOFF , 2= GENERALIZATION , 3= ADAPTABILITY, 4= COMPLEXITY, 5= INDEPENDENCE , 6= PARAMETER MINIMIZATION , ++= FULL COMPLIANCE , += GOOD , 0= SATISFACTORY, −= UNSATISFACTORY, −−= DEFICIENT R 1 2 3 4 5 6 RBF + + ++ −− + + NS 0 + ++ − + + SVM + + − + − + RADS ++ + ++ ++ + 0 PCA + ++ 0 ++ + + HDRADS ++ ++ ++ ++ ++ 0 tion incorporated in HDRADS further ensures fast training of the single maps. In case dimensionality reduction has to be improved, the reprojected positive and negative maps are sufficient to improve the mapping and reestimate the reconstruction error. Since HDRADS applies dimensionality reduction and could use both, positive and negative data for training, it is the most independent algorithm of those presented. A minor flaw of HDRADS is related to the least important requirement. Parameters for projection, the positive and negative map, and the SVM have to be identified. Concluding, for low-dimensional data depending on few parameters, RADS performs well. It ensures sufficiently fast detection of unknown errors to run in parallel to the test robot. The higher the initial dimensionality, the more important the application of HDRADS to ensure satisfactory generalization in the increased parameter space. None of the basic algorithms can outperform RADS or its enhanced version. V. C ONCLUSION By means of its redundant setup HDRADS is capable of detecting unknown anomalies reliably. While the system generalizes well enough to classify data points similar to the training data as known, anomalous divergence from the training data will cause an alarm. Supported by the projection and reprojection steps, the algorithm can handle high-dimensional inputs and can generalize between different inputs, thereby greatly reducing the number of required training samples. The algorithm meets the requirements introduced by [16]. Yet, most testing has been performed on repeating trajectories, ensuring the applicability of RADS especially for industrial robots. Tracing predefined trajectories containing critical setups before putting the robot in service can ensure system operability if training all possible configurations is infeasible. Moreover, instead of providing pre-specified derivatives to the algorithm, it might be helpful to consider other features calculated on the latest data. Alternatively, investigating the preformance when providing data of entire time windows rather than single points in time to the algorithm might show strong increases in performance. Using HDRADS on windowed data would automatically account for data smoothing and identify correlations between time steps that we are not aware of. In the following, continuous evaluation of HDRADS in parallel to state-of-the-art fault detection algorithms has to demonstrate the performance on subtle and or slinking faults, while facilitating the assessment of the effective benefit. The projection mechanisms introduced in Ch. III-D are sufficient for the test cases. Yet, only a small set of all possible techniques has been considered. Further investigation could reveal more suitable methods. R EFERENCES [1] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Space,” in Proc. 8th Int. Conf. Database Theory, 2001, pp. 420–434. [2] C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” in Proc. 2001 ACM Int. Conf. Management of Data, 2001, pp. 37–46. [3] A. Albu-Schäffer, S. Haddadin, C. Ott, A. Stemmer, T. Wimböck, and G. Hirzinger, “The DLR Lightweight Robot: Design and Control Concepts for Robots in Human Environments,” Industrial Robot: An Int. Jour., vol. 34, pp. 376–385, 2007. [4] C. M. Bishop, Pattern Recognition And Machine Learning, 1st ed. Springer, 2006. [5] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM Computing Surveys, vol. 41, pp. 15:1–15:58, 2009. [6] C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Trans. Intelligent Systems and Technology, vol. 2, pp. 1–27, 2011. [7] D. Dasgupta and S. Forrest, “Novelty Detection in Time Series Data using Ideas from Immunology,” in Proc. 5th Int. Conf. Intelligent Systems, 1995. [8] Z. Ferdousi and A. Maeda, “Unsupervised Outlier Detection in Time Series Data,” in Proc. 22nd Int. Conf. Data Engineering Wksps., 2006, pp. 51–56. [9] P. Goel, G. Dedeoglu, S. I. Roumeliotis, and G. S. Sukhatme, “Fault Detection and Identification in a Mobile Robot using Multiple Model Estimation and Neural Network.” in IEEE Int. Conf. Robotics and Automation, 2000, pp. 2302–2309. [10] S. Hawkins, H. He, G. Williams, and R. Baxter, “Outlier Detection Using Replicator Neural Networks,” in Data Warehousing and Knowledge Discovery, ser. Lecture Notes in Computer Science, Y. Kambayashi, W. Winiwarter, and M. Arikawa, Eds. Springer, 2002, vol. 2454, pp. 170–180. [11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,” Science, vol. 212, pp. 504–507, 2006. [12] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Exteme Learning Machine: Theory and Applications,” Neurocomputing, vol. 70, pp. 489–501, 2006. [13] R. Isermann, Fault-Diagnosis Applications—Model-Based Condition Monitoring: Actuators, Drives, Machinery, Plants, Sensors and Faulttolerant Systems. Springer, 2011. [14] E. Khalastchi, M. Kalech, G. A. Kaminka, and R. Lin, “Online anomaly detection in unmanned vehicles,” in Proc. of 10th Int. Joint Conf. Autonomous Agents and Multi-Agent Systems, 2011, pp. 115– 122. [15] S. P. Lloyd, “Least Sqares Quantization in PCM,” IEEE Trans. Information Theory, vol. 28, pp. 129–137, 1982. [16] M. Markou and S. Singh, “Novelty Detection: A Review - Part 1: Statistical Approaches,” Signal Processing, vol. 83, pp. 2481–2497, 2003. [17] T. Martinetz and K. Schulten, “A ”Neural-Gas” Network Learns Topologies,” Artificial Neural Networks, vol. I, pp. 397–402, 1991. [18] R. L. Plackett, “Karl Pearson and the Chi-Squared Test,” Int. Statistical Review, vol. 51, pp. 59–72, 1983. [19] B. Rinner, “Detecting and Diagnosing Faults (Guest Editor’s introduction),” Telematik, vol. 2, pp. 6–8, 2002. [20] B. Rinner and U. Weiss, “Online Monitoring of Hybrid Systems using Imprecise Models,” in Proc. 5th IFAC Symp. Fault Detection, Supervision and Saftey of Technical Processes, 2003, pp. 1811–1822. [21] R. Schneider and P. M. Frank, “Fuzzy Logic Based Threshold Adaption for Fault Detection in Robots,” in Proc. 3rd IEEE Conf. Control Applications, 1994, pp. 1127–1132. [22] B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the Support of a High-Dimensional Distribution,” Neural Computation, vol. 13, pp. 1443–1471, 2001. [23] J. Stoustrup, H. Niemann, and A. la Cour-Harbo, “Optimal Threshold Functions for Fault Detection and Isolation,” in Proc. American Control Conf. 2003, vol. 2, 2003, pp. 1782–1787. [24] M. Wang and R. Dearden, “Detecting and Learning Unknown Fault States in Hybrid Diagnosis,” in 20th Int. Wksp. Principles of Diagnosis, 2009, pp. 19–26. [25] K. Zhou and L. Liu, “Unknown Fault Diagnosis for Nonlinear Hybrid Systems Using Strong State Tracking Particle Filter,” in Int. Conf. Intelligent System Design and Engineering Application, 2010, pp. 850– 853. [26] A. Zimek, E. Schubert, and H.-P. Kriegel, “A Survey on Unsupervised Outlier Detection in High-Dimensional Numerical Data,” Statistical Analysis and Data Mining, vol. 5, pp. 363–387, 2012.

RELATED PAPERS

RELATED TOPICS

Log In

Model-free robot anomaly detection

Model-free robot anomaly detection

Related Papers

RELATED PAPERS

RELATED TOPICS