Purchase individual online access for 1 year to this journal.
Price: EUR 135.00
ISSN 1088-467X (P)
ISSN 1571-4128 (E)
Impact Factor 2024: 0.9
Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing.
In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.
Papers published in this journal are geared heavily towards applications, with an anticipated split of 70% of the papers published being applications-oriented, research and the remaining 30% containing more theoretical research. Manuscripts should be submitted in *.pdf format only. Please prepare your manuscripts in single space, and include figures and tables in the body of the text where they are referred to. For all enquiries regarding the submission of your manuscript please contact the IDA journal editor: [email protected]
Abstract: In machine learning, classification involves identifying the categories or classes to which a new observation belongs based on a training set. The performance of a classification model is generally measured by the classification accuracy of a test set. The first step in developing a classification model is to divide an acquired dataset into training and test sets through random sampling. In general, random sampling does not guarantee that test accuracy reflects the performance of a developed classification model. If random sampling produces biased training/test sets, the classification model may result in bias. In this study, we show the problems of…random sampling and propose balanced sampling as an alternative. We also propose a measure for evaluating sampling methods. We perform empirical experiments using benchmark datasets to verify that our sampling algorithm produces proper training and test sets. The results confirm that our method produces better training and test sets than random and several non-random sampling methods can.
Show more
Keywords: Classification, training and test sets, accuracy, random sampling, balanced sampling
Abstract: In order to avoid missing representative features, we should select a lot of features as far as possible when using machine learning algorithms in stock trading. Meanwhile, these high dimensional features can lead to redundancy of information and reduce the efficiency, and accuracy of learning algorithms. It is worth noting that dimensionality reduction operation (DRO) is one of the main means to deal with stock high-dimensional data. However, there are few studies on whether DRO can significantly improve the trading performance of deep neural network (DNN) algorithms. Therefore, this paper selects large-scale stock datasets in the American market and in…the Chinese market as the research objects. For each stock, we firstly apply four most widely used DRO, namely principal component analysis (PCA), least absolute shrinkage and selection operator (LASSO), classification and regression trees (CART), and autoencoder (AE) to deal with original features respectively, and then use the new features as inputs of the most six popular DNN algorithms such as Multilayer Perceptron (MLP), Deep Belief Network (DBN), Stacked Auto-Encoders (SAE), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) to generate trading signals. Finally, we apply the trading signals to conduct a lot of daily trading back-testing and non-parameter statistical testing. The experiments show that LASSO can significantly improve the performance of RNN, LSTM, and GRU. In addition, any DRO mentioned in this paper do not significantly improve trading performance and the speed of generating trading signals of the other DNN algorithms.
Show more
Keywords: Deep neural networks, dimensionality reduction, statistical test, trading performance
Abstract: Research on content-based image retrieval (CBIR) has been under development for decades, and numerous methods have been competing to extract the most discriminative features for improved representation of the image content. Recently, deep learning methods have gained attention in computer vision, including CBIR. In this paper, we present a comparative investigation of different features, including low-level and high-level features, for CBIR. We compare the performance of CBIR systems using different deep features with state-of-the-art low-level features such as SIFT, SURF, HOG, LBP, and LTP, using different dictionaries and coefficient learning techniques. Furthermore, we conduct comparisons with a set of primitive…and popular features that have been used in this field, including colour histograms and Gabor features. We also investigate the discriminative power of deep features using certain similarity measures under different validation approaches. Furthermore, we investigate the effects of the dimensionality reduction of deep features on the performance of CBIR systems using principal component analysis, discrete wavelet transform, and discrete cosine transform. Unprecedentedly, the experimental results demonstrate high (95% and 93%) mean average precisions when using the VGG-16 FC7 deep features of Corel-1000 and Coil-20 datasets with 10-D and 20-D K-SVD, respectively.
Show more
Abstract: In meteorology, ensemble forecasting aims to post-process an ensemble of multiple members’ forecasts and make better weather predictions. While multiple individual forecasts are generated to represent the uncertain weather system, the performance of ensemble forecasting is unsatisfactory. In this paper we conduct data analysis based on the expertise of human forecasters and introduce a machine learning method for ensemble forecasting. The proposed method, Label Distribution Learning with Climate Probability (LDLCP), can improve the accuracy of both deterministic forecasting and probabilistic forecasting. The LDLCP method utilizes the relevant variables of previous forecasts to construct the feature matrix and applies label distribution…learning (LDL) to adjust the probability distribution of ensemble forecast. Our proposal is novel in its specialized target function and appropriate conditional probability function for the ensemble forecasting task, which can optimize the forecasts to be consistent with local climate. Experimental testing is performed on both artificial data and the data set for ensemble forecasting of precipitation in East China from August to November, 2017. Experimental results show that, compared with a baseline method and two state-of-the-art machine learning methods, LDLCP shows significantly better performance on measures of RMSE and average continuous ranked probability score.
Show more
Keywords: Ensemble forecasting, label distribution learning, post-processing, domain knowledge
Abstract: Distance Measuring between two mixed data objects is the basis of many learning algorithms. The complex relevance between heterogeneous – various types/scales – attributes has a significant influence on the measured results. In this paper, we propose an End-to-End Distance Measuring method for mixed data based on deep relevance learning, called E 2 DM. Existing methods confuse the attributes space by mapping the discrete attribute values to new continuous values, or discretize continuous attributes values without considering the relevance. In contrast, E 2 DM directly manipulates on the original data with data conversion and…relevance learning simultaneously to avoid information loss and attribute space confusion. E 2 DM firstly estimates internal relevance (i.e., relevance within the attribute) influenced distance by considering the categorical attribute value frequency and mapping numerical attribute values into multiple bins. Then it takes a wrapper approach to iteratively optimize relevance influenced distance and bin boundaries using a Frobenius-norm deviation as its objective function. Co-occurrence Mover’s Distance is proposed to explicitly explore relevance between attributes in each iteration. Finally, the distance for numerical attribute values is refined based on the original values and the fallen bin centers. Experimental results on a number of real-world datasets demonstrate that E 2 DM outperforms the state-of-the-art methods.
Show more
Keywords: Distance measuring, mixed data, deep relevance learning
Abstract: Distant supervision for relation extraction aims to automatically obtain a large number of relational facts as training data, but it often leads to noisy label problem. In this paper, we propose a self-directed confidence learning based latent-label denoising method for distantly supervised relation extraction. Concretely, a self-directed algorithm that combines the semantic information of model prediction and distant supervision is designed to predict the confidence score of latent labels. Since this mechanism utilizes the obtained latent labels of easy examples to produce the latent labels of hard examples step by step, it is a robust and reliable learning process. Besides,…it facilitates dynamic exploration of the confidence space to achieve better denoising performance. Moreover, to cope with the common imbalance problem in large corpus where the negative instances account for a much larger percentage, we introduce a discriminative loss function to solve the misclassification between non-relational and relational instances. Empirically, in order to verify the generality of the proposed denoising method, we use different neural models – CNN, PCNN and BiLSTM for representation learning. Experimental results show that our method can correct the noisy labels with high accuracy and outperform the state-of-the-art relation extraction systems.
Show more
Abstract: Community structure, a foundational concept in understanding networks, is one of the most important properties of dynamic networks. A large number of dynamic community detection methods proposed are based on the temporal smoothness framework that the abrupt change of clustering within a short period is undesirable. However, how to improve the community detection performance by combining network topology information in a short period is a challenging problem. Additionally, previous efforts on utilizing such properties are insufficient. In this paper, we introduce the geometric structure of a network to represent the temporal smoothness in a short time and propose a novel…Dynamic Graph Regularized Symmetric NMF method (DGR-SNMF) to detect the community in dynamic networks. This method combines geometric structure information sufficiently in current detecting process by Symmetric Non-negative Matrix Factorization (SNMF). We also prove the convergence of the iterative update rules by constructing auxiliary functions. Extensive experiments on multiple synthetic networks and two real-world datasets demonstrate that the proposed DGR-SNMF method outperforms the state-of-the-art algorithms on detecting dynamic community.
Show more
Abstract: Access to copious amounts of information has reached unprecedented levels, and can generate very large data sources. These big data sources often contain a plethora of useful information but, in some cases, finding what is actually useful can be quite problematic. For binary classification problems, such as fraud detection, a major concern therein is one of class imbalance. This is when a dataset has more of one label versus another, such as a large number of non-fraud observations with comparatively few observations of fraud (which we consider the class of interest). Class rarity further delineates class imbalance with significantly smaller…numbers in the class of interest. In this study, we assess the impacts of class rarity in big data, and apply data sampling to mitigate some of the performance degradation caused by rarity. Real-world Medicare claims datasets with known excluded providers are used as fraud labels for a fraud detection scenario, incorporating three machine learning models. We discuss the necessary data processing and engineering steps in order to understand, integrate, and use the Medicare data. From these already imbalanced datasets, we generate three additional datasets representing varying levels of class rarity. We show that, as expected, rarity significantly decreases model performance, but data sampling, specifically random undersampling, can help significantly with rare class detection in identifying Medicare claims fraud cases.
Show more
Keywords: Big data, Medicare fraud detection, class imbalance, data sampling, rare classes
Abstract: The purpose of this study is to present a novel method which can objectively identify the subjective perception of tonic pain. To achieve this goal, scalp EEG data are recorded from 16 subjects under the cold stimuli condition. The proposed method is capable of classifying four classes of tonic pain states, which include No pain, Minor Pain, Moderate Pain, and Severe Pain. Due to multi-class problem of our research an extended Common Spatial Pattern (ECSP) method is first proposed for accurately extracting features of tonic pain from captured EEG data. Then, a single-hidden-layer feedforward network is used as a classifier…for pain identification. With the aid of extreme learning machine (ELM) algorithm, the classifier is trained here. The advantages of ELM-based classifier can obtain an optimal and generalized solution for multi-class tonic cold pain. Experimental results demonstrate that the proposed method discriminates the tonic pain successfully. Additionally, to show the superiority for the ELM-based classifier, compared results with the well-known support vector machine (SVM) method show the ELM-based classifier outperform than the SVM-based classifier. These findings may pay the way for providing a direct and objective measure of the subjective perception of tonic pain.
Show more