Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

    Joshua Zhexue Huang

    In this paper, an Observation Points Classifier Ensemble (OPCE) algorithm is proposed to deal with High‐Dimensional Imbalanced Classification (HDIC) problems based on data processed using the Multi‐Dimensional Scaling (MDS) feature... more
    In this paper, an Observation Points Classifier Ensemble (OPCE) algorithm is proposed to deal with High‐Dimensional Imbalanced Classification (HDIC) problems based on data processed using the Multi‐Dimensional Scaling (MDS) feature extraction technique. First, dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible. Second, a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low‐dimensional data space. Third, optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples. Exhaustive experiments have been conducted to evaluate the feasibility, rationality, and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets. Experimental results show that (1) the OPCE algorithm can be trained faster on low‐dimensional imbalanced data than on high‐dimensional data; (2) the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased; and (3) statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms. This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.
    Gene selection is an important step in analysis of gene data sets in which the number of genes exceeds greatly the number of samples. In this paper, we propose a new method that uses a random forest model to select genes from high... more
    Gene selection is an important step in analysis of gene data sets in which the number of genes exceeds greatly the number of samples. In this paper, we propose a new method that uses a random forest model to select genes from high dimensional gene data sets. In this method, Breiman's random forest algorithm is first used to generate a random forest model from a high dimensional data set. Then, features appearing in component tree models of the random forest are analyzed using the measures of feature correlations. Features are divided into two sets, those appearing in the roots of component trees and those appearing in other nodes of the trees. The frequency of the features is calculated and the features whose frequency is greater than given thresholds are selected as candidates. Finally, the correlation of candidate features with the class feature is measured with symmetrical uncertainty and the top features (with the highest symmetrical uncertainty values) are selected. 19 gene data sets were used to evaluate the new gene selection method. The comparison results have shown that the models built with the gene features selected with the new method outperformed other random forest models in classification accuracy.
    To break the in-memory bottleneck and facilitate online sampling in cluster computing frameworks, we propose a new sampling-based system for approximate big data analysis on computing clusters. We address both computational and... more
    To break the in-memory bottleneck and facilitate online sampling in cluster computing frameworks, we propose a new sampling-based system for approximate big data analysis on computing clusters. We address both computational and statistical aspects of big data across the main layers of cluster computing frameworks: big data storage, big data management, big data online sampling, big data processing, and big data exploration and analysis. We use the new Random Sample Partition (RSP) distributed data model to store a big data set as a set of ready-to-use random sample data blocks in Hadoop Distributed File System (HDFS), called RSP blocks. With this system, only a few RSP blocks are selected and processed using a sequential algorithm in a distributed data-parallel manner to produce approximate results for the entire data set. In this paper, we present a prototype RSP-based system and demonstrate its advantages. Our experiments show that RSP blocks can be used to get approximate models and summary statistics as well as estimate the proportions of inconsistent values without computing the entire data or running expensive online sampling operations. This new system enables big data exploration and analysis where the entire data set cannot be computed.
    This paper presents a text clustering system developed based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data. In this algorithm, a new step is added in the k-means clustering process... more
    This paper presents a text clustering system developed based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data. In this algorithm, a new step is added in the k-means clustering process to automatically calculate the weights of keywords in each cluster so that the important words of a cluster can be identified by the weight values. For understanding and interpretation of clustering results, a few keywords that can best represent the semantic topic are extracted from each cluster. Two methods are used to extract the representative words. The candidate words are first selected according to their weights calculated by our new algorithm. Then, the candidates are fed to the WordNet to identify the set of noun words and consolidate the synonymy and hyponymy words. Experimental results have shown that the clustering algorithm is superior to the other subspace clustering algorithms, such as PROCLUS and HARP and kmeans type algorithm, e.g....
    Mining frequent itemsets in transaction databases is an important task in many applications. This task becomes challenging when dealing with a very large transaction database because traditional algorithms are not scalable due to the... more
    Mining frequent itemsets in transaction databases is an important task in many applications. This task becomes challenging when dealing with a very large transaction database because traditional algorithms are not scalable due to the memory limit. In this paper, we propose a new approach for approximately mining of frequent itemsets in a transaction database. First, we partition the set of transactions in the database into disjoint subsets and make the distribution of frequent itemsets in each subset similar to that of the entire database. Then, we randomly select a set of subsets and independently mine the frequent itemsets in each of them. After that, each frequent itemset discovered from these subsets is voted and the one appearing in the majority subsets is determined as a frequent itemset, called a popular frequent itemset. All popular frequent itemsets are compared with the frequent itemsets discovered directly from the entire database using the same frequency threshold. The r...
    Data classification is an important research topic in the field of data mining. With the rapid development in social media sites and IoT devices, data have grown tremendously in volume and complexity, which has resulted in a lot of large... more
    Data classification is an important research topic in the field of data mining. With the rapid development in social media sites and IoT devices, data have grown tremendously in volume and complexity, which has resulted in a lot of large and complex high-dimensional data. Classifying such high-dimensional complex data with a large number of classes has been a great challenge for current state-of-the-art methods. This paper presents a novel, hierarchical, gamma mixture model-based unsupervised method for classifying high-dimensional data with a large number of classes. In this method, we first partition the features of the dataset into feature strata by using k-means. Then, a set of subspace data sets is generated from the feature strata by using the stratified subspace sampling method. After that, the GMM Tree algorithm is used to identify the number of clusters and initial clusters in each subspace dataset and passing these initial cluster centers to k-means to generate base subspa...
    In this paper, we propose a latent feature group learning (LFGL) algorithm to discover thefeature grouping structures and subspace clusters for high-dimensional data. The feature groupingstructures, which are learned in an analytical way,... more
    In this paper, we propose a latent feature group learning (LFGL) algorithm to discover thefeature grouping structures and subspace clusters for high-dimensional data. The feature groupingstructures, which are learned in an analytical way, can enhance the accuracy and efficiency ofhigh-dimensional data clustering. In LFGL algorithm, the Darwinian evolutionary process is usedto explore the optimal feature grouping structures, which are coded as chromosomes in the geneticalgorithm. The feature grouping weighting k-means algorithm is used as the fitness function toevaluate the chromosomes or feature grouping structures in each generation of evolution. To betterhandle the diverse densities of clusters in high-dimensional data, the original feature groupingweighting k-means is revised with the mass-based dissimilarity measure rather than the Euclideandistance measure and the feature weights are optimized as a nonnegative matrix factorizationproblem under the orthogonal constraint of featu...
    Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large... more
    Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that sep...
    High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a... more
    High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a data set, which is usually in thousands. High dimensional data occurs in business as well. In retails, for example, to effectively manage supplier relationship, suppliers are often categorized according to their business behaviors (Zhang, Huang, Qian, Xu, & Jing, 2006). The supplier’s behavior data is high dimensional, which contains thousands of attributes to describe the supplier’s behaviors, including product items, ordered amounts, order frequencies, product quality and so forth. One more example is DNA microarray data. Clustering high-dimensional data requires special treatment (Swanson, 1990; Jain, Murty, & Flynn, 1999; Cai, He, & Han, 2005; Kontaki, Papadopoulos & Manolopoulos., 2007), although various methods for clustering are available (J...
    We propose a sampling-based method, called RSP-Hist, to construct approximate equi-width histograms and help data scientists explore the probability distribution of big data on Hadoop clusters. In RSP- Hist, the Random Sample Partition... more
    We propose a sampling-based method, called RSP-Hist, to construct approximate equi-width histograms and help data scientists explore the probability distribution of big data on Hadoop clusters. In RSP- Hist, the Random Sample Partition (RSP) model is used to store a big data set as ready-to-use random sample data blocks, called RSP blocks, in the Hadoop Distributed File System (HDFS). An approximate histogram is computed by applying a sequential histogram algorithm in parallel to each block in a block-level sample of RSP blocks. Local histograms from individual RSP blocks are combined to produce an approximate histogram for the entire data. We tested RSP-Hist on four data sets using a small computing cluster. In this paper, we demonstrate the effect of the sampling rate and the number of buckets on the histogram accuracy and show that RSP-based approximate histograms are equivalent to the exact histograms computed from the entire data. RSP- Hist can avoid the data correlation issue in HDFS blocks and significantly reduce both computation and communication costs. It enables iterative and interactive exploration of big data sets on small computing clusters and can be used for multivariate data exploration.
    To enable the individual data block files of a distributed big data set to be used as random samples for big data analysis, a two-stage data processing (TSDP) algorithm is proposed in this paper to convert a big data set into a random... more
    To enable the individual data block files of a distributed big data set to be used as random samples for big data analysis, a two-stage data processing (TSDP) algorithm is proposed in this paper to convert a big data set into a random sample partition (RSP) representation which ensures that each individual data block in the RSP is a random sample of the big data, therefore, it can be used to estimate the statistical properties of the big data. The first stage of this algorithm is to sequentially chunk the big data set into non-overlapping subsets and distribute these subsets as data block files to the nodes of a cluster. The second stage is to take a random sample from each subset without replacement to form a new subset saved as an RSP data block file and the random sampling step is repeated until all data records in all subsets are used up and a new set of RSP data block files are created to form an RSP of the big data. It is formally proved that the expectation of the sample distribution function (s.d.f.) of each RSP data block equals to the s.d.f. of the big data set, therefore, each RSP data block is a random sample of the big data set. Implementation of the TSDP algorithm on Apache Spark and HDFS is presented. Performance evaluations on terabyte data sets show the efficiency of this algorithm in converting HDFS big data files into HDFS RSP big data files. We also show an example that uses only a small number of RSP data blocks to build ensemble models which perform better than the single model built from the entire data set.
    In many application areas, data that is being generated and processed goes beyond the petabyte scale. Analyzing such an increasing massive volume of data faces computational, as well as, statistical challenges. In order to solve these... more
    In many application areas, data that is being generated and processed goes beyond the petabyte scale. Analyzing such an increasing massive volume of data faces computational, as well as, statistical challenges. In order to solve these challenges, distributed and parallel processing frameworks have been used for implementing scalable data analysis algorithms. Nevertheless, processing the whole big data set at one time may exceed the available computing resources and the time requirements for some applications. Thus, approximate approaches can be used to achieve asymptotic analysis results, especially when data analysis algorithms are amenable to an approximate result rather than an exact one. However, most approximation approaches require taking a random sample of the data which is a nontrivial task when working with big data sets. In this paper, we employ ensemble learning as an approach for asymptotic analysis using randomly selected subsets (i.e. data blocks) of a big data set. We propose an asymptotic ensemble learning framework which depends on block-based sampling rather than record-based sampling. In order to demonstrate the feasibility and performance of this framework, we present an empirical analysis on real data sets. In addition to the scalability advantage, the experimental results show that several blocks of a data set are enough to get approximately the same results as those from using the whole data set.
    Search interface detection is an essential task for extracting information from the hidden Web. The challenge for this task is that search interface data is represented in high-dimensional and sparse features with many missing values.... more
    Search interface detection is an essential task for extracting information from the hidden Web. The challenge for this task is that search interface data is represented in high-dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build the individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learned from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM, C4.5, Naive Bayes, and original random forest algorithm (RFA). The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.
    This paper proposes a multi-agent Q-learning algorithm called meta-game-Q learning that is developed from the meta-game equilibrium concept. Di#erent from Nash equilibrium, meta-game equilibrium can achieve the optimal joint action game... more
    This paper proposes a multi-agent Q-learning algorithm called meta-game-Q learning that is developed from the meta-game equilibrium concept. Di#erent from Nash equilibrium, meta-game equilibrium can achieve the optimal joint action game through deliberating its preference and predicting others' policies in the general-sum game. A distributed negotiation algorithm is used to solve the meta-game equilibrium problem instead of using centralized linear programming algorithms. We use the repeated prisoner's dilemma example to empirically demonstrate that the algorithm converges to meta-game equilibrium.
    This paper proposes C3, a new learning scheme to improve classification performance of rare category emails in the early stage of incremental learning. C3 consists of three components: the chief-learner, the co-learners and the combiner.... more
    This paper proposes C3, a new learning scheme to improve classification performance of rare category emails in the early stage of incremental learning. C3 consists of three components: the chief-learner, the co-learners and the combiner. The chief-learner is an ...
    Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node... more
    Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features usingp-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionali...
    Transaction proceesing in Grid is to ensure reliable execution of inherently distributed Grid applications. This paper proposes coordination algorithms for handling short-lived and long-lived Grid transactions, models and analyzes these... more
    Transaction proceesing in Grid is to ensure reliable execution of inherently distributed Grid applications. This paper proposes coordination algorithms for handling short-lived and long-lived Grid transactions, models and analyzes these algorithms with the Petri net. The cohesion transaction can coordinate long-lived business Grid applications by automatically generating and executing compensation transactions to semantically undo committed sub-transactions. From analysis of the
    ... Fuzzy Soft Subspace Clustering Method for Gene Co-expression Network Analysis Qiang Wang*, Yunming Ye Shenzhen Graduate School Harbin Institute of Technology Xiii, Shenzhen 518055, China mikewq@yahoo.cn yeyunming@hit.edu.cn... more
    ... Fuzzy Soft Subspace Clustering Method for Gene Co-expression Network Analysis Qiang Wang*, Yunming Ye Shenzhen Graduate School Harbin Institute of Technology Xiii, Shenzhen 518055, China mikewq@yahoo.cn yeyunming@hit.edu.cn *Corresponding author ...
    ABSTRACT Recently a few Continuous Query systems have been developed to cope with applications involving continuous data streams. At the same time, numerous algorithms are proposed for better performance. A recent work on this subject was... more
    ABSTRACT Recently a few Continuous Query systems have been developed to cope with applications involving continuous data streams. At the same time, numerous algorithms are proposed for better performance. A recent work on this subject was to define scheduling strategies on shared window joins over data streams from multiple query expressions. In these strategies, a tuple with the highest priority is selected to process from multiple candidates. However, the performance of these static strategies is deeply influenced when data are bursting, because the priority is determined only by static information, such as the query windows, arriving order, etc. In this paper, we propose a novel adaptive strategy where the priority of a tuple is integrated with realtime information. A thorough experimental evaluation has demonstrated that this new strategy can outperform the existing strategies.
    Clustering is one of the fundamental operations in data mining. Clustering is widely used in solving business problems such as customer segmentation and fraud detection. In real applications of clustering, we are required to perform three... more
    Clustering is one of the fundamental operations in data mining. Clustering is widely used in solving business problems such as customer segmentation and fraud detection. In real applications of clustering, we are required to perform three tasks: partitioning data sets into clusters, validating the clustering results and interpreting the clusters. Various clustering algorithms have been designed for the first task. Few techniques are available for cluster validation in data mining. The third task is application dependent and needs domain knowledge to understand the clusters. In this paper, we present a few techniques for the first two tasks. We first discuss the family of the k-means type algorithms, which are mostly used in data mining. Then we present a visual method for cluster validation. This method is based on the Fastmap data projection algorithm and its enhancement. Finally, we present a method to combine a clustering algorithm and the visual cluster validation method to interactively build classification models.
    A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a... more
    A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a small set of unique categorical values such as {Female, Male} for the gender attribute. Unlike numeric data, categorical values are discrete and unordered. Therefore, the clustering algorithms for numeric data cannot be used to cluster categorical data that exists in many real world applications. In data mining research, much effort has been put on development of new techniques for clustering categorical data (Huang, 1997b; Huang, 1998; Gibson, Kleinberg, & Raghavan, 1998; Ganti, Gehrke, & Ramakrishnan, 1999; Guha, Rastogi, & Shim, 1999; Chaturvedi, Green, Carroll, & Foods, 2001; Barbara, Li, & Couto, 2002; Andritsos, Tsaparas, Miller, & Sevcik, 2003; Li, Ma, & Ogihara, 2004; Chen, & Liu, 2005; Parmar, Wu, & Blackhurst, 2007). The k-modes clusterin...
    An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much... more
    An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much attention in the data mining research community. In this paper, we present a new method to detect outliers by discovering frequent patterns (or frequent itemsets) from the data set. The outliers are defined as the data transactions that contain less frequent patterns in their itemsets. We define a measure called FPOF (Frequent Pattern Outlier Factor) to detect the outlier transactions and propose the FindFPOF algorithm to discover outliers. The experimental results have shown that our approach outperformed the existing methods on identifying interesting outliers.
    ... Incremental Learning from Positive Samples Yunming Ye1, Fanyuan Ma1, Yiming Lu1, Matthew Chiu2, and Joshua Huang2 ... AI Commu-nications 13 (2000) 215–224 12. Kleinberg, J.: Authoritative sources in a hyperlinked environment. In:... more
    ... Incremental Learning from Positive Samples Yunming Ye1, Fanyuan Ma1, Yiming Lu1, Matthew Chiu2, and Joshua Huang2 ... AI Commu-nications 13 (2000) 215–224 12. Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. ...
    Motivated by the insufficiency of the existing framework that could not process multiple attributes with different sensitivity requirements on modeling real world privacy requirements for data publishing, we present a novel method,... more
    Motivated by the insufficiency of the existing framework that could not process multiple attributes with different sensitivity requirements on modeling real world privacy requirements for data publishing, we present a novel method, rating, for publishing sensitive data. Rating releases AT (Attribute Table) and IDT (ID Table) based on different sensitivity coefficients for different attributes. This approach not only protects privacy for multiple sensitive attributes, but also keeps a large amount of correlations of the micro data. We develop algorithms for computing AT and IDT that obey the privacy requirements for multiple sensitive attributes, and maximize the utility of published data as well. We prove both theoretically and experimentally that our method has better performance than the conventional privacy preserving methods on protecting privacy and maximizing the utility of published data. To quantify the utility of published data, we propose a new measurement named classification measurement.
    Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data... more
    Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method,

    And 4 more