Joshua Zhexue Huang

Publisher: Institute of Electrical and Electronics Engineers

Publication Date: Nov 1, 2019

Publication Name: IEEE Transactions on Industrial Informatics

Research Interests:
Engineering, Mathematics, Computer Science, Physics, Technology, and 4 moreData Mining, Data Modeling, Big Data, and Data Set

Download (.pdf)

Publisher: Institution of Engineering and Technology

Publication Date: Feb 1, 2023

Publication Name: CAAI Transactions on Intelligence Technology

Research Interests:
Mathematics and Algorithm

In this paper, an Observation Points Classifier Ensemble (OPCE) algorithm is proposed to deal with High‐Dimensional Imbalanced Classification (HDIC) problems based on data processed using the Multi‐Dimensional Scaling (MDS) feature... more

In this paper, an Observation Points Classifier Ensemble (OPCE) algorithm is proposed to deal with High‐Dimensional Imbalanced Classification (HDIC) problems based on data processed using the Multi‐Dimensional Scaling (MDS) feature extraction technique. First, dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible. Second, a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low‐dimensional data space. Third, optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples. Exhaustive experiments have been conducted to evaluate the feasibility, rationality, and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets. Experimental results show that (1) the OPCE algorithm can be trained faster on low‐dimensional imbalanced data than on high‐dimensional data; (2) the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased; and (3) statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms. This demonstrates that OPCE is a viable algorithm to deal with HDIC problems.

Publisher: Institution of Engineering and Technology

Publication Date: May 4, 2022

Publication Name: CAAI Transactions on Intelligence Technology

Research Interests:
Curse of Dimensionality

Gene selection is an important step in analysis of gene data sets in which the number of genes exceeds greatly the number of samples. In this paper, we propose a new method that uses a random forest model to select genes from high... more

Gene selection is an important step in analysis of gene data sets in which the number of genes exceeds greatly the number of samples. In this paper, we propose a new method that uses a random forest model to select genes from high dimensional gene data sets. In this method, Breiman's random forest algorithm is first used to generate a random forest model from a high dimensional data set. Then, features appearing in component tree models of the random forest are analyzed using the measures of feature correlations. Features are divided into two sets, those appearing in the roots of component trees and those appearing in other nodes of the trees. The frequency of the features is calculated and the features whose frequency is greater than given thresholds are selected as candidates. Finally, the correlation of candidate features with the class feature is measured with symmetrical uncertainty and the top features (with the highest symmetrical uncertainty values) are selected. 19 gene data sets were used to evaluate the new gene selection method. The comparison results have shown that the models built with the gene features selected with the new method outperformed other random forest models in classification accuracy.

Publication Date: Nov 1, 2016

Research Interests:
Mathematics, Computer Science, Feature Selection, Random Forest, Gene Selection, and Data Set

Publisher: KeAi

Publication Date: Apr 1, 2023

Publication Name: Digital Communications and Networks

Research Interests:
Computer Science, Distributed Computing, Optimization Problem, Computation, Server, and 5 moreDigital communications networks, Mobile Edge Computing, Edge Computing, Computation offloading , and Lyapunov optimization

To break the in-memory bottleneck and facilitate online sampling in cluster computing frameworks, we propose a new sampling-based system for approximate big data analysis on computing clusters. We address both computational and... more

To break the in-memory bottleneck and facilitate online sampling in cluster computing frameworks, we propose a new sampling-based system for approximate big data analysis on computing clusters. We address both computational and statistical aspects of big data across the main layers of cluster computing frameworks: big data storage, big data management, big data online sampling, big data processing, and big data exploration and analysis. We use the new Random Sample Partition (RSP) distributed data model to store a big data set as a set of ready-to-use random sample data blocks in Hadoop Distributed File System (HDFS), called RSP blocks. With this system, only a few RSP blocks are selected and processed using a sequential algorithm in a distributed data-parallel manner to produce approximate results for the entire data set. In this paper, we present a prototype RSP-based system and demonstrate its advantages. Our experiments show that RSP blocks can be used to get approximate models and summary statistics as well as estimate the proportions of inconsistent values without computing the entire data or running expensive online sampling operations. This new system enables big data exploration and analysis where the entire data set cannot be computed.

Publication Date: Nov 3, 2019

Research Interests:
Computer Science and Big Data

This paper presents a text clustering system developed based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data. In this algorithm, a new step is added in the k-means clustering process... more

This paper presents a text clustering system developed based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data. In this algorithm, a new step is added in the k-means clustering process to automatically calculate the weights of keywords in each cluster so that the important words of a cluster can be identified by the weight values. For understanding and interpretation of clustering results, a few keywords that can best represent the semantic topic are extracted from each cluster. Two methods are used to extract the representative words. The candidate words are first selected according to their weights calculated by our new algorithm. Then, the candidates are fed to the WordNet to identify the set of noun words and consolidate the synonymy and hyponymy words. Experimental results have shown that the clustering algorithm is superior to the other subspace clustering algorithms, such as PROCLUS and HARP and kmeans type algorithm, e.g....

Publisher: Zenodo

Publication Date: Apr 26, 2008

Research Interests:
Computer Science, Ontology, Text Mining, Text Clustering, Cluster Analysis, and 8 moreNoun, Subspace Clustering, System Development, High Dimensionality, K Means, Extraction Method, K means Clustering, and Feature Weighting

Download (.pdf)

Mining frequent itemsets in transaction databases is an important task in many applications. This task becomes challenging when dealing with a very large transaction database because traditional algorithms are not scalable due to the... more

Mining frequent itemsets in transaction databases is an important task in many applications. This task becomes challenging when dealing with a very large transaction database because traditional algorithms are not scalable due to the memory limit. In this paper, we propose a new approach for approximately mining of frequent itemsets in a transaction database. First, we partition the set of transactions in the database into disjoint subsets and make the distribution of frequent itemsets in each subset similar to that of the entire database. Then, we randomly select a set of subsets and independently mine the frequent itemsets in each of them. After that, each frequent itemset discovered from these subsets is voted and the one appearing in the majority subsets is determined as a frequent itemset, called a popular frequent itemset. All popular frequent itemsets are compared with the frequent itemsets discovered directly from the entire database using the same frequency threshold. The r...

Publisher: DAMDID/RCDL

Publication Date: 2019

Research Interests:
Computer Science

Download (.pdf)

Data classification is an important research topic in the field of data mining. With the rapid development in social media sites and IoT devices, data have grown tremendously in volume and complexity, which has resulted in a lot of large... more

Data classification is an important research topic in the field of data mining. With the rapid development in social media sites and IoT devices, data have grown tremendously in volume and complexity, which has resulted in a lot of large and complex high-dimensional data. Classifying such high-dimensional complex data with a large number of classes has been a great challenge for current state-of-the-art methods. This paper presents a novel, hierarchical, gamma mixture model-based unsupervised method for classifying high-dimensional data with a large number of classes. In this method, we first partition the features of the dataset into feature strata by using k-means. Then, a set of subspace data sets is generated from the feature strata by using the stratified subspace sampling method. After that, the GMM Tree algorithm is used to identify the number of clusters and initial clusters in each subspace dataset and passing these initial cluster centers to k-means to generate base subspa...

Publisher: MDPI AG

Publication Date: 2019

Publication Name: Entropy

Research Interests:
Computer Science, Artificial Intelligence, Data Mining, Entropy, Mathematical Sciences, and 2 morePhysical sciences and Cluster Analysis

Download (.pdf)

In this paper, we propose a latent feature group learning (LFGL) algorithm to discover thefeature grouping structures and subspace clusters for high-dimensional data. The feature groupingstructures, which are learned in an analytical way,... more

In this paper, we propose a latent feature group learning (LFGL) algorithm to discover thefeature grouping structures and subspace clusters for high-dimensional data. The feature groupingstructures, which are learned in an analytical way, can enhance the accuracy and efficiency ofhigh-dimensional data clustering. In LFGL algorithm, the Darwinian evolutionary process is usedto explore the optimal feature grouping structures, which are coded as chromosomes in the geneticalgorithm. The feature grouping weighting k-means algorithm is used as the fitness function toevaluate the chromosomes or feature grouping structures in each generation of evolution. To betterhandle the diverse densities of clusters in high-dimensional data, the original feature groupingweighting k-means is revised with the mass-based dissimilarity measure rather than the Euclideandistance measure and the feature weights are optimized as a nonnegative matrix factorizationproblem under the orthogonal constraint of featu...

Publisher: MDPI AG

Publication Date: 2019

Publication Name: Information

Research Interests:
Computer Science, Artificial Intelligence, Information, Cluster Analysis, Inf, and weighting

Download (.pdf)

Publisher: Springer Science and Business Media LLC

Publication Date: 2019

Publication Name: Journal of Big Data

Research Interests:
Computer Science, Data Mining, Computational Science and Engineering, Big Data, and Data Set

Download (.pdf)

Publisher: Springer Science and Business Media LLC

Publication Date: 2016

Publication Name: International Journal of Machine Learning and Cybernetics

Research Interests:
Computer Science, Artificial Intelligence, Data Mining, Cluster Analysis, Robustness (evolution), and Ensemble Learning

Download (.pdf)

Publisher: Springer Berlin Heidelberg

Publication Date: 2012

Publication Name: Lecture Notes in Computer Science

Research Interests:
Computer Science and Random Forest

Download (.pdf)

Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large... more

Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that sep...

Publication Date: 2015

Publication Name: BMC genomics

Research Interests:
Computer Science, Algorithms, Machine Learning, Computational Biology, Biology, and 12 moreMedicine, Feature Selection, Biological Sciences, Humans, Random Forest, Overfitting, Single Nucleotide Polymorphism, Reproducibility of Results, Alzheimer Disease, Parkinson Disease, Medical and Health Sciences, and Genetic predisposition to disease

Download (.pdf)

High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a... more

High dimensional data is a phenomenon in real-world data mining applications. Text data is a typical example. In text mining, a text document is viewed as a vector of terms whose dimension is equal to the total number of unique terms in a data set, which is usually in thousands. High dimensional data occurs in business as well. In retails, for example, to effectively manage supplier relationship, suppliers are often categorized according to their business behaviors (Zhang, Huang, Qian, Xu, & Jing, 2006). The supplier’s behavior data is high dimensional, which contains thousands of attributes to describe the supplier’s behaviors, including product items, ordered amounts, order frequencies, product quality and so forth. One more example is DNA microarray data. Clustering high-dimensional data requires special treatment (Swanson, 1990; Jain, Murty, & Flynn, 1999; Cai, He, & Han, 2005; Kontaki, Papadopoulos & Manolopoulos., 2007), although various methods for clustering are available (J...

Publisher: IGI Global

Publication Name: Encyclopedia of Data Warehousing and Mining, Second Edition

Research Interests:
Computer Science, Data Mining, Cluster Analysis, High Dimensional Data, and Subspace Clustering

Download (.pdf)

Publisher: Springer Science and Business Media LLC

Publication Date: 2014

Publication Name: Machine Learning

Research Interests:
Cognitive Science, Computer Science, Statistics, Machine Learning, Quantile Regression, and 4 moreRandom Forest, Regression, Quantile, and Residual

Download (.pdf)

Publisher: Elsevier BV

Publication Date: 2012

Publication Name: Knowledge-Based Systems

Research Interests:
Computer Science, Social Networks, Machine Learning, Data Mining, Community Detection, and 7 moreLink analysis, Knowledge Based Systems, Subspace Clustering, Social Network Analysis, Social Network, Quantitative Evaluation, and Psychology and Cognitive Sciences

Download (.pdf)

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2005

Publication Name: IEEE Transactions on Pattern Analysis and Machine Intelligence

Research Interests:
Information Systems, Mathematics, Computer Science, Algorithms, Artificial Intelligence, and 15 moreData Mining, Medicine, Computer Simulation, Cluster Analysis, Clustered Data, Information Storage and Retrieval, Data mining application, Convergence theorem, Reproducibility of Results, K Means, Sensitivity and Specificity, Feature Evaluation and Selection, K means Clustering, Electrical And Electronic Engineering, and Correlation Clustering

Download (.pdf)

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2008

Publication Name: IEEE Transactions on Knowledge and Data Engineering

Research Interests:
Computer Science, Cluster Analysis, Clustered Data, Hierarchical Clustering, K means Clustering, and 2 moreObjective function and Synthetic Data

Download (.pdf)

Publication Date: 2006

Research Interests:
Text Clustering, Mutual Information, Vector Space Model, Data Model, K Means, and 3 moreVector Space, Euclidean Distance, and Distance Measure

Download (.pdf)

We propose a sampling-based method, called RSP-Hist, to construct approximate equi-width histograms and help data scientists explore the probability distribution of big data on Hadoop clusters. In RSP- Hist, the Random Sample Partition... more

We propose a sampling-based method, called RSP-Hist, to construct approximate equi-width histograms and help data scientists explore the probability distribution of big data on Hadoop clusters. In RSP- Hist, the Random Sample Partition (RSP) model is used to store a big data set as ready-to-use random sample data blocks, called RSP blocks, in the Hadoop Distributed File System (HDFS). An approximate histogram is computed by applying a sequential histogram algorithm in parallel to each block in a block-level sample of RSP blocks. Local histograms from individual RSP blocks are combined to produce an approximate histogram for the entire data. We tested RSP-Hist on four data sets using a small computing cluster. In this paper, we demonstrate the effect of the sampling rate and the number of buckets on the histogram accuracy and show that RSP-based approximate histograms are equivalent to the exact histograms computed from the entire data. RSP- Hist can avoid the data correlation issue in HDFS blocks and significantly reduce both computation and communication costs. It enables iterative and interactive exploration of big data sets on small computing clusters and can be used for multivariate data exploration.

Publication Date: Dec 1, 2021

Research Interests:
Computer Science, Enumeration, Big Data, and Histogram

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2020

Publication Name: IEEE Access

Research Interests:
Computer Science

Download (.pdf)

To enable the individual data block files of a distributed big data set to be used as random samples for big data analysis, a two-stage data processing (TSDP) algorithm is proposed in this paper to convert a big data set into a random... more

To enable the individual data block files of a distributed big data set to be used as random samples for big data analysis, a two-stage data processing (TSDP) algorithm is proposed in this paper to convert a big data set into a random sample partition (RSP) representation which ensures that each individual data block in the RSP is a random sample of the big data, therefore, it can be used to estimate the statistical properties of the big data. The first stage of this algorithm is to sequentially chunk the big data set into non-overlapping subsets and distribute these subsets as data block files to the nodes of a cluster. The second stage is to take a random sample from each subset without replacement to form a new subset saved as an RSP data block file and the random sampling step is repeated until all data records in all subsets are used up and a new set of RSP data block files are created to form an RSP of the big data. It is formally proved that the expectation of the sample distribution function (s.d.f.) of each RSP data block equals to the s.d.f. of the big data set, therefore, each RSP data block is a random sample of the big data set. Implementation of the TSDP algorithm on Apache Spark and HDFS is presented. Performance evaluations on terabyte data sets show the efficiency of this algorithm in converting HDFS big data files into HDFS RSP big data files. We also show an example that uses only a small number of RSP data blocks to build ensemble models which perform better than the single model built from the entire data set.

Publisher: Springer International Publishing

Publication Date: 2018

Publication Name: Lecture Notes in Computer Science

Research Interests:
Computer Science, Algorithm, Big Data, and Springer Ebooks

In many application areas, data that is being generated and processed goes beyond the petabyte scale. Analyzing such an increasing massive volume of data faces computational, as well as, statistical challenges. In order to solve these... more

In many application areas, data that is being generated and processed goes beyond the petabyte scale. Analyzing such an increasing massive volume of data faces computational, as well as, statistical challenges. In order to solve these challenges, distributed and parallel processing frameworks have been used for implementing scalable data analysis algorithms. Nevertheless, processing the whole big data set at one time may exceed the available computing resources and the time requirements for some applications. Thus, approximate approaches can be used to achieve asymptotic analysis results, especially when data analysis algorithms are amenable to an approximate result rather than an exact one. However, most approximation approaches require taking a random sample of the data which is a nontrivial task when working with big data sets. In this paper, we employ ensemble learning as an approach for asymptotic analysis using randomly selected subsets (i.e. data blocks) of a big data set. We propose an asymptotic ensemble learning framework which depends on block-based sampling rather than record-based sampling. In order to demonstrate the feasibility and performance of this framework, we present an empirical analysis on real data sets. In addition to the scalability advantage, the experimental results show that several blocks of a data set are enough to get approximately the same results as those from using the whole data set.

Publisher: ACM

Publication Date: 2016

Publication Name: Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

Research Interests:
Computer Science, Big Data, and Ensemble Learning

Search interface detection is an essential task for extracting information from the hidden Web. The challenge for this task is that search interface data is represented in high-dimensional and sparse features with many missing values.... more

Search interface detection is an essential task for extracting information from the hidden Web. The challenge for this task is that search interface data is represented in high-dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build the individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learned from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM, C4.5, Naive Bayes, and original random forest algorithm (RFA). The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.

Publisher: Int. J. Comput. Linguistics Chin. Lang. Process.

Publication Date: 2008

Publication Name: Int. J. Comput. Linguistics Chin. Lang. Process.

Research Interests:
Computer Science, Artificial Intelligence, Data Mining, Feature Selection, Random Forest, and 15 moreHidden Web, Decision Tree, Support vector machine, Form, Ensemble Learning, High Dimensionality, Missing Values, Keyword Search, Exploit, Classifier Ensemble, Classification Algorithm, Naive Bayes Classifier, Feature Space, Feature Weighting, and weighting

Download (.pdf)

Publisher: Tsinghua University Press

Publication Date: 2020

Publication Name: Big Data Mining and Analytics

Research Interests:
Computer Science, Big Data, Big Data / Analytics / Data Mining, Approximate Computing, and Data partitioning and sampling

Download (.pdf)

Publisher: Springer Science and Business Media LLC

Publication Date: 2016

Publication Name: International Journal of Data Science and Analytics

Research Interests:
Computer Science, Cluster Computing, Analytics, Parallel & Distributed Computing, Big Data, and 5 moreBig Data Analytics, Machine Learning Big Data, Big Data Graph Processing, Apache Spark Streaming, and Resilient Distributed Datasets

Download (.pdf)

This paper proposes a multi-agent Q-learning algorithm called meta-game-Q learning that is developed from the meta-game equilibrium concept. Di#erent from Nash equilibrium, meta-game equilibrium can achieve the optimal joint action game... more

This paper proposes a multi-agent Q-learning algorithm called meta-game-Q learning that is developed from the meta-game equilibrium concept. Di#erent from Nash equilibrium, meta-game equilibrium can achieve the optimal joint action game through deliberating its preference and predicting others' policies in the general-sum game. A distributed negotiation algorithm is used to solve the meta-game equilibrium problem instead of using centralized linear programming algorithms. We use the repeated prisoner's dilemma example to empirically demonstrate that the algorithm converges to meta-game equilibrium.

This paper proposes C3, a new learning scheme to improve classification performance of rare category emails in the early stage of incremental learning. C3 consists of three components: the chief-learner, the co-learners and the combiner.... more

This paper proposes C3, a new learning scheme to improve classification performance of rare category emails in the early stage of incremental learning. C3 consists of three components: the chief-learner, the co-learners and the combiner. The chief-learner is an ...

Publication Date: 2003

Publication Name: Lecture Notes in Computer Science

Research Interests:
Computer Science, Artificial Intelligence, Incremental learning, Learning Model, Naive Bayes, and 2 moreE Commerce and Naive Bayes Classifier

Publisher: Elsevier BV

Publication Date: 2012

Publication Name: Pattern Recognition

Research Interests:
Mathematics, Computer Science, Pattern Recognition, Feature Selection, Cluster Analysis, and 2 moreElectrical And Electronic Engineering and weighting

Download (.pdf)

Publication Date: 2012

Publication Name: Lecture Notes in Computer Science

Research Interests:
Computer Science, Random Forest, Histogram, and Scalability

Download (.pdf)

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node... more

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features usingp-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionali...

Publisher: Hindawi Limited

Publication Date: 2015

Publication Name: The Scientific World Journal

Research Interests:
Computer Science, Artificial Intelligence, Medicine, Feature Selection, Random Forest, and weighting

Download (.pdf)

Transaction proceesing in Grid is to ensure reliable execution of inherently distributed Grid applications. This paper proposes coordination algorithms for handling short-lived and long-lived Grid transactions, models and analyzes these... more

Transaction proceesing in Grid is to ensure reliable execution of inherently distributed Grid applications. This paper proposes coordination algorithms for handling short-lived and long-lived Grid transactions, models and analyzes these algorithms with the Petri net. The cohesion transaction can coordinate long-lived business Grid applications by automatically generating and executing compensation transactions to semantically undo committed sub-transactions. From analysis of the

Publication Date: 2004

Publication Name: Lecture Notes in Computer Science

Research Interests:
Computer Science, Distributed Computing, Executive Compensation, Distributed transaction, Petri Net, and 5 moreGrid, Transaction Processing, Automatic code generation, Undo, and Grid Applications

... Fuzzy Soft Subspace Clustering Method for Gene Co-expression Network Analysis Qiang Wang*, Yunming Ye Shenzhen Graduate School Harbin Institute of Technology Xiii, Shenzhen 518055, China mikewq@yahoo.cn yeyunming@hit.edu.cn... more

... Fuzzy Soft Subspace Clustering Method for Gene Co-expression Network Analysis Qiang Wang*, Yunming Ye Shenzhen Graduate School Harbin Institute of Technology Xiii, Shenzhen 518055, China mikewq@yahoo.cn yeyunming@hit.edu.cn *Corresponding author ...

Publisher: IEEE

Publication Date: 2010

Publication Name: 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)

Research Interests:
Computer Science, Fuzzy Logic, Biology, and Cluster Analysis

ABSTRACT Recently a few Continuous Query systems have been developed to cope with applications involving continuous data streams. At the same time, numerous algorithms are proposed for better performance. A recent work on this subject was... more

ABSTRACT Recently a few Continuous Query systems have been developed to cope with applications involving continuous data streams. At the same time, numerous algorithms are proposed for better performance. A recent work on this subject was to define scheduling strategies on shared window joins over data streams from multiple query expressions. In these strategies, a tuple with the highest priority is selected to process from multiple candidates. However, the performance of these static strategies is deeply influenced when data are bursting, because the priority is determined only by static information, such as the query windows, arriving order, etc. In this paper, we propose a novel adaptive strategy where the priority of a tuple is integrated with realtime information. A thorough experimental evaluation has demonstrated that this new strategy can outperform the existing strategies.

Publisher: Springer Nature

Publication Date: 2007

Publication Name: Frontiers of Computer Science in China

Research Interests:
Computer Science, Data Stream Mining, and tuple

Clustering is one of the fundamental operations in data mining. Clustering is widely used in solving business problems such as customer segmentation and fraud detection. In real applications of clustering, we are required to perform three... more

Clustering is one of the fundamental operations in data mining. Clustering is widely used in solving business problems such as customer segmentation and fraud detection. In real applications of clustering, we are required to perform three tasks: partitioning data sets into clusters, validating the clustering results and interpreting the clusters. Various clustering algorithms have been designed for the first task. Few techniques are available for cluster validation in data mining. The third task is application dependent and needs domain knowledge to understand the clusters. In this paper, we present a few techniques for the first two tasks. We first discuss the family of the k-means type algorithms, which are mostly used in data mining. Then we present a visual method for cluster validation. This method is based on the Fastmap data projection algorithm and its enhancement. Finally, we present a method to combine a clustering algorithm and the visual cluster validation method to interactively build classification models.

Publication Date: 2003

Publication Name: Advances in Data Mining and Modeling

Research Interests:
Computer Science, Data Mining, Fuzzy Clustering, Cluster Analysis, Data Stream Clustering, and Correlation Clustering

A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a... more

A lot of data in real world databases are categorical. For example, gender, profession, position, and hobby of customers are usually defined as categorical attributes in the CUSTOMER table. Each categorical attribute is represented with a small set of unique categorical values such as {Female, Male} for the gender attribute. Unlike numeric data, categorical values are discrete and unordered. Therefore, the clustering algorithms for numeric data cannot be used to cluster categorical data that exists in many real world applications. In data mining research, much effort has been put on development of new techniques for clustering categorical data (Huang, 1997b; Huang, 1998; Gibson, Kleinberg, & Raghavan, 1998; Ganti, Gehrke, & Ramakrishnan, 1999; Guha, Rastogi, & Shim, 1999; Chaturvedi, Green, Carroll, & Foods, 2001; Barbara, Li, & Couto, 2002; Andritsos, Tsaparas, Miller, & Sevcik, 2003; Li, Ma, & Ogihara, 2004; Chen, & Liu, 2005; Parmar, Wu, & Blackhurst, 2007). The k-modes clusterin...

Publisher: IGI Global

Publication Name: Encyclopedia of Data Warehousing and Mining, Second Edition

Research Interests:
Computer Science, Cluster Analysis, and Categorical Data

Download (.pdf)

Publication Date: 2004

Publication Name: Lecture Notes in Computer Science

Research Interests:
Computer Science, Web Mining, Web page design, Prediction Model, Scalability, and 7 moreWeb Pages, Web Browsing, Markov Process, Web Based Applications, Markov model, Continuous Time Markov Chain (CTMC), and transition probability

Download (.pdf)

Publication Date: 2002

Publication Name: Lecture Notes in Computer Science

Research Interests:
Computer Science, Distance Education, Data Mining, Distance Learning, Web Based Learning, and 13 moreTeaching Methods, Web Based, Question Answering System, expert System, Student Learning, World Wide Web, Case Based Reasoning, Open Systems, Flexibility in engineering design, Case Base Reasoning, Personalized Learning, Internet, and Early experience

Download (.pdf)

Publication Date: 2010

Publication Name: Lecture Notes in Computer Science

Research Interests:
Mathematics, Computer Science, Trajectory, Spatial Information, Temporal Information Extraction, and 4 moreGrid, Traverse, Moving object, and Location awareness

Download (.pdf)

Publication Date: 2003

Publication Name: Journal of Intelligent Information Systems

Research Interests:
Computer Science, Algorithms, Data Mining, Web Accessibility, Clustering, and 15 moreWeb Caching, Data Reduction, Prediction Model, Correlation Analysis, Web Prefetching, Web Pages, Intelligent Information Systems, Proxy Server, Correlation Methods, Network Traffic, Data Format, Web Documents, Replacement Policy, Data mining Algorithm, and transition probability

Download (.pdf)

An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much... more

An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much attention in the data mining research community. In this paper, we present a new method to detect outliers by discovering frequent patterns (or frequent itemsets) from the data set. The outliers are defined as the data transactions that contain less frequent patterns in their itemsets. We define a measure called FPOF (Frequent Pattern Outlier Factor) to detect the outlier transactions and propose the FindFPOF algorithm to discover outliers. The experimental results have shown that our approach outperformed the existing methods on identifying interesting outliers.

Publisher: National Library of Serbia

Publication Date: 2005

Publication Name: Computer Science and Information Systems

Research Interests:
Computer Science, Data Mining, Anomaly Detection, Frequent Itemset Mining, Outlier detection, and 5 moreComputer Science and Information Technology, Computer Science Information Systems, Outlier, Frequent Pattern, and local outlier factor (LOF)

Download (.pdf)

... Incremental Learning from Positive Samples Yunming Ye1, Fanyuan Ma1, Yiming Lu1, Matthew Chiu2, and Joshua Huang2 ... AI Commu-nications 13 (2000) 215224 12. Kleinberg, J.: Authoritative sources in a hyperlinked environment. In:... more

... Incremental Learning from Positive Samples Yunming Ye1, Fanyuan Ma1, Yiming Lu1, Matthew Chiu2, and Joshua Huang2 ... AI Commu-nications 13 (2000) 215224 12. Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. ...

Publication Date: 2004

Publication Name: Lecture Notes in Computer Science

Research Interests:
Computer Science, Information Retrieval, Incremental learning, Web Crawling, Focused crawling, and 5 moreLink Prediction, Crawling, Web Pages, Web Crawler, and Web page classification

Publisher: ACM

Publication Date: 2012

Publication Name: Proceedings of the 2012 Joint EDBT/ICDT Workshops

Research Interests:
Computer Science, Data Mining, Distributed Data Mining, Clustering, Cluster Analysis, and 3 morePrivacy Preserving, Dbscan, and Secure Multi-party Computation

Download (.pdf)

Motivated by the insufficiency of the existing framework that could not process multiple attributes with different sensitivity requirements on modeling real world privacy requirements for data publishing, we present a novel method,... more

Motivated by the insufficiency of the existing framework that could not process multiple attributes with different sensitivity requirements on modeling real world privacy requirements for data publishing, we present a novel method, rating, for publishing sensitive data. Rating releases AT (Attribute Table) and IDT (ID Table) based on different sensitivity coefficients for different attributes. This approach not only protects privacy for multiple sensitive attributes, but also keeps a large amount of correlations of the micro data. We develop algorithms for computing AT and IDT that obey the privacy requirements for multiple sensitive attributes, and maximize the utility of published data as well. We prove both theoretically and experimentally that our method has better performance than the conventional privacy preserving methods on protecting privacy and maximizing the utility of published data. To quantify the utility of published data, we propose a new measurement named classification measurement.

Publisher: IEEE

Publication Date: 2011

Publication Name: 2011 IEEE 11th International Conference on Data Mining Workshops

Research Interests:
Computer Science, Data Mining, Computer Security, Internet privacy, Information Privacy, and Data Publishing

Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data... more

Rapid growth of data has provided us with more information, yet challenges the tradition techniques to extract the useful knowledge. In this paper, we propose MCMM, a Minimum spanning tree (MST) based Classification model for Massive data with MapReduce implementation. It can be viewed as an intermediate model between the traditional K nearest neighbor method and cluster based classification method,

Publisher: IEEE

Publication Date: 2010

Publication Name: 2010 IEEE International Conference on Data Mining Workshops

Research Interests:
Computer Science, Data Mining, Cloud Computing, Big Data, Minimum Spanning Tree, and 3 moreScalability, Programming Paradigm, and Synthetic Data Generation

Publisher: IEEE

Publication Date: 2006

Publication Name: Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06)

Research Interests:
Computer Science, Biology, Cluster Analysis, Snp, Single Nucleotide Polymorphism, and Haplotype

Download (.pdf)

Publisher: Elsevier BV

Publication Date: 2013

Publication Name: Pattern Recognition

Research Interests:
Mathematics, Computer Science, Pattern Recognition, Feature Selection, Random Forest, and 3 moreHigh Dimensional Data, Electrical And Electronic Engineering, and Stratified sampling

Download (.pdf)

Publisher: Springer Nature

Publication Date: 2009

Publication Name: Information Systems Frontiers

Research Interests:
Information Systems, Computer Science, Property Rights, Business Process Management, Business Process Modeling, and 7 moreUser Participation, Process management, Business Process Discovery, Business Process, Boolean Satisfiability, Integrity Constraints, and SOUNDNESS

Download (.pdf)

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2007

Publication Name: IEEE Transactions on Pattern Analysis and Machine Intelligence

Research Interests:
Information Systems, Computer Science, Algorithms, Artificial Intelligence, Data Mining, and 10 moreMedicine, Algorithm, Cluster Analysis, Artifacts, Information Storage and Retrieval, Heuristic, Reproducibility of Results, Sensitivity and Specificity, Electrical And Electronic Engineering, and Categorical Data

Download (.pdf)

Publisher: Institute of Electrical and Electronics Engineers

Publication Date: Nov 1, 2019

Publication Name: IEEE Transactions on Industrial Informatics

Publisher: Institution of Engineering and Technology

Publication Date: Feb 1, 2023

Publication Name: CAAI Transactions on Intelligence Technology

Research Interests: Mathematics and Algorithm<div>()</div>

Publisher: Institution of Engineering and Technology

Publication Date: May 4, 2022

Publication Name: CAAI Transactions on Intelligence Technology

Research Interests: Curse of Dimensionality<div>()</div>

Publication Date: Nov 1, 2016

Research Interests: Mathematics, Computer Science, Feature Selection, Random Forest, Gene Selection, and Data Set<div>()</div>

Publisher: KeAi

Publication Date: Apr 1, 2023

Publication Name: Digital Communications and Networks

Publication Date: Nov 3, 2019

Research Interests: Computer Science and Big Data<div>()</div>

Publisher: Zenodo

Publication Date: Apr 26, 2008

Publisher: DAMDID/RCDL

Publication Date: 2019

Research Interests: Computer Science<div>()</div>

Publisher: MDPI AG

Publication Date: 2019

Publication Name: Entropy

Publisher: MDPI AG

Publication Date: 2019

Publication Name: Information

Research Interests: Computer Science, Artificial Intelligence, Information, Cluster Analysis, Inf, and weighting<div>()</div>

Publisher: Springer Science and Business Media LLC

Publication Date: 2019

Publication Name: Journal of Big Data

Research Interests: Computer Science, Data Mining, Computational Science and Engineering, Big Data, and Data Set<div>()</div>

Publisher: Springer Science and Business Media LLC

Publication Date: 2016

Publication Name: International Journal of Machine Learning and Cybernetics

Research Interests: Computer Science, Artificial Intelligence, Data Mining, Cluster Analysis, Robustness (evolution), and Ensemble Learning<div>()</div>

Publisher: Springer Berlin Heidelberg

Publication Date: 2012

Publication Name: Lecture Notes in Computer Science

Research Interests: Computer Science and Random Forest<div>()</div>

Publication Date: 2015

Publication Name: BMC genomics

Publisher: IGI Global

Publication Name: Encyclopedia of Data Warehousing and Mining, Second Edition

Research Interests: Computer Science, Data Mining, Cluster Analysis, High Dimensional Data, and Subspace Clustering<div>()</div>

Publisher: Springer Science and Business Media LLC

Publication Date: 2014

Publication Name: Machine Learning

Publisher: Elsevier BV

Publication Date: 2012

Publication Name: Knowledge-Based Systems

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2005

Publication Name: IEEE Transactions on Pattern Analysis and Machine Intelligence

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2008

Publication Name: IEEE Transactions on Knowledge and Data Engineering

Publication Date: 2006

Publication Date: Dec 1, 2021

Research Interests: Computer Science, Enumeration, Big Data, and Histogram<div>()</div>

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Publication Date: 2020

Publication Name: IEEE Access

Research Interests: Computer Science<div>()</div>

Publisher: Springer International Publishing

Publication Date: 2018

Publication Name: Lecture Notes in Computer Science

Research Interests: Computer Science, Algorithm, Big Data, and Springer Ebooks<div>()</div>

Publisher: ACM

Publication Date: 2016

Publication Name: Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

Research Interests: Computer Science, Big Data, and Ensemble Learning<div>()</div>

Publisher: Int. J. Comput. Linguistics Chin. Lang. Process.

Publication Date: 2008

Publication Name: Int. J. Comput. Linguistics Chin. Lang. Process.

Publisher: Tsinghua University Press

Publication Date: 2020

Research Interests:
Mathematics and Algorithm

Research Interests:
Curse of Dimensionality

Research Interests:
Mathematics, Computer Science, Feature Selection, Random Forest, Gene Selection, and Data Set

Research Interests:
Computer Science and Big Data

Research Interests:
Computer Science

Research Interests:
Computer Science, Artificial Intelligence, Information, Cluster Analysis, Inf, and weighting

Research Interests:
Computer Science, Data Mining, Computational Science and Engineering, Big Data, and Data Set

Research Interests:
Computer Science, Artificial Intelligence, Data Mining, Cluster Analysis, Robustness (evolution), and Ensemble Learning

Research Interests:
Computer Science and Random Forest

Research Interests:
Computer Science, Data Mining, Cluster Analysis, High Dimensional Data, and Subspace Clustering

Research Interests:
Computer Science, Enumeration, Big Data, and Histogram

Research Interests:
Computer Science

Research Interests:
Computer Science, Algorithm, Big Data, and Springer Ebooks

Research Interests:
Computer Science, Big Data, and Ensemble Learning

Research Interests:
Computer Science, Big Data, Big Data / Analytics / Data Mining, Approximate Computing, and Data partitioning and sampling

Research Interests:
Computer Science, Random Forest, Histogram, and Scalability

Research Interests:
Computer Science, Artificial Intelligence, Medicine, Feature Selection, Random Forest, and weighting

Research Interests:
Computer Science, Fuzzy Logic, Biology, and Cluster Analysis

Research Interests:
Computer Science, Data Stream Mining, and tuple

Research Interests:
Computer Science, Data Mining, Fuzzy Clustering, Cluster Analysis, Data Stream Clustering, and Correlation Clustering

Research Interests:
Computer Science, Cluster Analysis, and Categorical Data