Counting items in a distributed system, and estimating the cardinality of multisets in particular... more Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general
International Journal of Digital Crime and Forensics, 2013
In this study, a novel algorithm for recognizing pornographic images based on the analysis of ski... more In this study, a novel algorithm for recognizing pornographic images based on the analysis of skin color regions is presented. The skin color information essentially provides Regions of Interest (ROIs). It is demonstrated that the convex hull of these ROIs provides semantically useful information for pornographic image detection. Based on these convex hulls, the authors extract a small set of low-level visual features that are empirically proven to possess discriminative power for pornographic image classification. In this study, the authors consider multi-class pornographic image classification, where the “nude” and “benign” image classes are further split into two specialized sub-classes, namely “bikini”/”porn” and “skin”/”non-skin”, respectively. The extracted feature vectors are fed to an ensemble of random forest classifiers for image classification. Each classifier is trained on a partition of the training set and solves a binary classification problem. In this sense, the mode...
2013 IEEE 29th International Conference on Data Engineering (ICDE), 2013
ABSTRACT Cloud key-value stores are becoming increasingly more important. Challenging application... more ABSTRACT Cloud key-value stores are becoming increasingly more important. Challenging applications, requiring efficient and scalable access to massive data, arise every day. We focus on supporting interval queries (which are prevalent in several data intensive applications, such as temporal querying for temporal analytics), an efficient solution for which is lacking. We contribute a compound interval index structure, comprised of two tiers: (i) the MRSegmentTree (MRST), a key-value representation of the Segment Tree, and (ii) the Endpoints Index (EPI), a column family index that stores information for interval endpoints. In addition to the above, our contributions include: (i) algorithms for efficiently constructing and populating our indices using MapReduce jobs, (ii) techniques for efficient and scalable index maintenance, and (iii) algorithms for processing interval queries. We have implemented all algorithms using HBase and Hadoop, and conducted a detailed performance evaluation. We quantify the costs associated with the construction of the indices, and evaluate our query processing algorithms using queries on real data sets. We compare the performance of our approach to two alternatives: the native support for interval queries provided in HBase, and the execution of such queries using the Hive query execution tool. Our results show a significant speedup, far outperforming the state of the art.
Counting items in a distributed system, and estimating the cardinality of multisets in particular... more Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general
International Journal of Digital Crime and Forensics, 2013
In this study, a novel algorithm for recognizing pornographic images based on the analysis of ski... more In this study, a novel algorithm for recognizing pornographic images based on the analysis of skin color regions is presented. The skin color information essentially provides Regions of Interest (ROIs). It is demonstrated that the convex hull of these ROIs provides semantically useful information for pornographic image detection. Based on these convex hulls, the authors extract a small set of low-level visual features that are empirically proven to possess discriminative power for pornographic image classification. In this study, the authors consider multi-class pornographic image classification, where the “nude” and “benign” image classes are further split into two specialized sub-classes, namely “bikini”/”porn” and “skin”/”non-skin”, respectively. The extracted feature vectors are fed to an ensemble of random forest classifiers for image classification. Each classifier is trained on a partition of the training set and solves a binary classification problem. In this sense, the mode...
2013 IEEE 29th International Conference on Data Engineering (ICDE), 2013
ABSTRACT Cloud key-value stores are becoming increasingly more important. Challenging application... more ABSTRACT Cloud key-value stores are becoming increasingly more important. Challenging applications, requiring efficient and scalable access to massive data, arise every day. We focus on supporting interval queries (which are prevalent in several data intensive applications, such as temporal querying for temporal analytics), an efficient solution for which is lacking. We contribute a compound interval index structure, comprised of two tiers: (i) the MRSegmentTree (MRST), a key-value representation of the Segment Tree, and (ii) the Endpoints Index (EPI), a column family index that stores information for interval endpoints. In addition to the above, our contributions include: (i) algorithms for efficiently constructing and populating our indices using MapReduce jobs, (ii) techniques for efficient and scalable index maintenance, and (iii) algorithms for processing interval queries. We have implemented all algorithms using HBase and Hadoop, and conducted a detailed performance evaluation. We quantify the costs associated with the construction of the indices, and evaluate our query processing algorithms using queries on real data sets. We compare the performance of our approach to two alternatives: the native support for interval queries provided in HBase, and the execution of such queries using the Hive query execution tool. Our results show a significant speedup, far outperforming the state of the art.
Uploads
Papers by Nikos Ntarmos