Category suggestions or recommendations for customers or users have become an essential feature f... more Category suggestions or recommendations for customers or users have become an essential feature for commerce or leisure websites. This is a growing topic that follows users’ activity in social networks generating a huge quantity of information about their interests, contacts, among many others. These data are usually collected to analyze people’s behavior, trends, and integrate a complete user profile. In this sense, we analyze a dataset collected from Pinterest to predict the gender and age by processing input images using a Convolutional Neural Network. Our method is based on the meaning of the image rather than the visual content. Additionally, we propose a heuristic-based approach for text analysis to predict users’ age and gender from Twitter. Both of the classifiers are based on text and images and they are compared with various similar approaches in the state of the art. Suggested categories are based on association rules conformed by the activity of thousands of users in ord...
Early detection of different levels of tremors helps to obtain a more accurate diagnosis of Parki... more Early detection of different levels of tremors helps to obtain a more accurate diagnosis of Parkinson's disease and to increase the therapy options for a better quality of life for patients. This work proposes a non-invasive strategy to measure the severity of tremors with the aim of diagnosing one of the first three levels of Parkinson's disease by the Unified Parkinson's Disease Rating Scale (UPDRS). A tremor being an involuntary motion that mainly appears in the hands; the dataset is acquired using a leap motion controller that measures 3D coordinates of each finger and the palmar region. Texture features are computed using sum and difference of histograms (SDH) to characterize the dataset, varying the window size; however, only the most fundamental elements are used in the classification stage. A machine learning classifier provides the final classification results of the tremor level. The effectiveness of our approach is obtained by a set of performance metrics, which are also used to show a comparison between different proposed designs.
2019 International Conference on Electronics, Communications and Computers (CONIELECOMP), 2019
Nowadays, people share a lot of information in social media in the form of videos, news, photos, ... more Nowadays, people share a lot of information in social media in the form of videos, news, photos, posts, likes, etc. This large amount of generated information reflects the opinions, emotions and preferences of the users. As an example of the previous, Pinterest is a popular social network where the users show their interests in the form of pins, which are information units formed by a short text comment and an image. In this research, we study the problem of building a model to characterize users of Pinterest with two demographic variables, age and gender, using their textual information post in the network. To do that, we introduce a dataset formed by the texts in English from 548,761 pins corresponding to 264 users. This dataset is imbalanced and reflects the actual distribution of the social network for gender and age, with a dominant presence of women over men, and of middle age persons over young persons. With this dataset, we conducted experiments with a diversity of machine learning models, a variety of features and considering a set of performance metrics. Our results produce interesting insights about the problem.
2019 International Conference on Electronics, Communications and Computers (CONIELECOMP), 2019
The globalization of Economy has forced the society to maintain a constant evolution in marketing... more The globalization of Economy has forced the society to maintain a constant evolution in marketing techniques. It is thus very important to design tools and methods that allow knowing and characterize individuals in groups to develop effective marketing strategies. In this context, any company would be interested in finding the tastes and preferences of people regarding the products and services offered in the global market. One technique that could help in this, is the analysis of the personality of each individual to identify their tastes and preferences. In this way we can offer products and services that meet their needs through appropriate advertising for each type of personality. In this work, we propose the use of latent features, extracted with a diversity of dimensionality reduction methods, to infer the personality of Twitter users using textual content-based features, and we compare the performance of the different techniques. For conducting our experiments, we use the PAN CLEF 2015 dataset consisting of 14,166 tweets in English of 152 different users, and a diversity of classification methods. Our results shows interesting insight about the personality prediction task.
We propose using text matching to measure the technological similarity between patents. Technolog... more We propose using text matching to measure the technological similarity between patents. Technology experts from different fields validate the new similarity measure and its improvement on measures based on the United States Patent Classification System, and identify its limitations. As an application, we replicate prior findings on the localization of knowledge spillovers by constructing a case-control group of text-matched patents. We also provide open access to the code and data to calculate the similarity between any two utility patents granted by the United States Patent and Trademark Office between 1976 and 2013, or between any two patent portfolios.
This paper presents an evolutionary method for learning lists of metarules for generalizing the s... more This paper presents an evolutionary method for learning lists of metarules for generalizing the selection of the best classifier for a given text dataset. The method builds rules based on features of a set of training text datasets, and evolves them using special crossover and mutation operators. Once the rules are learned, they are tested in a different set of datasets to demonstrate their accuracy and generality. Our experiments show encouraging results.
In this article, a multi-objective evolutionary framework to build selection hyper-heuristics for... more In this article, a multi-objective evolutionary framework to build selection hyper-heuristics for solving instances of the 2D bin packing problem is presented. The approach consists of a multi-objective evolutionary learning process, using specific tailored genetic operators, to produce sets of variable length rules representing hyper-heuristics. Each hyper-heuristic builds a solution to a given problem instance by sensing the state of the instance , and deciding which single heuristic to apply at each decision point. The hyper-heuristics consider the minimization of two conflicting objectives when building a solution: the number of bins used to accommodate the pieces and the total time required to do the job. The proposed framework integrates three well-studied multi-objective evolutionary algorithms to produce sets of Pareto-approximated hyper-heuristics: the Non-dominated Sorting Genetic Algorithm-II, the Strength Pareto Evolutionary Algorithm 2, and the Generalized Differential Evolution Algorithm 3. We conduct an extensive experimental analysis using a large set of 2D bin packing problem instances containing convex and non-convex irregular pieces, under many conditions, settings and using several performance metrics. The analysis assesses the ro-bustness and flexibility of the proposed approach, providing encouraging results when compared against a set of well-known baseline single heuristics.
When we refer to an image that attracts our attention, it is natural to mention not only what is ... more When we refer to an image that attracts our attention, it is natural to mention not only what is literally depicted in the image, but also the sentiments, thoughts and opinions that it invokes in ourselves. In this work we deviate from the standard mainstream tasks of associating tags or keywords to an image, or generating content image descriptions, and we introduce the novel task of automatically generate user comments for an image. We present a new dataset collected from the social media Pinterest and we propose a strategy based on building joint textual and visual user models, tailored to the specificity of the mentioned task. We conduct an extensive experimental analysis of our approach on both qualitative and quantitative terms, which allows assessing the value of the proposed approach and shows its encouraging results against several existing image-to-text methods.
In this paper, we focus on cross-modal (visual and textual) e-commerce search within the fashion ... more In this paper, we focus on cross-modal (visual and textual) e-commerce search within the fashion domain. Particularly, we investigate two tasks: 1) given a query image, we retrieve textual descriptions that correspond to the visual attributes in the query; and 2) given a textual query that may express an interest in specific visual product characteristics, we retrieve relevant images that exhibit the required visual attributes. To this end, we introduce a new dataset that consists of 53,689 images coupled with textual descriptions. The images contain fashion garments that display a great variety of visual attributes, such as different shapes, colors and textures in natural language. Unlike previous datasets, the text provides a rough and noisy description of the item in the image. We extensively analyze this dataset in the context of cross-modal e-commerce search. We investigate two state-of-the-art latent variable models to bridge between textual and visual data: bilingual latent Dirichlet allocation and canonical correlation analysis. We use state-of-the-art visual and textual features and report promising results.
In the present article we introduce and validate an approach for single-label multi-class documen... more In the present article we introduce and validate an approach for single-label multi-class document categorization based on text content features. The introduced approach uses the statistical property of Principal Component Analysis, which minimizes the reconstruction error of the training documents used to compute a low-rank category transformation matrix. Such matrix transforms the original set of training documents from a given category to a new low-rank space and then optimally reconstructs them to the original space with a minimum reconstruction error. The proposed method, called Minimizer of the Reconstruction Error (mRE) classifier, uses this property, and extends and applies it to new unseen test documents. Several experiments on four multi-class datasets for text categorization are conducted in order to test the stable and generally better performance of the proposed approach in comparison with other popular classification methods.
In this article we build multi-objective hyper-heuristics (MOHHs) using the multi-objective evolu... more In this article we build multi-objective hyper-heuristics (MOHHs) using the multi-objective evolutionary algorithm NSGA-II for solving irregular 2D cutting stock problems under a bi-objective minimization schema, having a trade-off between the number of sheets used to fit a finite number of pieces and the time required to perform the placement of these pieces. We solve this problem using a multi-objective variation of hyper-heuristics called MOHH, whose main idea consists of finding a set of simple heuristics which can be combined to find a general solution, where a single heuristic is applied depending on the current condition of the problem instead of applying a unique single heuristic during the whole placement process. MOHHs are built after going through a learning process using the NSGA-II, which evolves combinations of condition-action rules producing at the end a set of Pareto-optimal MOHHs. We test the approximated MOHHs on several sets of benchmark problems and present the results.
This paper reports on email classification and filtering, more specifically on spam versus ham an... more This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different bench-marking corpora, and the evidence that especially the technique of Biased Discriminant Analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.
This paper presents a document classifier based on text content features and its application to e... more This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents-in our experiments email classes-that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier.
Hidden salting in digital media involves the intentional addition or distortion of content patter... more Hidden salting in digital media involves the intentional addition or distortion of content patterns with the purpose of content filtering. We propose a method to detect portions of a digital text source which are invisible to the end user, when they are rendered on a visual medium (like a computer monitor). The method consists of " tapping " into the rendering process and analyzing the rendering commands to identify portions of the source text (plaintext) which will be invisible for a human reader, using criteria based on text character and background colors, font size, overlapping characters, etc. Moreover, text deemed visible (covertext) is reconstructed from rendering commands and then the character reading order is identified, which could differ from the rendering order. The detection and resolution of hidden salting is evaluated on two e-mail corpora, and the effectiveness of this method in spam filtering task is assessed. We provide a solution to a relevant open problem in content filtering applications, namely the presence of tricks aimed at circumventing automatic filters.
In this work we employ Evolution Strategies (ES) to automatically extract a set of physical param... more In this work we employ Evolution Strategies (ES) to automatically extract a set of physical parameters, corresponding to stellar population synthesis, from a sample of galaxy spectra taken from the Sloan Digital Sky Survey (SDSS). We pose this parameter extraction as an optimization problem and then solve it using ES. The idea is to reconstruct each galaxy spectrum by means of a linear combination of three different theoretical models for stellar population synthesis. This combination produces a model spectrum that is compared with the original spectrum using a simple difference function. The goal is to find a model that minimizes this difference, using ES as the algorithm to explore the parameter space. We present experimental results using a set of 100 spectra from SDSS Data Release 2 that show that ES are very well suited to extract stellar population parameters from galaxy spectra. Additionally, in order to better understand the performance of ES in this problem, we present a comparison with other two well known stochastic search algorithms: Genetic Algorithms (GA) and Simulated Annealing (SA).
In this era of "big data", hundreds or even thousands of patent applications arrive every day to ... more In this era of "big data", hundreds or even thousands of patent applications arrive every day to patent offices around the world. One of the first tasks of the professional analysts in patent offices is to assign classification codes to those patents based on their content. Such classification codes are usually organized in hierarchical structures of concepts. Traditionally the classification task has been done manually by professional experts. However, given the large amount of documents, the patent professionals are becoming overwhelmed. If we add that the hierarchical structures of classification are very complex (containing thousands of categories), reliable, fast and scalable methods and algorithms are needed to help the experts in patent classification tasks. This chapter describes, analyzes and reviews systems that, based on the textual content of patents, automatically classify such patents into a hierarchy of categories. This chapter focuses specially in the patent classification task applied for the International Patent Classification (IPC) hierarchy. The IPC is the most used classification structure to organize patents, it is worldwide recognized, and several other structures use or are based on it to ensure office inter-operability.
In this demo we focus on cross-modal (visual and textual) e-commerce search within the fashion do... more In this demo we focus on cross-modal (visual and textual) e-commerce search within the fashion domain. Particularly, we demonstrate two tasks: 1) given a query image (without any accompanying text), we retrieve textual descriptions that correspond to the visual attributes in the visual query; and 2) given a textual query that may express an interest in specific visual characteristics, we retrieve relevant images (without leveraging textual meta-data) that exhibit the required visual attributes. The first task is especially useful to manage image collections by online stores who might want to automatically organize and mine predominantly visual items according to their attributes without human input. The second task renders useful for users to find items with specific visual characteristics, in the case where there is no text available describing the target image. We use a state-of-the-art visual and tex-tual features, as well as a state-of-the-art latent variable model to bridge between textual and visual data: bilingual latent Dirichlet allocation. Unlike traditional search engines, we demonstrate a truly cross-modal system, where we can directly bridge between visual and textual content without relying on pre-annotated meta-data.
In this paper we focus on cross-modal (visual and textual) attribute recognition within the fashi... more In this paper we focus on cross-modal (visual and textual) attribute recognition within the fashion domain. Particularly, we investigate two tasks: 1) given a query image, we retrieve textual descriptions that correspond to the visual attributes in the query; and 2) given a textual query that may express visual characteristics, we retrieve relevant images that exhibit the required visual attributes. To this end, we collected a dataset that consists of 53,689 images coupled with textual descriptions in natural language. The images contain fashion garments that display a great variety of visual attributes. Examples of visual attributes in fashion include colors (e., the text provides a rough and noisy description of the item in the image. We extensively analyze this dataset in the context of cross-modal attribute recognition. We investigate two latent variable models to bridge between textual and visual data: bilingual latent Dirichlet allocation and canonical correlation analysis. We use visual and textual features and report promising results 1 .
The labeling of discussion forums using the cognitive levels of Bloom's taxonomy is a time-consum... more The labeling of discussion forums using the cognitive levels of Bloom's taxonomy is a time-consuming and very expensive task due to the big amount of information that needs to be labeled and the need of an expert in the educational field for applying the taxonomy according to the messages of the forums. In this paper we present a framework in order to automatically label messages from discussion forums using the categories of Bloom's taxonomy. Several models were created using three kind of machine learning approaches: linear, Rule-Based and combined classifiers. The models are evaluated using the accuracy, the F1-measure and the area under the ROC curve. Additionally, a statistical significance of the results is performed using a McNemar test in order to validate them. The results show that the combination of a linear classifier with a Rule-Based classifier yields very good and promising results for this difficult task.
Category suggestions or recommendations for customers or users have become an essential feature f... more Category suggestions or recommendations for customers or users have become an essential feature for commerce or leisure websites. This is a growing topic that follows users’ activity in social networks generating a huge quantity of information about their interests, contacts, among many others. These data are usually collected to analyze people’s behavior, trends, and integrate a complete user profile. In this sense, we analyze a dataset collected from Pinterest to predict the gender and age by processing input images using a Convolutional Neural Network. Our method is based on the meaning of the image rather than the visual content. Additionally, we propose a heuristic-based approach for text analysis to predict users’ age and gender from Twitter. Both of the classifiers are based on text and images and they are compared with various similar approaches in the state of the art. Suggested categories are based on association rules conformed by the activity of thousands of users in ord...
Early detection of different levels of tremors helps to obtain a more accurate diagnosis of Parki... more Early detection of different levels of tremors helps to obtain a more accurate diagnosis of Parkinson's disease and to increase the therapy options for a better quality of life for patients. This work proposes a non-invasive strategy to measure the severity of tremors with the aim of diagnosing one of the first three levels of Parkinson's disease by the Unified Parkinson's Disease Rating Scale (UPDRS). A tremor being an involuntary motion that mainly appears in the hands; the dataset is acquired using a leap motion controller that measures 3D coordinates of each finger and the palmar region. Texture features are computed using sum and difference of histograms (SDH) to characterize the dataset, varying the window size; however, only the most fundamental elements are used in the classification stage. A machine learning classifier provides the final classification results of the tremor level. The effectiveness of our approach is obtained by a set of performance metrics, which are also used to show a comparison between different proposed designs.
2019 International Conference on Electronics, Communications and Computers (CONIELECOMP), 2019
Nowadays, people share a lot of information in social media in the form of videos, news, photos, ... more Nowadays, people share a lot of information in social media in the form of videos, news, photos, posts, likes, etc. This large amount of generated information reflects the opinions, emotions and preferences of the users. As an example of the previous, Pinterest is a popular social network where the users show their interests in the form of pins, which are information units formed by a short text comment and an image. In this research, we study the problem of building a model to characterize users of Pinterest with two demographic variables, age and gender, using their textual information post in the network. To do that, we introduce a dataset formed by the texts in English from 548,761 pins corresponding to 264 users. This dataset is imbalanced and reflects the actual distribution of the social network for gender and age, with a dominant presence of women over men, and of middle age persons over young persons. With this dataset, we conducted experiments with a diversity of machine learning models, a variety of features and considering a set of performance metrics. Our results produce interesting insights about the problem.
2019 International Conference on Electronics, Communications and Computers (CONIELECOMP), 2019
The globalization of Economy has forced the society to maintain a constant evolution in marketing... more The globalization of Economy has forced the society to maintain a constant evolution in marketing techniques. It is thus very important to design tools and methods that allow knowing and characterize individuals in groups to develop effective marketing strategies. In this context, any company would be interested in finding the tastes and preferences of people regarding the products and services offered in the global market. One technique that could help in this, is the analysis of the personality of each individual to identify their tastes and preferences. In this way we can offer products and services that meet their needs through appropriate advertising for each type of personality. In this work, we propose the use of latent features, extracted with a diversity of dimensionality reduction methods, to infer the personality of Twitter users using textual content-based features, and we compare the performance of the different techniques. For conducting our experiments, we use the PAN CLEF 2015 dataset consisting of 14,166 tweets in English of 152 different users, and a diversity of classification methods. Our results shows interesting insight about the personality prediction task.
We propose using text matching to measure the technological similarity between patents. Technolog... more We propose using text matching to measure the technological similarity between patents. Technology experts from different fields validate the new similarity measure and its improvement on measures based on the United States Patent Classification System, and identify its limitations. As an application, we replicate prior findings on the localization of knowledge spillovers by constructing a case-control group of text-matched patents. We also provide open access to the code and data to calculate the similarity between any two utility patents granted by the United States Patent and Trademark Office between 1976 and 2013, or between any two patent portfolios.
This paper presents an evolutionary method for learning lists of metarules for generalizing the s... more This paper presents an evolutionary method for learning lists of metarules for generalizing the selection of the best classifier for a given text dataset. The method builds rules based on features of a set of training text datasets, and evolves them using special crossover and mutation operators. Once the rules are learned, they are tested in a different set of datasets to demonstrate their accuracy and generality. Our experiments show encouraging results.
In this article, a multi-objective evolutionary framework to build selection hyper-heuristics for... more In this article, a multi-objective evolutionary framework to build selection hyper-heuristics for solving instances of the 2D bin packing problem is presented. The approach consists of a multi-objective evolutionary learning process, using specific tailored genetic operators, to produce sets of variable length rules representing hyper-heuristics. Each hyper-heuristic builds a solution to a given problem instance by sensing the state of the instance , and deciding which single heuristic to apply at each decision point. The hyper-heuristics consider the minimization of two conflicting objectives when building a solution: the number of bins used to accommodate the pieces and the total time required to do the job. The proposed framework integrates three well-studied multi-objective evolutionary algorithms to produce sets of Pareto-approximated hyper-heuristics: the Non-dominated Sorting Genetic Algorithm-II, the Strength Pareto Evolutionary Algorithm 2, and the Generalized Differential Evolution Algorithm 3. We conduct an extensive experimental analysis using a large set of 2D bin packing problem instances containing convex and non-convex irregular pieces, under many conditions, settings and using several performance metrics. The analysis assesses the ro-bustness and flexibility of the proposed approach, providing encouraging results when compared against a set of well-known baseline single heuristics.
When we refer to an image that attracts our attention, it is natural to mention not only what is ... more When we refer to an image that attracts our attention, it is natural to mention not only what is literally depicted in the image, but also the sentiments, thoughts and opinions that it invokes in ourselves. In this work we deviate from the standard mainstream tasks of associating tags or keywords to an image, or generating content image descriptions, and we introduce the novel task of automatically generate user comments for an image. We present a new dataset collected from the social media Pinterest and we propose a strategy based on building joint textual and visual user models, tailored to the specificity of the mentioned task. We conduct an extensive experimental analysis of our approach on both qualitative and quantitative terms, which allows assessing the value of the proposed approach and shows its encouraging results against several existing image-to-text methods.
In this paper, we focus on cross-modal (visual and textual) e-commerce search within the fashion ... more In this paper, we focus on cross-modal (visual and textual) e-commerce search within the fashion domain. Particularly, we investigate two tasks: 1) given a query image, we retrieve textual descriptions that correspond to the visual attributes in the query; and 2) given a textual query that may express an interest in specific visual product characteristics, we retrieve relevant images that exhibit the required visual attributes. To this end, we introduce a new dataset that consists of 53,689 images coupled with textual descriptions. The images contain fashion garments that display a great variety of visual attributes, such as different shapes, colors and textures in natural language. Unlike previous datasets, the text provides a rough and noisy description of the item in the image. We extensively analyze this dataset in the context of cross-modal e-commerce search. We investigate two state-of-the-art latent variable models to bridge between textual and visual data: bilingual latent Dirichlet allocation and canonical correlation analysis. We use state-of-the-art visual and textual features and report promising results.
In the present article we introduce and validate an approach for single-label multi-class documen... more In the present article we introduce and validate an approach for single-label multi-class document categorization based on text content features. The introduced approach uses the statistical property of Principal Component Analysis, which minimizes the reconstruction error of the training documents used to compute a low-rank category transformation matrix. Such matrix transforms the original set of training documents from a given category to a new low-rank space and then optimally reconstructs them to the original space with a minimum reconstruction error. The proposed method, called Minimizer of the Reconstruction Error (mRE) classifier, uses this property, and extends and applies it to new unseen test documents. Several experiments on four multi-class datasets for text categorization are conducted in order to test the stable and generally better performance of the proposed approach in comparison with other popular classification methods.
In this article we build multi-objective hyper-heuristics (MOHHs) using the multi-objective evolu... more In this article we build multi-objective hyper-heuristics (MOHHs) using the multi-objective evolutionary algorithm NSGA-II for solving irregular 2D cutting stock problems under a bi-objective minimization schema, having a trade-off between the number of sheets used to fit a finite number of pieces and the time required to perform the placement of these pieces. We solve this problem using a multi-objective variation of hyper-heuristics called MOHH, whose main idea consists of finding a set of simple heuristics which can be combined to find a general solution, where a single heuristic is applied depending on the current condition of the problem instead of applying a unique single heuristic during the whole placement process. MOHHs are built after going through a learning process using the NSGA-II, which evolves combinations of condition-action rules producing at the end a set of Pareto-optimal MOHHs. We test the approximated MOHHs on several sets of benchmark problems and present the results.
This paper reports on email classification and filtering, more specifically on spam versus ham an... more This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different bench-marking corpora, and the evidence that especially the technique of Biased Discriminant Analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.
This paper presents a document classifier based on text content features and its application to e... more This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents-in our experiments email classes-that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier.
Hidden salting in digital media involves the intentional addition or distortion of content patter... more Hidden salting in digital media involves the intentional addition or distortion of content patterns with the purpose of content filtering. We propose a method to detect portions of a digital text source which are invisible to the end user, when they are rendered on a visual medium (like a computer monitor). The method consists of " tapping " into the rendering process and analyzing the rendering commands to identify portions of the source text (plaintext) which will be invisible for a human reader, using criteria based on text character and background colors, font size, overlapping characters, etc. Moreover, text deemed visible (covertext) is reconstructed from rendering commands and then the character reading order is identified, which could differ from the rendering order. The detection and resolution of hidden salting is evaluated on two e-mail corpora, and the effectiveness of this method in spam filtering task is assessed. We provide a solution to a relevant open problem in content filtering applications, namely the presence of tricks aimed at circumventing automatic filters.
In this work we employ Evolution Strategies (ES) to automatically extract a set of physical param... more In this work we employ Evolution Strategies (ES) to automatically extract a set of physical parameters, corresponding to stellar population synthesis, from a sample of galaxy spectra taken from the Sloan Digital Sky Survey (SDSS). We pose this parameter extraction as an optimization problem and then solve it using ES. The idea is to reconstruct each galaxy spectrum by means of a linear combination of three different theoretical models for stellar population synthesis. This combination produces a model spectrum that is compared with the original spectrum using a simple difference function. The goal is to find a model that minimizes this difference, using ES as the algorithm to explore the parameter space. We present experimental results using a set of 100 spectra from SDSS Data Release 2 that show that ES are very well suited to extract stellar population parameters from galaxy spectra. Additionally, in order to better understand the performance of ES in this problem, we present a comparison with other two well known stochastic search algorithms: Genetic Algorithms (GA) and Simulated Annealing (SA).
In this era of "big data", hundreds or even thousands of patent applications arrive every day to ... more In this era of "big data", hundreds or even thousands of patent applications arrive every day to patent offices around the world. One of the first tasks of the professional analysts in patent offices is to assign classification codes to those patents based on their content. Such classification codes are usually organized in hierarchical structures of concepts. Traditionally the classification task has been done manually by professional experts. However, given the large amount of documents, the patent professionals are becoming overwhelmed. If we add that the hierarchical structures of classification are very complex (containing thousands of categories), reliable, fast and scalable methods and algorithms are needed to help the experts in patent classification tasks. This chapter describes, analyzes and reviews systems that, based on the textual content of patents, automatically classify such patents into a hierarchy of categories. This chapter focuses specially in the patent classification task applied for the International Patent Classification (IPC) hierarchy. The IPC is the most used classification structure to organize patents, it is worldwide recognized, and several other structures use or are based on it to ensure office inter-operability.
In this demo we focus on cross-modal (visual and textual) e-commerce search within the fashion do... more In this demo we focus on cross-modal (visual and textual) e-commerce search within the fashion domain. Particularly, we demonstrate two tasks: 1) given a query image (without any accompanying text), we retrieve textual descriptions that correspond to the visual attributes in the visual query; and 2) given a textual query that may express an interest in specific visual characteristics, we retrieve relevant images (without leveraging textual meta-data) that exhibit the required visual attributes. The first task is especially useful to manage image collections by online stores who might want to automatically organize and mine predominantly visual items according to their attributes without human input. The second task renders useful for users to find items with specific visual characteristics, in the case where there is no text available describing the target image. We use a state-of-the-art visual and tex-tual features, as well as a state-of-the-art latent variable model to bridge between textual and visual data: bilingual latent Dirichlet allocation. Unlike traditional search engines, we demonstrate a truly cross-modal system, where we can directly bridge between visual and textual content without relying on pre-annotated meta-data.
In this paper we focus on cross-modal (visual and textual) attribute recognition within the fashi... more In this paper we focus on cross-modal (visual and textual) attribute recognition within the fashion domain. Particularly, we investigate two tasks: 1) given a query image, we retrieve textual descriptions that correspond to the visual attributes in the query; and 2) given a textual query that may express visual characteristics, we retrieve relevant images that exhibit the required visual attributes. To this end, we collected a dataset that consists of 53,689 images coupled with textual descriptions in natural language. The images contain fashion garments that display a great variety of visual attributes. Examples of visual attributes in fashion include colors (e., the text provides a rough and noisy description of the item in the image. We extensively analyze this dataset in the context of cross-modal attribute recognition. We investigate two latent variable models to bridge between textual and visual data: bilingual latent Dirichlet allocation and canonical correlation analysis. We use visual and textual features and report promising results 1 .
The labeling of discussion forums using the cognitive levels of Bloom's taxonomy is a time-consum... more The labeling of discussion forums using the cognitive levels of Bloom's taxonomy is a time-consuming and very expensive task due to the big amount of information that needs to be labeled and the need of an expert in the educational field for applying the taxonomy according to the messages of the forums. In this paper we present a framework in order to automatically label messages from discussion forums using the categories of Bloom's taxonomy. Several models were created using three kind of machine learning approaches: linear, Rule-Based and combined classifiers. The models are evaluated using the accuracy, the F1-measure and the area under the ROC curve. Additionally, a statistical significance of the results is performed using a McNemar test in order to validate them. The results show that the combination of a linear classifier with a Rule-Based classifier yields very good and promising results for this difficult task.
Uploads
Papers by Juan Carlos Gomez