-
Large Scale Generative AI Text Applied to Sports and Music
Authors:
Aaron Baughman,
Stephen Hammer,
Rahul Agarwal,
Gozde Akay,
Eduardo Morales,
Tony Johnson,
Leonid Karlinsky,
Rogerio Feris
Abstract:
We address the problem of scaling up the production of media content, including commentary and personalized news stories, for large-scale sports and music events worldwide. Our approach relies on generative AI models to transform a large volume of multimodal data (e.g., videos, articles, real-time scoring feeds, statistics, and fact sheets) into coherent and fluent text. Based on this approach, we…
▽ More
We address the problem of scaling up the production of media content, including commentary and personalized news stories, for large-scale sports and music events worldwide. Our approach relies on generative AI models to transform a large volume of multimodal data (e.g., videos, articles, real-time scoring feeds, statistics, and fact sheets) into coherent and fluent text. Based on this approach, we introduce, for the first time, an AI commentary system, which was deployed to produce automated narrations for highlight packages at the 2023 US Open, Wimbledon, and Masters tournaments. In the same vein, our solution was extended to create personalized content for ESPN Fantasy Football and stories about music artists for the Grammy awards. These applications were built using a common software architecture achieved a 15x speed improvement with an average Rouge-L of 82.00 and perplexity of 6.6. Our work was successfully deployed at the aforementioned events, supporting 90 million fans around the world with 8 billion page views, continuously pushing the bounds on what is possible at the intersection of sports, entertainment, and AI.
△ Less
Submitted 27 February, 2024; v1 submitted 31 January, 2024;
originally announced February 2024.
-
Parameterized Complexity of Weighted Local Hamiltonian Problems and the Quantum Exponential Time Hypothesis
Authors:
Michael J. Bremner,
Zhengfeng Ji,
Xingjian Li,
Luke Mathieson,
Mauro E. S. Morales
Abstract:
We study a parameterized version of the local Hamiltonian problem, called the weighted local Hamiltonian problem, where the relevant quantum states are superpositions of computational basis states of Hamming weight $k$. The Hamming weight constraint can have a physical interpretation as a constraint on the number of excitations allowed or particle number in a system. We prove that this problem is…
▽ More
We study a parameterized version of the local Hamiltonian problem, called the weighted local Hamiltonian problem, where the relevant quantum states are superpositions of computational basis states of Hamming weight $k$. The Hamming weight constraint can have a physical interpretation as a constraint on the number of excitations allowed or particle number in a system. We prove that this problem is in QW[1], the first level of the quantum weft hierarchy and that it is hard for QM[1], the quantum analogue of M[1]. Our results show that this problem cannot be fixed-parameter quantum tractable (FPQT) unless certain natural quantum analogue of the exponential time hypothesis (ETH) is false.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Quantum Parameterized Complexity
Authors:
Michael J. Bremner,
Zhengfeng Ji,
Ryan L. Mann,
Luke Mathieson,
Mauro E. S. Morales,
Alexis T. E. Shaw
Abstract:
Parameterized complexity theory was developed in the 1990s to enrich the complexity-theoretic analysis of problems that depend on a range of parameters. In this paper we establish a quantum equivalent of classical parameterized complexity theory, motivated by the need for new tools for the classifications of the complexity of real-world problems. We introduce the quantum analogues of a range of pa…
▽ More
Parameterized complexity theory was developed in the 1990s to enrich the complexity-theoretic analysis of problems that depend on a range of parameters. In this paper we establish a quantum equivalent of classical parameterized complexity theory, motivated by the need for new tools for the classifications of the complexity of real-world problems. We introduce the quantum analogues of a range of parameterized complexity classes and examine the relationship between these classes, their classical counterparts, and well-studied problems. This framework exposes a rich classification of the complexity of parameterized versions of QMA-hard problems, demonstrating, for example, a clear separation between the Quantum Circuit Satisfiability problem and the Local Hamiltonian problem.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
Deep Artificial Intelligence for Fantasy Football Language Understanding
Authors:
Aaron Baughman,
Micah Forester,
Jeff Powell,
Eduardo Morales,
Shaun McPartlin,
Daniel Bohm
Abstract:
Fantasy sports allow fans to manage a team of their favorite athletes and compete with friends. The fantasy platform aligns the real-world statistical performance of athletes to fantasy scoring and has steadily risen in popularity to an estimated 9.1 million players per month with 4.4 billion player card views on the ESPN Fantasy Football platform from 2018-2019. In parallel, the sports media comm…
▽ More
Fantasy sports allow fans to manage a team of their favorite athletes and compete with friends. The fantasy platform aligns the real-world statistical performance of athletes to fantasy scoring and has steadily risen in popularity to an estimated 9.1 million players per month with 4.4 billion player card views on the ESPN Fantasy Football platform from 2018-2019. In parallel, the sports media community produces news stories, blogs, forum posts, tweets, videos, podcasts and opinion pieces that are both within and outside the context of fantasy sports. However, human fantasy football players can only analyze an average of 3.9 sources of information. Our work discusses the results of a machine learning pipeline to manage an ESPN Fantasy Football team. The use of trained statistical entity detectors and document2vector models applied to over 100,000 news sources and 2.3 million articles, videos and podcasts each day enables the system to comprehend natural language with an analogy test accuracy of 100% and keyword test accuracy of 80%. Deep learning feedforward neural networks provide player classifications such as if a player will be a bust, boom, play with a hidden injury or play meaningful touches with a cumulative 72% accuracy. Finally, a multiple regression ensemble uses the deep learning output and ESPN projection data to provide a point projection for each of the top 500+ fantasy football players in 2018. The point projection maintained a RMSE of 6.78 points. The best fit probability density function from a set of 24 is selected to visualize score spreads. Within the first 6 weeks of the product launch, the total number of users spent a cumulative time of over 4.6 years viewing our AI insights. The training data for our models was provided by a 2015 to 2016 web archive from Webhose, ESPN statistics, and Rotowire injury reports. We used 2017 fantasy football data as a test set.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Large Scale Diverse Combinatorial Optimization: ESPN Fantasy Football Player Trades
Authors:
Aaron Baughman,
Daniel Bohm,
Micah Forster,
Eduardo Morales,
Jeff Powell,
Shaun McPartlin,
Raja Hebbar,
Kavitha Yogaraj,
Yoshika Chhabra,
Sudeep Ghosh,
Rukhsan Ul Haq,
Arjun Kashyap
Abstract:
Even skilled fantasy football managers can be disappointed by their mid-season rosters as some players inevitably fall short of draft day expectations. Team managers can quickly discover that their team has a low score ceiling even if they start their best active players. A novel and diverse combinatorial optimization system proposes high volume and unique player trades between complementary teams…
▽ More
Even skilled fantasy football managers can be disappointed by their mid-season rosters as some players inevitably fall short of draft day expectations. Team managers can quickly discover that their team has a low score ceiling even if they start their best active players. A novel and diverse combinatorial optimization system proposes high volume and unique player trades between complementary teams to balance trade fairness. Several algorithms create the valuation of each fantasy football player with an ensemble of computing models: Quantum Support Vector Classifier with Permutation Importance (QSVC-PI), Quantum Support Vector Classifier with Accumulated Local Effects (QSVC-ALE), Variational Quantum Circuit with Permutation Importance (VQC-PI), Hybrid Quantum Neural Network with Permutation Importance (HQNN-PI), eXtreme Gradient Boosting Classifier (XGB), and Subject Matter Expert (SME) rules. The valuation of each player is personalized based on league rules, roster, and selections. The cost of trading away a player is related to a team's roster, such as the depth at a position, slot count, and position importance. Teams are paired together for trading based on a cosine dissimilarity score so that teams can offset their strengths and weaknesses. A knapsack 0-1 algorithm computes outgoing players for each team. Postprocessors apply analytics and deep learning models to measure 6 different objective measures about each trade. Over the 2020 and 2021 National Football League (NFL) seasons, a group of 24 experts from IBM and ESPN evaluated trade quality through 10 Football Error Analysis Tool (FEAT) sessions. Our system started with 76.9% of high-quality trades and was deployed for the 2021 season with 97.3% of high-quality trades. To increase trade quantity, our quantum, classical, and rules-based computing have 100% trade uniqueness. We use Qiskit's quantum simulators throughout our work.
△ Less
Submitted 18 April, 2022; v1 submitted 4 November, 2021;
originally announced November 2021.
-
Knowledge-Based Hierarchical POMDPs for Task Planning
Authors:
Sergio A. Serrano,
Elizabeth Santiago,
Jose Martinez-Carranza,
Eduardo Morales,
L. Enrique Sucar
Abstract:
The main goal in task planning is to build a sequence of actions that takes an agent from an initial state to a goal state. In robotics, this is particularly difficult because actions usually have several possible results, and sensors are prone to produce measurements with error. Partially observable Markov decision processes (POMDPs) are commonly employed, thanks to their capacity to model the un…
▽ More
The main goal in task planning is to build a sequence of actions that takes an agent from an initial state to a goal state. In robotics, this is particularly difficult because actions usually have several possible results, and sensors are prone to produce measurements with error. Partially observable Markov decision processes (POMDPs) are commonly employed, thanks to their capacity to model the uncertainty of actions that modify and monitor the state of a system. However, since solving a POMDP is computationally expensive, their usage becomes prohibitive for most robotic applications. In this paper, we propose a task planning architecture for service robotics. In the context of service robot design, we present a scheme to encode knowledge about the robot and its environment, that promotes the modularity and reuse of information. Also, we introduce a new recursive definition of a POMDP that enables our architecture to autonomously build a hierarchy of POMDPs, so that it can be used to generate and execute plans that solve the task at hand. Experimental results show that, in comparison to baseline methods, by following a recursive hierarchical approach the architecture is able to significantly reduce the planning time, while maintaining (or even improving) the robustness under several scenarios that vary in uncertainty and size.
△ Less
Submitted 9 April, 2021; v1 submitted 19 March, 2021;
originally announced March 2021.
-
Discovering Clinically Meaningful Shape Features for the Analysis of Tumor Pathology Images
Authors:
Esteban Fernández Morales,
Cong Zhang,
Guanghua Xiao,
Chul Moon,
Qiwei Li
Abstract:
With the advanced imaging technology, digital pathology imaging of tumor tissue slides is becoming a routine clinical procedure for cancer diagnosis. This process produces massive imaging data that capture histological details in high resolution. Recent developments in deep-learning methods have enabled us to automatically detect and characterize the tumor regions in pathology images at large scal…
▽ More
With the advanced imaging technology, digital pathology imaging of tumor tissue slides is becoming a routine clinical procedure for cancer diagnosis. This process produces massive imaging data that capture histological details in high resolution. Recent developments in deep-learning methods have enabled us to automatically detect and characterize the tumor regions in pathology images at large scale. From each identified tumor region, we extracted 30 well-defined descriptors that quantify its shape, geometry, and topology. We demonstrated how those descriptor features were associated with patient survival outcome in lung adenocarcinoma patients from the National Lung Screening Trial (n=143). Besides, a descriptor-based prognostic model was developed and validated in an independent patient cohort from The Cancer Genome Atlas Program program (n=318). This study proposes new insights into the relationship between tumor shape, geometrical, and topological features and patient prognosis. We provide software in the form of R code on GitHub: https://github.com/estfernandez/Slide_Image_Segmentation_and_Extraction.
△ Less
Submitted 9 December, 2020;
originally announced December 2020.
-
High precision indoor positioning by means of LiDAR
Authors:
Eduardo Sánchez Morales,
Michael Botsch,
Bertold Huber,
Andrés García Higuera
Abstract:
The trend towards autonomous driving and the continuous research in the automotive area, like Advanced Driver Assistance Systems (ADAS), requires an accurate localization under all circumstances. An accurate estimation of the vehicle state is a basic requirement for any trajectory-planning algorithm. Still, even when the introduction of the GPS L5 band promises lane-accuracy, coverage limitations…
▽ More
The trend towards autonomous driving and the continuous research in the automotive area, like Advanced Driver Assistance Systems (ADAS), requires an accurate localization under all circumstances. An accurate estimation of the vehicle state is a basic requirement for any trajectory-planning algorithm. Still, even when the introduction of the GPS L5 band promises lane-accuracy, coverage limitations in roofed areas still have to be addressed. In this work, a method for high precision indoor positioning using a LiDAR is presented. The method is based on the combination of motion models with LiDAR measurements, and uses infrastructural elements as positioning references. This allows to estimate the orientation, velocity over ground and position of a vehicle in a Local Tangent Plane (LTP) reference frame. When the outputs of the proposed method are compared to those of an Automotive Dynamic Motion Analyzer (ADMA), mean errors of 1 degree, 0.1 m/s and of 4.7 cm respectively are obtained. The method can be implemented by using a LiDAR sensor as a stand-alone unit. A median runtime of 40.77 us on an Intel i7-6820HQ CPU signals the possibility of real-time processing.
△ Less
Submitted 14 May, 2020;
originally announced May 2020.
-
High Precision Indoor Navigation for Autonomous Vehicles
Authors:
Eduardo Sánchez Morales,
Michael Botsch,
Bertold Huber,
Andrés García Higuera
Abstract:
Autonomous driving is an important trend of the automotive industry. The continuous research towards this goal requires a precise reference vehicle state estimation under all circumstances in order to develop and test autonomous vehicle functions. However, even when lane-accurate positioning is expected from oncoming technologies, like the L5 GPS band, the question of accurate positioning in roofe…
▽ More
Autonomous driving is an important trend of the automotive industry. The continuous research towards this goal requires a precise reference vehicle state estimation under all circumstances in order to develop and test autonomous vehicle functions. However, even when lane-accurate positioning is expected from oncoming technologies, like the L5 GPS band, the question of accurate positioning in roofed areas, e.\,g., tunnels or park houses, still has to be addressed.
In this paper, a novel procedure for a reference vehicle state estimation is presented. The procedure includes three main components. First, a robust standstill detection based purely on signals from an Inertial Measurement Unit. Second, a vehicle state estimation by means of statistical filtering. Third, a high accuracy LiDAR-based positioning method that delivers velocity, position and orientation correction data with a mean error of 0.1 m/s, 4.7 cm and 1$^\circ$ respectively. Runtime tests on a CPU indicates the possibility of real-time implementation.
△ Less
Submitted 14 May, 2020;
originally announced May 2020.
-
Parallel Multi-Hypothesis Algorithm for Criticality Estimation in Traffic and Collision Avoidance
Authors:
Eduardo Sánchez Morales,
Richard Membarth,
Andreas Gaull,
Philipp Slusallek,
Tobias Dirndorfer,
Alexander Kammenhuber,
Christoph Lauer,
Michael Botsch
Abstract:
Due to the current developments towards autonomous driving and vehicle active safety, there is an increasing necessity for algorithms that are able to perform complex criticality predictions in real-time. Being able to process multi-object traffic scenarios aids the implementation of a variety of automotive applications such as driver assistance systems for collision prevention and mitigation as w…
▽ More
Due to the current developments towards autonomous driving and vehicle active safety, there is an increasing necessity for algorithms that are able to perform complex criticality predictions in real-time. Being able to process multi-object traffic scenarios aids the implementation of a variety of automotive applications such as driver assistance systems for collision prevention and mitigation as well as fall-back systems for autonomous vehicles.
We present a fully model-based algorithm with a parallelizable architecture. The proposed algorithm can evaluate the criticality of complex, multi-modal (vehicles and pedestrians) traffic scenarios by simulating millions of trajectory combinations and detecting collisions between objects. The algorithm is able to estimate upcoming criticality at very early stages, demonstrating its potential for vehicle safety-systems and autonomous driving applications. An implementation on an embedded system in a test vehicle proves in a prototypical manner the compatibility of the algorithm with the hardware possibilities of modern cars. For a complex traffic scenario with 11 dynamic objects, more than 86 million pose combinations are evaluated in 21 ms on the GPU of a Drive PX~2.
△ Less
Submitted 14 May, 2020;
originally announced May 2020.
-
Vehicle Position Estimation with Aerial Imagery from Unmanned Aerial Vehicles
Authors:
Friedrich Kruber,
Eduardo Sánchez Morales,
Samarjit Chakraborty,
Michael Botsch
Abstract:
The availability of real-world data is a key element for novel developments in the fields of automotive and traffic research. Aerial imagery has the major advantage of recording multiple objects simultaneously and overcomes limitations such as occlusions. However, there are only few data sets available. This work describes a process to estimate a precise vehicle position from aerial imagery. A rob…
▽ More
The availability of real-world data is a key element for novel developments in the fields of automotive and traffic research. Aerial imagery has the major advantage of recording multiple objects simultaneously and overcomes limitations such as occlusions. However, there are only few data sets available. This work describes a process to estimate a precise vehicle position from aerial imagery. A robust object detection is crucial for reliable results, hence the state-of-the-art deep neural network Mask-RCNN is applied for that purpose. Two training data sets are employed: The first one is optimized for detecting the test vehicle, while the second one consists of randomly selected images recorded on public roads. To reduce errors, several aspects are accounted for, such as the drone movement and the perspective projection from a photograph. The estimated position is comapared with a reference system installed in the test vehicle. It is shown, that a mean accuracy of 20 cm can be achieved with flight altitudes up to 100 m, Full-HD resolution and a frame-by-frame detection. A reliable position estimation is the basis for further data processing, such as obtaining additional vehicle state variables. The source code, training weights, labeled data and example videos are made publicly available. This supports researchers to create new traffic data sets with specific local conditions.
△ Less
Submitted 13 May, 2020; v1 submitted 17 April, 2020;
originally announced April 2020.
-
Unsupervised and Supervised Learning with the Random Forest Algorithm for Traffic Scenario Clustering and Classification
Authors:
Friedrich Kruber,
Jonas Wurst,
Eduardo Sánchez Morales,
Samarjit Chakraborty,
Michael Botsch
Abstract:
The goal of this paper is to provide a method, which is able to find categories of traffic scenarios automatically. The architecture consists of three main components: A microscopic traffic simulation, a clustering technique and a classification technique for the operational phase. The developed simulation tool models each vehicle separately, while maintaining the dependencies between each other.…
▽ More
The goal of this paper is to provide a method, which is able to find categories of traffic scenarios automatically. The architecture consists of three main components: A microscopic traffic simulation, a clustering technique and a classification technique for the operational phase. The developed simulation tool models each vehicle separately, while maintaining the dependencies between each other. The clustering approach consists of a modified unsupervised Random Forest algorithm to find a data adaptive similarity measure between all scenarios. As part of this, the path proximity, a novel technique to determine a similarity based on the Random Forest algorithm is presented. In the second part of the clustering, the similarities are used to define a set of clusters. In the third part, a Random Forest classifier is trained using the defined clusters for the operational phase. A thresholding technique is described to ensure a certain confidence level for the class assignment. The method is applied for highway scenarios. The results show that the proposed method is an excellent approach to automatically categorize traffic scenarios, which is particularly relevant for testing autonomous vehicle functionality.
△ Less
Submitted 5 April, 2020;
originally announced April 2020.
-
Towards AutoML in the presence of Drift: first results
Authors:
Jorge G. Madrid,
Hugo Jair Escalante,
Eduardo F. Morales,
Wei-Wei Tu,
Yang Yu,
Lisheng Sun-Hosoya,
Isabelle Guyon,
Michele Sebag
Abstract:
Research progress in AutoML has lead to state of the art solutions that can cope quite wellwith supervised learning task, e.g., classification with AutoSklearn. However, so far thesesystems do not take into account the changing nature of evolving data over time (i.e., theystill assume i.i.d. data); even when this sort of domains are increasingly available in realapplications (e.g., spam filtering,…
▽ More
Research progress in AutoML has lead to state of the art solutions that can cope quite wellwith supervised learning task, e.g., classification with AutoSklearn. However, so far thesesystems do not take into account the changing nature of evolving data over time (i.e., theystill assume i.i.d. data); even when this sort of domains are increasingly available in realapplications (e.g., spam filtering, user preferences, etc.). We describe a first attempt to de-velop an AutoML solution for scenarios in which data distribution changes relatively slowlyover time and in which the problem is approached in a lifelong learning setting. We extendAuto-Sklearn with sound and intuitive mechanisms that allow it to cope with this sort ofproblems. The extended Auto-Sklearn is combined with concept drift detection techniquesthat allow it to automatically determine when the initial models have to be adapted. Wereport experimental results in benchmark data from AutoML competitions that adhere tothis scenario. Results demonstrate the effectiveness of the proposed methodology.
△ Less
Submitted 24 July, 2019;
originally announced July 2019.
-
Reachability Deficits in Quantum Approximate Optimization
Authors:
V. Akshay,
H. Philathong,
M. E. S. Morales,
J. Biamonte
Abstract:
The quantum approximate optimization algorithm (QAOA) has rapidly become a cornerstone of contemporary quantum algorithm development. Despite a growing range of applications, only a few results have been developed towards understanding the algorithms ultimate limitations. Here we report that QAOA exhibits a strong dependence on a problem instances constraint to variable ratio$-$this problem densit…
▽ More
The quantum approximate optimization algorithm (QAOA) has rapidly become a cornerstone of contemporary quantum algorithm development. Despite a growing range of applications, only a few results have been developed towards understanding the algorithms ultimate limitations. Here we report that QAOA exhibits a strong dependence on a problem instances constraint to variable ratio$-$this problem density places a limiting restriction on the algorithms capacity to minimize a corresponding objective function (and hence solve optimization problem instances). Such $reachability~deficits$ persist even in the absence of barren plateaus [McClean et al., 2018] and are outside of the recently reported level-1 QAOA limitations [Hastings 2019]. These findings are among the first to determine strong limitations on variational quantum approximate optimization.
△ Less
Submitted 24 October, 2019; v1 submitted 26 June, 2019;
originally announced June 2019.
-
Meta-learning of textual representations
Authors:
Jorge Madrid,
Hugo Jair Escalante,
Eduardo Morales
Abstract:
Recent progress in AutoML has lead to state-of-the-art methods (e.g., AutoSKLearn) that can be readily used by non-experts to approach any supervised learning problem. Whereas these methods are quite effective, they are still limited in the sense that they work for tabular (matrix formatted) data only. This paper describes one step forward in trying to automate the design of supervised learning me…
▽ More
Recent progress in AutoML has lead to state-of-the-art methods (e.g., AutoSKLearn) that can be readily used by non-experts to approach any supervised learning problem. Whereas these methods are quite effective, they are still limited in the sense that they work for tabular (matrix formatted) data only. This paper describes one step forward in trying to automate the design of supervised learning methods in the context of text mining. We introduce a meta learning methodology for automatically obtaining a representation for text mining tasks starting from raw text. We report experiments considering 60 different textual representations and more than 80 text mining datasets associated to a wide variety of tasks. Experimental results show the proposed methodology is a promising solution to obtain highly effective off the shell text classification pipelines.
△ Less
Submitted 19 July, 2019; v1 submitted 20 June, 2019;
originally announced June 2019.
-
Entanglement Scaling in Quantum Advantage Benchmarks
Authors:
Jacob D. Biamonte,
Mauro E. S. Morales,
Dax Enshan Koh
Abstract:
A contemporary technological milestone is to build a quantum device performing a computational task beyond the capability of any classical computer, an achievement known as quantum adversarial advantage. In what ways can the entanglement realized in such a demonstration be quantified? Inspired by the area law of tensor networks, we derive an upper bound for the minimum random circuit depth needed…
▽ More
A contemporary technological milestone is to build a quantum device performing a computational task beyond the capability of any classical computer, an achievement known as quantum adversarial advantage. In what ways can the entanglement realized in such a demonstration be quantified? Inspired by the area law of tensor networks, we derive an upper bound for the minimum random circuit depth needed to generate the maximal bipartite entanglement correlations between all problem variables (qubits). This bound is (i) lattice geometry dependent and (ii) makes explicit a nuance implicit in other proposals with physical consequence. The hardware itself should be able to support super-logarithmic ebits of entanglement across some poly($n$) number of qubit-bipartitions, otherwise the quantum state itself will not possess volumetric entanglement scaling and full-lattice-range correlations. Hence, as we present a connection between quantum advantage protocols and quantum entanglement, the entanglement implicitly generated by such protocols can be tested separately to further ascertain the validity of any quantum advantage claim.
△ Less
Submitted 31 December, 2019; v1 submitted 1 August, 2018;
originally announced August 2018.
-
Term-Weighting Learning via Genetic Programming for Text Classification
Authors:
Hugo Jair Escalante,
Mauricio A. García-Limón,
Alicia Morales-Reyes,
Mario Graff,
Manuel Montes-y-Gómez,
Eduardo F. Morales
Abstract:
This paper describes a novel approach to learning term-weighting schemes (TWSs) in the context of text classification. In text mining a TWS determines the way in which documents will be represented in a vector space model, before applying a classifier. Whereas acceptable performance has been obtained with standard TWSs (e.g., Boolean and term-frequency schemes), the definition of TWSs has been tra…
▽ More
This paper describes a novel approach to learning term-weighting schemes (TWSs) in the context of text classification. In text mining a TWS determines the way in which documents will be represented in a vector space model, before applying a classifier. Whereas acceptable performance has been obtained with standard TWSs (e.g., Boolean and term-frequency schemes), the definition of TWSs has been traditionally an art. Further, it is still a difficult task to determine what is the best TWS for a particular problem and it is not clear yet, whether better schemes, than those currently available, can be generated by combining known TWS. We propose in this article a genetic program that aims at learning effective TWSs that can improve the performance of current schemes in text classification. The genetic program learns how to combine a set of basic units to give rise to discriminative TWSs. We report an extensive experimental study comprising data sets from thematic and non-thematic text classification as well as from image classification. Our study shows the validity of the proposed method; in fact, we show that TWSs learned with the genetic program outperform traditional schemes and other TWSs proposed in recent works. Further, we show that TWSs learned from a specific domain can be effectively used for other tasks.
△ Less
Submitted 6 October, 2014; v1 submitted 2 October, 2014;
originally announced October 2014.