Search | arXiv e-print repository

doi 10.3390/s19143155

Benchmarking Particle Filter Algorithms for Efficient Velodyne-Based Vehicle Localization

Authors: Jose Luis Blanco-Claraco, Francisco Mañas-Alvarez, Jose Luis Torres-Moreno, Francisco Rodriguez, Antonio Gimenez-Fernandez

Abstract: Keeping a vehicle well-localized within a prebuilt-map is at the core of any autonomous vehicle navigation system. In this work, we show that both standard SIR sampling and rejection-based optimal sampling are suitable for efficient (10 to 20 ms) real-time pose tracking without feature detection that is using raw point clouds from a 3D LiDAR. Motivated by the large amount of information captured b… ▽ More Keeping a vehicle well-localized within a prebuilt-map is at the core of any autonomous vehicle navigation system. In this work, we show that both standard SIR sampling and rejection-based optimal sampling are suitable for efficient (10 to 20 ms) real-time pose tracking without feature detection that is using raw point clouds from a 3D LiDAR. Motivated by the large amount of information captured by these sensors, we perform a systematic statistical analysis of how many points are actually required to reach an optimal ratio between efficiency and positioning accuracy. Furthermore, initialization from adverse conditions, e.g., poor GPS signal in urban canyons, we also identify the optimal particle filter settings required to ensure convergence. Our findings include that a decimation factor between 100 and 200 on incoming point clouds provides a large savings in computational cost with a negligible loss in localization accuracy for a VLP-16 scanner. Furthermore, an initial density of $\sim$2 particles/m$^2$ is required to achieve 100% convergence success for large-scale ($\sim$100,000 m$^2$), outdoor global localization without any additional hint from GPS or magnetic field sensors. All implementations have been released as open-source software. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: 24 pages, 13 figures

arXiv:2112.13241 [pdf, other]

A Preliminary Study for Literary Rhyme Generation based on Neuronal Representation, Semantics and Shallow Parsing

Authors: Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno, Roseli S. Wedemann

Abstract: In recent years, researchers in the area of Computational Creativity have studied the human creative process proposing different approaches to reproduce it with a formal procedure. In this paper, we introduce a model for the generation of literary rhymes in Spanish, combining structures of language and neural network models %(\textit{Word2vec}).%, into a structure for semantic assimilation. The re… ▽ More In recent years, researchers in the area of Computational Creativity have studied the human creative process proposing different approaches to reproduce it with a formal procedure. In this paper, we introduce a model for the generation of literary rhymes in Spanish, combining structures of language and neural network models %(\textit{Word2vec}).%, into a structure for semantic assimilation. The results obtained with a manual evaluation of the texts generated by our algorithm are encouraging. △ Less

Submitted 25 December, 2021; originally announced December 2021.

Comments: 7 pages, 2 figures

Journal ref: STIL 2021 - Symposium in Information and Human Language Technology / Bracis

arXiv:2112.10189 [pdf, ps, other]

LUC at ComMA-2021 Shared Task: Multilingual Gender Biased and Communal Language Identification without using linguistic features

Authors: Rodrigo Cuéllar-Hidalgo, Julio de Jesús Guerrero-Zambrano, Dominic Forest, Gerardo Reyes-Salgado, Juan-Manuel Torres-Moreno

Abstract: This work aims to evaluate the ability that both probabilistic and state-of-the-art vector space modeling (VSM) methods provide to well known machine learning algorithms to identify social network documents to be classified as aggressive, gender biased or communally charged. To this end, an exploratory stage was performed first in order to find relevant settings to test, i.e. by using training and… ▽ More This work aims to evaluate the ability that both probabilistic and state-of-the-art vector space modeling (VSM) methods provide to well known machine learning algorithms to identify social network documents to be classified as aggressive, gender biased or communally charged. To this end, an exploratory stage was performed first in order to find relevant settings to test, i.e. by using training and development samples, we trained multiple algorithms using multiple vector space modeling and probabilistic methods and discarded the less informative configurations. These systems were submitted to the competition of the ComMA@ICON'21 Workshop on Multilingual Gender Biased and Communal Language Identification. △ Less

Submitted 19 December, 2021; originally announced December 2021.

Comments: 6 pages

Journal ref: ComMA-2021 Shared Task: Multilingual Gender Biased and Communal Language Identification

arXiv:2005.08223 [pdf, ps, other]

LiSSS: A toy corpus of Spanish Literary Sentences for Emotions detection

Authors: Juan-Manuel Torres-Moreno, Luis-Gil Moreno-Jiménez

Abstract: In this work we present a new small data-set in Computational Creativity (CC) field, the Spanish Literary Sentences for emotions detection corpus (LISSS). We address this corpus of literary sentences in order to evaluate or design algorithms of emotions classification and detection. We have constitute this corpus by manually classifying the sentences in a set of emotions: Love, Fear, Happiness, An… ▽ More In this work we present a new small data-set in Computational Creativity (CC) field, the Spanish Literary Sentences for emotions detection corpus (LISSS). We address this corpus of literary sentences in order to evaluate or design algorithms of emotions classification and detection. We have constitute this corpus by manually classifying the sentences in a set of emotions: Love, Fear, Happiness, Anger and Sadness/Pain. We also present some baseline classification algorithms applied on our corpus. The LISSS corpus will be available to the community as a free resource to evaluate or create CC-like algorithms. △ Less

Submitted 6 June, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

Comments: 8 pages, 3 tables

arXiv:2005.00468 [pdf]

Automatic Discourse Segmentation: Review and Perspectives

Authors: Iria da Cunha, Juan-Manuel Torres-Moreno

Abstract: Multilingual discourse parsing is a very prominent research topic. The first stage for discourse parsing is discourse segmentation. The study reported in this article addresses a review of two on-line available discourse segmenters (for English and Portuguese). We evaluate the possibility of developing similar discourse segmenters for Spanish, French and African languages. Multilingual discourse parsing is a very prominent research topic. The first stage for discourse parsing is discourse segmentation. The study reported in this article addresses a review of two on-line available discourse segmenters (for English and Portuguese). We evaluate the possibility of developing similar discourse segmenters for Spanish, French and African languages. △ Less

Submitted 1 May, 2020; originally announced May 2020.

Comments: 5 pages, 1 figure

Journal ref: International Workshop on African Human Language Technologies. 17-20 Jan 2010

arXiv:2004.06747 [pdf, other]

Extending Text Informativeness Measures to Passage Interestingness Evaluation (Language Model vs. Word Embedding)

Authors: Carlos-Emiliano González-Gallardo, Eric SanJuan, Juan-Manuel Torres-Moreno

Abstract: Standard informativeness measures used to evaluate Automatic Text Summarization mostly rely on n-gram overlapping between the automatic summary and the reference summaries. These measures differ from the metric they use (cosine, ROUGE, Kullback-Leibler, Logarithm Similarity, etc.) and the bag of terms they consider (single words, word n-grams, entities, nuggets, etc.). Recent word embedding approa… ▽ More Standard informativeness measures used to evaluate Automatic Text Summarization mostly rely on n-gram overlapping between the automatic summary and the reference summaries. These measures differ from the metric they use (cosine, ROUGE, Kullback-Leibler, Logarithm Similarity, etc.) and the bag of terms they consider (single words, word n-grams, entities, nuggets, etc.). Recent word embedding approaches offer a continuous alternative to discrete approaches based on the presence/absence of a text unit. Informativeness measures have been extended to Focus Information Retrieval evaluation involving a user's information need represented by short queries. In particular for the task of CLEF-INEX Tweet Contextualization, tweet contents have been considered as queries. In this paper we define the concept of Interestingness as a generalization of Informativeness, whereby the information need is diverse and formalized as an unknown set of implicit queries. We then study the ability of state of the art Informativeness measures to cope with this generalization. Lately we show that with this new framework, standard word embeddings outperforms discrete measures only on uni-grams, however bi-grams seems to be a key point of interestingness evaluation. Lastly we prove that the CLEF-INEX Tweet Contextualization 2012 Logarithm Similarity measure provides best results. △ Less

Submitted 14 April, 2020; originally announced April 2020.

arXiv:2004.04468 [pdf, other]

A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

Authors: Elvys Linhares Pontes, Stéphane Huet, Juan-Manuel Torres-Moreno, Thiago G. da Silva, Andréa Carneiro Linhares

Abstract: Multi-Sentence Compression (MSC) aims to generate a short sentence with the key information from a cluster of similar sentences. MSC enables summarization and question-answering systems to generate outputs combining fully formed sentences from one or several documents. This paper describes an Integer Linear Programming method for MSC using a vertex-labeled graph to select different keywords, with… ▽ More Multi-Sentence Compression (MSC) aims to generate a short sentence with the key information from a cluster of similar sentences. MSC enables summarization and question-answering systems to generate outputs combining fully formed sentences from one or several documents. This paper describes an Integer Linear Programming method for MSC using a vertex-labeled graph to select different keywords, with the goal of generating more informative sentences while maintaining their grammaticality. Our system is of good quality and outperforms the state of the art for evaluations led on news datasets in three languages: French, Portuguese and Spanish. We led both automatic and manual evaluations to determine the informativeness and the grammaticality of compressions for each dataset. In additional tests, which take advantage of the fact that the length of compressions can be modulated, we still improve ROUGE scores with shorter output sentences. △ Less

Submitted 9 April, 2020; originally announced April 2020.

Comments: Preprint version

Journal ref: Computación y Sistemas Vo. 24, No. 2, 2020

arXiv:2002.04095 [pdf, other]

Automatic Discourse Segmentation: an evaluation in French

Authors: Rémy Saksik, Alejandro Molina-Villegas, Andréa Carneiro Linhares, Juan-Manuel Torres-Moreno

Abstract: In this article, we describe some discursive segmentation methods as well as a preliminary evaluation of the segmentation quality. Although our experiment were carried for documents in French, we have developed three discursive segmentation models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling. We have also carried out automatic e… ▽ More In this article, we describe some discursive segmentation methods as well as a preliminary evaluation of the segmentation quality. Although our experiment were carried for documents in French, we have developed three discursive segmentation models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling. We have also carried out automatic evaluations of these systems against the Annodis corpus, which is a manually annotated reference. The results obtained are very encouraging. △ Less

Submitted 11 June, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

Comments: 7 pages, 2 figures, 2 tables

arXiv:2001.11382 [pdf, ps, other]

Intweetive Text Summarization

Authors: Jean Valère Cossu, Juan-Manuel Torres-Moreno, Eric SanJuan, Marc El-Bèze

Abstract: The amount of user generated contents from various social medias allows analyst to handle a wide view of conversations on several topics related to their business. Nevertheless keeping up-to-date with this amount of information is not humanly feasible. Automatic Summarization then provides an interesting mean to digest the dynamics and the mass volume of contents. In this paper, we address the iss… ▽ More The amount of user generated contents from various social medias allows analyst to handle a wide view of conversations on several topics related to their business. Nevertheless keeping up-to-date with this amount of information is not humanly feasible. Automatic Summarization then provides an interesting mean to digest the dynamics and the mass volume of contents. In this paper, we address the issue of tweets summarization which remains scarcely explored. We propose to automatically generated summaries of Micro-Blogs conversations dealing with public figures E-Reputation. These summaries are generated using key-word queries or sample tweet and offer a focused view of the whole Micro-Blog network. Since state-of-the-art is lacking on this point we conduct and evaluate our experiments over the multilingual CLEF RepLab Topic-Detection dataset according to an experimental evaluation process. △ Less

Submitted 16 January, 2020; originally announced January 2020.

Comments: 8 pages, 4 tables

Journal ref: International Journal of Computational Linguistics and Applications vol. 7, no. 1, 2016, pp. 67-83

arXiv:2001.11381 [pdf, other]

Generación automática de frases literarias en español

Authors: Luis-Gil Moreno-Jiménez, Juan-Manuel Torres-Moreno, Roseli S. Wedemann

Abstract: In this work we present a state of the art in the area of Computational Creativity (CC). In particular, we address the automatic generation of literary sentences in Spanish. We propose three models of text generation based mainly on statistical algorithms and shallow parsing analysis. We also present some rather encouraging preliminary results. In this work we present a state of the art in the area of Computational Creativity (CC). In particular, we address the automatic generation of literary sentences in Spanish. We propose three models of text generation based mainly on statistical algorithms and shallow parsing analysis. We also present some rather encouraging preliminary results. △ Less

Submitted 17 January, 2020; originally announced January 2020.

Comments: 13 pages, in Spanish, 6 figures, 3 tables

arXiv:2001.10613 [pdf, other]

Predicting Personalized Academic and Career Roads: First Steps Toward a Multi-Uses Recommender System

Authors: Alexandre Nadjem, Juan-Manuel Torres-Moreno, Marc El-Bèze, Guillaume Marrel, Benoît Bonte

Abstract: Nobody knows what one's do in the future and everyone will have had a different answer to the question : how do you see yourself in five years after your current job/diploma? In this paper we introduce concepts, large categories of fields of studies or job domains in order to represent the vision of the future of the user's trajectory. Then, we show how they can influence the prediction when propo… ▽ More Nobody knows what one's do in the future and everyone will have had a different answer to the question : how do you see yourself in five years after your current job/diploma? In this paper we introduce concepts, large categories of fields of studies or job domains in order to represent the vision of the future of the user's trajectory. Then, we show how they can influence the prediction when proposing him a set of next steps to take. △ Less

Submitted 3 January, 2020; originally announced January 2020.

Comments: 4 pages, 3 figures, 4 tables

Journal ref: Digital Tools & Uses Congress (DTUC '18), pp 1--4, 2018, Paris, France

arXiv:2001.07098 [pdf, other]

Audio Summarization with Audio Features and Probability Distribution Divergence

Authors: Carlos-Emiliano González-Gallardo, Romain Deveaud, Eric SanJuan, Juan-Manuel Torres-Moreno

Abstract: The automatic summarization of multimedia sources is an important task that facilitates the understanding of an individual by condensing the source while maintaining relevant information. In this paper we focus on audio summarization based on audio features and the probability of distribution divergence. Our method, based on an extractive summarization approach, aims to select the most relevant se… ▽ More The automatic summarization of multimedia sources is an important task that facilitates the understanding of an individual by condensing the source while maintaining relevant information. In this paper we focus on audio summarization based on audio features and the probability of distribution divergence. Our method, based on an extractive summarization approach, aims to select the most relevant segments until a time threshold is reached. It takes into account the segment's length, position and informativeness value. Informativeness of each segment is obtained by mapping a set of audio features issued from its Mel-frequency Cepstral Coefficients and their corresponding Jensen-Shannon divergence score. Results over a multi-evaluator scheme shows that our approach provides understandable and informative summaries. △ Less

Submitted 2 April, 2020; v1 submitted 20 January, 2020; originally announced January 2020.

Comments: 20th International Conference on Computational Linguistics and Intelligent Text Processing

arXiv:2001.06190 [pdf]

Visual Simplified Characters' Emotion Emulator Implementing OCC Model

Authors: Ana Lilia Laureano-Cruces, Laura Hernández-Domínguez, Martha Mora-Torres, Juan-Manuel Torres-Moreno, Jaime Enrique Cabrera-López

Abstract: In this paper, we present a visual emulator of the emotions seen in characters in stories. This system is based on a simplified view of the cognitive structure of emotions proposed by Ortony, Clore and Collins (OCC Model). The goal of this paper is to provide a visual platform that allows us to observe changes in the characters' different emotions, and the intricate interrelationships between: 1)… ▽ More In this paper, we present a visual emulator of the emotions seen in characters in stories. This system is based on a simplified view of the cognitive structure of emotions proposed by Ortony, Clore and Collins (OCC Model). The goal of this paper is to provide a visual platform that allows us to observe changes in the characters' different emotions, and the intricate interrelationships between: 1) each character's emotions, 2) their affective relationships and actions, 3) The events that take place in the development of a plot, and 4) the objects of desire that make up the emotional map of any story. This tool was tested on stories with a contrasting variety of emotional and affective environments: Othello, Twilight, and Harry Potter, behaving sensibly and in keeping with the atmosphere in which the characters were immersed. △ Less

Submitted 17 January, 2020; originally announced January 2020.

Comments: 7 pages, 14 figures, 2 tables

Journal ref: CGST Conference on Computer Science and Engineering, Istanbul, Turkey, 19-21 December 2011

arXiv:2001.05285 [pdf, other]

Detecting New Word Meanings: A Comparison of Word Embedding Models in Spanish

Authors: Andrés Torres-Rivera, Juan-Manuel Torres-Moreno

Abstract: Semantic neologisms (SN) are defined as words that acquire a new word meaning while maintaining their form. Given the nature of this kind of neologisms, the task of identifying these new word meanings is currently performed manually by specialists at observatories of neology. To detect SN in a semi-automatic way, we developed a system that implements a combination of the following strategies: topi… ▽ More Semantic neologisms (SN) are defined as words that acquire a new word meaning while maintaining their form. Given the nature of this kind of neologisms, the task of identifying these new word meanings is currently performed manually by specialists at observatories of neology. To detect SN in a semi-automatic way, we developed a system that implements a combination of the following strategies: topic modeling, keyword extraction, and word sense disambiguation. The role of topic modeling is to detect the themes that are treated in the input text. Themes within a text give clues about the particular meaning of the words that are used, for example: viral has one meaning in the context of computer science (CS) and another when talking about health. To extract keywords, we used TextRank with POS tag filtering. With this method, we can obtain relevant words that are already part of the Spanish lexicon. We use a deep learning model to determine if a given keyword could have a new meaning. Embeddings that are different from all the known meanings (or topics) indicate that a word might be a valid SN candidate. In this study, we examine the following word embedding models: Word2Vec, Sense2Vec, and FastText. The models were trained with equivalent parameters using Wikipedia in Spanish as corpora. Then we used a list of words and their concordances (obtained from our database of neologisms) to show the different embeddings that each model yields. Finally, we present a comparison of these outcomes with the concordances of each word to show how we can determine if a word could be a valid candidate for SN. △ Less

Submitted 12 January, 2020; originally announced January 2020.

Comments: 16 pages, 3 figures

Journal ref: COnference en Recherche d'Informations et Applications {CORIA} 2019 France

arXiv:1912.09558 [pdf, ps, other]

RIMAX: Ranking Semantic Rhymes by calculating Definition Similarity

Authors: Alfonso Medina-Urrea, Juan-Manuel Torres-Moreno

Abstract: This paper presents RIMAX, a new system for detecting semantic rhymes, using a Comprehensive Mexican Spanish Dictionary (DEM) and its Rhyming Dictionary (REM). We use the Vector Space Model to calculate the similarity of the definition of a query with the definitions corresponding to the assonant and consonant rhymes of the query. The preliminary results using a manual evaluation are very encourag… ▽ More This paper presents RIMAX, a new system for detecting semantic rhymes, using a Comprehensive Mexican Spanish Dictionary (DEM) and its Rhyming Dictionary (REM). We use the Vector Space Model to calculate the similarity of the definition of a query with the definitions corresponding to the assonant and consonant rhymes of the query. The preliminary results using a manual evaluation are very encouraging. △ Less

Submitted 25 December, 2019; v1 submitted 19 December, 2019; originally announced December 2019.

Comments: 5 pages

arXiv:1903.07397 [pdf, other]

Un duel probabiliste pour départager deux présidents (LIA @ DEFT'2005)

Authors: Marc El-Bèze, Juan-Manuel Torres-Moreno, Frédéric Béchet

Abstract: We present a set of probabilistic models applied to binary classification as defined in the DEFT'05 challenge. The challenge consisted a mixture of two differents problems in Natural Language Processing : identification of author (a sequence of François Mitterrand's sentences might have been inserted into a speech of Jacques Chirac) and thematic break detection (the subjects addressed by the two a… ▽ More We present a set of probabilistic models applied to binary classification as defined in the DEFT'05 challenge. The challenge consisted a mixture of two differents problems in Natural Language Processing : identification of author (a sequence of François Mitterrand's sentences might have been inserted into a speech of Jacques Chirac) and thematic break detection (the subjects addressed by the two authors are supposed to be different). Markov chains, Bayes models and an adaptative process have been used to identify the paternity of these sequences. A probabilistic model of the internal coherence of speeches which has been employed to identify thematic breaks. Adding this model has shown to improve the quality results. A comparison with different approaches demostrates the superiority of a strategy that combines learning, coherence and adaptation. Applied to the DEFT'05 data test the results in terms of precision (0.890), recall (0.955) and Fscore (0.925) measure are very promising. △ Less

Submitted 11 March, 2019; originally announced March 2019.

Comments: 27 figures, 1 table (in French)

Journal ref: RNTI (E10)776:1889-1918, 2007

arXiv:1810.10641 [pdf, other]

Predicting the Semantic Textual Similarity with Siamese CNN and LSTM

Authors: Elvys Linhares Pontes, Stéphane Huet, Andréa Carneiro Linhares, Juan-Manuel Torres-Moreno

Abstract: Semantic Textual Similarity (STS) is the basis of many applications in Natural Language Processing (NLP). Our system combines convolution and recurrent neural networks to measure the semantic similarity of sentences. It uses a convolution network to take account of the local context of words and an LSTM to consider the global context of sentences. This combination of networks helps to preserve the… ▽ More Semantic Textual Similarity (STS) is the basis of many applications in Natural Language Processing (NLP). Our system combines convolution and recurrent neural networks to measure the semantic similarity of sentences. It uses a convolution network to take account of the local context of words and an LSTM to consider the global context of sentences. This combination of networks helps to preserve the relevant information of sentences and improves the calculation of the similarity between sentences. Our model has achieved good results and is competitive with the best state-of-the-art systems. △ Less

Submitted 24 October, 2018; originally announced October 2018.

arXiv:1810.10639 [pdf, ps, other]

A Multilingual Study of Compressive Cross-Language Text Summarization

Authors: Elvys Linhares Pontes, Stéphane Huet, Juan-Manuel Torres-Moreno

Abstract: Cross-Language Text Summarization (CLTS) generates summaries in a language different from the language of the source documents. Recent methods use information from both languages to generate summaries with the most informative sentences. However, these methods have performance that can vary according to languages, which can reduce the quality of summaries. In this paper, we propose a compressive f… ▽ More Cross-Language Text Summarization (CLTS) generates summaries in a language different from the language of the source documents. Recent methods use information from both languages to generate summaries with the most informative sentences. However, these methods have performance that can vary according to languages, which can reduce the quality of summaries. In this paper, we propose a compressive framework to generate cross-language summaries. In order to analyze performance and especially stability, we tested our system and extractive baselines on a dataset available in four languages (English, French, Portuguese, and Spanish) to generate English and French summaries. An automatic evaluation showed that our method outperformed extractive state-of-art CLTS methods with better and more stable ROUGE scores for all languages. △ Less

Submitted 24 October, 2018; originally announced October 2018.

arXiv:1809.00994 [pdf, other]

Étude de l'informativité des transcriptions : une approche basée sur le résumé automatique

Authors: Carlos-Emiliano González-Gallardo, Malek Hajjem, Eric SanJuan, Juan-Manuel Torres-Moreno

Abstract: In this paper we propose a new approach to evaluate the informativeness of transcriptions coming from Automatic Speech Recognition systems. This approach, based in the notion of informativeness, is focused on the framework of Automatic Text Summarization performed over these transcriptions. At a first glance we estimate the informative content of the various automatic transcriptions, then we explo… ▽ More In this paper we propose a new approach to evaluate the informativeness of transcriptions coming from Automatic Speech Recognition systems. This approach, based in the notion of informativeness, is focused on the framework of Automatic Text Summarization performed over these transcriptions. At a first glance we estimate the informative content of the various automatic transcriptions, then we explore the capacity of Automatic Text Summarization to overcome the informative loss. To do this we use an automatic summary evaluation protocol without reference (based on the informative content), which computes the divergence between probability distributions of different textual representations: manual and automatic transcriptions and their summaries. After a set of evaluations this analysis allowed us to judge both the quality of the transcriptions in terms of informativeness and to assess the ability of automatic text summarization to compensate the problems raised during the transcription phase. △ Less

Submitted 4 September, 2018; originally announced September 2018.

Comments: in French, 15e Conférence en Recherche d'Information et Applications (CORIA)

arXiv:1808.08850 [pdf, ps, other]

WiSeBE: Window-based Sentence Boundary Evaluation

Authors: Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno

Abstract: Sentence Boundary Detection (SBD) has been a major research topic since Automatic Speech Recognition transcripts have been used for further Natural Language Processing tasks like Part of Speech Tagging, Question Answering or Automatic Summarization. But what about evaluation? Do standard evaluation metrics like precision, recall, F-score or classification error; and more important, evaluating an a… ▽ More Sentence Boundary Detection (SBD) has been a major research topic since Automatic Speech Recognition transcripts have been used for further Natural Language Processing tasks like Part of Speech Tagging, Question Answering or Automatic Summarization. But what about evaluation? Do standard evaluation metrics like precision, recall, F-score or classification error; and more important, evaluating an automatic system against a unique reference is enough to conclude how well a SBD system is performing given the final application of the transcript? In this paper we propose Window-based Sentence Boundary Evaluation (WiSeBE), a semi-supervised metric for evaluating Sentence Boundary Detection systems based on multi-reference (dis)agreement. We evaluate and compare the performance of different SBD systems over a set of Youtube transcripts using WiSeBE and standard metrics. This double evaluation gives an understanding of how WiSeBE is a more reliable metric for the SBD task. △ Less

Submitted 27 August, 2018; originally announced August 2018.

Comments: In proceedings of the 17th Mexican International Conference on Artificial Intelligence (MICAI), 2018

arXiv:1802.04559 [pdf, other]

Sentence Boundary Detection for French with Subword-Level Information Vectors and Convolutional Neural Networks

Authors: Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno

Abstract: In this work we tackle the problem of sentence boundary detection applied to French as a binary classification task ("sentence boundary" or "not sentence boundary"). We combine convolutional neural networks with subword-level information vectors, which are word embedding representations learned from Wikipedia that take advantage of the words morphology; so each word is represented as a bag of thei… ▽ More In this work we tackle the problem of sentence boundary detection applied to French as a binary classification task ("sentence boundary" or "not sentence boundary"). We combine convolutional neural networks with subword-level information vectors, which are word embedding representations learned from Wikipedia that take advantage of the words morphology; so each word is represented as a bag of their character n-grams. We decide to use a big written dataset (French Gigaword) instead of standard size transcriptions to train and evaluate the proposed architectures with the intention of using the trained models in posterior real life ASR transcriptions. Three different architectures are tested showing similar results; general accuracy for all models overpasses 0.96. All three models have good F1 scores reaching values over 0.97 regarding the "not sentence boundary" class. However, the "sentence boundary" class reflects lower scores decreasing the F1 metric to 0.778 for one of the models. Using subword-level information vectors seem to be very effective leading to conclude that the morphology of words encoded in the embeddings representations behave like pixels in an image making feasible the use of convolutional neural network architectures. △ Less

Submitted 13 February, 2018; originally announced February 2018.

Comments: In proceedings of the International Conference on Natural Language, Signal and Speech Processing (ICNLSSP) 2017

arXiv:1710.06524 [pdf, ps, other]

Unsupervised Sentence Representations as Word Information Series: Revisiting TF--IDF

Authors: Ignacio Arroyo-Fernández, Carlos-Francisco Méndez-Cruz, Gerardo Sierra, Juan-Manuel Torres-Moreno, Grigori Sidorov

Abstract: Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sen… ▽ More Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the word embeddings are fitted by using Shannon's word entropies provided by the Term Frequency--Inverse Document Frequency (TF--IDF) transform. The hyperparameters of the model can be selected according to the properties of data (e.g. sentence length and textual gender). Hyperparameter selection involves word embedding methods and dimensionalities, as well as weighting schemata. Our method offers advantages over existing methods: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and language resources. Results showed that our model outperformed the state of the art in well-known Semantic Textual Similarity (STS) benchmarks. Moreover, our model reached state-of-the-art performance when compared to supervised and knowledge-based STS systems. △ Less

Submitted 19 October, 2017; v1 submitted 17 October, 2017; originally announced October 2017.

arXiv:1703.06630 [pdf, other]

Automatic Text Summarization Approaches to Speed up Topic Model Learning Process

Authors: Mohamed Morchid, Juan-Manuel Torres-Moreno, Richard Dufour, Javier Ramírez-Rodríguez, Georges Linarès

Abstract: The number of documents available into Internet moves each day up. For this reason, processing this amount of information effectively and expressibly becomes a major concern for companies and scientists. Methods that represent a textual document by a topic representation are widely used in Information Retrieval (IR) to process big data such as Wikipedia articles. One of the main difficulty in usin… ▽ More The number of documents available into Internet moves each day up. For this reason, processing this amount of information effectively and expressibly becomes a major concern for companies and scientists. Methods that represent a textual document by a topic representation are widely used in Information Retrieval (IR) to process big data such as Wikipedia articles. One of the main difficulty in using topic model on huge data collection is related to the material resources (CPU time and memory) required for model estimate. To deal with this issue, we propose to build topic spaces from summarized documents. In this paper, we present a study of topic space representation in the context of big data. The topic space representation behavior is analyzed on different languages. Experiments show that topic spaces estimated from text summaries are as relevant as those estimated from the complete documents. The real advantage of such an approach is the processing time gain: we showed that the processing time can be drastically reduced using summarized documents (more than 60\% in general). This study finally points out the differences between thematic representations of documents depending on the targeted languages such as English or latin languages. △ Less

Submitted 20 March, 2017; originally announced March 2017.

Comments: 16 pages, 4 tables, 8 figures

Journal ref: International Journal of Computational Linguistics and Applications, 7(2):87-109, 2016

arXiv:1703.06501 [pdf, other]

Métodos de Otimização Combinatória Aplicados ao Problema de Compressão MultiFrases

Authors: Elvys Linhares Pontes, Thiago Gouveia da Silva, Andréa Carneiro Linhares, Juan-Manuel Torres-Moreno, Stéphane Huet

Abstract: The Internet has led to a dramatic increase in the amount of available information. In this context, reading and understanding this flow of information have become costly tasks. In the last years, to assist people to understand textual data, various Natural Language Processing (NLP) applications based on Combinatorial Optimization have been devised. However, for Multi-Sentences Compression (MSC),… ▽ More The Internet has led to a dramatic increase in the amount of available information. In this context, reading and understanding this flow of information have become costly tasks. In the last years, to assist people to understand textual data, various Natural Language Processing (NLP) applications based on Combinatorial Optimization have been devised. However, for Multi-Sentences Compression (MSC), method which reduces the sentence length without removing core information, the insertion of optimization methods requires further study to improve the performance of MSC. This article describes a method for MSC using Combinatorial Optimization and Graph Theory to generate more informative sentences while maintaining their grammaticality. An experiment led on a corpus of 40 clusters of sentences shows that our system has achieved a very good quality and is better than the state-of-the-art. △ Less

Submitted 19 March, 2017; originally announced March 2017.

Comments: 12 pages, 1 figure, 3 tables (paper in Portuguese), Preprint of XLVIII Simpósio Brasileiro de Pesquisa Operacional, 2016, Vitória, ES, (Brazil)

arXiv:1703.04718 [pdf, ps, other]

Extending Automatic Discourse Segmentation for Texts in Spanish to Catalan

Authors: Iria da Cunha, Eric SanJuan, Juan-Manuel Torres-Moreno, Irene Castellón

Abstract: At present, automatic discourse analysis is a relevant research topic in the field of NLP. However, discourse is one of the phenomena most difficult to process. Although discourse parsers have been already developed for several languages, this tool does not exist for Catalan. In order to implement this kind of parser, the first step is to develop a discourse segmenter. In this article we present t… ▽ More At present, automatic discourse analysis is a relevant research topic in the field of NLP. However, discourse is one of the phenomena most difficult to process. Although discourse parsers have been already developed for several languages, this tool does not exist for Catalan. In order to implement this kind of parser, the first step is to develop a discourse segmenter. In this article we present the first discourse segmenter for texts in Catalan. This segmenter is based on Rhetorical Structure Theory (RST) for Spanish, and uses lexical and syntactic information to translate rules valid for Spanish into rules for Catalan. We have evaluated the system by using a gold standard corpus including manually segmented texts and results are promising. △ Less

Submitted 11 March, 2017; originally announced March 2017.

Journal ref: Proceedings of the First Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine 2016), 38th European Conference on Information Retrieval (ECIR 2016)

arXiv:1703.03923 [pdf, other]

A German Corpus for Text Similarity Detection Tasks

Authors: Juan-Manuel Torres-Moreno, Gerardo Sierra, Peter Peinl

Abstract: Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for text similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of this corpus is to automatically assess the similarity between a pair of texts an… ▽ More Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for text similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of this corpus is to automatically assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences. Therefore we have calculated several simple measures on our corpus based on a library of similarity functions. △ Less

Submitted 11 March, 2017; originally announced March 2017.

Comments: 1 figure; 13 pages

Journal ref: Preprint of International Journal of Computational Linguistics and Applications, vol. 5, no. 2, 2014, pp. 9-24

arXiv:1702.06510 [pdf, ps, other]

Algorithmes de classification et d'optimisation: participation du LIA/ADOC á DEFT'14

Authors: Luis Adrián Cabrera-Diego, Stéphane Huet, Bassam Jabaian, Alejandro Molina, Juan-Manuel Torres-Moreno, Marc El-Bèze, Barthélémy Durette

Abstract: This year, the DEFT campaign (Défi Fouilles de Textes) incorporates a task which aims at identifying the session in which articles of previous TALN conferences were presented. We describe the three statistical systems developed at LIA/ADOC for this task. A fusion of these systems enables us to obtain interesting results (micro-precision score of 0.76 measured on the test corpus) This year, the DEFT campaign (Défi Fouilles de Textes) incorporates a task which aims at identifying the session in which articles of previous TALN conferences were presented. We describe the three statistical systems developed at LIA/ADOC for this task. A fusion of these systems enables us to obtain interesting results (micro-precision score of 0.76 measured on the test corpus) △ Less

Submitted 21 February, 2017; originally announced February 2017.

Comments: 8 pages, 3 tables, Conference paper (in French)

arXiv:1702.06478 [pdf, ps, other]

Systèmes du LIA à DEFT'13

Authors: Xavier Bost, Ilaria Brunetti, Luis Adrián Cabrera-Diego, Jean-Valère Cossu, Andréa Linhares, Mohamed Morchid, Juan-Manuel Torres-Moreno, Marc El-Bèze, Richard Dufour

Abstract: The 2013 Défi de Fouille de Textes (DEFT) campaign is interested in two types of language analysis tasks, the document classification and the information extraction in the specialized domain of cuisine recipes. We present the systems that the LIA has used in DEFT 2013. Our systems show interesting results, even though the complexity of the proposed tasks. The 2013 Défi de Fouille de Textes (DEFT) campaign is interested in two types of language analysis tasks, the document classification and the information extraction in the specialized domain of cuisine recipes. We present the systems that the LIA has used in DEFT 2013. Our systems show interesting results, even though the complexity of the proposed tasks. △ Less

Submitted 21 February, 2017; originally announced February 2017.

Comments: 12 pages, 3 tables, (Paper in French)

Journal ref: Proceedings of the Ninth DEFT Workshop, DEFT2013, Les Sables-d'Olonne, France, 21st June 2013

arXiv:1702.06467 [pdf, other]

Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization

Authors: Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón, Gerardo Sierra

Abstract: In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, $n$-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character floodi… ▽ More In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, $n$-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character flooding, capital letters, references to other users, hyperlinks, hashtags, etc.). Experiments with SVM showed up to 90% of performance. △ Less

Submitted 21 February, 2017; originally announced February 2017.

Comments: 8 pages, 6 figures, Conference paper

Journal ref: Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Vol 1: KDIR, 307-314, 2016, Porto, Portugal

arXiv:1601.07124 [pdf, other]

LIA-RAG: a system based on graphs and divergence of probabilities applied to Speech-To-Text Summarization

Authors: Elvys Linhares Pontes, Juan-Manuel Torres-Moreno, Andréa Carneiro Linhares

Abstract: This paper aims to introduces a new algorithm for automatic speech-to-text summarization based on statistical divergences of probabilities and graphs. The input is a text from speech conversations with noise, and the output a compact text summary. Our results, on the pilot task CCCS Multiling 2015 French corpus are very encouraging This paper aims to introduces a new algorithm for automatic speech-to-text summarization based on statistical divergences of probabilities and graphs. The input is a text from speech conversations with noise, and the output a compact text summary. Our results, on the pilot task CCCS Multiling 2015 French corpus are very encouraging △ Less

Submitted 26 January, 2016; originally announced January 2016.

Comments: 7 pages, 2 figures, CCCS Multiling 2015 Workshop

arXiv:1506.06205 [pdf, other]

Trivergence of Probability Distributions, at glance

Authors: Juan-Manuel Torres-Moreno

Abstract: In this paper we introduce the intuitive notion of trivergence of probability distributions (TPD). This notion allow us to calculate the similarity among triplets of objects. For this computation, we can use the well known measures of probability divergences like Kullback-Leibler and Jensen-Shannon. Divergence measures may be used in Information Retrieval tasks as Automatic Text Summarization, Tex… ▽ More In this paper we introduce the intuitive notion of trivergence of probability distributions (TPD). This notion allow us to calculate the similarity among triplets of objects. For this computation, we can use the well known measures of probability divergences like Kullback-Leibler and Jensen-Shannon. Divergence measures may be used in Information Retrieval tasks as Automatic Text Summarization, Text Classification, among many others. △ Less

Submitted 20 June, 2015; originally announced June 2015.

Comments: 10 pages, 1 figure

arXiv:1501.04920 [pdf]

Regroupement sémantique de définitions en espagnol

Authors: Gerardo Sierra, Juan-Manuel Torres-Moreno, Alejandro Molina

Abstract: This article focuses on the description and evaluation of a new unsupervised learning method of clustering of definitions in Spanish according to their semantic. Textual Energy was used as a clustering measure, and we study an adaptation of the Precision and Recall to evaluate our method. This article focuses on the description and evaluation of a new unsupervised learning method of clustering of definitions in Spanish according to their semantic. Textual Energy was used as a clustering measure, and we study an adaptation of the Precision and Recall to evaluate our method. △ Less

Submitted 20 January, 2015; originally announced January 2015.

Comments: 11 pages, in French, 5 figures. Workshop Evaluation des méthodes d'Extraction de Connaissances dans les Données EvalECD EGC'10, 2010 Tunis

arXiv:1501.01252 [pdf, other]

doi 10.15439/2014F336

Optimisation using Natural Language Processing: Personalized Tour Recommendation for Museums

Authors: Mayeul Mathias, Assema Moussa, Fen Zhou, Juan-Manuel Torres-Moreno, Marie-Sylvie Poli, Didier Josselin, Marc El-Bèze, Andréa Carneiro Linhares, Francoise Rigat

Abstract: This paper proposes a new method to provide personalized tour recommendation for museum visits. It combines an optimization of preference criteria of visitors with an automatic extraction of artwork importance from museum information based on Natural Language Processing using textual energy. This project includes researchers from computer and social sciences. Some results are obtained with numeric… ▽ More This paper proposes a new method to provide personalized tour recommendation for museum visits. It combines an optimization of preference criteria of visitors with an automatic extraction of artwork importance from museum information based on Natural Language Processing using textual energy. This project includes researchers from computer and social sciences. Some results are obtained with numerical experiments. They show that our model clearly improves the satisfaction of the visitor who follows the proposed tour. This work foreshadows some interesting outcomes and applications about on-demand personalized visit of museums in a very near future. △ Less

Submitted 6 January, 2015; originally announced January 2015.

Comments: 8 pages, 4 figures; Proceedings of the 2014 Federated Conference on Computer Science and Information Systems pp. 439-446

arXiv:1501.01243 [pdf]

Un résumeur à base de graphes, indépéndant de la langue

Authors: Juan-Manuel Torres-Moreno, Javier Ramirez, Iria da Cunha

Abstract: In this paper we present REG, a graph-based approach for study a fundamental problem of Natural Language Processing (NLP): the automatic text summarization. The algorithm maps a document as a graph, then it computes the weight of their sentences. We have applied this approach to summarize documents in three languages. In this paper we present REG, a graph-based approach for study a fundamental problem of Natural Language Processing (NLP): the automatic text summarization. The algorithm maps a document as a graph, then it computes the weight of their sentences. We have applied this approach to summarize documents in three languages. △ Less

Submitted 6 January, 2015; originally announced January 2015.

Comments: 8 pages, in French, 2 figures; International Workshop on African Human Language Technologies

arXiv:1212.3493 [pdf, ps, other]

Sentence Compression in Spanish driven by Discourse Segmentation and Language Models

Authors: Alejandro Molina, Juan-Manuel Torres-Moreno, Iria da Cunha, Eric SanJuan, Gerardo Sierra

Abstract: Previous works demonstrated that Automatic Text Summarization (ATS) by sentences extraction may be improved using sentence compression. In this work we present a sentence compressions approach guided by level-sentence discourse segmentation and probabilistic language models (LM). The results presented here show that the proposed solution is able to generate coherent summaries with grammatical comp… ▽ More Previous works demonstrated that Automatic Text Summarization (ATS) by sentences extraction may be improved using sentence compression. In this work we present a sentence compressions approach guided by level-sentence discourse segmentation and probabilistic language models (LM). The results presented here show that the proposed solution is able to generate coherent summaries with grammatical compressed sentences. The approach is simple enough to be transposed into other languages. △ Less

Submitted 17 December, 2012; v1 submitted 14 December, 2012; originally announced December 2012.

Comments: 7 pages, 3 tables

arXiv:1212.1918 [pdf]

Condensés de textes par des méthodes numériques

Authors: Juan-Manuel Torres-Moreno, Patricia Velázquez-Morales, Jean-Guy Meunier

Abstract: Since information in electronic form is already a standard, and that the variety and the quantity of information become increasingly large, the methods of summarizing or automatic condensation of texts is a critical phase of the analysis of texts. This article describes CORTEX a system based on numerical methods, which allows obtaining a condensation of a text, which is independent of the topic an… ▽ More Since information in electronic form is already a standard, and that the variety and the quantity of information become increasingly large, the methods of summarizing or automatic condensation of texts is a critical phase of the analysis of texts. This article describes CORTEX a system based on numerical methods, which allows obtaining a condensation of a text, which is independent of the topic and of the length of the text. The structure of the system enables it to find the abstracts in French or Spanish in very short times. △ Less

Submitted 9 December, 2012; originally announced December 2012.

Comments: Conférence JADT 2002, Saint-Malo/France. 12 pages, 7 figures

arXiv:1210.3312 [pdf, other]

Artex is AnotheR TEXt summarizer

Authors: Juan-Manuel Torres-Moreno

Abstract: This paper describes Artex, another algorithm for Automatic Text Summarization. In order to rank sentences, a simple inner product is calculated between each sentence, a document vector (text topic) and a lexical vector (vocabulary used by a sentence). Summaries are then generated by assembling the highest ranked sentences. No ruled-based linguistic post-processing is necessary in order to obtain… ▽ More This paper describes Artex, another algorithm for Automatic Text Summarization. In order to rank sentences, a simple inner product is calculated between each sentence, a document vector (text topic) and a lexical vector (vocabulary used by a sentence). Summaries are then generated by assembling the highest ranked sentences. No ruled-based linguistic post-processing is necessary in order to obtain summaries. Tests over several datasets (coming from Document Understanding Conferences (DUC), Text Analysis Conferences (TAC), evaluation campaigns, etc.) in French, English and Spanish have shown that summarizer achieves interesting results. △ Less

Submitted 11 October, 2012; originally announced October 2012.

Comments: 11 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:1209.3126

arXiv:1209.3126 [pdf, other]

Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

Authors: Juan-Manuel Torres-Moreno

Abstract: In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduc… ▽ More In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results confirm an increase in the performance, regardless of summarizer system used. △ Less

Submitted 14 September, 2012; originally announced September 2012.

Comments: 22 pages, 12 figures, 9 tables

arXiv:1004.3371 [pdf, other]

Improving Update Summarization by Revisiting the MMR Criterion

Authors: Florian Boudin, Juan-Manuel Torres-Moreno, Marc El-Bèze

Abstract: This paper describes a method for multi-document update summarization that relies on a double maximization criterion. A Maximal Marginal Relevance like criterion, modified and so called Smmr, is used to select sentences that are close to the topic and at the same time, distant from sentences used in already read documents. Summaries are then generated by assembling the high ranked material and app… ▽ More This paper describes a method for multi-document update summarization that relies on a double maximization criterion. A Maximal Marginal Relevance like criterion, modified and so called Smmr, is used to select sentences that are close to the topic and at the same time, distant from sentences used in already read documents. Summaries are then generated by assembling the high ranked material and applying some ruled-based linguistic post-processing in order to obtain length reduction and maintain coherency. Through a participation to the Text Analysis Conference (TAC) 2008 evaluation campaign, we have shown that our method achieves promising results. △ Less

Submitted 20 April, 2010; originally announced April 2010.

Comments: 20 pages, 3 figures and 8 tables.

ACM Class: I.2.7

arXiv:1001.1093 [pdf, other]

Solving the Frequency Assignment Problem by Site Availability and Constraint Programming

Authors: Andrea Carneiro Linhares, Juan-Manuel Torres-Moreno, Peter Peinl, Philippe Michelon

Abstract: The efficient use of bandwidth for radio communications becomes more and more crucial when developing new information technologies and their applications. The core issues are addressed by the so-called Frequency Assignment Problems (FAP). Our work investigates static FAP, where an attempt is first made to configure a kernel of links. We study the problem based on the concepts and techniques of C… ▽ More The efficient use of bandwidth for radio communications becomes more and more crucial when developing new information technologies and their applications. The core issues are addressed by the so-called Frequency Assignment Problems (FAP). Our work investigates static FAP, where an attempt is first made to configure a kernel of links. We study the problem based on the concepts and techniques of Constraint Programming and integrate the site availability concept. Numerical simulations conducted on scenarios provided by CELAR are very promising. △ Less

Submitted 7 January, 2010; originally announced January 2010.

Comments: 11 pages, 1 figure and 3 tables

arXiv:0906.0470 [pdf, ps, other]

An optimal linear separator for the Sonar Signals Classification task

Authors: Juan-Manuel Torres-Moreno, Mirta B. Gordon

Abstract: The problem of classifying sonar signals from rocks and mines first studied by Gorman and Sejnowski has become a benchmark against which many learning algorithms have been tested. We show that both the training set and the test set of this benchmark are linearly separable, although with different hyperplanes. Moreover, the complete set of learning and test patterns together, is also linearly sep… ▽ More The problem of classifying sonar signals from rocks and mines first studied by Gorman and Sejnowski has become a benchmark against which many learning algorithms have been tested. We show that both the training set and the test set of this benchmark are linearly separable, although with different hyperplanes. Moreover, the complete set of learning and test patterns together, is also linearly separable. We give the weights that separate these sets, which may be used to compare results found by other algorithms. △ Less

Submitted 2 June, 2009; originally announced June 2009.

Comments: 8 pages, 6 tables

arXiv:0905.2990 [pdf, other]

Automatic Summarization System coupled with a Question-Answering System (QAAS)

Authors: Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot

Abstract: To select the most relevant sentences of a document, it uses an optimal decision algorithm that combines several metrics. The metrics processes, weighting and extract pertinence sentences by statistical and informational algorithms. This technique might improve a Question-Answering system, whose function is to provide an exact answer to a question in natural language. In this paper, we present t… ▽ More To select the most relevant sentences of a document, it uses an optimal decision algorithm that combines several metrics. The metrics processes, weighting and extract pertinence sentences by statistical and informational algorithms. This technique might improve a Question-Answering system, whose function is to provide an exact answer to a question in natural language. In this paper, we present the results obtained by coupling the Cortex summarizer with a Question-Answering system (QAAS). Two configurations have been evaluated. In the first one, a low compression level is selected and the summarization system is only used as a noise filter. In the second configuration, the system actually functions as a summarizer, with a very high level of compression. Our results on French corpus demonstrate that the coupling of Automatic Summarization system with a Question-Answering system is promising. Then the system has been adapted to generate a customized summary depending on the specific question. Tests on a french multi-document corpus have been realized, and the personalized QAAS system obtains the best performances. △ Less

Submitted 18 May, 2009; originally announced May 2009.

Comments: 28 pages, 11 figures

arXiv:0905.2347 [pdf, other]

Combining Supervised and Unsupervised Learning for GIS Classification

Authors: Juan-Manuel Torres-Moreno, Laurent Bougrain, Frdéric Alexandre

Abstract: This paper presents a new hybrid learning algorithm for unsupervised classification tasks. We combined Fuzzy c-means learning algorithm and a supervised version of Minimerror to develop a hybrid incremental strategy allowing unsupervised classifications. We applied this new approach to a real-world database in order to know if the information contained in unlabeled features of a Geographic Infor… ▽ More This paper presents a new hybrid learning algorithm for unsupervised classification tasks. We combined Fuzzy c-means learning algorithm and a supervised version of Minimerror to develop a hybrid incremental strategy allowing unsupervised classifications. We applied this new approach to a real-world database in order to know if the information contained in unlabeled features of a Geographic Information System (GIS), allows to well classify it. Finally, we compared our results to a classical supervised classification obtained by a multilayer perceptron. △ Less

Submitted 14 May, 2009; originally announced May 2009.

Comments: 8 pages, 3 figures

arXiv:0905.1130 [pdf, other]

Statistical Automatic Summarization in Organic Chemistry

Authors: Florian Boudin, Patricia Velazquez-Morales, Juan-Manuel Torres-Moreno

Abstract: We present an oriented numerical summarizer algorithm, applied to producing automatic summaries of scientific documents in Organic Chemistry. We present its implementation named Yachs (Yet Another Chemistry Summarizer) that combines a specific document pre-processing with a sentence scoring method relying on the statistical properties of documents. We show that Yachs achieves the best results am… ▽ More We present an oriented numerical summarizer algorithm, applied to producing automatic summaries of scientific documents in Organic Chemistry. We present its implementation named Yachs (Yet Another Chemistry Summarizer) that combines a specific document pre-processing with a sentence scoring method relying on the statistical properties of documents. We show that Yachs achieves the best results among several other summarizers on a corpus of Organic Chemistry articles. △ Less

Submitted 7 May, 2009; originally announced May 2009.

Comments: 10 pages, 3 figures

arXiv:0904.4587 [pdf, ps, other]

Adaptive Learning with Binary Neurons

Authors: Juan-Manuel Torres-Moreno, Mirta B. Gordon

Abstract: A efficient incremental learning algorithm for classification tasks, called NetLines, well adapted for both binary and real-valued input patterns is presented. It generates small compact feedforward neural networks with one hidden layer of binary units and binary output units. A convergence theorem ensures that solutions with a finite number of hidden units exist for both binary and real-valued… ▽ More A efficient incremental learning algorithm for classification tasks, called NetLines, well adapted for both binary and real-valued input patterns is presented. It generates small compact feedforward neural networks with one hidden layer of binary units and binary output units. A convergence theorem ensures that solutions with a finite number of hidden units exist for both binary and real-valued input patterns. An implementation for problems with more than two classes, valid for any binary classifier, is proposed. The generalization error and the size of the resulting networks are compared to the best published results on well-known classification benchmarks. Early stopping is shown to decrease overfitting, without improving the generalization performance. △ Less

Submitted 29 April, 2009; originally announced April 2009.

Comments: 29 pages, 7 figures

Showing 1–45 of 45 results for author: Torres-Moreno, J