Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Saul Gutierrez
Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning
Proceedings of the 20th Python in Science Conference, 2021
Most areas of Python data science have standardized on using Pandas DataFrames for representing and manipulating structured data in memory. Natural Language Processing (NLP), not so much. We believe that Pandas has the potential to serve as a universal data structure for NLP data. DataFrames could make every phase of NLP easier, from creating new models, to evaluating their effectiveness, to building applications that integrate those models. However, Pandas currently lacks important data types and operations for representing and manipulating crucial types of data in many of these NLP tasks. This paper describes Text Extensions for Pandas, a library of extensions to Pandas that make it possible to build end-to-end NLP applications while representing all of the applications’ internal data with DataFrames. We leverage the extension points built into Pandas library to add new data types, and we provide important NLP-specfific operations over these data types and and integrations with po...
SoftwareX, 2022
Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data.
Informatica (Slovenia), 2013
Text is one of the traditional ways of communication between people. With the growing availability of text data in electronic form, handling and analysis of text by means of computers gained popularity. Handling text data with machine learning methods brought interesting challenges to the area that got further extended by incorporation of some natural language specifics. As the methods were capable of addressing more complex problems related to text data, the expectations become bigger calling for more sophisticated methods, in particular a combination of methods from different research areas including information retrieval, machine learning, statistical data analysis, data mining, natural language processing, semantic technologies. Automatic text analysis become an integral part of many systems, pushing boundaries of research capabilities towards what one can refer to as an artificial intelligence dream never ending learning from text aiming at mimicking ways of human learning. The...
Abstract —The analysis of the text content in emails, blogs, tweets, forums and other forms of textual communication constitutes what we call text analytics. Text analytics is applicable to most industries: it can help analyze millions of emails; you can analyze customers’ comments and questions in forums; you can perform sentiment analysis using text analytics by measuring positive or negative perceptions of a company, brand, or product. Text Analytics has also been called text mining, and is a subcategory of the Natural Language Processing (NLP) field, which is one of the founding branches of Artificial Intelligence, back in the 1950s, when an interest in understanding text originally developed. Currently Text Analytics is often considered as the next step in Big Dataanalysis. Text Analytics has a number of subdivisions: Information Extraction, Named Entity Recognition, Semantic Web annotated domain’s representation, and many more. Several techniques are currently used and some of them have gained a lot of attention, such as Machine Learning, to show a semisupervised enhancement of systems, but they also present a number of limitations which make them not always the only or the best choice. We conclude with current and near future applications of Text Analytics. Keywords— Big Data Analysis, Information Extraction, TextAnalytics
Syllabus for introduction to computational text analysis, a Digital Humanities course focused on text analysis using Python.
INTERNATIONAL JOURNAL OF ADVANCE RESEARCH, IDEAS AND INNOVATIONS IN TECHNOLOGY
As the amount of information on the World Wide Web grows, it becomes increasingly burdensome to and just what we want. While general-purpose search engines such as Ask.com and Bing high coverage, they often provide only low precision compared to others, even for detailed and relative queries. When we know that we want information about a certain type, or on a certain topic, a domain-specific search engine can be a powerful tool. Like www.campsearch.com allows complex queries over summer camps by age-group, size, location, and cost. Domain-specific search engines are becoming increasingly popular because they increase accuracy not possible with general, Web-wide search engines. Unfortunately, they are also burdensome and time-consuming to maintain. In this paper, we use machine learning techniques to greatly automate the creation and maintenance of domain-specific search. It describes new research in semi-supervised learning, text classification, and information extraction. We have built a demonstration system using these technics like Web Scrapping, Fuzzy C-Means and Hierarchy Clustering for a search engine which gives accurate results which is a more advantage when compared to other Search engines. Searching with a traditional, general purpose search engine would be extremely tedious or impossible to perform search operations. For this basis, domain-specific search engines are becoming popular. This article mainly concentrated on Project an effort to automate many aspects of creating and maintaining domain-specific search engines by using machine learning techniques. These techniques permit search engines to be created quickly with less effort and are suited for re-use across many domains.
The Israelite Settlement Outside the Walls of the City of David, 1997
Nexus Network Journal. Architecture and Mathematics, vol. 26, issue 2, 2024
Studies in Church History, 2021
Journal of Iranian Studies [İran Çalışmaları Dergisi], 2024
Open Journal of Soil Science, 2018
British Journal of Clinical Psychology, 1997
Journal of Animal Physiology and Animal Nutrition, 2019
Annuario dell'Associazione storica Valle Telesina, 2022