Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
60 views

Building Business Intelligence Data Extractor Using NLP and Python

The goal of the Business Intelligence data extractor (BID- Extractor) tool is to offer high-quality
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Building Business Intelligence Data Extractor Using NLP and Python

The goal of the Business Intelligence data extractor (BID- Extractor) tool is to offer high-quality
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Building Business Intelligence Data


Extractor using NLP and Python
Tamilselvan Arjunan,
Assistant Manager, Ernst and Young strategy,
Data Science and Analytics

Abstract:- The goal of the Business Intelligence data I. INTRODUCTION


extractor (BID- Extractor) tool is to offer high-quality,
usable data that is freely available to the public. To assist Business Intelligence data extractor can be used by
companies across all industries in achieving their many of the world's leading industries to convert millions of
objectives, we prefer to use cutting-edge, business- web pages into meaningful information daily.To effectively
focused web scraping solutions. The World wide web gauge the impact on business, this solution might be made
contains all kinds of information of different origins; available as a service.
some of those are social, financial, security, and
academic. Most people access information through the The following factors increase the impact of the
internet for educational purposes. Information on the solution:
web is available in different formats and through  Fundamental analysis of companies
different access interfaces. Therefore, indexing or  Analyse prospects for better deals.
semantic processing of the data through websites could
Data as a Service (DaaS) enables intelligent decision-
be cumbersome. Web Scraping/Data extracting is the
making by providing high-quality structured data to improve
technique that aims to address this issue. Web scraping
business outcomes, acquire useful insight, and boost
is used to transform unstructured data on the web into
business outcomes. Any research, whether it be academic,
structured data that can be stored and analyzed in a
marketing-related, or scientific.
central local database or spreadsheet. There are various
web scraping techniques including Traditional copy-and- People may desire to gather and examine the
paste, Text capturing and regular expression matching, information from several websites. the various websites that
HTTP programming, HTML parsing, DOM parsing, belong information shown according to the particular
Vertical aggregation platforms, Semantic annotation category are varied formats. You might not be able to
recognition, and Computer vision webpageanalyzers. compete even with one website to view all information at
Traditional copy and paste is the basic and tiresome web once. Data spans are possible. spans several pages and under
scraping technique where people need to scrap lots of different topics.The only available method is manually
datasets. Web scraping software is the easiest scraping copying the website's data into a local file on your computer.
technique since all the other techniques except This is an extremely time-consuming and laborious task.
traditional copy and pastes require some form of
technical expertise. Even though there are many webs II. OVERVIEW OF WEB DATA EXTRACTION
scraping software available today, most of them are
designedto serve one specific purpose. Businesses cannot Web data extraction is a fantastic method for removing
decide using the data. This research focused on building and converting unstructured data from websites. The
web scraping software using Python and NLP. Convert information into structured information that may be stored
the unstructured data to structured data using NLP. We anddatabase-based analysis Web scraping also goes by the
can also train the NLP NER model. The study's findings namesWeb harvesting, web data extraction, and web data
provide a way to effectively gauge business impact. scrapingor scratching the screen. Data collected by web
scraping is called mining. The process of web scraping is
The solution has a greater impact when applied to: intended to:information from websites is extracted and
 Analyzing companies’ fundamentals transformed intoa logical framework, such as databases and
 Analyzing better deal opportunities. spreadsheetsor a CSV (comma-separated values) file.

Keywords:- Web Scraping, Information Extraction. A. Challenges


Targeting websites, such as the "top 100 search results
for this phrase" or "these 3 e-commerce websites for this
product category," is the first step in web scraping. On the
surface, this may seem simple, but the next step requires
finding precise URLs that match these targets, which is
difficult for web scraping. To create the target URLs for the
required pages, a web scraper must locate the source URL.
Broken links and websites with irrelevant information cause
the algorithm to waste time and data storage while creating

IJISRT22SEP1100 www.ijisrt.com 1146


Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
thousands of URLs for content that has no commercial value We will use SpaCy to train custom NER model. SpaCy
to the consumer. is an open-source software library for advanced natural
language processing, written in the Programming languages
To avoid having their services interrupted by heavy Python and Cython were used to create the open-source
traffic, websites may try to prevent web scrapers. They software library known as SpaCy, which is used for
accomplish this by "fingerprinting" the scraper in order to sophisticated natural language processing. To train our
identify its origin and behavior. Examples of this include custom-named entity recognition model, we’ll need some
determining whether the same IP address is repeatedly relevant text data with the proper annotations. We will use
attempting to scrape the same website, the scraper's device open-source US data to train the NER model.
and operating system, and the speed at which requests are
sent. According to a study, these fingerprints can be In contrast to NLTK, which is frequently used for
followed by websites once they have been recognized for an research and education, spaCy concentrates on offering
average of 54 days. This necessitates the usage of unique software for use in actual production. By combining
origins in each online scraping request and the requirement statistical models trained by well-known machine learning
for web scrapers to anonymize themselves by altering their libraries like Tensor Flow, PyTorch, or MXNet through its
behavior to that of human users while scraping a website. machine learning library Thinc, spaCy now supports deep
learning workflows as of version 1.0.
B. Solution
Web scraping is made easier by AI in two ways:

Algorithms for classifying data: Algorithms that have


been trained on large data sets obtained via web scraping
can recognize and categorize inactive URLs. This enables
web scraping algorithms to focus their efforts on only a
small fraction of potentially useful websites.

Algorithms for natural language processing: A recent


study recommends enhancing web scraping algorithms to
use natural language processing to scan the scraped data and
determine the content's relevance. In this method, the effort
required for data processing and storage is optimized
because data that is below the relevancy level would not be
saved at all.

Dynamic proxies, which require the web scraper to Fig. 1: Custom NER
dynamically alter their IP address with each web scraping
Now, the major part is to create custom entity data for
request, are a frequent solution to this problem. Other
the input text where the named entity is to be identified by
factors do, however, still aid websites in identifying
the model during the testing period.
automated web scrapers. Dynamic proxy technology is
supported by AI solutions that optimize the other At its core, all entity recognition systems have two
parameters. Web scrapers can use this training data to make steps:
sure the new parameters they employ are considerably
different from the fingerprints they generated in thepast as Detecting the entities in text Categorizing the entities
each attempt at web scraping generates a fingerprint on the into named classes. In the first step, the NER detects the
scraper end. location of the token or series of tokens that form an entity.
Inside-outside-beginning chunking is a common method for
AI techniques can produce adaptive parsing models that finding the starting and ending indices of entities. The
gain knowledge via practice. Parsing models can learn how second step involves the creation of entity categories. These
to effectively classify distinct sections of the scraped data categories change depending on the use case, but here are
and weed out unneeded pieces by utilizing parsed data as a some of the most common entities classes:
training set. Some of these features, despite having separate  Person
website structures, might also be present on related websites.  Organization
For instance, a data parsing algorithm may identify the
 Location
approximate location of a product's image and details and
 Time
use this as a proxy to identify where to look for the
necessary data in a different dataset because many e-  Measurements or Quantities
commerce websites have similar layouts to display the  String patterns like email addresses, phone numbers, or IP
product image and details, such as price. addresses

IJISRT22SEP1100 www.ijisrt.com 1147


Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Below are the steps to be performed to get this tool running


 Select type to web scrapping – Text, Geo coordinates
from maps or Images.
 Input the web URL from where data is to be extracted.
 Tool will navigate to the web URL and display the web
page in the display panel
 User selects elements/links from where data is to be
extracted
 Select if data is to be extracted from multiple pages.
 Run the tool and extract data

The following example collects historical stock prices


using web scraping. Data points, such as daily opening,
daily highest, daily lowest, and daily closing, will be
collected as well.Thankfully, numerous websites provide
such data, and it’s usually, conveniently, presented in a
table. Typically, you’ll see the HTML code that renders
these tables, such as the following image.

Dynamic Fingerprinting Powered by AI

Fig. 2: NLP NER Model building How might AI- and ML-based anti-bot algorithms be
best defeated? developing a crawling method that uses AI
Web scraping's significance in machine learning and ML. Finding reliable data is not difficult because the
indicators of success and failure are clear-cut.Anyone who
Web scraping in machine learning is primarily focused has previously engaged in web scraping ought to already
on the fundamental issue of obtaining high-quality data. have a sizable collection of fingerprints that could be valued.
These fingerprints might be tagged, saved in a database, and
Although the internal data gathered on routine business
used as training data.
operations can offer insightful information, such data is
insufficient. Therefore, even though it is a more difficult Testing and validation, however, will be slightly more
process, getting information from outside sources is crucial. challenging. Some fingerprints may experience blocks more
When scraping, accuracy and poor data quality become frequently than others because not all fingerprints are
major issues. As a result, every scraping project must always created equal. The AI will be significantly improved over
include a final clean-up process, although this will be time by gathering information on success rates per
covered in more detail later in this guide. fingerprint and developing a feedback loop.

III. DESIGN OF SOFTWARE:

Enter the URL and click on Search

Step 1:

Fig. 3: Technical Architecture and Workflow

Select the type of data to be extracted

IJISRT22SEP1100 www.ijisrt.com 1148


Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Step 2:

Fig. 5: Location data extraction:

IV. CONCLUSION

At some time soon, applying AI and machine learning


to unstructured data will become inevitable. This business
intelligence data extraction can help to create a financial
news sentiment analysis to assess the effect on market value
and other drivers that aid in strategic planning and assist
management in identifying key strategic levers.

Click on download button Building AI and machine learning models could


appear to be a difficult undertaking to some people. Web
Step 3: crawling, however, is a game with a lot of moving
components. It's not necessary to develop a single, all-
encompassing ML model that could perform every task.
Attend to the lesser chores first (such as dynamic user agent
creation). Small ML-based models will eventually allow you
to construct the whole web crawling system.

It can also help business to create an intelligent search


engine to gain visibility into multiple competitors’ products,
services offered and their presence in different regions based
on up-to-date and comprehensive data for making better
deals and to get competitive edge.

REFERENCES

[1.] Acar, G., Juarez, M., Nikiforakis, N., Diaz, C., Gürses,
S., Piessens, F., &Preneel, B. (2013). Fpdetective:
Dusting the web for fingerprinters. In Proceedings of
the 2013 ACM SIGSAC conference on computer &
Output in excel file: communications security. New York: ACM.
[2.] Bar-Ilan, J. (2001). Data collection methods on the web
for infometric purposes – A review and analysis.
Scientometrics, 50(1), 7–32. Butler, J. (2007).
[3.] Visual web page analytics. Google Patents.
[4.] Doran, D., & Gokhale, S. S. (2011). Web robot
detection techniques: Overview and limitations. Data
Mining and Knowledge Discovery, 22(1), 183–210.
[5.] Yi, J., Nasukawa, T., Bunescu, R., &Niblack, W.
(2003). Sentiment analyzer: Extracting sentiments about
a given topic using natural language processing
techniques. Data Mining, 2003. ICDM 2003. Third
IEEE International Conference on, IEEE. Melbourne,
Florida, USA.

Fig. 4: Service data extraction

IJISRT22SEP1100 www.ijisrt.com 1149


Volume 7, Issue 9, September – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Biography of Author

Tamilselvan Arjunan is working as an Assistant


manager at Ernst and Young strategy. He has a total of 7
years of hands-on experience in Machine learning, Data
Science and Python. He has built many AI-based products
for clients. He is certified in Data Science and Python.
Hecompleteda bachelor’s degree in mechanical
engineeringfrom Anna University.

IJISRT22SEP1100 www.ijisrt.com 1150

You might also like