Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Irs Unit5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Heterogeneous Data :

Heterogeneous data are any data with high variability of data types and formats. They are possibly
ambiguous and low quality due to missing values, high data redundancy, and untruthfulness. It is
difficult to integrate heterogeneous data to meet the business information demands.

For example, heterogeneous data are often generated from Internet of Things (IoT).

Data generated from IoT often has the following four features .

1. First, they are of heterogeneity. Because of the variety of data devices, the acquired data are also
different in types with heterogeneity.
2. Second, they are at a large-scale. Massive data is used and distributed, not only the currently
acquired data, but also the historical data within a certain time frame should be stored.
3. Third, there is a strong correlation between time and space. Every data acquisition device is placed
at a specific geographic location and every piece of data has a time stamp.
4. Fourth, effective data accounts for only a small portion of the big data. A great quantity of noises
may be collected during the acquisition and transmission of data in IoT.

There are following types of data heterogeneity :

• Syntactic heterogeneity occurs when two data sources are not expressed in the same language.

• Conceptual heterogeneity, also known as semantic heterogeneity or logical mismatch, denotes the
differences in modelling the same domain of interest.

• Terminological heterogeneity stands for variations in names when referring to the same entities
from different data sources.

• Semiotic heterogeneity, also known as pragmatic heterogeneity, stands for different interpretation
of entities by people.

Data Processing Methods for Heterogeneous Data:


Data Cleaning: Data cleaning is a process to identify incomplete, inaccurate or unreasonable data,
and then to modify or delete such data for improving data quality . For example, the multisource and
multimodal nature of healthcare data results in high complexity and noise problems. In addition, there
are also problems of missing values and impurity in the high-volume data. Since data quality determines
information quality, which will eventually affect the decision-making process, it is critical to develop
efficient big data cleansing approaches to improve data quality for making accurate and effective
decisions

Data Integration: In the case of data integration or aggregation, datasets are matched and merged
on the basis of shared variables and attributes. Advanced data processing and analysis techniques allow
to mix both structured and unstructured data for new insights; however, this requires “clean” data. Data
fusion techniques are used to match and aggregate heterogeneous datasets for creating and enhancing
a representation of data that helps data mining.

New Section 1 Page 1


RECOMMENDED SYSTEM:

• Searching for the right information has always been difficult. Not so long ago, relevant documents
were stored in physical libraries and discovering them was a lengthy and complicated process.
When documents became available through online repositories, the number of indexed
documents started to grow beyond physical storage limits.

• The same applies to the number of products offered by e-commerce sites or to content available
through online streaming services. Users tend to prefer finding everything in one place and most
of them enjoy choosing from more relevant alternatives so service providers need to adapt. Global
services (such as Google, Amazon, Netflix, Spotify), where users can find almost anything are on
the rise. One of the most powerful tools driving their global dominance is their highly advanced
personalization powered by machine learning techniques. These techniques are recommended
systems and personalized search.

• Recommended systems use a history of users interacting with items to produce ranked list of most
relevant items for any given user. Search engine ranks items based on similarity with given query
regardless of user history.

• Recommended systems enable users to discover relevant documents, products, or content online.
Often, items that users might like the most are hidden deep among millions of other items. Users
are not able to find such items directly through the search engine because they rarely know their
label or might not even be aware of their existence.

• On the other hand, sometimes users are looking for a specific item and are willing to help the
online system by expressing their needs in order to reduce number of possible items that can be
recommended.

Two general ways of executing this :

Query expansion:
● Modify or augment user query
● E.g., query term “IR” can be augmented with either “information retrieval” or “Ice-rum” depending on user
interest
● Ensures that there are enough personalized results

Reranking :
● Issue the same query and fetch the same results
● but re rank the results based on a user profile
● Allow both personalized and globally relevant results

Text Similarity: In Natural Language Processing (NLP), the answer to “how two
words/phrases/documents are similar to each other?” is a crucial topic for research and applications. Text
similarity is to calculate how two words/phrases/documents are close to each other. That closeness may be
lexical or in meaning.
Semantic similarity is about the meaning closeness, and lexical similarity is about the closeness of the word set.
Let’s check the following two phrases as an example:
● The dog bites the man
● The man bites the dog
According to the lexical similarity, those two phrases are very close and almost identical because they have the
same word set. For semantic similarity, they are completely different because they have different meanings
despite the similarity of the word set.

New Section 1 Page 2


Semantic Similarity:
Semantic similarity is the similarity between two classes of objects in a taxonomy (the scientific
process of arranging things into groups). A class C1 in the taxonomy is considered to be a subclass of
C2 if all the members of C1 are also members of C2. Therefore, the similarity between two classes is
based on how closely they are related in the taxonomy.

The main objective of Semantic Similarity is to measure the distance between the semantic meanings of
a pair of words, phrases, sentences, or documents. For example, the word “car” is more similar to “bus”
than it is to “cat”. The two main approaches to measuring Semantic Similarity :
● knowledge-based approaches
● corpus-based, distributional methods

Semantic similarity between two pieces of text measures how their meanings are close. This measure
usually is a score between 0 and 1. 0 means not close at all, and 1 means they almost have identical
meaning.

Types of Semantic Similarity:

• Knowledge-Based Similarity:

We use this type to determine the semantic


similarity between concepts. This type
represents each concept by a node in a graph.
This method is also called the topological
method because the graph is used as a
representation for the collection of corpus
concepts. A minimum number of edges
between two concepts (nodes) means they are
more close in meaning and more semantically
close. The following graph shows an example
concepts form a topology, and this graph will
result in “coin” is more close to “money” more
than “credit card”:

• Statistical-Based Similarity :

This type calculates the semantic similarity based on learning features’ from the corpus. In this
type, most of the previous techniques can be combined with word Implant for better results
because word implants capture the semantic relation between words.

• String-Based Similarity:

Measuring semantic similarity doesn’t depend on this type separately but combines it with other
types for measuring the distance between non-zero vectors.

New Section 1 Page 3


• Language Model-Based Similarity:

This is a novel type of semantic similarity measurement between two English phrases, with the
assumption that they are syntactically correct. This type has five main steps:
1. Removing stop words
2. Tagging the two phrases using any Part of Speech (POS) algorithm
3. From the tagging step output, this type forms a structure tree for each phrase (parsing tree)
4. Building undirected weighted graph using the parsing tree
5. Finally, the similarity is calculated as the minimum distance path between nodes (words)

Applications of Semantic Text Similarity:


For natural language processing (NLP):

We use the semantic similarity in many applications, like sentiment analysis, natural language
understanding, machine translation, question answering, chatbots, search engines, and information
retrieval.

For informatics sciences:

We have applications in the biomedical field and geo-informatics. Biomedical informatics builds the
biomedical ontologies mainly using semantic similarity methods. .

Web Mining:
Web Mining is the process of Data Mining techniques to automatically discover and extract information
from Web documents and services. The main purpose of web mining is discovering useful information
from the World-Wide Web and its usage patterns.

Applications of Web Mining:


1. Web mining helps to improve the power of web search engine by classifying the web documents and
identifying the web pages.
2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g., Fat Lens, etc.
3. Web mining is used to predict user behaviour.
4. Web mining is very useful of a particular Website and e-service e.g., landing page optimization.

Web mining can be broadly divided into three different types of


techniques of mining:

• Web Content Mining


• Web Structure Mining
• Web Usage Mining.

New Section 1 Page 4


1. Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of several types of
data – text, image, audio, video etc. Content data is the group of facts that designed by a
webpage. It can provide effective and interesting patterns about user needs. Text documents
are related to text mining, machine learning and natural language processing. This mining is
also known as text mining. This type of mining performs scanning and mining of the text,
images and groups of web pages according to the content need of the input.

2. Web Structure Mining: Web structure mining is the application of discovering structure
information from the web. The structure of the web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. Structure mining basically shows the structured
summary of a particular website. It identifies relationship between web pages linked by
information or direct link connection. To determine the connection between two commercial
websites, Web structure mining can be very useful.

3. Web Usage Mining: Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable you to understand
the user behaviours or something like that. In web usage mining, user access data on the web
and collect data in form of logs., usage mining is also called log mining.

New Section 1 Page 5


Web Challenges in Web Mining:
The web pretends incredible challenges for resources, and knowledge discovery based
on the following observations:

● The complexity of web pages: The site pages don't have a unifying structure. They
are extremely complicated as compared to traditional text documents. There are
enormous amounts of documents in the digital library of the web. These libraries are
not organized according to a specific order.

● The web is a dynamic data source: The data on the internet is quickly updated. For
example, news, climate, shopping, financial news, sports, and so on.

● Diversity of client networks : The client network on the web is quickly expanding.
These clients have different interests, backgrounds, and usage purposes. There are over
a hundred million workstations that are associated with the internet and still increasing
tremendously.

● Relevancy of data: It is considered that a specific person is generally concerned


about a small portion of the web, while the rest of the segment of the web contains the
data that is not familiar to the user and may lead to unwanted results.

● The web is too broad: The size of the web is tremendous and rapidly increasing. It
appears that the web is too huge for data warehousing and data mining

Application of Web Mining:

● Marketing and conversion tool


● Data analysis on website and application accomplishment.
● Audience behaviour analysis
● Advertising and campaign accomplishment analysis.
● Testing and analysis of a site.

New Section 1 Page 6

You might also like