Module 1 and NoSQL
Module 1 and NoSQL
• affect the ability to parse text, detect sentiment, and identify new
patterns.
• New types include content, geo-spatial, hardware data points, location based, log
data, machine data, metrics, mobile, physical data points, process, RFID’s, search,
sentiment, streaming data, social, text, and web.
• The addition of unstructured data such as speech, text, image, video increasingly
complicate the ability to categorize data.
• Some technologies that deal with unstructured data include data mining, text
analytics, and noisy text analytics.
• Today data is being produced in large volumes. And just collecting the produced
data is of no use. Instead, we have to look for data from which business insights
can be generated which adds “value” to the company.
• This is where Big data analytics comes into the big picture. There are companies
that have invested in establishing data and data storage infrastructure, but they
fail to understand that the aggregation of data doesn’t equal value addition.
• Data analytics helps to derive useful insights from the collected data. These
insights, in turn, add value to the decision-making process.
• The Validity and Veracity of Big data can be described as the assurance of quality
or credibility of the collected data.
• Since Big data is vast and involves so many data sources, it is the possibility that
not all the collected data is accurate and of good quality.
• Hence, when processing big data sets, it is important to check the validity of the
data before proceeding with further analysis.
• Questions like Can you trust the data that you have collected? Is the data reliable
enough? , etc. need to be entertained. Hence, before processing the data for
further analysis, it is important to check the validity of the data.
• ii. Web log data: When servers, applications, networks, and so on operate,
they capture all kinds of data about their activity.
• iii. Point-of-sale data: When the cashier swipes the bar code of any
product that you are purchasing.
• iv. Financial data: such as the company symbol and dollar value.
• ii. Click-stream data: Data is generated every time when you click a
link on a website.
M/c generated:
• i. Satellite images: includes weather data or the data that the government captures in its satellite
surveillance imagery.
• ii. Scientific data: includes seismic imagery, atmospheric data and high energy physics.
• iii. Photographs and video: includes security, surveillance, and traffic video.
• iv. Radar or sonar data: includes vehicular, meteorological, and oceanographic data.
• Graph stores are highly optimized to efficiently store graph nodes and
links that allow you to query these graphs.
• Graph databases are useful for any business problem that has
complex relationships between objects such as social networking,
rules-based engines, creating mashups.
• These are important NoSQL data architecture patterns because they can
scale to manage large volumes of data.
• In the MapReduce framework, the map operation has a master node which
breaks up an operation into subparts and distributes each operation to
another node for processing, and reduce is the process where the master
node collects the results from the other nodes and combines them into the
answer to the original problem.
• Column family stores use row and column identifiers as general purposes
keys for data lookup. They’re sometimes referred to as data stores rather
than databases
• Document stores work in the opposite manner: the key may be a simple ID
• But you can get almost any item out of a document store by querying any
value or content within the document.
• Each branch has a related path expression that shows you how to
navigate from the root of the tree to any given branch, sub-branch or
value.