Bda Ese

Module 1: Big Data Introduction
1. What are the characteristics of Big Data?

The characteristics of Big Data are often described using the 5Vs:
1. Volume:
• Big Data is characterized by a massive volume of data.
• The size of data is crucial in determining its value.
• Examples include the vast number of messages and posts generated by platforms
like Facebook, the data generated by a Boeing 737 during a single flight, and the
daily ingestion of 500 terabytes of new data by Facebook.
2. Velocity:
• Refers to the speed of generation and processing of data.
• It deals with how fast data is generated and processed to meet demands.
• Examples include real-time personalization with streaming analytics, high-
frequency stock trading algorithms, and machine-to-machine processes exchanging
data between billions of devices.
3. Variety:
• Variety refers to the heterogeneous sources and nature of data, including both
structured and unstructured data.
• Unlike earlier days when spreadsheets and databases were the main sources, today's
data includes emails, photos, videos, PDFs, audio, etc.
• This variety of unstructured data poses challenges for storage, mining, and analysis.
4. Veracity:
• Veracity is about the trustworthiness of data.
• Data involves uncertainty and ambiguities, and mistakes can be introduced by both
humans and machines.
• Data quality is crucial for reliable analytics and conclusions.
5. Value:
• The raw data of Big Data has low value on its own.
• The value increases through analytics and the development of theories about the
data.
• Analytics transform Big Data into smart data, making it useful and valuable.
2. Define Big Data and Hadoop. How are they interconnected?
Big Data:
Big Data refers to large and complex datasets that cannot be effectively managed, processed,
and analysed using traditional data processing tools. These datasets typically have high
volumes, high velocity (generated or updated rapidly), and high variety (structured and
unstructured data). The challenges of Big Data include capturing, storing, analysing, and
visualizing this massive amount of data to derive meaningful insights and make informed
decisions.
Hadoop:
Hadoop is an open-source framework for distributed storage and processing of large
datasets. It was developed by Doug Cutting and Michael J. Cafarella in 2005 to support the
Nutch search engine project, funded by Yahoo. Hadoop was later contributed to the Apache
Software Foundation in 2006. Hadoop provides a scalable, fault-tolerant, and cost-effective
solution for handling Big Data.
Interconnection:
Hadoop addresses the challenges posed by Big Data through its distributed storage and
processing model. Here are some key points on how Big Data and Hadoop are
interconnected:
1. Scalability: Hadoop is designed to scale from single servers to thousands of machines.

This scalability is essential for handling the vast volumes of data in Big Data.
2. Distributed Processing: Hadoop allows developers to use multiple machines for a single
task, which is crucial for processing large datasets. It employs a "shared nothing"
architecture, where nodes communicate as little as possible.
3. Data Storage and Computation: Hadoop's approach involves breaking Big Data into
smaller pieces, distributing these pieces across a cluster of machines, and performing
computations where the data is already stored. This contrasts with traditional approaches
where powerful computers process centralized data.
4. MapReduce: Hadoop utilizes the MapReduce programming model, which enables parallel
processing of data across distributed nodes. This approach facilitates the efficient processing
of large-scale data.
5. Hadoop Distributed File System (HDFS): Hadoop's architecture includes HDFS, a

distributed file system that stores data across multiple nodes. It ensures fault tolerance by
maintaining multiple copies of each file.
6. Fault Tolerance: Hadoop features built-in fault tolerance mechanisms. It maintains

multiple copies of data to handle hardware failures, ensuring that data is not lost if a machine
or component fails.
7. Easy Programming: Hadoop provides a programming model that allows developers to

focus on writing scalable programs. It abstracts the complexities of distributed computing,
making it easier for programmers to work with large-scale datasets.
3. Explain the three Vs of Big Data and provide examples of case studies, specifying which
Vs are satisfied by each case.
Big Data Exploration: Customer Example (Case Study 1)
• Volume: The case involves exploring 4 TB of data for various business solutions
such as a supplier portal and call center applications.
• Velocity: Single-point data fusion is utilized, suggesting a need for real-time or near-
real-time access to data.
• Variety: The exploration involves navigating and exploring all enterprise and
external content in a single user interface, indicating the diverse nature of the data
sources.
Enhanced 360º Customer View: Customer Example (Case Study 2)
• Volume: Creating a "Facebook" for customers implies dealing with a substantial

volume of data, especially if 200+ different customer profiles are being identified.
• Velocity: The creation of a complete customer view and leveraging new data types
for customer analysis implies a need for real-time or dynamic updates.
• Variety: The case involves combining structured and unstructured data to run
analytics, indicating a variety of data types, including social data, surveys, and
support emails.
Security/Intelligence Extension: Needs (Case Study 3)
• Volume: - Analyzing all Internet traffic, including social media and email, suggests
dealing with a massive volume of data.
• Velocity: Tracking persons of interest and civil/border activity indicates a need for
real-time or near-real-time monitoring.
• Variety: The diverse sources of data, such as social media and email, highlight the
variety of data types being analyzed for security and intelligence.
Operations Analysis: Needs (Case Study 4)
• Volume: Gaining real-time visibility into operations, customer experience,

transactions, and behavior implies dealing with a substantial volume of operational
data.
• Velocity: Proactively planning to increase operational efficiency and monitoring
end-to-end infrastructure in real-time indicate a need for rapid data processing.
• Variety: Analyzing a variety of machine data for improved business results involves
handling diverse types of data generated by machines.
Data Warehouse Augmentation: Needs (Case Study 5)
• Volume: The need to leverage a variety of data and optimize storage and
maintenance costs suggests dealing with large volumes of data.
• Velocity: Low latency requirements and the processing of streaming data imply a
need for timely access and analysis.
• Variety: Structured, unstructured, and streaming data sources are required, indicating
the need to handle diverse data types.
4. Describe the HDFS architecture with a diagram.
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode
and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
a master/slave architecture. This architecture consist of a single NameNode performs the role of
master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.

o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming
and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.

o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data
by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.

o It receives task and code from Job Tracker and applies that code on the file. This process
can also be called as a Mapper.
5. What are limitations of Hadoop.
The limitations of Hadoop mentioned in your provided text are summarized below:
1. Not Suitable for Small Data: Hadoop, especially Hadoop Distributed File System
(HDFS), is not well-suited for managing small amounts of data or a large number of
small files. The high-capacity design of HDFS is more efficient when dealing with a
smaller number of larger files.
2. Dependency Between Data: Hadoop may not be suitable for scenarios where there are
dependencies between data. The MapReduce paradigm, which Hadoop is based on,
divides jobs into smaller tasks that are processed independently. If there are dependencies
between data, this approach may not be optimal.
3. Job Division Limitation: Jobs in Hadoop need to be divided into small, independent
chunks to take advantage of the distributed processing capabilities. If a job cannot be
effectively divided into smaller tasks, it may not be suitable for Hadoop.
4. No Support for Real-time and Stream Processing: Hadoop, by default, is designed for
batch processing rather than real-time or stream-based processing. It is not suitable for
scenarios where immediate or continuous data processing is required.
5. Issue with Small Files: HDFS, the file system component of Hadoop, has limitations
with small files. Storing a large number of small files can overload the NameNode, which
manages the metadata. HDFS is more efficient when dealing with a smaller number of
large files.
6. Support for Batch Processing Only: Hadoop primarily supports batch processing. It is
not designed for real-time data processing, and the MapReduce framework may not fully
leverage the memory of the Hadoop cluster.
7. No Real-time Data Processing:

- Hadoop is not suitable for real-time data processing. While batch processing is
efficient for high volumes of data, the time it takes to process and produce results can be
significant, making it unsuitable for scenarios requiring real-time insights.
8. No Delta Iteration: Hadoop is not as efficient for iterative processing, particularly for
tasks involving cyclic data flow where the output of one stage is the input to the next.
This limitation can impact the performance of iterative algorithms.
Module 2: Big Data Processing Algorithms
6. Describe Two-stage matrix multiplication using MapReduce.
Two-stage matrix multiplication using MapReduce is a technique commonly employed in

distributed computing to efficiently compute the product of two matrices.
Stage 1: Map Phase

1. Input Distribution: The matrices are divided into smaller blocks and distributed across
the nodes in the MapReduce cluster.
2. Mapper Tasks: Each mapper task processes a subset of the blocks. For each element in
the input matrices, the mapper emits intermediate key-value pairs. The key is the index of
the result element, and the value includes the matrix identifier and the actual element
value.
Stage 2: Reduce Phase

1. Shuffling and Sorting: The MapReduce framework groups the intermediate key-value
pairs by key, ensuring that all values corresponding to a particular key are sent to the
same reducer.
2. Reducer Tasks: Each reducer receives a set of key-value pairs representing the
elements that contribute to a specific element in the resulting matrix.
3. Matrix Multiplication: The reducer performs the actual multiplication for each element
by accumulating the products of the corresponding elements from the input matrices. The
final result is the product of the two matrices.
Example:
Consider matrices A, B, and C, where C = AXB. In the Map phase, each mapper
computes partial products and emits key-value pairs like `(i, ("A", j, Aij))` and `(j, ("B",
k, Bjk))`. In the Reduce phase, for each key (i, j), the reducer accumulates the products
for all pairs with the same key and calculates the final result for Cij.
7. Define combiners and discuss when one should use a combiner in MapReduce.
In the context of your studies in computer engineering at a Mumbai university, let's

define combiners and discuss when one should use a combiner in MapReduce.
Combiners:
Combiners in MapReduce are essentially "mini-reducers" that run on the mapper nodes
after the map phase. They perform a local aggregation of the intermediate key-value pairs
generated by the mappers before sending the data to the full reducer. The use of
combiners is particularly beneficial in scenarios where the reduction operation is
commutative and associative.
When to Use Combiners:
1. Bandwidth Optimization: Combiners are employed to reduce the volume of data that
needs to be transmitted from the mappers to the reducers. By aggregating and
compressing the intermediate data locally on the mapper nodes, combiners save
bandwidth and improve the overall efficiency of the MapReduce job.
2. Commutative and Associative Operations: Combiners are most effective when the
reduce operation is commutative (order of operands doesn't affect the result) and
associative (grouping of operands doesn't affect the result). In such cases, the order in
which the intermediate key-value pairs are combined does not impact the final result,
allowing for local aggregation without affecting the correctness of the computation.
3. Reduction in Shuffle and Sort Phase: The use of combiners can significantly reduce the
amount of data shuffled and sorted during the MapReduce process. This is particularly
advantageous in scenarios where network bandwidth is a limiting factor, as it minimizes
the amount of data that needs to be transferred between the map and reduce phases.
4. Resource Efficiency: Combiners help in utilizing system resources more efficiently by

performing partial aggregation on the mapper nodes. This local aggregation can be seen
as a form of "pre-reduction," which contributes to a more streamlined and optimized
MapReduce execution.
8. Explain MapReduce, shuffling in MapReduce, and provide a working example.
MapReduce:
MapReduce is a programming model and processing technique for handling large-scale
data processing tasks in a distributed and parallel fashion. It consists of two main phases:
the Map phase and the Reduce phase.
1. Map Phase:
- Input data is divided into smaller chunks and distributed to different nodes in a
computing cluster.
- Each node independently processes its chunk of data using a map function, generating
key-value pairs as intermediate outputs.
2. Shuffling Phase:
- After the Map phase, the system performs a shuffling phase to redistribute and group
the intermediate key-value pairs across the nodes based on their keys.
- The objective is to ensure that all values associated with a particular key end up at the
same reducer during the Reduce phase.
3. Reduce Phase:
- The output of the shuffling phase is then processed by a reduce function.
- Each reducer gets a set of key-value pairs grouped by key and performs the final
computation to produce the output.
Shuffling in MapReduce:
Shuffling is a critical phase in the MapReduce process where the intermediate data
produced by the mappers is transferred and rearranged to the reducers. This phase
involves network communication and coordination to ensure that all values associated
with a particular key are sent to the same reducer.
The shuffling phase includes three main steps:

1. Partitioning: The system partitions the intermediate key-value pairs based on the keys.
Each partition corresponds to a reducer.
2. Sorting: Within each partition, the intermediate key-value pairs are sorted by key. This
step is crucial for the efficiency of the subsequent reduction.
3. Data Movement: The sorted data is then transferred to the reducers, ensuring that each
reducer receives all values for a specific key.
Working Example:
Let's consider a simple word count example.
Map Phase:
- Input: "Hello world, hello Mumbai!"
- Mapper 1: {("Hello", 1), ("world", 1)}
- Mapper 2: {("hello", 1), ("Mumbai", 1)}
Shuffling Phase:
- Partitioning: "Hello" and "hello" go to the same partition.
- Sorting: Within each partition, sort by key.
- Data Movement: Send sorted data to reducers.
Reduce Phase:
- Reducer 1: {("Hello", [1]), ("hello", [1])}
- Reducer 2: {("world", [1]), ("Mumbai", [1])}
- Final output: {("Hello", 2), ("world", 1), ("hello", 1), ("Mumbai", 1)}
OR
9. List relational algebra operations and explain any four using MapReduce.
Relational algebra operations are fundamental operations used in database query

languages to manipulate and retrieve data from relational databases. MapReduce, a
programming model for processing and generating large datasets, can be used to
implement these operations in a distributed computing environment. Here's an
explanation of four relational algebra operations and how they can be implemented using
MapReduce:
1. Selection:
- Relational Algebra Operation:
- σ_condition(R): Selects tuples from relation R that satisfy a given condition.
- MapReduce Implementation:
- Map phase: Each mapper filters tuples based on the selection condition.
- Reduce phase: Merged results from mappers are aggregated.
2. Projection:
- π_attribute1, attribute2, ..., attributeN(R): Selects specific columns (attributes) from
relation R.
- Map phase: Each mapper emits key-value pairs with the selected attributes as the key
and the entire tuple as the value.
- Reduce phase: Reducers merge results based on the selected attributes.
3. Union, Intersection, and Difference:

- Relational Algebra Operations:
- Union: R ∪ S - Combines tuples from relations R and S without duplicates.
- Intersection: R ∩ S - Returns tuples common to both relations R and S.
- Difference: R - S - Returns tuples that are in relation R but not in relation S.
- For Union: Concatenate the tuples from both relations and perform a distinct
operation.
- For Intersection: Map tuples from both relations with a flag indicating the source
relation, and then reduce by selecting tuples with occurrences from both relations.
- For Difference: Map tuples from R with a flag, map tuples from S with a different
flag, and then reduce by selecting tuples with occurrences only from R.
4. Natural Join:
- R ⨝ S: Returns the combination of tuples from relations R and S that have equal
values on their common attributes.
- Map phase: For each relation, emit key-value pairs with the join attribute as the key
and the entire tuple as the value.
- Reduce phase: For each key (join attribute value), merge tuples from both relations
that share the key.
Note:
- Grouping and Aggregation are also common relational algebra operations, but they are
typically more complex to implement in a MapReduce framework. They often require
multiple MapReduce jobs, as the aggregation needs to be done globally across the entire
dataset.
Implementing relational algebra operations in MapReduce requires careful design to

distribute the computation across multiple nodes in a cluster, taking advantage of the
parallel processing capabilities of the framework. Additionally, optimizations such as data
partitioning and combiners can be employed to improve efficiency.
10.Provide MapReduce pseudocode.
11.Explain PageRank and discuss whether a website's PageRank can increase or decrease.
PageRank is an algorithm developed by Larry Page and Sergey Brin, the founders of Google, to
measure the importance of web pages. It is a link analysis algorithm that assigns a numerical
weight, or "PageRank score," to each element of a hyperlinked set of documents, such as the
World Wide Web. The algorithm works on the principle that a link from one page to another can
be seen as a vote of confidence or importance.
Here's a high-level overview of how PageRank works:

1. Initialization: Initially, all pages are assigned an equal and small PageRank score.
2. Iteration: The algorithm iteratively updates the PageRank scores based on the link structure
of the web. Pages with higher PageRank scores are considered more important.
3. Calculation: The PageRank of a page is influenced by the PageRanks of the pages linking to
it. The more inbound links a page receives, the higher its PageRank. The importance of a
linking page is also a factor; a link from a high-ranking page carries more weight.
4. Damping Factor: To avoid issues with infinite loops in the web graph, a damping factor is
introduced. It typically takes a value of 0.85, meaning there's a 15% chance that a user will
randomly jump to another page rather than following a link.
5. Convergence: The iterative process continues until the PageRank scores converge to stable
values.
Can a website's PageRank increase or decrease?

Yes, a website's PageRank can both increase and decrease over time. This can happen for a
number of reasons, including:
Changes to the website's inbound links: If a website loses backlinks from other websites, its
PageRank will decrease. Conversely, if a website gains backlinks from high-quality websites, its
PageRank will increase.
Changes to the website's outbound links: If a website adds more outbound links, its PageRank
will decrease. This is because the PageRank score is distributed among all of the links on a
page.
Changes to the overall number of pages on the web: As the number of pages on the web
increases, the PageRank score of all pages will decrease. This is because the PageRank score is
a measure of the relative importance of a page, and the importance of a page decreases as there
are more pages to compete with.
Module 3: NoSQL Databases
12.Explain the NoSQL data architecture pattern.
Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database that helps to perform operations on big data and
store it in a valid format. It is widely used because of its flexibility and a wide variety of
services.The data is stored in NoSQL in any of the following four data architecture
patterns.
1. Key-Value Store Database:

This model is one of the most basic models of NoSQL databases. As the name suggests,
the data is stored in the form of Key-Value Pairs. The key is usually a sequence of strings,
integers, or characters but can also be a more advanced data type. The value is typically
linked or correlated to the key. The key-value pair storage databases generally store data
as a hash table where each key is unique. The value can be of any type (JSON,
BLOB(Binary Large Object), strings, etc). This type of pattern is usually used in
shopping websites or e-commerce applications.
Advantages:
- Can handle large amounts of data and heavy load.
- Easy retrieval of data by keys.
Limitations:
- Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
- Data can be involving many-to-many relationships which may collide.
Examples:
- DynamoDB
- Berkeley DB
2. Column Store Database:
Rather than storing data in relational tuples, the data is stored in individual cells which
are further grouped into columns. Column-oriented databases work only on columns.
They store large amounts of data into columns together. Format and titles of the columns
can diverge from one row to another. Every column is treated separately. But still, each
individual column may contain multiple other columns like traditional databases.
Basically, columns are the mode of storage in this type.
Advantages:
- Data is readily available.
- Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:
- HBase
- Bigtable by Google
- Cassandra
3. Document Database:
The document database fetches and accumulates data in the form of key-value pairs but
here, the values are called Documents. Document can be stated as a complex data
structure. Document here can be a form of text, arrays, strings, JSON, XML, or any such
format. The use of nested documents is also very common. It is very effective as most of
the data created is usually in the form of JSONs and is unstructured.
Advantages:
- This type of format is very useful and apt for semi-structured data.
- Storage retrieval and managing of documents is easy.
Limitations:
- Handling multiple documents is challenging.
- Aggregation operations may not work accurately.
Examples:
- MongoDB
- CouchDB
4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data in
graphs. Graphs are basically structures that depict connections between two or more
objects in some data. The objects or entities are called nodes and are joined together by
relationships called Edges. Each edge has a unique identifier. Each node serves as a point
of contact for the graph. This pattern is very commonly used in social networks where
there are a large number of entities and each entity has one or many characteristics which
are connected by edges. The relational database pattern has tables that are loosely
connected, whereas graphs are often very strong and rigid in nature.
Advantages:
- Fastest traversal because of connections.
- Spatial data can be easily handled.
Limitations:
- Wrong connections may lead to infinite loops.
Examples:
- Neo4J
- FlockDB (Used by Twitter)
13.Define NoSQL and outline the business drivers. Discuss any two architecture patterns.
Definition:
• NoSQL database stands for “Not Only SQL” or “NOT SQL”
• Traditional RDBMS uses SQL syntax and queries to analyze and get the data for
further insights.
• NoSQL is a Database Management System that provides mechanism for
storage and retrieval of massive amount of unstructured data in distributed
environment.
Business Drivers:
1. Volume:
- The ability to handle large amounts of data.
- Important for businesses dealing with massive data volumes, such as ecommerce
companies, social media firms, and telecommunications companies.
- Examples include e-commerce companies managing terabytes or petabytes of
clickstream data and social media companies handling even larger user generated content.
2. Velocity:
- The ability to handle high-velocity data.
- Crucial for businesses requiring real-time data processing, such as financial
trading firms and fraud detection systems.
- Examples include financial trading companies processing millions of
transactions per second in real time and fraud detection systems analyzing credit
card transactions on the fly.
3. Variability:
- The ability to handle highly variable data.
- Essential for businesses collecting data from diverse sources like sensors, social
media, and web logs.
- Examples include manufacturing plants managing sensor data with varying
formats, frequencies, and volumes, and social media companies dealing with
user-generated content of varying lengths, formats, and content.
4. Agility:
- The ability to quickly adapt to changes in data or requirements.
- Vital for businesses needing to respond promptly to changing market conditions
or customer demands.
- Examples include retailers regularly adding new products or adjusting pricing
strategies, social media companies introducing new platform features or altering
content recommendation algorithms, and telecommunications companies
upgrading networks or adding new services.
14.Clarify how agility serves as a non-SQL business driver.
Agility serves as a non-SQL business driver in several ways:
1. Rapid application development: NoSQL databases are schema-less, meaning they

don't require a predefined structure for data storage. This allows developers to
rapidly build and iterate on applications without the constraints imposed by
traditional relational databases.
2. Scalability: NoSQL databases are designed to handle large volumes of data and can
easily scale horizontally by adding more nodes to the database cluster. This makes
them well-suited for applications that experience rapid growth or have
unpredictable data volumes.
3. Flexibility: NoSQL databases support a variety of data models, including key-

value, document, and graph stores. This flexibility allows developers to choose the
data model that best suits the application's needs.
4. Performance: NoSQL databases are often faster than relational databases for
certain types of queries, such as those that involve unstructured data or complex
relationships.
5. Cost-effectiveness: NoSQL databases are often more cost-effective than relational

databases, especially for large-scale applications.
Here are some specific examples of how agility has driven the adoption of NoSQL
databases:
• Netflix: Netflix uses NoSQL databases to store its vast library of movies and TV
shows. This allows Netflix to quickly add new content and make changes to
existing content without disrupting the user experience.
• Amazon: Amazon uses NoSQL databases to power many of its core services,
including its e-commerce platform and its cloud computing services. This allows
Amazon to handle massive amounts of data and traffic without compromising
performance.
• Facebook: Facebook uses NoSQL databases to store social graph data for its
billions of users. This allows Facebook to quickly connect users with friends and
family, and to provide personalized recommendations.
15.Describe the characteristics of a NoSQL database.
1. Non-relational:
• NoSQL databases do not adhere to the traditional relational model.
• Tables with fixed-column records are not provided.
• Self-contained aggregates or Binary Large Objects (BLOBs) are commonly used.
• Object-relational mapping and data normalization are not required.
• Complex features like query languages, query planners, referential integrity joins,
and ACID (Atomicity, Consistency, Isolation, Durability) properties are absent.
2. Schema-free:
• NoSQL databases are schema-free or have relaxed schemas.
• No predefined schema is necessary for data storage.
• They allow heterogeneous structures of data within the same domain, providing
flexibility in data representation.
3. Simple API:
• NoSQL databases provide user-friendly interfaces for storage and data querying.
• APIs support low-level data manipulation and selection methods.
• Text-based protocols, such as HTTP and REST, are commonly used, often with
JSON (JavaScript Object Notation).
• Standard-based query languages are generally lacking, and many NoSQL databases
operate as web-enabled services accessible over the internet.
4. Distributed:
• NoSQL databases operate in a distributed fashion.
• They offer features like auto-scaling and fail-over capabilities for enhanced
performance and resilience.
• In pursuit of scalability and throughput, the ACID concept may be sacrificed.
• NoSQL databases often adopt a Shared Nothing Architecture, reducing
coordination and facilitating high distribution, essential for managing substantial
data volumes and high traffic levels.
Module 4: Mining Data Streams
16.Discuss various issues and challenges in data streaming and query processing.
1. Unbounded Memory Requirement:

• Challenge: Dealing with potentially unbounded data streams requires addressing
the growing storage needs for computing exact answers to data stream queries.
• Issue: External memory algorithms designed for large datasets may not be suitable
for real-time data stream applications. Continuous queries, crucial for timely
responses, demand low computation time per data element to keep up with the
constant arrival of new data.
2. Approximate Query Answering:

• Challenge: Limited memory may prevent the production of exact answers,
necessitating the use of high-quality approximate answers.
• Issue: Approximation algorithms, such as sketches, random sampling, histograms,
and wavelets, offer solutions for data reduction. However, ensuring the reliability
and accuracy of these approximations becomes a critical concern.
3. Sliding Windows:
• Challenge: Producing approximate answers using sliding windows focuses on
recent data, but defining an appropriate window size poses challenge.
• Issue: While deterministic and well-defined, determining the optimal window size
and ensuring the relevance of recent data can be challenging. Expressing sliding
windows as part of desired query semantics requires careful consideration.
4. Batch Processing, Sampling, and Synopses:

• Challenge: Slow data structure operations relative to data arrival rates hinder the
ability to provide continually up-to-date answers.
• Issues:
Batch Processing: Computation in batches sacrifices timeliness for accuracy, which
is suitable for bursty data streams. Balancing accuracy and timeliness through
periodic computation poses challenges.
Sampling: Sampling introduces approximation, but for certain queries, reliable
approximation guarantees may be hard to achieve.
Synopsis Data Structures: Designing approximate structures for specific query
classes is challenging but necessary to maintain low computation per data element.
5. Blocking Operations:
• Challenge: Blocking query operators, unable to produce output until the entire
input is seen, pose challenges in the data stream computation model.
• Issue: Dealing with blocking operators is challenging as continuous data streams
may be infinite. Blocking operators may lead to delayed or unreliable results,
requiring effective strategies for handling changes in results over time until all data
is processed. Balancing the efficiency of sorted data with the challenges posed by
blocking operators is a complex aspect of data stream computation.
17.Clarify DGIM for counting once in a tree with an example.
DGIM (Datar-Gionis-Indyk-Motwani) is an algorithm designed for approximating the

count of elements in a binary stream, particularly useful for counting the number of
occurrences of 1s in a sliding window. The algorithm works efficiently in limited
memory space.
The basic idea of DGIM involves dividing the stream into buckets and maintaining a set
of counters within each bucket. The counters are designed to estimate the number of 1s in
a specific time range. The algorithm uses a binary tree structure to organize these
counters.
Let's clarify DGIM for counting once in a tree with an example:
1. Initialize:
- Start with a binary stream of elements (0 or 1) moving from left to right.
- Divide the stream into buckets. Each bucket contains a certain number of elements.
2. Create Buckets and Counters:

- Divide the stream into buckets, where each subsequent bucket contains half the
number of elements of the previous one. Assign timestamps to buckets to keep track of
when they were created.
Bucket 1: |__|__|__|__| (Timestamp: T)

Bucket 2: |____|____| (Timestamp: 2T)
Bucket 3: |________| (Timestamp: 4T)
3. Maintain Counters:
- Within each bucket, maintain counters that count the number of 1s encountered within
that bucket.
Bucket 1: |1|0|0|1| (Count: 2)

Bucket 2: |0|0|1|1| (Count: 2)
Bucket 3: |1|0|1|0| (Count: 2)
4. Binary Tree Structure:
- Organize the counters in each bucket in a binary tree structure.
5. Counting:
- To estimate the count of 1s in the sliding window, traverse the tree starting from the
top.
- Sum the counts of the encountered nodes, considering their respective timestamps.
Estimated count of 1s in the last T time units = 3 + 2 + 2 = 7

18.Explain Bloom’s filter and FM algo with examples.
Bloom’s Filter:
Bloom's Filter is a probabilistic data structure used to test whether an element is a

member of a set. It involves three operations: adding an element into the set, testing
whether an element is a member, and testing whether an element is not a member. It is
characterized by two parameters, `m` (the length of the filter) and `k` (the number of
different hash functions).
Bloom Filter Algorithm:

1. Initialization: Bloom filter is a bit array of `m` bits, initially set to 0.
2. Insertion of Element:
- Calculate values of all `k` hash functions for the element.
- Set the bit with the corresponding indices obtained from hash functions.
3. Testing Membership:
- Calculate all `k` hash functions for the element.
- Check bits in all corresponding indices.
- If all bits are set, the answer is "maybe"; if at least one bit isn't set, the answer is
"definitely not."
Example:
Properties and Applications:
- False Positives and Negatives:

- False positives are possible (indicating membership when not a member).
- False negatives are not possible (only returns not a member if it's not a member).
- Hash Functions:
- Hash functions should be independent, uniformly distributed, and fast.
- Cryptographic hash functions like SHA1 are not recommended.
- Applications:
- Used in Google BigTable, Apache HBase, and Apache Cassandra to reduce disk
lookups for non-existent rows or columns.
- Applied by Medium to avoid recommending articles a user has previously read.
- Formerly used by Google Chrome to identify malicious URLs.
Flajolet-Martin (FM) Algorithm:
The Flajolet-Martin algorithm is a probabilistic algorithm used for estimating the number
of distinct elements in a stream of data. It's particularly useful when counting exact
distinct elements is impractical for large datasets.
How it Works:
1. Hashing: Use a hash function to map elements to binary strings.
2. Counting Trailing Zeros: Count the number of trailing zeros in each binary string.
3. Estimation: Use the maximum count of trailing zeros to estimate the number of distinct
elements.
Example:
Consider a stream of elements: "apple," "banana," "orange," "apple," "banana," "grape."
1. Hashing to Binary: Hash the elements to binary strings: "001," "010," "011," "001,"
"010," "100."
2. Counting Trailing Zeros: Count trailing zeros: "2," "1," "0," "2," "1," "2."
3. Estimation: The maximum count is 2.
4. Final Estimation: The estimated number of distinct elements is (2^2 = 4).
Properties and Considerations:

- Approximation: The algorithm provides an approximation of the number of distinct
elements.
- Accuracy Improvement: The accuracy of the estimation improves as more elements are
processed.
- Parameter Tuning: Parameters like the hash function used can impact the algorithm's
performance.
Applications:
- Large Datasets: Suitable for scenarios where counting exact distinct elements in a large
dataset is impractical.
- Data Stream Analysis: Useful in streaming applications where data is continuously
arriving.
- Probabilistic Counting: Offers a trade-off between accuracy and computational
efficiency.
19.Describe the updating bucket of DGIM.
The DGIM algorithm uses buckets to represent the binary stream and efficiently estimate
the count of 1's in a sliding window. The updating bucket mechanism is essential for
managing these buckets and maintaining the accuracy of the estimation.
Bucket Representation:
1. Timestamp and Size:

- Each bucket is associated with:
- The timestamp of its right (most recent) end.
- The number of 1's in the bucket, represented as a power of 2 (size of the bucket).
2. Logarithmic Representation:
- Timestamps are represented modulo (N) (the length of the window), ensuring they can
be encoded with (log_2 N) bits.
- The number of 1's in a bucket is represented with (log_2 log_2 N) bits. This is
possible because the size is a power of 2, and its logarithm can be efficiently encoded.
Rules for Representing Buckets:
To maintain the integrity of the representation, several rules are followed:
1. Right End of a Bucket: The right end of a bucket always corresponds to a position with
a 1. This ensures that each bucket contains at least one 1.
2. Every 1 in a Bucket: Every position with a 1 in the stream is in some bucket. This rule
ensures that the algorithm keeps track of all 1's in the stream.
3. Non-Overlap: No position in the stream is in more than one bucket. This prevents
double counting of 1's.
4. Bucket Sizes: There are one or two buckets of any given size, up to some maximum
size. This rule controls the size distribution of buckets.
5. Power of 2 Sizes: All sizes of buckets must be a power of 2. This simplifies the
representation and ensures a consistent structure.
6. Non-Decreasing Sizes: Buckets cannot decrease in size as we move to the left (back in
time). This rule ensures that the algorithm maintains a non-decreasing size of buckets as
time progresses.
Example:
Suppose we have a window of length (N), and we represent the stream using DGIM
buckets. Each bucket follows the rules mentioned above, with timestamps represented
modulo (N) and bucket sizes as powers of 2.
- A bucket represents a range of timestamps, and its size indicates the count of 1's in that
range.
- The algorithm periodically merges adjacent buckets with the same size, ensuring a
logarithmic number of buckets.
This updating bucket mechanism allows DGIM to efficiently estimate the count of 1's in
the sliding window with an error of no more than 50%. It provides a memory-efficient
solution for real-time counting of 1's in a binary stream.
20.Explain DSMS (Data Stream Management System).
A Data Stream Management System (DSMS) is a specialized type of database management
system designed to handle continuous streams of data in real-time. Unlike traditional database
management systems (DBMS), which store and manage static data in persistent relations,
DSMS is tailored to handle rapidly changing and often unbounded data streams. DSMS is
particularly well-suited for applications where data arrives continuously and needs to be
analyzed on-the-fly.
Key Characteristics of DSMS:

1. Dynamic and Continuous:
- DSMS is designed to process and analyze data streams continuously and dynamically as
they arrive. It is well-suited for scenarios where data is constantly changing, such as network
monitoring, financial trading, or web clickstreams.
2. Transient Streams: Unlike traditional DBMS that store persistent relations, DSMS deals with
transient streams of data. The focus is on processing the data as it flows rather than storing it for
long-term use.
3. Continuous Queries: DSMS supports continuous queries, allowing users to define queries
that are continuously applied to the incoming data streams. These queries are persistent and are
evaluated in real-time as data flows through the system.
4. Bounded Main Memory: DSMS systems typically operate with a bounded main memory.
They are optimized for efficient processing of data streams within the available memory
constraints.
5. Random Access: While traditional DBMS often supports random access to data, DSMS is
optimized for sequential access due to the nature of continuous data streams.
Applications of DSMS:
1. Network Management and Traffic Engineering: Analyzing streams of measurements and

packet traces to detect anomalies and adjust routing in networks.
2. Telecom Call Data: Processing streams of call records for tasks like fraud detection,
analyzing customer call patterns, and billing.
3. Network Security: Analyzing network packet streams and user session information to
implement URL filtering and detect intrusions, denial-of-service attacks, and viruses.
4. Financial Applications: Analyzing streams of trading data, stock tickers, and news feeds for
tasks like identifying arbitrage opportunities, analytics, and patterns.
5. Web Tracking and Personalization: Analyzing clickstreams, user query streams, and log
records for monitoring, analysis, and personalization on platforms like Yahoo, Google, and
Akamai.
6. Massive Databases: Handling truly massive databases, such as astronomy archives, where
data is streamed once or repeatedly, and queries operate to the best of their ability.
Abstract Architecture for a DSMS
1. Input Buffer: Captures streaming inputs; an optional input monitor collects statistics or drops
data as needed.
2. Working Storage: Temporarily stores recent stream portions and necessary summary data
structures for queries.
3. Memory Management: Varied based on arrival rates, ranging from fast RAM counters to
memory-resident sliding windows.
4. Local Storage: metadata like foreign key mappings; users update metadata directly, used for
query processing.
5. Continuous Queries & Execution Plans: Queries registered and converted into execution
plans; similar queries may be grouped for shared processing. Requires buffers, inter-operator
queues, and scheduling for streaming data processing.
6. Query Processor: Communicates with the input monitor; adjusts query plans based on
workload and input rate changes.
7. Results: Streams results to users, alerting applications, or a Storage and Data Warehouse
(SDW) for storage and further analysis.
21.Differentiate between DBMS and DSMS.
22.Explain how to count distinct elements in a string.
The task is to count the number of distinct elements in a data stream, and the traditional
approach of maintaining the set of elements seen may not be feasible due to space
constraints. The Flajolet-Martin approach is introduced to estimate the count in an
unbiased way, even when complete sets cannot be stored. This approach is useful in
scenarios where there is limited space, or when counting multiple sets simultaneously.
Flajolet-Martin (FM) Algorithm:
The Flajolet-Martin algorithm is a probabilistic algorithm used for estimating the number
of distinct elements in a stream of data. It's particularly useful when counting exact
distinct elements is impractical for large datasets.
How it Works:
1. Hashing: Use a hash function to map elements to binary strings.
2. Counting Trailing Zeros: Count the number of trailing zeros in each binary string.
3. Estimation: Use the maximum count of trailing zeros to estimate the number of distinct
elements.
Example:
Consider a stream of elements: "apple," "banana," "orange," "apple," "banana," "grape."
1. Hashing to Binary: Hash the elements to binary strings: "001," "010," "011," "001,"
"010," "100."
2. Counting Trailing Zeros: Count trailing zeros: "2," "1," "0," "2," "1," "2."
3. Estimation: The maximum count is 2.
4. Final Estimation: The estimated number of distinct elements is (2^2 = 4).
Properties and Considerations:

- Approximation: The algorithm provides an approximation of the number of distinct
elements.
- Accuracy Improvement: The accuracy of the estimation improves as more elements are
processed.
- Parameter Tuning: Parameters like the hash function used can impact the algorithm's
performance.
23.For FM data string given as 3,1,4,1,5,9,2,6,5, let the hash function h(x)= 3x+1 mod (5).
24.Given x=1,3,2,1,2,3,4,3,1,2,3,1, f(x)= 6x+1 mod (5) and f(x) = 1,4,2,1,2,4,4,4,4,1,2,4,1,7,

explain FM algo.
25.Explain Blooms filter:

m=5
h(x) = x mod 5
h2(x) = (2x+3) mod 5
26.Discuss Blooms and FM for specific scenarios.
Bloom Filters:
1. Scenario: Set Membership Testing

- Use Case: In applications where there is a need to check whether an element is a
member of a set or not.
- Example: Checking if a given URL is present in a list of malicious websites.
2. Scenario: Reducing Database Lookups

- Use Case: Minimizing the number of expensive database lookups by quickly filtering
out elements that are not in the dataset.
- Example: Determining if a username is available in a large user database before
querying the database.
3. Scenario: Caching and Duplicates Elimination

- Use Case: Avoiding redundant work by quickly checking if an item is already in a
cache or if it's a duplicate.
- Example: Web caching systems can use Bloom Filters to quickly identify whether a
requested webpage is in the cache.
4. Scenario: Load Balancing in Distributed Systems

- Use Case: Distributing data evenly among multiple nodes without requiring
centralized tracking of elements.
- Example: Assigning tasks to distributed nodes based on whether a particular item falls
within their responsibility range.
Flajolet-Martin Algorithm:
1. Scenario: Distinct Element Counting in Data Streams

- Use Case: Counting the number of distinct elements in a continuous data stream
without storing the entire stream.
- Example: Analyzing user clicks on a website to estimate the number of unique
visitors.
2. Scenario: Web Analytics and Clickstream Analysis

- Use Case: Approximating the cardinality of user activities in real-time data streams.
- Example: Counting the number of unique searches made by users within a given time
window.
3. Scenario: Network Monitoring and Anomaly Detection

- Use Case: Detecting unusual patterns or identifying outliers in network traffic.
- Example: Monitoring the flow of distinct IP addresses to identify potential security
threats.
4. Scenario: Social Network Analysis

- Use Case: Estimating the number of unique users or communities in a social network.
- Example: Analyzing interactions between users to understand the reach and diversity
of a social media platform.
Module 5: Real-Time Big Data Models
27.Explain various distance measures with examples.

28.Calculate Euclidean distance for points with attributes:
x attribute1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
29.Solve Jaccard distance:

a. 1 2 3 4 and 2 3 5 7
b. aaab aabbc
c. find 110011 and 010101
33. Solve 11001 and 01011:

a. 1254 2357
b. Compute the cosine distance between d1 and d2:
i. d1 = 5510000137
ii. d2 = 2210012202
30.Solve:
a. d1 = 1112210000
b. d2 = 0111101100
c. d3 = 0111100011
d. d4 = 0102201000
31.Compute the cosine of the angle between the first in scheme:

a. (3 -1 2)
b. scheme 2 (-2 31)
32.For the graph below, use the betweenness factor to find all communities (shape is a
triangle).
33.Elaborate on the social network graph, clustering algorithm, and mining.
Social Network Graph:

1. Definition:
- A social network graph represents the relationships and interactions between entities in a
social system, typically individuals or organizations.
- Nodes in the graph represent entities (e.g., people), and edges represent connections or
relationships between them.
2. Characteristics:
- Nodes: Individuals or entities in the social network.
- Edges: Connections or relationships between nodes, indicating social interactions.
- Attributes: Additional information associated with nodes or edges, such as user profiles or
interaction strengths.
3. Representation:
- People are represented as nodes, and relationships are represented as edges.
- Allows for the application of mathematical graph theory tools for analysis.
Clustering Algorithm:
1. Definition:
- Clustering algorithms aim to group nodes in a graph into clusters or communities based on
certain criteria, often focusing on maximizing intra-cluster connectivity and minimizing inter-
cluster connectivity.
2. Types of Clustering Algorithms:

- K-means Clustering: Assigns nodes to clusters based on similarity measures.
- Hierarchical Clustering: Builds a tree of clusters, merging nodes progressively based on
similarity.
- Modularity-based Clustering: Maximizes the modularity of network partitions.
- Louvain Method: Optimizes modularity through an iterative process of node movement
between communities.
3. Community Detection:
- Identifying groups of tightly connected nodes within a network, often referred to as
communities.
4. Applications:
- Community detection helps in understanding the structure of social networks, identifying
influential nodes, and improving recommendation systems.
Mining in Social Networks:
1. Definition:
- Mining in social networks involves extracting meaningful patterns, insights, or knowledge
from large volumes of social network data.
2. Mining Techniques:
- Link Prediction: Predicting future connections between nodes in a social network.
- Anomaly Detection: Identifying unusual patterns or outliers in network behavior.
- Influence Analysis: Determining the impact of individuals or groups on the network.
- Sentiment Analysis: Analyzing text data to determine the sentiment expressed by users.
3. Challenges:
- Large-scale Data: Social networks often involve massive amounts of data.
- Dynamic Nature: Social networks evolve over time, requiring algorithms that can adapt to
changes.
- Privacy Concerns: Ensuring the ethical use of data and protecting user privacy.
4. Applications:
- Social network mining is applied in recommendation systems, targeted advertising, fraud
detection, understanding the dynamics of information spread, and more.
34.Explain one algorithm for finding a community in a social graph.
The Clique Percolation Method (CPM) is a community detection algorithm that focuses
on identifying overlapping communities in a graph. The algorithm uses the concept of
cliques, which are subsets of nodes where each node is connected to every other node in
the subset. Here's an explanation of how the Clique Percolation Method works:
Clique Percolation Method (CPM):
1. Input:
- The algorithm takes as input a parameter (k) and a network (graph).
2. Algorithm Steps:
- Step 1: Find Cliques of Size (k): Identify all cliques of size (k) in the given network. A
clique is a complete subgraph where every node is connected to every other node.
- Step 2: Construct Clique Graph:

- Create a clique graph where each node represents a clique of size (k), and there is an
edge between two nodes if the corresponding cliques share (k-1) nodes.
- Step 3: Identify Communities:

- Each connected component in the clique graph forms an overlapping community.
- Two cliques are considered adjacent if they share (k-1) nodes.
4. Advantages:
- CPM allows nodes to belong to multiple communities, capturing the inherent overlap
in real-world networks.
- It provides a flexible approach for detecting communities with varying degrees of
overlap.
5. Applications:
- CPM has been applied in various domains, including social network analysis,
biological network analysis, and citation networks.
6. Limitations:
- The choice of the parameter (k) influences the granularity of the communities, and
there is no universally optimal value.
- The algorithm may not scale well for very large graphs.
Example :
35.Elaborate on the Greven Neiman algorithm.
The Greven-Neiman algorithm is a statistical method for estimating the parameters of a mixture
model. A mixture model is a statistical model that assumes that the data is generated from a
mixture of several different distributions. The Greven-Neiman algorithm is an iterative
algorithm that uses the expectation-maximization (EM) algorithm to estimate the parameters of
the mixture model.
The EM algorithm is an iterative algorithm that consists of two steps: the expectation (E) step
and the maximization (M) step. In the E step, the algorithm computes the expected value of the
complete-data log-likelihood given the current parameter estimates. In the M step, the algorithm
maximizes the expected value of the complete-data log-likelihood with respect to the
parameters.
The Greven-Neiman algorithm is a specific implementation of the EM algorithm for mixture

models. The algorithm assumes that the data is generated from a mixture of Gaussian
distributions. The algorithm uses the following steps to estimate the parameters of the mixture
model:
1. Initialize the parameters of the mixture model.
2. For each data point, compute the posterior probability that the data point was generated
from each of the Gaussian distributions.
3. Update the parameters of the Gaussian distributions using the posterior probabilities.
4. Repeat steps 2 and 3 until the convergence criterion is met.
The convergence criterion is typically met when the log-likelihood of the data stops increasing.
The Greven-Neiman algorithm is a relatively simple and efficient algorithm for estimating the
parameters of mixture models. The algorithm is implemented in a number of statistical software
packages, including R and Python.
36.Explain how to extract features from a document in a content-based system. Discuss
document similarity.
To build a content-based recommendation system for textual documents, you typically follow
these steps:
1. Text Preprocessing:
• Remove stop words: Stop words are common words like "and," "the," "of," etc.,
which do not contribute much to the meaning of the document.
• Tokenization: Split the text into individual words or tokens.
• Lowercasing: Convert all words to lowercase to ensure consistency.
• Stemming or Lemmatization: Reduce words to their root or base form to capture
the core meaning.
2. Feature Extraction:
Represent each document as a feature vector. The choice of features depends on the
specific characteristics of your documents.
Common methods include Bag-of-Words (BoW) and Term Frequency-Inverse Document
Frequency (TF-IDF).
3. Similarity Measures :Use similarity measures to compare documents and determine how
closely related they are.
37.Define the nearest neighbour problem, illustrate how finding plagiarism in a document is
a nearest neighbour problem, and identify similarity measures that can be used.
Nearest Neighbour Problem:

The nearest neighbor problem is a type of pattern recognition and classification problem. Given
a set of data points in a multidimensional space, the goal is to find the data point(s) that are
closest or most similar to a given query point. Nearest neighbor algorithms are commonly used
for tasks such as classification, regression, and anomaly detection.
Illustration: Finding Plagiarism as a Nearest Neighbour Problem:
In the context of finding plagiarism in a document, the nearest neighbor problem arises when
you want to identify whether a given document is similar to any other documents in a corpus.
The idea is to treat each document as a point in a high-dimensional space, where each
dimension corresponds to a feature or characteristic of the document. The task is then to find the
nearest neighbors (documents) to the query document.
Working:
1. Representation:
- Represent each document as a feature vector. This could be done using methods like Bag-of-
Words, TF-IDF, or other vectorization techniques.
2. Distance Metric:
- Define a distance metric or similarity measure to quantify the similarity between two
documents. The smaller the distance, the more similar the documents are.
3. Nearest Neighbor Search:
- Given a query document, find its nearest neighbors in the feature space. These are the
documents that are most similar to the query.
4. Thresholding:
- Set a similarity threshold to determine when a document is considered plagiarized. If the
similarity between the query document and its nearest neighbors exceeds the threshold, it
suggests potential plagiarism.
Similarity Measures for Plagiarism Detection:
Several similarity measures can be employed for identifying plagiarism:
1. Jaccard Similarity:
- Measures the similarity between two sets by calculating the size of their intersection divided
by the size of their union.
2. Cosine Similarity:
- Represents documents as vectors and calculates the cosine of the angle between them. It's
effective for measuring the similarity of documents regardless of their length.
3. Dice Coefficient:
- Similar to Jaccard similarity but gives more weight to the intersection of sets.
4. Euclidean Distance:
- Measures the straight-line distance between two points in a multidimensional space.
5. Hamming Distance:
- Measures the number of positions at which corresponding symbols are different between
two strings of equal length.
6. Levenshtein Distance (Edit Distance):

- Measures the minimum number of single-character edits required to change one string into
the other.
38.Explain collaborative filtering.
Collaborative filtering is a technique used in recommendation systems to make predictions
about the interests of a user by collecting preferences from multiple users (or items) and
leveraging the patterns and similarities among them. The main idea is to recommend items that
similar users have liked or that are similar to items the target user has already shown interest in.
There are two primary types of collaborative filtering:
1. User-Based Collaborative Filtering:

- Find Similar Users:
- For a target user (x), identify a set (N) of other users whose preferences are similar to (x)'s
preferences. This set (N) is often referred to as the user's neighborhood.
- Estimate User (x)'s Preferences:
- Predict the target user (x)'s preferences for items based on the preferences of users in the
neighborhood (N).
- Example:
- If User A and User B have similar tastes and User A liked an item that User B has not seen,
it might be recommended to User B.
2. Item-Based Collaborative Filtering:

- Find Similar Items:
- For a target item (i), identify a set (N) of other items that are similar to (i). This set (N)
represents the items that are likely to be preferred by users who liked item (i).
- Estimate User (x)'s Preferences:
- Predict the target user (x)'s preferences for items based on the preferences for similar items
in set (N).
- Example:
- If User A liked Item1 and Item2 is similar to Item1, then Item2 might be recommended to
User A.
Similarity Measures:
Collaborative filtering relies on similarity measures to determine the closeness between users or
items. Common similarity measures include:
1. Jaccard Similarity:
- Measures the size of the intersection of two sets divided by the size of their union.
2. Cosine Similarity:
- Represents users or items as vectors and calculates the cosine of the angle between them.
3. Pearson Correlation:
- Measures the linear correlation between two users or items based on their rating vectors.
39.Define recommendation systems and discuss types with examples.
Recommendation Systems:
A recommendation system, also known as a recommender system or recommendation engine, is

a software application or algorithm designed to provide personalized suggestions or
recommendations to users. The primary goal of a recommendation system is to predict or filter
items that a user may be interested in, based on their preferences, historical behavior, or explicit
input. These systems are widely used in various domains, such as e-commerce, content
streaming, social media, and more, to enhance user experience and engagement.
Types of Recommendation Systems:
1. Collaborative Filtering: Collaborative filtering relies on the preferences and behaviors of a

group of users to make recommendations. It assumes that users who have similar preferences in
the past will continue to have similar preferences in the future.
- Examples:
- User-Based Collaborative Filtering: Recommends items based on the preferences of users
with similar tastes.
- Item-Based Collaborative Filtering: Recommends items similar to those the user has liked
or interacted with.
2. Content-Based Filtering: Content-based filtering recommends items to users based on the

characteristics of the items and the user's preferences. It focuses on the features of items and the
user's profile.
- Examples:
- Recommending movies based on genres, actors, or directors the user has shown interest in.
- Recommending articles based on the topics the user has read.
3. Hybrid Methods: Hybrid recommendation systems combine both collaborative and content-
based approaches to leverage the strengths of each method. This integration aims to improve the
accuracy and address limitations of individual methods.
- Examples:
- Netflix uses a hybrid approach, combining collaborative filtering with content-based
features such as user viewing history and preferences.
4. Knowledge-Based Systems: Knowledge-based systems recommend items to users based on

explicit knowledge about the user's preferences and the characteristics of items. These systems
often use rule-based or expert systems.
Examples: Educational platforms recommending courses based on a user's academic history
and career goals.
5. Context-Aware Recommendation: Context-aware recommendation systems take into account

contextual information, such as location, time, and device, to provide more relevant and timely
recommendations.
Examples: Mobile apps recommending nearby restaurants based on the user's location and
preferences.
6. Matrix Factorization: Matrix factorization techniques decompose the user-item interaction

matrix into latent factors, capturing hidden patterns and relationships in the data to make
personalized recommendations.
Examples: Collaborative filtering algorithms like Singular Value Decomposition (SVD) and
Alternating Least Squares (ALS).
7. Association Rule Mining: Association rule mining identifies patterns and relationships
between different items in a dataset, allowing the system to make recommendations based on
frequent co-occurrences.
Examples: Recommending products frequently purchased together in an e-commerce setting.
Examples of Recommendation Systems:
1. Amazon:
- Recommends products based on a user's purchase history, browsing behavior, and the
preferences of users with similar tastes.
2. Netflix:
- Uses a combination of collaborative filtering and content-based filtering to recommend
movies and TV shows based on user ratings, viewing history, and content features.
3. Spotify:
- Recommends music based on a user's listening history, favorite genres, and the preferences
of similar users.
4. YouTube:
- Recommends videos based on a user's watch history, search queries, and content features
such as genre and tags.
5. LinkedIn:
- Suggests professional connections based on a user's job history, skills, and connections of
connections.
6. Google:
- Recommends search results, news articles, and ads based on a user's search history, location,
and preferences.
Module 6: Data Visualization
40.Explain handling basic expressions in R, variables in R, working with vectors, storing
and calculating values in R.
41.Explain executing scripts and creating plots.
Executing scripts and creating plots are two important tasks in data analysis and visualization.
Executing scripts refers to the process of running a set of instructions written in a programming
language. These instructions can be anything from simple calculations to complex data
processing tasks. Scripts are typically saved as files with a .py or .R extension, depending on the
programming language used.
Creating plots refers to the process of generating visual representations of data. Plots can be
used to summarize data, identify trends, and communicate insights to others. There are many
different types of plots, including line plots, bar charts, scatter plots, and histograms.
To execute scripts and create plots, you will need a programming language and a plotting
library. Some popular programming languages for data analysis and visualization include
Python and R. Some popular plotting libraries include Matplotlib (Python) and ggplot2 (R).
Here is an example of how to execute a script and create a plot in Python:
import pandas as pd
import matplotlib.pyplot as plt
# Load data from a CSV file

data = pd.read_csv('data.csv')
# Create a scatter plot of 'x' vs. 'y'

plt.scatter(data['x'], data['y'])
# Save the plot as a PNG image

plt.savefig('plot.png')
42.Explain reading datasets and exporting data from R.
43.Explain features of R.
R is a powerful and widely used programming language and environment for statistical
computing and data analysis. Here are some key features of R:
1. Open Source:
- R is an open-source language, which means that it is freely available for anyone to
use, modify, and distribute. This has contributed to its widespread adoption in academia,
industry, and research.
2. Statistical Computing:
- R was specifically designed for statistical computing and data analysis. It provides a
rich set of statistical and mathematical functions that make it well-suited for a wide range
of statistical tasks.
3. Extensive Libraries and Packages:

- R has a vast collection of packages and libraries contributed by the R community.
These packages cover a broad spectrum of domains, including machine learning, data
visualization, time series analysis, and more. Popular packages include ggplot2, dplyr,
tidyr, caret, and many others.
4. Data Manipulation and Transformation:

- R excels in data manipulation and transformation tasks. The `dplyr` and `tidyr`
packages provide a convenient and expressive syntax for filtering, summarizing, and
transforming data.
5. Graphics and Data Visualization:

- R offers powerful tools for creating high-quality visualizations. The base graphics
system provides a flexible way to create a wide range of plots, and the ggplot2 package
offers a layered grammar of graphics for constructing complex and customized
visualizations.
6. Community Support:
- R has a vibrant and active community of users and developers. This community
support is valuable for troubleshooting issues, sharing knowledge, and collaborating on
the development of new packages and functionalities.
7. Reproducibility:
- Reproducibility is a key principle in scientific research, and R provides features to
support this. R scripts can be easily shared, allowing others to reproduce analyses and
results.
8. Integration with Other Languages:
- R can be easily integrated with other programming languages, such as C, C++, and
Java. This flexibility allows users to leverage existing code written in other languages.
9. Cross-Platform Compatibility:
- R is cross-platform and can run on various operating systems, including Windows,
macOS, and Linux. This makes it accessible to a wide range of users regardless of their
preferred operating system.
10. Interactive Environment:

- R provides an interactive environment where users can execute commands and see
the results immediately. This is particularly helpful for exploratory data analysis and
iterative development.
11. Data Import and Export:

- R supports a variety of data import and export formats, including CSV, Excel, SPSS,
SAS, and more. This makes it easy to work with data from different sources.
12. Package Management:

- R's package management system, including CRAN (Comprehensive R Archive
Network), makes it easy to discover, install, and update packages. This facilitates the
extension of R's functionality.

Bda Ese

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Bda Ese

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda Ese

Uploaded by

Copyright:

Available Formats

Module 1: Big Data Introduction

1. What are the characteristics of Big Data?

1. Scalability: Hadoop is designed to scale from single servers to thousands of machines.

5. Hadoop Distributed File System (HDFS): Hadoop's architecture includes HDFS, a

6. Fault Tolerance: Hadoop features built-in fault tolerance mechanisms. It maintains

7. Easy Programming: Hadoop provides a programming model that allows developers to

Big Data Exploration: Customer Example (Case Study 1)

Enhanced 360º Customer View: Customer Example (Case Study 2)

• Volume: Creating a "Facebook" for customers implies dealing with a substantial

Security/Intelligence Extension: Needs (Case Study 3)

• Volume: Gaining real-time visibility into operations, customer experience,

Data Warehouse Augmentation: Needs (Case Study 5)

Hadoop Distributed File System

o It is a single master server exist in the HDFS cluster.

o The HDFS cluster contains multiple DataNodes.

o It works as a slave node for Job Tracker.

7. No Real-time Data Processing:

6. Describe Two-stage matrix multiplication using MapReduce.

Two-stage matrix multiplication using MapReduce is a technique commonly employed in

Stage 1: Map Phase

Stage 2: Reduce Phase

In the context of your studies in computer engineering at a Mumbai university, let's

When to Use Combiners:

4. Resource Efficiency: Combiners help in utilizing system resources more efficiently by

The shuffling phase includes three main steps:

Relational algebra operations are fundamental operations used in database query

3. Union, Intersection, and Difference:

Implementing relational algebra operations in MapReduce requires careful design to

Here's a high-level overview of how PageRank works:

Can a website's PageRank increase or decrease?

1. Key-Value Store Database:

Agility serves as a non-SQL business driver in several ways:

1. Rapid application development: NoSQL databases are schema-less, meaning they

3. Flexibility: NoSQL databases support a variety of data models, including key-

5. Cost-effectiveness: NoSQL databases are often more cost-effective than relational

1. Unbounded Memory Requirement:

2. Approximate Query Answering:

4. Batch Processing, Sampling, and Synopses:

DGIM (Datar-Gionis-Indyk-Motwani) is an algorithm designed for approximating the

Let's clarify DGIM for counting once in a tree with an example:

2. Create Buckets and Counters:

Bucket 1: |__|__|__|__| (Timestamp: T)

Bucket 1: |1|0|0|1| (Count: 2)

Estimated count of 1s in the last T time units = 3 + 2 + 2 = 7

Bloom's Filter is a probabilistic data structure used to test whether an element is a

Bloom Filter Algorithm:

- False Positives and Negatives:

3. Estimation: The maximum count is 2.

4. Final Estimation: The estimated number of distinct elements is (2^2 = 4).

Properties and Considerations:

1. Timestamp and Size:

Rules for Representing Buckets:

To maintain the integrity of the representation, several rules are followed:

Key Characteristics of DSMS:

1. Network Management and Traffic Engineering: Analyzing streams of measurements and

Flajolet-Martin (FM) Algorithm:

3. Estimation: The maximum count is 2.

4. Final Estimation: The estimated number of distinct elements is (2^2 = 4).

Bucket 1: ||||| (Timestamp: T)