Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

Machine Learning for Data Science Unit-2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Machine Learning for Data Science Unit-2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Machine Learning for data science

Unit-2
Graphs

Definition:
A graph is a mathematical structure consisting of nodes (or vertices) connected by edges. Graphs are used to
model relationships and connections in various fields such as social networks, transportation systems, and
computer networks.

Types of Graphs:

1. Based on Edges:
o Undirected Graph: Edges have no direction. Example: Social networks where friendships are
mutual.
o Directed Graph (Digraph): Edges have directions. Example: Webpages linked by hyperlinks.
2. Based on Weight:
o Weighted Graph: Edges have weights representing costs, distances, or capacities. Example:
Road networks with distances.
o Unweighted Graph: Edges have no weights.
3. Special Graphs:
o Tree: A connected acyclic graph.
o Bipartite Graph: Vertices can be divided into two sets, and edges only connect vertices from
different sets.
o Complete Graph: Every vertex is connected to every other vertex.
o Sparse and Dense Graphs: Based on the number of edges relative to vertices.

Representation of Graphs:

1. Adjacency Matrix:
o A 2D array where cell G[i][j]G[i][j] indicates the presence and weight of an edge between
vertices ii and jj.
o Space Complexity: O(V2)O(V^2), where VV is the number of vertices.
2. Adjacency List:
o Each vertex has a list of connected vertices.
o Space Complexity: O(V+E)O(V + E), where EE is the number of edges.

Graph Traversal Algorithms:

1. Depth-First Search (DFS):


o Explores as far as possible along a branch before backtracking.
o Applications: Pathfinding, cycle detection.
2. Breadth-First Search (BFS):
o Explores all neighbors of a vertex before moving deeper.
o Applications: Shortest path in unweighted graphs, level-order traversal.

Applications of Graphs:

1. Computer Networks:
o Represent network topologies, routing protocols (e.g., Dijkstra's shortest path).
2. Social Networks:
o Model friendships, followers, or interactions. Algorithms analyze connectivity and communities.
3. Transportation Systems:
o Represent cities as vertices and roads as edges, finding optimal routes (e.g., A* algorithm).
4. Machine Learning:
o Used in Graph Neural Networks (GNNs) for recommendations and predictions.
5. Biology:
o Represent gene or protein interactions.
6. Web Search:
o PageRank algorithm models the web as a graph of pages and hyperlinks.

Conclusion:
Graphs are versatile structures that provide a foundation for solving real-world problems involving
relationships and connections. Their efficient representations and algorithms make them essential in computer
science, data analysis, and artificial intelligence.

Maps and Map Searching

Definition:
A map in computer science is a data structure or a conceptual representation used to associate keys with values.
In a broader sense, maps also refer to geographical maps, represented digitally for navigation and location-
based searches. Map searching involves querying these structures or representations for specific information
efficiently.

Types of Maps:

1. Geographical Maps:
o Represent locations, routes, and regions, often as graphs where nodes are locations and edges are
routes with weights like distance or time.
o Example: Google Maps.
2. Data Structure Maps:
o Implemented using hash tables, binary search trees, or specialized structures like Trie.
o Example: Python’s dictionary, Java’s HashMap.

Map Searching Techniques for Geographical Maps:

1. Graph-Based Searching:
o Represent locations as graph nodes and paths as edges.
o Algorithms for map searching:
▪ Dijkstra’s Algorithm: Finds the shortest path from a source to all nodes in a weighted
graph.
▪ A Algorithm:* An optimized shortest-path search using heuristics like Euclidean or
Manhattan distances.
▪ Bellman-Ford Algorithm: Handles graphs with negative edge weights.
2. Spatial Indexing:
o Structures like Quadtrees, R-Trees, or KD-Trees index spatial data for efficient range queries
and nearest-neighbor searches.
o Example: Finding all restaurants within 5 km.
3. Geocoding and Reverse Geocoding:
o Geocoding: Converts addresses into geographic coordinates.
o Reverse Geocoding: Converts geographic coordinates into addresses.
4. Point of Interest (POI) Search:
o Uses keywords or categories to search for locations of interest on a map.
o Example: Searching for “gas stations near me.”

Map Searching Techniques for Data Structure Maps:

1. Hash Maps:
o Provide O(1) average-time complexity for searches.
o Example: Searching for a user’s details using their ID as the key.
2. Tree-Based Maps:
o Implemented using binary search trees, red-black trees, or AVL trees.
o Provide O(log n) search time in balanced trees.
o Example: C++ std::map.
3. Trie (Prefix Trees):
o Efficient for string searches, such as autocomplete in search engines.
o Example: Searching for a city name prefix like “San.”
4. Sorted Maps:
o Support ordered traversal and range queries.
o Example: Java’s TreeMap.

Applications of Map Searching:

1. Navigation Systems:
o GPS-based services like Google Maps and Waze use map searching to find optimal routes and
nearby facilities.
2. Search Engines:
o Indexing and searching massive datasets use hash maps and tries for speed and efficiency.
3. Geospatial Analysis:
o Applications in urban planning, disaster management, and logistics use map searching to analyze
spatial data.
4. E-commerce:
o Matches customer addresses to delivery zones using geocoding.
5. Games:
o AI in games searches maps for paths or explores territories.

Conclusion:
Maps and map searching are integral to solving real-world problems involving spatial and key-value
associations. Advanced algorithms and data structures ensure efficient searching, enabling their applications in
navigation, AI, e-commerce, and beyond.

Application of Algorithm: Stable Marriages Problem

Introduction:
The Stable Marriages Problem, also known as the Gale-Shapley problem, involves finding a stable matching
between two equally sized sets, typically referred to as "men" and "women," based on their preferences for
members of the opposite set. The solution ensures that no pair of individuals would both rather be with each
other than with their current partners, avoiding any "instability."

Problem Definition:

• Input:
o Two sets, MM (men) and WW (women), each containing nn elements.
o Each man ranks all women in order of preference, and each woman ranks all men.
• Output:
o A stable matching, where each man and woman are paired based on mutual preference and no
one would prefer to switch partners with someone else.

Concept of Stability:
A matching is stable if there is no pair of man and woman who:

1. Are not yet married to each other.


2. Prefer each other over their current partners.
3. Would break their current match and form a new pair.

In simpler terms, there must be no "blocking pairs."


The Gale-Shapley Algorithm (Deferred Acceptance Algorithm):
This is the classical algorithm used to solve the Stable Marriages Problem. The algorithm proceeds as follows:

1. Initialization:
o Each man proposes to his top choice woman.
o Each woman reviews all proposals and temporarily accepts the one she prefers (if she has
multiple proposals). If she is already engaged, she may "defer" her decision, holding onto the
better option.
2. Iterative Proposals:
o Each man, rejected by the woman he proposed to, proposes to his next choice.
o The process repeats until all men are matched.
3. Termination:
o The algorithm ends when every man is matched with a woman, and the match is stable.

Time Complexity:
The Gale-Shapley algorithm runs in O(n²) time, where nn is the number of men or women. This is because each
man may propose to every woman once, leading to at most nn proposals for each of the nn men.

Applications of Stable Marriages Algorithm:

1. Marriage Market Simulation:


o Originally formulated to model marriages, it provides a theoretical framework for stable
matching in real-world scenarios where preferences matter, such as arranged marriages or
partnership allocations.
2. College Admissions and Job Matching:
o College Admissions: Universities and students are matched based on mutual preferences. For
example, the algorithm can be applied to match students to universities, ensuring no student or
university prefers another match over the one they’ve been given.
o Job Matching: Used to match job seekers with employers, such as matching medical residents
to hospitals in residency programs or matching internship applicants to companies.
3. Kidney Exchange Programs:
o In kidney transplant programs, patients in need of a kidney may be matched with donors based
on preferences and compatibility. The Gale-Shapley algorithm helps in creating stable, mutually
beneficial exchanges between donors and recipients, minimizing the possibility of individuals
finding better matches.
4. Stable Roommates Problem:
o A variant of the stable marriages problem, where the goal is to match people to roommates based
on preferences, ensuring no two individuals would prefer each other over their current matches.
5. Organ Donation Networks:
o The algorithm has been applied in systems for matching organ donors and recipients, where the
goal is to ensure the most stable and beneficial matching based on compatibility and preference
criteria.
6. Online Dating Systems:
o Online dating platforms use variations of the Gale-Shapley algorithm to match users based on
mutual preferences, ensuring that the pairings are stable and no two users would prefer to switch
partners.

Advantages of the Gale-Shapley Algorithm:

1. Guaranteed Stability:
o The algorithm always produces a stable matching, preventing any potential "instability" or
reallocation of partners.
2. Fairness in Matching:
o It ensures that no individual can do better by pairing with someone else, improving the quality of
the solution.
3. Efficient and Scalable:
o The algorithm runs efficiently in O(n2)O(n²) time, making it feasible for practical applications
with large datasets.

Limitations:

1. Preference Bias:
o The Gale-Shapley algorithm guarantees a stable matching, but the stability may be biased
towards the group that proposes (usually the men in the standard algorithm), leading to an
outcome where men are happier on average than women.
2. Not Always Optimal for Both Sides:
o The matching produced is optimal for the proposing side (men in the classic version) but not
necessarily for the receiving side (women).

Conclusion:
The Stable Marriages Problem and its solution via the Gale-Shapley algorithm have broad applications beyond
marriage, particularly in scenarios where optimal pairings are essential, such as job matching, school
admissions, and organ donation. The algorithm's simplicity and guaranteed stability make it a foundational
method for matching problems in various fields.

Dictionaries and Hashing

Introduction: Dictionaries and hashing are fundamental concepts in computer science used to store and
retrieve data efficiently. A dictionary is a data structure that stores key-value pairs, while hashing is a
technique used to map these keys to specific locations in memory, facilitating fast access.

1. Dictionaries:

A dictionary (also called a map or associative array) is a collection of key-value pairs. Each key is unique, and
the dictionary allows you to store, retrieve, and modify values based on these keys.

• Keys are unique identifiers (e.g., strings, numbers), and values are the associated data.
• Operations on dictionaries typically include:
o Insertion: Adding a new key-value pair.
o Lookup: Finding the value associated with a given key.
o Deletion: Removing a key-value pair.
o Update: Modifying the value of an existing key.

Example:
In Python, a dictionary can be implemented as:

my_dict = {"apple": 3, "banana": 5, "cherry": 7}

In this example, "apple", "banana", and "cherry" are keys, and 3, 5, and 7 are their corresponding values.

2. Hashing:

Hashing is a technique that maps keys to positions in a hash table using a hash function. This process allows
for efficient data retrieval. A hash function takes an input (key) and computes an index (hash value) that
determines where to store the associated value.

• The hash table is an array-like structure that stores the values. The hash function computes an index
that directly corresponds to an array slot, where the value is stored.
• Properties of a Good Hash Function:
o Deterministic: The same input should always result in the same hash value.
o Uniform Distribution: The hash function should distribute keys evenly across the hash table to
minimize collisions.
o Efficient: The function should compute the hash value quickly.

3. Hash Table and Collisions:

A hash table is a data structure that implements a dictionary by using hashing to store and retrieve values.
However, collisions occur when two different keys hash to the same index in the hash table. To handle
collisions, several strategies are used:

1. Chaining:
o In chaining, each table index points to a linked list or another data structure to store all the
values that hash to the same index.
o Example: If two keys hash to the same index, they are stored in a linked list at that index.
2. Open Addressing:
o In open addressing, when a collision occurs, the hash table searches for the next available slot in
the table (usually following a specific probing strategy like linear probing, quadratic probing, or
double hashing).
o Example: If the slot at index 5 is occupied, the algorithm may look at index 6, 7, or use a
specific probing sequence until an empty slot is found.

4. Operations on Hash Tables:

1. Insertion:
o Compute the hash of the key and place the key-value pair at the corresponding index in the hash
table.
o In case of a collision, the chosen collision resolution strategy is applied.
2. Lookup (Search):
o Compute the hash of the key and directly check the corresponding index in the table. In case of
collisions, follow the collision resolution procedure.
3. Deletion:
o Compute the hash of the key and remove the key-value pair at the corresponding index. If there
are collisions, handle it using the same resolution method.

5. Time Complexity:

• Average Case:
o Insertion, Lookup, Deletion: O(1) if the hash table is well-designed and collisions are minimal.
• Worst Case (for Chaining):
o If many collisions occur (e.g., poor hash function or high load factor), the time complexity may
degrade to O(n), where nn is the number of keys.
• Worst Case (for Open Addressing):
o With poor collision resolution or high load factor, the time complexity can degrade to O(n),
where nn is the number of keys.

6. Applications of Hashing and Dictionaries:

1. Database Indexing:
o Hash tables are used to quickly find records in a database based on key values (e.g., a customer
ID).
2. Caching:
o Hash maps or dictionaries are used to store and retrieve previously computed results, speeding
up future lookups (e.g., memoization in dynamic programming).
3. Symbol Tables in Compilers:
o Compilers use hash tables to store variables, functions, and objects for fast lookup during
compilation.
4. Cryptography:
oHash functions are widely used in cryptography to generate hash values for message verification,
digital signatures, and password storage.
5. Data Deduplication:
o Hashing is used to detect and remove duplicate data by comparing hash values of data blocks or
files.
6. Unique Identification:
o Hash tables can be used for managing unique keys, such as in the case of social media platforms,
where each user is associated with a unique user ID.

Conclusion:

Dictionaries and hashing are essential concepts in computer science, enabling fast and efficient data storage and
retrieval. Hashing allows for optimal key-value mappings with minimal collisions, while dictionaries provide a
versatile way to manage and access data. Together, these techniques underpin many real-world applications in
databases, caching, cryptography, and more.

Search Trees

Introduction: A search tree is a data structure used to store and organize data in a way that allows for efficient
searching, insertion, and deletion operations. These trees follow a specific order or property that ensures
optimal performance for search-related operations.

1. Binary Search Tree (BST):

A Binary Search Tree (BST) is a type of search tree where each node has at most two children, often referred
to as the left and right child. The nodes in a BST are organized in such a way that:

• For each node: The left child has a key less than the parent node, and the right child has a key greater
than the parent node.
• Inorder Traversal: When traversing the tree in an inorder manner (left subtree, root, right subtree), the
keys appear in ascending order.

Operations:

• Search: Start at the root and compare the key with the current node. If the key is less, move to the left
child; if the key is greater, move to the right child.
• Insertion: Insert a new key while maintaining the BST property (left < parent < right).
• Deletion: Removing a node requires handling three cases: node with no children, node with one child,
and node with two children.

Time Complexity (Average Case):

• Search, Insert, Delete: O(log n) for a balanced tree, where nn is the number of nodes.

Time Complexity (Worst Case):

• O(n) in the case of an unbalanced tree (e.g., if the tree degenerates into a linked list).

2. AVL Tree:

An AVL Tree is a self-balancing binary search tree, where the height of the two child subtrees of every node
differs by no more than one. If at any point the balance factor (difference in heights of left and right subtrees)
exceeds one, a rotation is performed to balance the tree.

Operations:
• Insertion: Insert a node as in a normal BST, followed by rebalancing the tree if necessary.
• Deletion: Remove the node and rebalance the tree if required.

Time Complexity:

• Search, Insert, Delete: O(log n) in both average and worst cases, due to the self-balancing nature of the
tree.

3. Red-Black Tree:

A Red-Black Tree is another self-balancing binary search tree with extra color attributes (either red or black)
assigned to each node. It satisfies five properties that ensure the tree remains balanced:

• Every node is either red or black.


• The root is always black.
• Red nodes cannot have red children.
• Every path from a node to its descendants has the same number of black nodes.
• Every leaf node (NIL) is black.

Operations:

• Insertion: Involves inserting a node as in a regular BST, followed by fixing any violations of the red-
black properties using rotations and color changes.
• Deletion: Similar to insertion, deletion involves removing a node and adjusting the tree to maintain red-
black properties.

Time Complexity:

• Search, Insert, Delete: O(log n) for all operations, thanks to the balancing properties of the red-black
tree.

4. B-Tree:

A B-Tree is a self-balancing search tree that maintains sorted data and allows searches, insertions, deletions,
and updates in logarithmic time. It is commonly used in databases and file systems due to its ability to
efficiently handle large amounts of data on disk.

Properties:

• A B-tree is a generalization of a binary search tree, where each node can have multiple children (more
than two).
• Nodes are kept sorted, and each node has a specific range of keys.

Operations:

• Search: Traverse the tree from the root to a leaf, choosing the appropriate child node based on
comparisons.
• Insertion: Insert a new key into the appropriate node. If a node overflows (exceeds its maximum
number of keys), it splits into two nodes.
• Deletion: Remove a key, and if necessary, rebalance the tree by merging or redistributing keys between
nodes.

Time Complexity:

• Search, Insert, Delete: O(log n) for all operations, where nn is the number of keys in the tree.
5. Splay Tree:

A Splay Tree is a self-adjusting binary search tree where recently accessed elements are moved to the root
using tree rotations. This operation is known as splaying and helps in optimizing access patterns for frequently
accessed elements.

Operations:

• Search: After searching for an element, it is splayed to the root.


• Insertion: Similar to a regular BST, but after insertion, the newly inserted node is splayed to the root.
• Deletion: After deleting a node, the tree is restructured by splaying a nearby node to the root.

Time Complexity:

• Search, Insert, Delete: O(log n) amortized time, meaning that over a sequence of operations, the
average time per operation is logarithmic, but individual operations may take longer in the worst case.

6. Applications of Search Trees:

1. Database Indexing:
o B-Trees are heavily used in databases for indexing purposes, allowing fast search, insertion, and
deletion of records.
2. Memory Management:
o Search trees such as AVL or Red-Black trees are used in memory management systems for
efficient allocation and deallocation of memory blocks.
3. File Systems:
o B-Trees and B+ Trees are widely used in file systems to store and retrieve file metadata
efficiently.
4. Autocompletion Systems:
o Tries (a type of tree) and other search trees are used in autocompletion and spell-checking
applications.
5. Routing Tables:
o Search trees, such as radix trees, are used in networking for routing table management.
6. Symbol Tables in Compilers:
o Red-Black trees and AVL trees are used in compilers to maintain symbol tables for efficient
variable lookup.

Conclusion:

Search trees are essential data structures for efficiently storing, searching, and manipulating data. From basic
Binary Search Trees to more advanced self-balancing trees like AVL and Red-Black trees, these structures are
fundamental in real-time systems such as databases, memory management, and file systems. Their applications
span a wide range of domains, making them crucial in both theoretical and practical computer science.

Dynamic Programming (DP)

Introduction: Dynamic Programming (DP) is a powerful algorithmic technique used for solving problems that
can be broken down into smaller overlapping subproblems. The core idea behind DP is to solve each
subproblem only once and store the result, eliminating the need for recomputation. It is widely used for
optimization problems and plays a key role in fields such as operations research, bioinformatics, and computer
science.

1. Characteristics of DP Problems:

To apply dynamic programming, the problem must have two key properties:
• Overlapping Subproblems:
o The problem can be divided into smaller subproblems, and these subproblems recur multiple
times. For example, in the Fibonacci sequence, the calculation of Fibonacci(n) involves repeated
computation of Fibonacci(n-1) and Fibonacci(n-2), which can be avoided by storing intermediate
results.
• Optimal Substructure:
o The solution to the problem can be constructed efficiently from the solutions to its subproblems.
That is, the optimal solution to the problem can be derived by combining the optimal solutions
of smaller subproblems.

2. Key Steps in Dynamic Programming:

DP generally follows a bottom-up approach where solutions to smaller subproblems are built up to solve larger
problems. The key steps in applying DP are:

1. Define the State (Subproblem):


o Clearly define what constitutes a "subproblem" and how it relates to the original problem.
o Example: In the Fibonacci sequence, the subproblem is computing the nth Fibonacci number.
2. State Transition (Recurrence Relation):
o Establish a recurrence relation or formula that defines how to compute the solution to the current
subproblem based on smaller subproblems.
o Example: F(n)=F(n−1)+F(n−2)F(n) = F(n-1) + F(n-2) for the Fibonacci sequence.
3. Memoization or Tabulation:
o Memoization: A top-down approach where you solve the problem recursively and store the
results of subproblems in a cache to avoid redundant calculations.
o Tabulation: A bottom-up approach where you solve all the subproblems iteratively and store
their results in a table (usually a 1D or 2D array).
4. Compute the Final Solution:
o Once all the subproblems are solved, the final solution can be obtained either directly from the
stored results or by combining the solutions of subproblems.

3. Types of Dynamic Programming:

1. Top-Down Approach (Memoization):


o In this approach, the problem is solved by starting with the original problem and recursively
solving smaller subproblems. The results are cached in a table (memoized) to prevent redundant
calculations.
o Time Complexity: O(n) for problems with overlapping subproblems, where nn is the number of
subproblems.
o Space Complexity: O(n) due to the recursion stack and memoization table.
2. Bottom-Up Approach (Tabulation):
o This approach solves the problem by solving all possible subproblems in a systematic manner,
starting from the smallest subproblem and working towards the original problem.
o Time Complexity: O(n), as every subproblem is solved once.
o Space Complexity: O(n), depending on the storage of the results.

4. Common DP Problems and Applications:

1. Fibonacci Sequence:
o The problem involves finding the nth Fibonacci number where each number is the sum of the
previous two. Using DP, we can reduce the time complexity from exponential O(2n)O(2^n) to
linear O(n)O(n).
2. Knapsack Problem:
o Given a set of items with weights and values, and a knapsack with a weight limit, the objective is
to find the maximum value that can be obtained by selecting a subset of items without exceeding
the weight limit.
o Time Complexity: O(nW), where nn is the number of items and WW is the maximum weight
capacity.
3. Longest Common Subsequence (LCS):
o The problem involves finding the longest subsequence that is common to two given strings. The
LCS problem has applications in DNA sequence comparison and file comparison.
o Time Complexity: O(mn), where mm and nn are the lengths of the two strings.
4. Matrix Chain Multiplication:
o The problem involves determining the most efficient way to multiply a chain of matrices. This
can be solved using DP by breaking the problem into smaller subproblems of multiplying
smaller matrix chains.
o Time Complexity: O(n³), where nn is the number of matrices.
5. Coin Change Problem:
o Given a set of coin denominations, the objective is to find the minimum number of coins
required to make a certain amount. DP helps solve this by building solutions for smaller
amounts.
o Time Complexity: O(nS), where nn is the number of denominations and SS is the target
amount.
6. Edit Distance (Levenshtein Distance):
o This problem involves finding the minimum number of operations (insertions, deletions, or
substitutions) required to transform one string into another. It is commonly used in text
processing, spell checking, and bioinformatics.
o Time Complexity: O(mn), where mm and nn are the lengths of the two strings.

5. Advantages of Dynamic Programming:

1. Optimal Solutions:
o DP guarantees an optimal solution by solving subproblems and combining them effectively,
ensuring the overall solution is optimal.
2. Efficient Computation:
o By storing intermediate results, DP avoids redundant calculations, leading to significant
reductions in time complexity.
3. Versatility:
o DP can be applied to a wide range of optimization problems, including scheduling, partitioning,
sequence alignment, and resource allocation.

6. Disadvantages of Dynamic Programming:

1. Space Complexity:
o DP can require significant memory, especially for problems with large input sizes. The storage
of subproblem solutions might become infeasible in some cases.
2. Overhead in Some Cases:
o While DP is efficient for many problems, it can sometimes introduce overhead due to recursion
or the need to store a large number of subproblem results, making it less effective for simpler
problems.

Conclusion:

Dynamic Programming is a key algorithmic technique used to solve complex problems by breaking them down
into simpler overlapping subproblems. It is used widely in various fields such as optimization, bioinformatics,
artificial intelligence, and operations research. By leveraging the principles of memoization and tabulation, DP
ensures that redundant calculations are avoided, leading to efficient and optimal solutions for a broad class of
problems.

You might also like