Notes ML for Data science
Notes ML for Data science
Benefits of Randomization:
Performance Improvement: Randomization can reduce the average-case complexity, leading to faster
solutions.
Simplicity and Elegance: Randomized algorithms can be simpler and more straightforward to implement
than deterministic ones.
Handling Adversarial Inputs: Randomized algorithms are effective against inputs crafted to worsen the
performance of deterministic algorithms.
Probabilistic Guarantees: Many randomized algorithms offer high-probability guarantees for correctness
and performance, even if they may not always be perfect.
Randomized QuickSort is a variation of the QuickSort algorithm where the pivot element is chosen
randomly instead of picking a fixed position (like the first or last element). Choosing a random pivot
minimizes the likelihood of consistently encountering the worst-case partition, which can occur with
certain input sequences, improving the average-case performance.
Divide: Split the problem into smaller subproblems of the same type.
Combine: Combine the solutions of the subproblems to form the solution to the original problem.
This technique is efficient for a range of problems because it reduces the size of the input with each
recursive call, leading to faster solutions compared to iterative or straightforward approaches.
Parallelism: Subproblems can often be solved independently, making Divide and Conquer well-suited for
parallel processing.
Simplification: Complex problems can be more manageable when broken down into smaller
subproblems.
Combine: Merge the two sorted halves to produce the sorted array.
Time Complexity: Merge Sort has a time complexity of O( n logn) which is efficient compared to the
O(n^2) complexity of simpler sorting algorithms, like bubble sort or selection sort.
Example: Consider the array [38, 27, 43, 3, 9, 82, 10]. Using Merge Sort:
Divide the array into [38, 27, 43, 3] and [9, 82, 10].
Further divide until subarrays of one element remain.
Combine the sorted subarrays step-by-step until the full array is sorted: [3, 9, 10, 27, 38, 43, 82].
The Divide and Conquer method, as demonstrated in Merge Sort, ensures an efficient and scalable
sorting solution that works well even for large datasets.
A hash function takes an input (key) and returns a fixed-size integer value, called a hash code or hash
value. This hash value determines the location in the hash table where the data should be stored.
A good hash function distributes keys uniformly across the table to minimize collisions and optimize
efficiency.
Hash Table:
A hash table is an array-like data structure where data is stored based on its hash value.
Each slot in the hash table corresponds to a unique hash code generated by the hash function.
Collision Handling:
Collisions occur when multiple keys hash to the same slot in the hash table. Collision-handling
techniques like chaining (linking multiple entries in the same slot) or open addressing (probing other
slots) are used to resolve these conflicts.
Hashing in Dictionaries
A dictionary (or map) is a data structure that stores key-value pairs, where each key is unique, and the
associated value can be any data type. Hashing is used in dictionary implementations to:
Maintain efficient performance for operations like search, insert, and delete.
The dictionary uses a hash function on the key to determine where to store the value in the hash table.
If there’s a collision, the dictionary resolves it using a collision-handling technique and stores the new
pair.
The hash function is applied to the key to determine its location in the table.
The dictionary then retrieves the value directly from that location, making lookup operations fast.
The hash function finds the key’s location in the hash table.
The dictionary removes the value from that slot and adjusts accordingly, ensuring that other elements
remain accessible.
Flexibility: Dictionaries allow for rapid retrieval and update of data, making them useful for a variety of
applications.
Minimal Space Usage: By using a fixed-size table and handling collisions effectively, hash tables avoid
excessive memory use.
user_data = {
"alice": "password123",
"bob": "secure456",
"charlie": "qwerty789"
To retrieve the password for "bob", the hash function calculates the hash for "bob", directly locating
"secure456" in O(1) time.
By using hashing, dictionaries ensure quick and efficient access to data, making them one of the most
Dynamic Programming (DP) is a technique used to solve problems by breaking them down into simpler
subproblems, solving each subproblem only once, and storing their solutions to avoid redundant work.
There are two main approaches to solve DP problems:
When a subproblem is encountered, the algorithm first checks if the solution is already stored (i.e.,
memoized); if it is, the stored result is reused, avoiding re-computation.
Example: Fibonacci sequence calculation, where each recursive call checks if the Fibonacci of
A table (or array) is used to store solutions to all subproblems, building up the solution to the original
problem.
This approach is often faster than the top-down approach because it avoids the overhead of recursive
calls.
Example: Solving the longest common subsequence (LCS) problem by filling out a 2D table iteratively
based on smaller subproblems.
A problem being NP-Complete has significant implications for algorithmic efficiency and difficulty:
Solving NP-Complete problems generally requires exponential time algorithms in the worst case, making
them computationally infeasible for large inputs.
As input size grows, the time required to solve the problem increases rapidly, limiting practical
applicability.
Difficulty and Approximation:
Since no efficient algorithm is known for NP-Complete problems, they are considered difficult to solve.
Many NP-Complete problems, like the Traveling Salesman Problem or the Knapsack Problem, are solved
by approximation algorithms or heuristic methods in practice, where a near-optimal solution is
acceptable.
all problems in NP can actually be solved in polynomial time. If P ≠ NP, then NP-Complete problems
The P vs NP problem, one of the most significant unsolved questions in computer science, asks whether
Alternative strategies, such as dynamic programming (when applicable), greedy algorithms, or even
randomization, can be used to make these problems more tractable, although they may not guarantee
an optimal solution.
In essence, NP-Complete problems are computationally intensive and difficult to solve optimally. Their
study is crucial because they arise frequently in real-world applications, and finding approximate or
Techniques like deep learning can identify patterns in genomic data that may not be detectable by
traditional methods, enabling researchers to pinpoint genetic markers linked to disease susceptibility
and progression.
2. Predictive Modeling for Disease Risk
Machine learning models can predict an individual's risk of developing certain diseases based on their
unique genetic profile.
Using a combination of genetic, clinical, and lifestyle data, algorithms assess probabilities of disease
onset, allowing for early interventions and preventive strategies tailored to the patient’s risk factors.
Machine learning also accelerates drug discovery by analyzing genomic and molecular data to identify
potential drug targets, enabling the design of new therapies tailored to specific genetic profiles.
This patient stratification ensures that clinical trials enroll individuals most likely to benefit from a
specific treatment, improving trial efficiency and outcome prediction.
This understanding of molecular mechanisms enables the development of precise diagnostics and
treatments, leading to better disease management.
This precision in gene editing opens up possibilities for developing therapies customized to an
individual's genetic profile.
Example:
In oncology, machine learning analyzes genomic data to identify mutations in genes like BRCA1 and
BRCA2, which are linked to breast cancer risk. Based on this information, personalized treatments and
preventive measures can be recommended for patients with these genetic markers.
Overall, machine learning helps bridge the gap between genomic insights and practical clinical
applications, making precision medicine a reality by tailoring treatment plans, therapies, and preventive
measures to the genetic and molecular makeup of each individual.
Examples of linear classifiers include Logistic Regression and Support Vector Machines (SVM) with a
linear kernel.
A linear classifier creates a straight-line boundary when the classes can be separated by a linear
relationship between features.
The equation of the decision boundary in two-dimensional space is typically of the form:
w1 . x1 + w2 . x2 + b = 0
where w1 and w2 are weights , x1 and x2 are features, and b is the bias term. Data points on one side of
this boundary belong to one class, while those on the other side belong to the other class.
In a linearly separable dataset, a linear classifier can perfectly classify the data using a single straight line
or hyperplane.
For non-linearly separable data, linear classifiers may struggle, and more complex classifiers or
transformations (such as kernels in SVM) are needed.
Training and Validation: For each iteration, select one fold as the validation set and use the remaining
k−1 folds as the training set.
Train the model on the training set and evaluate it on the validation set.
Repeat: Repeat the process k times, with each fold serving as the validation set exactly once.
Average Performance: Calculate the average performance metric (e.g., accuracy, precision) across all k
iterations. This average score is the model's overall performance estimate.
Example:
Suppose we have a dataset with 100 samples and choose 5-Fold Cross-Validation (where k=5 )
1. Divide the 100 samples into 5 subsets (each subset containing 20 samples).
a) In the first cycle, use the first 4 folds (80 samples) for training and the 5th fold (20 samples) for
validation.
b) In the second cycle, use the second fold as validation and the rest for training, and so on.
3. After 5 cycles, calculate the average validation accuracy across the five runs, giving a more reliable
performance metric for the model.
Maximizes Data Use: It allows each data point to be used for both training and validation, which is
especially useful for smaller datasets.
Avoids Overfitting: Helps detect overfitting by testing the model on multiple subsets of the data,
ensuring it generalizes well across different data points.
In summary, k-fold cross-validation is an essential technique for model evaluation that provides a more
accurate measure of a model’s generalization capability, helping to ensure robust and reliable
performance metrics.
Purpose:
Hypothesis Testing: Used in statistics to determine whether there is enough
evidence to reject a null hypothesis about a population parameter. The result is
typically a p-value, which tells us the probability of observing the data if the null
hypothesis is true.
Goal:
In hypothesis testing, the goal is to assess the validity of a specific hypothesis, such
as whether a treatment has an effect. The process is designed to control the
probability of making Type I and Type II errors.
Outcome:
Hypothesis Testing: The outcome is typically a decision (reject or fail to reject the
null hypothesis), based on whether the p-value is below a significance threshold
(e.g., 0.05).
Despite these differences, both approaches focus on assessing whether the results
observed on sample data can be generalized or trusted, providing evidence for or
against a proposed model or hypothesis.
1. Philosophy of Probability
Frequentist Approach: Views probability as the long-run frequency of events. A
probability statement is about the proportion of times an event would occur if the
experiment were repeated infinitely.
5. Example of Differences:
In hypothesis testing, a frequentist approach might calculate a p-value for a null
hypothesis. A Bayesian approach, by contrast, would calculate the posterior
probability of the null hypothesis given the data.
Bayesian: Allows for more flexible, intuitive interpretations and use of prior
knowledge, but can be computationally intensive and may be sensitive to the
choice of prior.