Internship Report 2023-24 Data Science
Internship Report 2023-24 Data Science
CHAPTER 1
INTRODUCTION
1.1 DATA SCIENCE
Data science is a multidisciplinary field that uses statistical and computational methods to
extract insights and knowledge from data. It involves a combination of skills and knowledge
from various fields such as statistics, computer science, mathematics, and domain expertise.
The process of data science involves several steps, including data collection, cleaning,
exploration, analysis, and interpretation. These steps are often iterative, and the process may
be refined based on the results obtained. Data science is used in a wide range of applications,
including business, healthcare, social science, engineering, and many others. Some examples
of data science applications include fraud detection, personalized marketing, medical
diagnosis, predictive maintenance, and recommendation systems.
2. Curiosity
Intellectual curiosity inspires data scientists to look for answers to address business crises.
Professionals can go beyond initial assumptions and surface results. A data scientist must be
curious enough to unlock solutions to known problems and uncover hidden, overlooked
insights. As a result, they derive a higher quality of knowledge from their data sets.
3. Business Acumen
Data scientists have to deal with a massive amount of knowledge. If they don’t translate it
effectively, this priceless information goes down the drain because upper-level management
never gets to use it to make business decisions. Data scientists need to appreciate current and
upcoming industry trends and acquire basic business concepts and tools.
4. Storytelling
Storytelling aids data scientists in conveying their results logically and clearly. It takes data
visualization to another dimension, allowing decision-makers to see things from a new
perspective. A compelling storytelling approach builds a strong data narrative where
stakeholders attain a new sense of understanding about the data presented and use it to
support their decisions going forward.
5. Team Player
Data scientists don’t work inside a bubble, and they must recognize the importance of
teamwork and collaborate effectively with others. They need to listen to other team members
and use that input to their advantage.
Eigenvalues and eigenvectors help understand data records variability, influencing clustering
and pattern recognition.
Solving systems of equations is crucial for optimization tasks and parameter estimation.
Furthermore, linear algebra supports image and signal processing strategies critical in data
analysis.
Proficiency in linear algebra empowers data scientists to successfully represent, control, and
extract insights from data, in the end driving the development of accurate models and
informed decision-making.
CHAPTER 2
INTRODUCTION TO DATA
2.1 WHAT IS DATA?
Data is different types of information usually formatted in a particular manner. All software
is divided into two major categories: programs and data. We already know what data is now,
and programs are collections of instructions used to manipulate data.
We use data science to make it easier to work with data. Data science is defined as a field that
combines knowledge of mathematics, programming skills, domain expertise, scientific
methods, algorithms, processes, and systems to extract actionable knowledge and insights
from both structured and unstructured data, then apply the knowledge gleaned from that data
to a wide range of uses and domains.
Computers represent data (e.g., text, images, sound, video), as binary values that employ two
numbers: 1 and 0. The smallest unit of data is called a “bit,” and it represents a single value.
Additionally, a byte is eight bits long. Memory and storage are measured in units such as
megabytes, gigabytes, terabytes, petabytes, and exabytes. Data scientists keep coming up
with newer, larger data measurements as the amount of data our society generates continues
to grow.
Data can be stored in file formats using mainframe systems such as ISAM and VSAM,
though there are other file formats for data conversion, processing, and storage, like comma-
separated values. These data formats are currently used across a wide range of machine types,
despite more structured-data-oriented approaches gaining a greater foothold in today’s IT
world.
The field of data storage has seen greater specialization develop as the database, the database
management system, and more recently, relational database technology, each made their
debut and provided new ways to organize information.
name, date, number) and restrictions on their values (e.g. number of characters). This level of
organisation means that data can be entered, stored, queried, or analysed by machines.
Structured data includes:
Names
Dates
Phone numbers
Currency or Prices
Heights or Weights
Latitude and Longitude
Word count or File size of a document
3. Big data
Volume: Data sets contain vast quantities of information that put high demands on systems
used for storing, manipulating, and processing the information.
Variety: It’s common for systems to process data from many sources, including emails,
images, video, audio, readings from IoT devices, and even scanned PDF documents.
Velocity: Vast quantities of data are being generated faster than ever, presenting challenges
for analysts as more industries use this information. The ability to make instant decisions
based on up-to-date information can make or break a business.
outlier detection, and data validation help refine the dataset. Data preprocessing involves
tasks like standardization, normalization, and feature scaling, ensuring that data is prepared
for downstream analysis.
4. Feature Engineering
Feature engineering is the art of creating new features from existing ones to enhance the
performance of machine learning models. Techniques include dimensionality reduction,
creating interaction terms, and generating domain-specific features. Thoughtful feature
engineering can significantly impact model accuracy and interpretability.
5. Data Integration
Data integration involves combining data from different sources to create a unified dataset.
Techniques range from simple concatenation to more complex merging and joining
operations. Ensuring data consistency and resolving conflicts are essential aspects of
successful integration.
uses data to evaluate future probabilities and develop actionable analyses. The data can be
structures, semi-structures or unstructured, and can be stored in various forms such as
databases, data warehouses, and data lakes.
2. Data Preparation
The most time-consuming phase, the preparation phase, consists of three steps: extraction,
transformation, and loading — also referred to as ETL. First, data is extracted from various
sources and deposited into a staging area. Next, during the transformation step: the data is
cleaned, null sets are populated, duplicative data is removed, errors are resolved, and all data
is allocated into tables. In the final step, loading, the formated data is loaded into the database
for use.
3. Modeling
Data modeling addresses the relevant data set and considers the best statistical and
mathematical approach to answering the objective question(s). There are a variety of
modeling techniques available, such as classification, clustering, and regression analysis
(more on them later). It’s also not uncommon to use different models on the same data to
address specific objectives.
4. Evaluation
After the models are built and tested, it’s time to evaluate their efficiency in answering the
question identified during the business understanding phase. This is a human-driven phase, as
the individual running the project must determine whether the model output sufficiently
meets their objectives. If not, a different model can be created, or different data can be
prepared.
Data analytics is the science of analyzing raw data to make conclusions about information.
Many of the techniques and processes of data analytics have been automated into mechanical
processes and algorithms that work over raw data for human consumption.
Data analytics is the science of analyzing raw data to make conclusions about that
information.
Data analytics help a business optimize its performance, perform more efficiently, maximize
profit, or make more strategically-guided decisions.
The techniques and processes of data analytics have been automated into mechanical
processes and algorithms that work over raw data for human consumption.
Various approaches to data analytics include looking at what happened (descriptive
analytics), why something happened (diagnostic analytics), what is going to happen
(predictive analytics), or what should be done next (prescriptive analytics).
Data analytics relies on a variety of software tools including spreadsheets, data visualization,
reporting tools, data mining programs, and open-source languages for the greatest data
manipulation.
Some of them include business intelligence and visualization software, predictive analytics,
and data mining, among others.
Interpret: This stage is where the researcher comes up with courses of action based on the
findings. For example, here you would understand if your clients prefer packaging that is red
or green, plastic or paper, etc. Additionally, at this stage, you can also find some limitations
and work on them.
CHAPTER 3
INTRODUCTION TO MACHINE LEARNING
3.1 OVERVIEW
Machine learning (ML) is a branch of artificial intelligence (AI) that enables computers to
“self-learn” from training data and improve over time, without being explicitly programmed.
Machine learning algorithms are able to detect patterns in data and learn from them, in order
to make their own predictions. In short, machine learning algorithms and models learn
through experience. While artificial intelligence and machine learning are often used
interchangeably, they are two different concepts. AI is the broader concept – machines
making decisions, learning new skills, and solving problems in a similar way to humans –
whereas machine learning is a subset of AI that enables intelligent systems to autonomously
learn new things from data
3. Qualitative data talks about the 3. Quantitative data talks about the quantity
experience or quality and explains the and explains the questions like ‘how much’,
questions like ‘why’ and ‘how’ ‘how many .
4. The data is analyzed by grouping it into 4. The data is analyzed by statistical
different categories. methods.
5.Qualitative data are subjective and can be 5. Quantitative data are fixed and universal.
further open for interpretation.
CHAPTER 4
INTRODUCTION TO R PROGRAMMING
4.1 R PROGRAMMING OVERVIEW
R is a statistical computing and graphics system. This system is comprised of two parts: the R
language itself (which is what most people mean when they talk about R) and a run-time
environment. R is an interpreted language, which means that users access its functions
through a command-line interpreter.
4.1.1 R VARIABLES
Variables are used to store the information to be manipulated and referenced in the R
program. The R variable can store an atomic vector, a group of atomic vectors, or a
combination of many R objects. There are two functions which are used to print the value of
the variable i.e., print() and cat(). The cat() function combines multiples values into a
continuous print output.
4.1.3 R FUNCTIONS
A set of statements which are organized together to perform a specific task is known as a
function. R provides a series of in-built functions, and it allows the user to create their own
functions. Functions are used to perform tasks in the modular approach. Creating functions in
R involves using the function keyword. Functions are useful for encapsulating code that
performs a specific task and can be reused throughout your script. Here's the syntax and an
example to illustrate how to define and use a function in R:
2. Imports
Import acts as a substitute for library with an important difference: library has the side effect
of changing the search path of the complete R session.
# Define the function
functionWithDep <- function(x) {
median(x)
}
# Call the function with a numeric vector
result <- functionWithDep(1:10)
print(result)
3. Importing modules
To import other modules, the function use can be called. use really just means import module.
m <- module({
import("stats")
functionWithDep <- function(x) median(x)
})
mm <- module({
muse(m)
anotherFunction <- function(x) m$functionWithDep(x)
})
mm$anotherFunction(1:10)
4. Exports
Exports can be defined as regular expressions which is indicated by a leading ‘^’. In this case
only one export declaration should be used.
# Load the modules package
library(modules)
# Define the module
m <- module({
export("fun") # Export the public function
fun <- identity # Define the public function
privateFunction <- identity # Define a private function
-privateFunction <- identity # This line is unnecessary
})
# Use the public function from the module
result <- m$fun(1:10)
print(result)
CHAPTER 5
INTRODUCTION TO PYTHON
Python is a widely used general-purpose, high level programming language. It was created by
Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was
designed with an emphasis on code readability, and its syntax allows programmers to express
their concepts in fewer lines of code. Python is a programming language that lets you work
quickly and integrate systems more efficiently. There are two major Python versions: Python
2 and Python 3. Both are quite different.
The sequence Data Type in Python is the ordered collection of similar or different Python
data types. Sequences allow storing of multiple values in an organized and efficient fashion.
There are several sequence data types of Python:
Python String
Python List
Python Tuple
objects can be evaluated in a Boolean context as well and determined to be true or false. It is
denoted by the class bool.
2. Use a Module
Now we can use the module we just created, by using the import statement:
import mymodule
mymodule.greeting(“Jonathan”)
3. Variables in Module
The module can contain functions, as already described, but also variables of all types
(arrays, dictionaries, objects etc):
4. Naming a Module
You can name the module file whatever you like, but it must have the file extension .py.
5. Re-naming a Module
You can create an alias when you import a module, by using the ‘as’ keyword.
import mymodule as mx
a = mx.person1[“age”]
print(a)
6. Built-in Modules
There are several built-in modules in Python, which you can import whenever you like.
import platform
x = platform.system()
print(x)
CHAPTER 6
PROJECT
6.1 ANALYSING A DATASET OF BEST-SELLING BOOKS
Source Code
import pandas as pd
book_data = pd.read_csv(r"C:\Users\91966\Desktop\BestSellingBooks.csv")
print("\nNumber of books by each author:")
print(book_data['Author'].value_counts())
print("\nUnique genres in the dataset:")
print(book_data['Genre'].unique())
print("\nDuplicate book names:")
print(book_data['Name'].duplicated())
This code reads a CSV file into a Data Frame and performs three main tasks:
1. Counts the number of books by each author.
2. Lists the unique genres in the dataset.
3. Identifies duplicate book names.
Output
CHAPTER 7
CONCLUSION
Data Science internship using Python has been an invaluable learning experience. I gained
proficiency in essential libraries like NumPy, pandas, Matplotlib, and Scikit-learn, and
developed skills in data cleaning, exploratory analysis, and machine learning. Working on
real-world projects taught me the importance of data preprocessing and effective
communication of results. Collaborating with a professional team highlighted the value of
teamwork. This internship solidified my passion for data science and prepared me for future
opportunities in this dynamic field.
REFERENCES
1. VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working
with Data. O'Reilly Media.
2. McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and IPython. O'Reilly Media.
3. Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and
TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly
Media.
4. https://www.w3schools.com/python/python_intro.asp
5. https://www.w3schools.com/python/python_datatypes.asp
6. https://www.geeksforgeeks.org/loops-in-python/?ref=lbp
7. https://www.programiz.com/python-programming/function