Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)
Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)
Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)
Ebook586 pages4 hours

Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book begins with an introduction to Data Science followed by the Python concepts. The readers will understand how to interact with various database and Statistics concepts with their Python implementations. You will learn how to import various types of data in Python, which is the first step of the data analysis process. Once you become comfortable with data importing, you will clean the dataset and after that will gain an understanding about various visualization charts. This book focuses on how to apply feature engineering techniques to make your data more valuable to an algorithm. The readers will get to know various Machine Learning Algorithms, concepts, Time Series data, and a few real-world case studies. This book also presents some best practices that will help you to be industry-ready.

This book focuses on how to practice data science techniques while learning their concepts using Python and Jupyter. This book is a complete answer to the most common question that how can you get started with Data Science instead of explaining Mathematics and Statistics behind the Machine Learning Algorithms.
LanguageEnglish
Release dateMar 1, 2021
ISBN9789389898071
Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)

Related to Practical Data Science with Jupyter

Related ebooks

Intelligence (AI) & Semantics For You

View More

Reviews for Practical Data Science with Jupyter

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Practical Data Science with Jupyter - Prateek Gupta

    CHAPTER 1

    Data Science Fundamentals

    Learning from data is virtually universally useful. Master it and you will be welcomed anywhere.

    – John Elder, founder of the Elder Research

    Elder Research is America’s largest and most experienced analytics consultancy. With his vision about data, John started his company in 1995, yet the importance of finding information from the data is a niche and the most demanding skill of the 21st century. Today data science is everywhere.

    The explosive growth of the digital world requires professionals with not just strong skills, but also adaptability and a passion for staying on the forefront of technology. A recent study shows that demand for data scientists and analysts is projected to grow by 28 percent by 2021. This is on top of the current market need. According to the U.S. Bureau of Labor Statistics, growth for data science jobs skills will grow about 28% through 2026. Unless something changes, these skill-gaps will continue to widen. In this first chapter, you will learn how to be familiar with data, your role as an aspiring data scientist, and the importance of Python programming language in data science.

    Structure

    What is data?

    What is data science?

    What does a data scientist do?

    Real-world use cases of data science

    Why Python for data science?

    Objective

    After studying this chapter, you should be able to understand the data types, the amount of the data generated daily, and the need for data scientists with currently available real-world use cases.

    What is data?

    The best way to describe data is to understand the types of data. Data is divided into the following three categories.

    Structured data

    A well-organized data in the form of tables that can be easily be operated is known as structured data. Searching and accessing information from such type of data is very easy. For example, data stored in the relational database, i.e., SQL in the form of tables having multiple rows and columns. The spreadsheet is another good example of structured data. Structured data represent only 5% to 10% of all data present in the world. The following figure 1.1 is an example of SQL data, where an SQL table is holding the merchant related data:

    Figure 1.1: Sample SQL Data

    Unstructured data

    Unstructured data requires advanced tools and software’s to access information. For example, images and graphics, PDF files, word document, audio, video, emails, PowerPoint presentations, webpages and web contents, wikis, streaming data, location coordinates, etc., fall under the unstructured data category. Unstructured data represent around 80% of the data. The following figure 1.2 shows various unstructured data types:

    Figure 1.2: Unstructured data types

    Semi-structured data

    Semi-structured data is structured data that is unorganized. Web data such as JSON (JavaScript Object Notation) files, BibTex files, CSV files, tab-delimited text files, XML, and other markup languages are examples of semi-structured data found on the web. Semi-structured data represent only 5% to 10% of all data present in the world. The following figure 1.3 shows an example of JSON data:

    Figure 1.3: JSON data

    What is data science?

    It’s become a universal truth that modern businesses are awash with data. Last year, McKinsey estimated that Big Data initiatives in the US healthcare system could account for $300 billion to $450 billion in reduced healthcare spending or 12-17 percent of the $2.6 trillion baselines in US healthcare costs. On the other hand though, bad or unstructured data is estimated to be costing the US roughly $3.1 trillion a year.

    Data-driven decision making is increasing in popularity. Accessing and finding information from the unstructured data is complex and cannot be done easily with some BI tools; here data science comes into the picture.

    Data science is a field that extracts the knowledge and insights from the raw data. To do so, it uses mathematics, statistics, computer science, and programming language knowledge. A person who has all these skills is known as a data scientist. A data scientist is all about being curious, self-driven, and passionate about finding answers. The following figure 1.4 shows the skills that a modern data scientist should have:

    Figure 1.4: Skills of a modern data scientist

    What does a data scientist do?

    Most data scientists in the industry have advanced training in statistics, math, and computer science. Their experience is a vast horizon that also extends to data visualization, data mining, and information management. The primary job of a data scientist is to ask the right question. It’s about surfacing hidden insight that can help enable companies to make smarter business decisions.

    The job of a data scientist is not bound to a particular domain. Apart from scientific research, they are working in various domains including shipping, healthcare, e-commerce, aviation, finance, education, etc. They start their work by understanding the business problem and then they proceed with data collection, reading the data, transforming the data in the required format, visualizing, modeling, and evaluating the model and then deployment. You can imagine their work cycle as mentioned in the following figure 1.5:

    Figure 1.5: Work cycle of a data scientist

    Eighty percent of a data scientist’s time is spent in simply finding, cleansing, and organizing data, leaving only 20 percent to perform analysis. These processes can be time-consuming and tedious. But it’s crucial to get them right since a model is only as good as the data that is used to build it. And because models generally improve as they are exposed to increasing amounts of data, it’s in the data scientists’ interests to include as much data as they can in their analysis.

    In the later chapters of this book, you will learn all the above-required skills to be a data scientist.

    Real-world use cases of data science

    Information is the oil of the 21st century, and analytics is the combustion engine. Whether you are uploading a picture on Facebook, posting a tweet, emailing anybody, or shopping in an e-commerce site, the role of data science is everywhere. In the modern workplace, data science is applied to many problems to predict and calculate outcomes that would have taken several times more human hours to process. Following are some list of real-world examples where data scientists are playing a key role:

    Google’s AI research arm is taking the help of data scientists to build the best performing algorithm for automatically detecting objects.

    Amazon has built a product recommendation system to personalize their product.

    Santander Group of Bank has built a model with the help of data scientists to identify the value of transactions for each potential customer.

    Airbus in the maritime industry is taking the help of data scientists to build a model that detects all ships in satellite images as quickly as possible to increase knowledge, anticipate threats, trigger alerts, and improve efficiency at sea.

    YouTube is using an automated video classification model in limited memory.

    Data scientists at the Chinese internet giant Baidu Inc. released details of a new deep learning algorithm that they claim can help pathologists identify tumors more accurately.

    The Radiological Society of North America (RSNA®) is using an algorithm to detect a visual signal for pneumonia in medical images which automatically locate lung opacities on chest radiographs.

    The Inter-American Development Bank is using an algorithm that considers a family’s observable household attributes like the material of their walls and ceiling, or the assets found in the home to classify them and predict their level of need.

    Netflix data uses data science skills on the movie viewing patterns to understand what drives user interest and uses that to make decisions on which Netflix original series to produce.

    Why Python for data science?

    Python is very beginner friendly. The syntax (words and structure) is extremely simple to read and follow, most of which can be understood even if you do not know any programming. Python is a multi-paradigm programming language – a sort of Swiss Army knife for the coding world. It supports object-oriented programming, structured programming, and functional programming patterns, among others. There’s a joke in the Python community that Python is generally the second-best language for everything.

    Python is a free, open-source software, and consequently, anyone can write a library package to extend its functionality. Data science has been an early beneficiary of these extensions, particularly Pandas, the big daddy of them all.

    Python’s inherent readability and simplicity makes it relatively easy to pick up, and the number of dedicated analytical libraries available today means that data scientists in almost every sector will find packages already tailored to their needs, freely available for download.

    The following survey (figure 1.6) was done by KDnuggets – a leading site on business analytics, Big Data, data mining, data science, and machine learning – clearly shows that Python is a preferable choice for data science/machine learning:

    Figure 1.6: Survey by KDnuggets

    Conclusion

    Most of the people think that it is very difficult to become a data scientist. But, let me be clear, it is not tough!

    If you love making discoveries about the world, and if you are fascinated by machine learning, then you can break into the data science industry no matter what your situation is. This book will push you to learn, improve, and master the data science skill on your own. There is only one thing you need to keep on, that is, LEARN-APPLY-REPEAT. In the next chapter, we will set up our machine, and be ready for our data science journey.

    CHAPTER 2

    Installing Software and System Setup

    In the last chapter, we covered the data science fundamentals, and now we are ready to move ahead and prepare our system for data science. In this chapter, we will learn about the most popular Python data science platform – Anaconda. With this platform, you don't need to install Python explicitly – just one installation in your system (Windows, macOS, or Linux) and you are ready to use the industry-standard platform for developing, testing, and training.

    Structure

    System requirements

    Downloading the Anaconda

    Installing the Anaconda in Windows

    Installing the Anaconda in Linux

    How to install a new Python library in Anaconda

    Open your notebook – Jupyter

    Know your notebook

    Objective

    After studying this chapter, you should be able to install Anaconda in your system successfully and use the Jupyter notebook. You will also run your first Python program in your notebook.

    System requirements

    System architecture: 64-bit x86, 32-bit x86 with Windows or Linux, Power8, or Power9

    Operating system: Windows Vista or newer, 64-bit macOS 10.10+, or Linux, including Ubuntu, RedHat, CentOS 6+

    Minimum 3 GB disk space to download and install

    Downloading Anaconda

    You can download the Anaconda Distribution from the following link:

    https://www.anaconda.com/download/

    Once you click on the preceding link, you will see the following screen (as shown in figure 2.1):

    Figure 2.1: Anaconda Distribution download page

    Anaconda Distribution shows different OS options – Windows, macOS, and Linux. According to your OS, select the appropriate option. For this example, I have selected the Windows OS’s 64-Bit Graphical Installer (457 MB) option as shown in the following figure 2.2 :

    Figure 2.2: Anaconda Distribution installer versions for Windows OS

    Python community has stopped its support for Python 2.x and the prior version, so it is highly recommended that you should use Python 3.x. We are going to use Python 3.8 version throughout this book, so I will recommend downloading this version only. For downloading the distribution, see the two links just below the Download button; they are showing the Graphical Installer for each system architecture type-64-bit or 32-bit. Click on the appropriate link, and the downloading will start. This downloading process is the same for macOS and Linux.

    Installing the Anaconda on Windows

    Once the downloading is complete, double click on the installer to launch (the recommended way is to run the installer with admin privileges).

    Click Next, accept the terms, select the users – Just Me or All Users and click Next.

    Select the default destination folder or add a custom location to install the Anaconda, copy this path for later use and click Next.

    Install Anaconda to a directory path that does not contain spaces or Unicode characters.

    Deselect (uncheck) the first following option (if checked already) – add Anaconda to my PATH environment variable, then click Install, wait till the installation is completed.

    Click Next, click Skip, and then click Finish.

    Now open the Advanced system settings in your machine and add the following two values in your PATH environment variable:

    C:\Users\prateek\Anaconda3

    C:\Users\prateek\Anaconda3\Scripts

    Here, replace theC:\Users\prateek\Anaconda3with the actual path of your Anaconda installation folder that you copied earlier.

    Save the settings and restart your system.

    Verify your installation by clicking on the Windows icon in the taskbar or simply type Anaconda in the search bar – you will see Anaconda Navigator option, click on this option, and the following screen will appear (as shown in figure 2.3):

    Figure 2.3: Anaconda Navigator

    Installing the Anaconda with Graphical Installer in macOS is the same as we did above for Windows.

    Installing the Anaconda in Linux

    After downloading the 64bit(x86) installer, run the following two commands to check the data integrity:

    Md5sum /path/filename

    Sha256sum /path/filename

    Replace /path/filename with the actual path and filename of the file you downloaded.

    Enter the following to install Anaconda for Python 3.8, just replace ~/Downloads/ with the path to the file you downloaded:

    Figure 2.4: Installing Anaconda in Linux

    Choose Install Anaconda as a user unless root privileges are required. The installer prompts – In order to continue the installation process, please review the license agreement. Click Enter to view license terms.

    Scroll to the bottom of the license terms and enter Yes to agree. The installer prompts you to click Enter to accept the default install location, CTRL + C to cancel the installation, or specify an alternate installation directory. If you accept the default install location, the installer displays PREFIX=/home//anaconda<3> and continues the installation. It may take a few minutes to complete.

    The installer prompts – Do you wish the installer to prepend the Anaconda<3> install location to PATH in your /home//.bashrc? Enter Yes.

    If you enter No, you must manually add the path to Anaconda or conda will not work.

    The installer describes Microsoft VS Code and asks if you would like to install the VS Code. Enter yes or no. If you select yes, follow the instructions on the screen to complete the VS Code installation.

    Installing VS Code with the Anaconda installer requires an internet connection. Offline users may be able to find an offline VS Code installer from Microsoft.

    The installer finishes and displays – Thank you for installing Anaconda<3>! Close and open your terminal window for the installation to take effect, or you can enter the command source ~/.bashrc.

    After your installation is complete, verify it by opening Anaconda Navigator, a program that is included with Anaconda – open a Terminal window and type anaconda-navigator. If Navigator opens, you have successfully installed Anaconda.

    You can find some known issues while installing Anaconda and their solutions in the following link: https://docs.anaconda.com/anaconda/user-guide/troubleshooting/

    How to install a new Python library in Anaconda?

    Most of the Python libraries/packages are preinstalled with the Anaconda Distribution, which you can verify by typing the following command in an Anaconda Prompt:

    conda list

    Figure 2.5: Anaconda Prompt

    Now if you need to install any Python package which is not in the preceding list and is required for your task, then follow these steps. In the same Anaconda Prompt terminal, type conda install .

    For example, if you want to install scipy package, just type conda install scipy, then press enter and then enter y to continue.

    A second recommended approach to install any new package in Anaconda is to search the same (conda install ) in Google first and then go to the first search result, which is shown as follows:

    In Google search, I am searching a package for example imageio i.e. conda install imageio.

    Go to the first search result; this will open the Anaconda official site showing the installers of the searched package. In our example, it is like https://anaconda.org/menpo/imageio

    Now copy the text under– To install this package with conda run: and paste in the Anaconda Prompt. In our case, text is: conda install -c menpoimageio

    Open your notebook – Jupyter

    After installing Anaconda, the next step is to open the notebook – an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. For the notebook, open Anaconda Navigator and click on Launch button under the Jupyter Notebook icon or just type Jupyter Notebook in the search bar in Windows and then select it as shown in the following figure 2.6:

    Figure 2.6: Windows search bar

    Once you select it, a browser window (default is IE) will be opened showing the notebook as showing in the following figure 2.7:

    Figure 2.7: Browser window

    Know your notebook

    Once your notebook is opened in the browser, click on the New dropdown and select the default first option – Python 3 as shown in the following figure 2.8:

    Figure 2.8: Dropdown menu

    After clicking on Python 3 option, a new tab will be opened containing the new untitled notebook, as shown in the following figure 2.9:

    Figure 2.9: New tab

    Rename your notebook with a proper name by double-clicking on the Untitled text and then enter any new name (I have named it MyFirstNotebook) and click Rename (refer to the following figure 2.10):

    Figure 2.10: Rename

    The preceding step will rename your notebook. Now it's time to run your first Python program in your first notebook. We will print a greeting message in Python for this purpose. In the cell (text bar) just type any welcome message inside the print block as shown in the following figure 2.11:

    Figure 2.11: Welcome message

    In the above cell, we are printing a string in Python 3.6. Now to run this program, you can simply press Shift + Enter keys together or click on the Play button just below the cell column (refer to the following figure 2.12):

    Figure 2.12: Play button

    Once you run the cell, your program will run and give you the output, as shown just below the cell in the following figure 2.13:

    Figure 2.13: Output

    Congrats! You have successfully run your first program in Python 3.7. This is just a one-line code using simple plain English text. Let's explore some more, the simplicity of the Python by doing some mathematical calculations.

    Let's add two numbers by entering the FirstNumber + SecondNumber and then run it as shown in the following figure 2.14:

    Figure 2.14: Simple calculation

    Quite interesting, right! Let's move ahead and ask the user to input numbers and let Python do the homework. In the following example, you need

    Enjoying the preview?
    Page 1 of 1