Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods
By Hong Zhou
()
About this ebook
Use popular data mining techniques in Microsoft Excel to better understand machine learning methods.
Software tools and programming language packages take data input and deliver data mining results directly, presenting no insight on working mechanics and creating a chasm between input and output. This is where Excel can help.
Excel allows you to work with data in a transparent manner. When you open an Excel file, data is visible immediately and you can work with it directly. Intermediate results can be examined while you are conducting your mining task, offering a deeper understanding of how data is manipulated and results are obtained. These are critical aspects of the model construction process that are hidden in software tools and programming language packages.
This book teaches you data mining through Excel. You will learn how Excel has an advantage in data mining when the data sets are not too large. It can give you a visual representation of data mining, building confidence in your results. You will go through every step manually, which offers not only an active learning experience, but teaches you how the mining process works and how to find the internal hidden patterns inside the data.
What You Will Learn
- Comprehend data mining using a visual step-by-step approach
- Build on a theoretical introduction of a data mining method, followed by an Excel implementation
- Unveil the mystery behind machine learning algorithms, making a complex topic accessible to everyone
- Become skilled in creative uses of Excel formulas and functions
- Obtain hands-on experience with data mining and Excel
Who This Book Is For
Anyone who is interested in learning data mining or machine learning, especially data science visual learners and people skilled in Excel, who would like to explore data science topics and/or expand their Excel skills. A basic or beginner level understanding of Excel is recommended.
Related to Learn Data Mining Through Excel
Related ebooks
Mastering Excel Through Projects: A Learn-by-Doing Approach from Payroll to Crypto to Data Analysis Rating: 0 out of 5 stars0 ratingsDynamic SQL: Applications, Performance, and Security in Microsoft SQL Server Rating: 0 out of 5 stars0 ratingsAdvanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data Rating: 0 out of 5 stars0 ratingsPractical MATLAB: With Modeling, Simulation, and Processing Projects Rating: 0 out of 5 stars0 ratingsLearn Java with Math: Using Fun Projects and Games Rating: 0 out of 5 stars0 ratingsData Scientist Roadmap Rating: 5 out of 5 stars5/5Mastering Excel: Mastering Software Series, #1 Rating: 0 out of 5 stars0 ratingsExcel 2016 Hands-On Lab Rating: 0 out of 5 stars0 ratingsEssential Excel 2019: A Step-By-Step Guide Rating: 0 out of 5 stars0 ratingsHands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python Rating: 0 out of 5 stars0 ratingsPython for Data Science Rating: 0 out of 5 stars0 ratingsMachine Learning with PySpark: With Natural Language Processing and Recommender Systems Rating: 0 out of 5 stars0 ratingsA Python Data Analyst’s Toolkit: Learn Python and Python-based Libraries with Applications in Data Analysis and Statistics Rating: 0 out of 5 stars0 ratingsBeginning MATLAB and Simulink: From Novice to Professional Rating: 0 out of 5 stars0 ratingsHandbook of Human Centric Visualization Rating: 0 out of 5 stars0 ratingsAutomated Theorem Proving in Software Engineering Rating: 0 out of 5 stars0 ratingsMATLAB Optimization Techniques Rating: 0 out of 5 stars0 ratingsPython for Probability, Statistics, and Machine Learning Rating: 0 out of 5 stars0 ratingsSelf-Service Data & Analytics Third Edition Rating: 0 out of 5 stars0 ratingsBeginning Security with Microsoft Technologies: Protecting Office 365, Devices, and Data Rating: 0 out of 5 stars0 ratingsData Governance and Data Management: Contextualizing Data Governance Drivers, Technologies, and Tools Rating: 0 out of 5 stars0 ratingsEmbedded Software Design and Programming of Multiprocessor System-on-Chip: Simulink and System C Case Studies Rating: 0 out of 5 stars0 ratingsMongoDB Recipes: With Data Modeling and Query Building Strategies Rating: 0 out of 5 stars0 ratingsBusiness Aviation Leadership: From the Traits to the Trenches Rating: 0 out of 5 stars0 ratingsBeginning T-SQL Rating: 0 out of 5 stars0 ratingsLatent Variable Models and Factor Analysis: A Unified Approach Rating: 0 out of 5 stars0 ratingsAgile Software Development: Best Practices for Large Software Development Projects Rating: 0 out of 5 stars0 ratingsOracle Enterprise Manager 12c Command-Line Interface Rating: 0 out of 5 stars0 ratingsPredictive Lead Scoring Standard Requirements Rating: 0 out of 5 stars0 ratings
Programming For You
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsLearn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 5 out of 5 stars5/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5HTML in 30 Pages Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5C# Programming from Zero to Proficiency (Beginner): C# from Zero to Proficiency, #2 Rating: 0 out of 5 stars0 ratingsC Programming For Beginners: The Simple Guide to Learning C Programming Language Fast! Rating: 5 out of 5 stars5/5Coding with JavaScript For Dummies Rating: 0 out of 5 stars0 ratingsProgramming Arduino: Getting Started with Sketches Rating: 4 out of 5 stars4/5Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications Rating: 0 out of 5 stars0 ratingsPython: Learn Python in 24 Hours Rating: 4 out of 5 stars4/5Mastering Windows PowerShell Scripting Rating: 4 out of 5 stars4/5JavaScript All-in-One For Dummies Rating: 5 out of 5 stars5/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5
Reviews for Learn Data Mining Through Excel
0 ratings0 reviews
Book preview
Learn Data Mining Through Excel - Hong Zhou
© Hong Zhou 2020
H. ZhouLearn Data Mining Through Excelhttps://doi.org/10.1007/978-1-4842-5982-5_1
1. Excel and Data Mining
Hong Zhou¹
(1)
University of Saint Joseph, West Hartford, CT, USA
Let’s get right to the topic. Why do we need to learn Excel in our data mining endeavor? It is true that there are quite a few outstanding data mining software tools such as RapidMiner and Tableau that make the mining process easy and straightforward. In addition, programming languages Python and R have a large number of reliable packages dedicated to various data mining tasks. What is the purpose of studying data mining or machine learning through Excel?
Why Excel?
If you are already an experienced data mining professional, I would say that you are asking the right question and probably you should not read this book. However, if you are a beginner in data mining, or a visual learner, or want to understand the mathematical background behind some popular data mining techniques, or an educator, then this book is right for you, and probably is the first book you should read before you start your data mining journey.
Excel allows you to work with data in a transparent manner, meaning when an Excel file is opened, the data is visible immediately and every step of data processing is also visible. Intermediate results are contained in the Excel worksheet and can be examined while you are conducting your mining task. This allows you to obtain a deep and clear understanding of how the data are manipulated and how the results are obtained. Other software tools and programming languages hide critical aspects of the model construction process. For most data mining projects, the goal is to find the internal hidden patterns inside the data. Therefore, hiding the detailed process is beneficial to the users of the tools or packages. But it is not helpful for beginners, visual learners, or those who want to understand how the mining process works. Let me use k-nearest neighbors method (K-NN) to illustrate the learning differences between RapidMiner, R, and Excel. Before we do that, we need to understand several terminologies in data mining.
There are two types of data mining techniques: supervised and unsupervised. Supervised methods require the use of a training dataset to train
the software programs or algorithms (such programs or algorithms are often referred to as machines) first. Programs are trained to reach an optimal state called a model. This is why a training process is also called modeling. Data mining methods can also be categorized into parametric and nonparametric methods. For parametric methods, a model is just a set of parameters or rules obtained through the training process that are believed to allow the programs to work well with the training dataset. Nonparametric methods do not generate a set of parameters. Instead, they dynamically evaluate the incoming data based on the existing dataset. You may be confused by such definitions at this time. They will make sense soon.
What is a training dataset? In a training dataset, the target variable (also called label, target, dependent variable, outcome variable, response), the value of which to be predicted, is given or known. The value of the target variable depends on the values of other variables which are usually called attributes, predictors, or independent variables. Based on the attribute values, a supervised data mining method computes (or so-called predicts) the value of the target variable. Some computed target values might not match the known target values in the training dataset. A good model indicates an optimal set of parameters or rules that can minimize the mismatches.
A model is usually constructed to work on future datasets with unknown target values in a supervised data mining method. Such future datasets are commonly called scoring datasets. In an unsupervised data mining method , however, there is no training dataset and the model is an algorithm that can directly be applied on the scoring datasets. K-nearest neighbors method is a supervised data mining technique.
Suppose we want to predict if a person is likely to accept a credit card offer based on the person’s age, gender, income, and number of credit cards they already have. The target variable is the response to the credit card offer (assume it is either Yes or No), while age, gender, income, and number of existing credit cards are the attributes. In the training dataset, all variables including both the target and attributes are known. In such a scenario, a K-NN model is constructed through the use of the training dataset. Based on the constructed model, we can predict the responses to the credit card offer of people whose information is stored in the scoring dataset.
In RapidMiner, one of the best data mining tools, the prediction process is as follows: retrieve both the training data and scoring data from the repository ➤ set role for the training data ➤ apply the K-NN operator on the training data to construct the model ➤ connect the model and the scoring data to the Apply Model operator. That’s it! You can now execute the process and the result is obtained. Yes, very straightforward. This is shown in Figure 1-1. Be aware that there is no model validation in this simple process.
../images/494428_1_En_1_Chapter/494428_1_En_1_Fig1_HTML.jpgFigure 1-1
K-NN model in RapidMiner
Applying K-NN method is very simple in R, too. After loading the library class
, read the training data and scoring data and make use of the K-NN function, and by then we have finished our job: ready to view our result. This is demonstrated in Figure 1-2. Note that lines starting with #
are comments.
Figure 1-2
K-NN in R
The knowledge you have gained from the preceding tasks is enough to just be able to apply the data mining method K-NN. But if you are trying to understand, step by step, why and how K-NN works, you will need a lot more information. Excel can offer you the opportunity to go through a step-by-step analysis process on a dataset during which you can develop a solid understanding of the K-NN algorithm. With this solid understanding, you can then be more proficient in using other powerful tools or programming languages. Most importantly, you will have a better understanding of the quality and value of your data mining results. You will see that in later chapters.
Of course, Excel is much more limited in data mining compared to R, Python, and RapidMiner. Excel can only work with data up to a smaller size limit. Meanwhile, some data mining techniques are too complicated to be practiced through Excel. Nonetheless, Excel provides us direct and visual understanding of the data mining mechanisms. In addition, Excel is naturally suitable for data preparation.
Today, because of the software tools and other packages, most effort in a data mining task is spent on understanding the task (including the business understanding and data understanding), preparing the data, and presenting the results. Less than 10% of the effort is spent on the modeling process. The process of preparing the data for modeling is called data engineering . Excel has an advantage on data engineering when the datasets are not too large because it can give us a visual representation of data engineering, which allows us to be more confident in our data preparation process.
As an experienced educator , I realize that students can better develop a deep understanding of data mining methods if these methods are also explained through step-by-step instructions in Excel. Studying through Excel unveils the mystery behind data mining or machine learning methods and makes students more confident in applying these methods.
Did I just mention machine learning? Yes, I did. Machine learning is another buzz phrase today. What is machine learning ? What is the difference between data mining and machine learning?
Most efforts to differentiate data mining and machine learning are not successful because data mining and machine learning cannot be clearly separated per se. At this moment, I would suggest that we treat them the same. But if I must tell the difference between data mining and machine learning, I would say that machine learning is more on supervised methods, while data mining includes both supervised and unsupervised methods.
Prepare Some Excel Skills
There are quite some Excel skills to learn in this book. I will explain some of them in detail when we need to use them. However, there are several fundamental Excel skills and functions that we need to be familiar with before we start talking about data mining.
Formula
Formula is the most important feature of Excel. Writing a formula is like writing a programming statement. In Excel, a formula always starts with an equal sign (=
without quotation marks).
Upon opening an Excel file, we are greeted with a table-like worksheet. Yes, every worksheet is a huge table. One reason why Excel is naturally suitable for data storage, analysis, and mining is because data are automatically arranged in a table format in Excel. Each cell in the big table has a name or so-called reference. By default, each column is labeled by an alphabet, while each row is labeled with a number. For example, the very first cell at the top-left corner is cell A1, that is, column A and row 1. The content in a cell, whatever it is, is represented by the cell reference.
Enter number 1 in cell A1. The value of cell A1 is 1 and A1 represents 1 at this moment.
Enter the formula =A1*10
(without the double quotation marks) in cell B1 and hit the Enter key. Note that the formula starts with =
. Be aware that this is the only time a formula is presented inside a pair of double quotation marks in this book. From now on, all formulas are presented directly without quotation marks.
Enter the text A1 * 10
in cell C1. Because the text does not start with =
, it is not a formula.
Our worksheet looks like Figure 1-3.
../images/494428_1_En_1_Chapter/494428_1_En_1_Fig3_HTML.jpgFigure 1-3
Excel formula
Autofill or Copy
Autofill is another critical feature of Excel which makes Excel capable of working with a relatively large dataset. Autofill is also called copy
by many people.
Let’s learn autofill by the following experiment:
1.
Enter 1 in cell A1.
2.
Enter 2 in cell A2.
3.
Select both cells A1 and A2.
4.
Release the left mouse button.
5.
Move the mouse cursor to the right-below corner of cell A2 until the cursor becomes a black cross (shown in Figure 1-4).
../images/494428_1_En_1_Chapter/494428_1_En_1_Fig4_HTML.jpgFigure 1-4
Cursor becomes a black cross
6.
Press down the left mouse button and drag down to cell A6.
The cells A1:A6 are automatically filled with numbers 1, 2, 3, 4, 5, and 6. This process is called autofill. Some people call it copy
, too. But more precisely, this process is autofill.
Let’s conduct another experiment:
1.
Select cell B1 (make sure that B1 still has the formula =A1*10). Lift up the left mouse button.
2.
Move the mouse cursor to the left-down corner of cell B1 until the cursor becomes a black cross.
3.
Drag down the mouse cursor to cell B6. Our worksheet looks like