Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Breast Cancer Classification Using Support Vector Machine (SVM)

Adebola Lamidi
Towards Data Science
8 min readNov 22, 2018

Background:

Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.

Early diagnosis significantly increases the chances of survival. The key challenges against it’s detection is how to classify tumors into malignant (cancerous) or benign(non cancerous). A tumor is considered malignant if the cells can grow into surrounding tissues or spread to distant areas of the body. A benign tumor does not invade nearby tissue nor spread to other parts of the body the way cancerous tumors can. But benign tumors can be serious if they press on vital structures such as blood vessels or nerves.

Machine Learning technique can dramatically improve the level of diagnosis in breast cancer. Research shows that experienced physicians can detect cancer by 79% accuracy, while a 91 %( sometimes up to 97%) accuracy can be achieved using Machine Learning techniques.

Project Task

In this study, my task is to classify tumors into malignant (cancerous) or benign (non-cancerous) using features obtained from several cell images.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Attribute Information:

  1. ID number
  2. Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

  1. Radius (mean of distances from center to points on the perimeter)
  2. Texture (standard deviation of gray-scale values)
  3. Perimeter
  4. Area
  5. Smoothness (local variation in radius lengths)
  6. Compactness (perimeter² / area — 1.0)
  7. Concavity (severity of concave portions of the contour)
  8. Concave points (number of concave portions of the contour)
  9. Symmetry
  10. Fractal dimension (“coastline approximation” — 1)

Loading Python Libraries and Breast Cancer Dataset

Let’s view the data in a dataframe

Features (Columns) breakdown

Visualize the relationship between our features

Let’s check the correlation between our features

There is a strong correlation between mean radius and mean perimeter, as well as mean area and mean perimeter

Let’s start by talking about modeling in Data Science.

What do we mean when we say “Modeling” ?

Depending on how long we’ve lived in a particular place and traveled to a location, we probably have a good understanding of commute times in our area. For example, we’ve traveled to work/school using some combination of the metro, buses, trains, ubers, taxis, carpools, walking, biking, etc.

All humans naturally model the world around them.

Over time, our observations about transportation have built up a mental dataset and a mental model that helps us predict what traffic will be like at various times and locations. We probably use this mental model to help plan our days, predict arrival times, and many other tasks.

  • As data scientists we attempt to make our understanding of relationships between different quantities more precise through using data and mathematical/statistical structures.
  • This process is called modeling.
  • Models are simplifications of reality that help us to better understand that which we observe.
  • In a data science setting, models generally consist of an independent variable (or output) of interest and one or more dependent variables (or inputs) believed to influence the independent variable.

Model-based inference

  • We can use models to conduct inference.
  • Given a model, we can better understand relationships between an independent variable and the dependent variable or between multiple independent variables.

An example of where inference from a mental model would be valuable is:

Determining what times of the day we work best or get tired.

Prediction

  • We can use a model to make predictions, or to estimate a dependent variable’s value given at least one independent variable’s value.
  • Predictions can be valuable even if they are not exactly right.
  • Good predictions are extremely valuable for a wide variety of purposes.

An example of where prediction from a mental model could be valuable:

Predicting how long it will take to get from point A to point B.

What is the difference between model prediction and inference?

  • Inference is judging what the relationship, if any, there is between the data and the output.
  • Prediction is making guesses about future scenarios based on data and a model constructed on that data.

In this project, we will be talking about a Machine Learning Model called Support Vector Machine (SVM)

Introduction to Classification Modeling: Support Vector Machine (SVM)

What is a Support Vector Machine (SVM)?

A Support Vector Machine (SVM) is a binary linear classification whose decision boundary is explicitly constructed to minimize generalization error. It is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression and even outlier detection.

SVM is well suited for classification of complex but small or medium sized datasets.

How does SVM classify?

It’s important to start with the intuition for SVM with the special linearly separable classification case.

If classification of observations is “linearly separable”, SVM fits the “decision boundary” that is defined by the largest margin between the closest points for each class. This is commonly called the “maximum margin hyperplane (MMH)”.

The advantages of support vector machines are:

  • Effective in high dimensional spaces.
  • Still effective in cases where number of dimensions is greater than the number of samples.
  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
  • Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

  • If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
  • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities).

Now that we have better understanding of Modeling and Support Vector Machine (SVM), let’s start training our predictive model.

Model Training

From our dataset, let’s create the target and predictor matrix

  • “y” = Is the feature we are trying to predict (Output). In this case we are trying to predict if our “target” is cancerous (Malignant) or not (Benign). i.e. we are going to use the “target” feature here.
  • “X” = The predictors which are the remaining columns (mean radius, mean texture, mean perimeter, mean area, mean smoothness, etc.)

Create the training and testing data

Now that we’ve assigned values to our “X” and “y”, the next step is to import the python library that will help us split our dataset into training and testing data.

  • Training data = the subset of our data used to train our model.
  • Testing data = the subset of our data that the model hasn’t seen before (We will be using this dataset to test the performance of our model).

Let’s split our data using 80% for training and the remaining 20% for testing.

Import Support Vector Machine (SVM) Model

Now, let’s train our SVM model with our “training” dataset.

Let’s use our trained model to make a prediction using our testing data

Next step is to check the accuracy of our prediction by comparing it to the output we already have (y_test). We are going to use confusion matrix for this comparison.

Let’s create a confusion matrix for our classifier’s performance on the test dataset.

Let’s visualize our confusion matrix on a Heatmap

As we can see, our model did not do a good job in its predictions. It predicted that 48 healthy patients have cancer. We only achieved 34% accuracy!

Let’s explore ways to improve the performance of our model.

Improving our Model

The first process we will try is by normalizing our data

Data normalization is a feature scaling process that brings all values into range [0,1]

X’ = (X-X_min) / (X_max — X_min)

Normalize Training Data

Normalize Training Data

Now, let’s train our SVM model with our scaled (Normalized) datasets.

Prediction with Scaled dataset

Confusion Matrix on Scaled dataset

Our prediction got a lot better with only 1 false prediction(Predicted cancer instead of healthy). We achieved 98% accuracy!

Summary:

This article took us through the journey of explaining what “modeling” means in Data Science, difference between model prediction and inference, introduction to Support Vector Machine (SVM), advantages and disadvantages of SVM, training an SVM model to make accurate breast cancer classifications, improving the performance of an SVM model, and testing model accuracy using Confusion Matrix.

If you want all the codes for this project in a Jupyter Notebook format, you can download them from my GitHub repository.

Sources:

  1. http://scikit-learn.org/stable/modules/svm.html
  2. http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf
  3. http://pyml.sourceforge.net/doc/howto.pdf
  4. https://www.bcrf.org/breast-cancer-statistics
  5. https://www.cancer.org/cancer/breast-cancer/about/what-is-breast-cancer.html

--

--

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by Adebola Lamidi

“Happiness is when what you think, what you say, and what you do are in harmony.” – Mahatma Gandhi

Responses (5)