Learning Model Building in Scikit-learn
Last Updated :
27 May, 2025
Building machine learning models from scratch can be complex and time-consuming. Scikit-learn which is an open-source Python library which helps in making machine learning more accessible. It provides a straightforward, consistent interface for a variety of tasks like classification, regression, clustering, data preprocessing and model evaluation. Whether we're new to machine learning or have some experience it makes easy to build reliable models quickly. In this article, we’ll see important features and steps to get started with Scikit-learn.
Installing and Using Scikit-learn
Before we start building models we need to install Scikit-learn. It requires Python 3.8 or newer and depends on two important libraries: NumPy and SciPy. Make sure these are installed first.
To install Scikit-learn run the following command:
pip install -U scikit-learn
This will download and install the latest version of Scikit-learn along with its dependencies. Lets see various steps involved in the process of building Model using Scikit-learn library.
Step 1: Loading a Dataset
A dataset is a collection of data used to train and test machine learning models. It has two main parts:
- Features: Also called predictors or inputs these are the variables that describe the data. There can be multiple features represented as a feature matrix denoted as X and the list of all feature names is known feature names.
- Response: They are also known as target, label or output this is the variable we want to predict. It is a single column represented as a response vector denoted as and all the possible values taken by a response vector are termed target names.
Scikit-learn includes some ready-to-use example datasets like Iris and Digits datasets for classification tasks and Boston Housing dataset for regression tasks. Here we will be using the Iris dataset.
- load_iris(): Loads the Iris dataset into the variable iris.
- Features and Targets: X contains the input data (features like petal length, width etc) and y contains the target values (species of the iris flower).
- Names: feature_names and target_names provide the names of the features and the target labels respectively.
- Inspecting Data: We print the feature names and target names check the type of X and display the first 5 rows of the feature data to understand the structure.
Python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nType of X is:", type(X))
print("\nFirst 5 rows of X:\n", X[:5])
Output:
Loading datasetSometimes we need to work on our own custom data then we load an external dataset. For this we can use the pandas library for easy loading and manipulating datasets.
For this you can refer to our article on How to import csv file in pandas?
Step 2: Splitting the Dataset
When working with machine learning models handling large datasets can be computationally expensive. To make training efficient and to evaluate model performance fairly we split the data into two parts: the training set and the testing set.
The training set is used to teach the model to recognize patterns while the testing set helps us check how well the model performs on new, unseen data. This separation helps in preventing overfitting and gives a more accurate measure of how the model will work in real-world situations. In Scikit-learn the train_test_split function from the sklearn.model_selection module makes this easy.
Here we are spliting the Iris dataset so that 60% of the data is used for training and 40% for testing by setting test_size=0.4. Using random_state=1 parameter helps in ensuring that the split remains the same every time we run the code which is helpful for reproducibility.
After splitting, we get four subsets:
- X_train and y_train: Features and target values used to train the model.
- X_test and y_test: Features and target values reserved for testing.
Python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
Now lets check the Shapes of the Splitted Data to ensures that both sets have correct proportions of data avoiding any potential errors in model evaluation or training.
Python
print("X_train Shape:", X_train.shape)
print("X_test Shape:", X_test.shape)
print("Y_train Shape:", y_train.shape)
print("Y_test Shape:", y_test.shape)
Output:
Shape of Splitted DataStep 3: Handling Categorical Data
Machine learning algorithms require numerical input so handling categorical data correctly is important. If categorical variables are left as text, the algorithms may misinterpret their meaning which leads to poor results. To avoid this we convert categorical data into numerical form using encoding techniques which are as follows:
1. Label Encoding: It converts each category into a unique integer. For example in a column with categories like 'cat', 'dog' and 'bird', it would convert them to 0, 1 and 2 respectively. This method works well when the categories have a meaningful order such as “Low”, “Medium” and “High”.
- LabelEncoder(): It is initialized to create an encoder object that will convert categorical values into numerical labels.
- fit_transform(): This method first fits the encoder to the categorical data and then transforms the categories into corresponding numeric labels.
Python
from sklearn.preprocessing import LabelEncoder
categorical_feature = ['cat', 'dog', 'dog', 'cat', 'bird']
encoder = LabelEncoder()
encoded_feature = encoder.fit_transform(categorical_feature)
print("Encoded feature:", encoded_feature)
Output:
Encoded feature: [1 2 2 1 0]
2. One-Hot Encoding: It creates binary columns for each category where each column represents a category. For example if we have a column with values 'cat' 'dog' and 'bird' it will create three new columns one for each category where each row will have 1 in the column corresponding to its category and 0s in the others. This method is useful for categorical variables without any order ensuring that no numeric relationships are implied between the categories.
- OneHotEncoder: It expects the input data to be in a 2D array i.e each sample should be a row and each feature should be a column thatswhy we reshape it.
- OneHotEncoder(sparse_output=False): It creates an encoder object that will convert categorical variables into binary columns.
Python
from sklearn.preprocessing import OneHotEncoder
import numpy as np
categorical_feature = ['cat', 'dog', 'dog', 'cat', 'bird']
categorical_feature = np.array(categorical_feature).reshape(-1, 1)
encoder = OneHotEncoder(sparse_output=False)
encoded_feature = encoder.fit_transform(categorical_feature)
print("OneHotEncoded feature:\n", encoded_feature)
Output:
Besides Label Encoding and One-Hot Encoding there are other techniques like Mean Encoding.
Step 4: Training the Model
Now that our data is ready, it’s time to train a machine learning model. Scikit-learn has many algorithms with a consistent interface for training, prediction and evaluation. Here we’ll use Logistic Regression as an example.
Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only.
- log_reg = LogisticRegression(max_iter=200): Creating a logistic regression classifier object.
- log_reg.fit(X_train, y_train): Using this the logistic regression model adjusts the model’s parameters to best fit the data.
Python
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)
Training Using Logistic Regression. Step 5: Make Predictions
Once trained we use the model to make predictions on the test data X_test by calling the predict method. This returns predicted labels y_pred.
- log_reg.predict: It uses trained logistic regression model to predict labels for the test data X_test.
Python
y_pred = log_reg.predict(X_test)
Check how well our model is performing by comparing y_test and y_pred. Here we are using the metrics module's method accuracy_score.
Python
from sklearn import metrics
print("Logistic Regression model accuracy:", metrics.accuracy_score(y_test, y_pred))
Output:
Logistic Regression model accuracy: 0.9666666666666667
Now we want our model to make predictions on new sample data. Then the sample input can simply be passed in the same way as we pass any feature matrix. Here we used it as sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
Python
sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = log_reg.predict(sample)
pred_species = [iris.target_names[p] for p in preds]
print("Predictions:", pred_species)
Output:
Predictions: [np.str_('virginica'), np.str_('virginica')]
Features of Scikit-learn
Scikit-learn is used because it makes building machine learning models straightforward and efficient. Here are some important reasons:
- Ready-to-Use Tools: It provides built-in functions for common tasks like data preprocessing, training models and making predictions. This saves time by avoiding the need to code algorithms from scratch.
- Easy Model Evaluation: With tools like cross-validation and performance metrics it helps to measure how well our model works and identify areas for improvement.
- Wide Algorithm Support: It offers many popular machine learning algorithms including classification, regression and clustering which gives us flexibility to choose the right model for our problem.
- Smooth Integration: Built on top of important Python libraries like NumPy and SciPy so it fits into our existing data analysis workflow.
- Simple and Consistent Interface: The same straightforward syntax works across different models helps in making it easier to learn and switch between algorithms.
- Model Tuning Made Easy: Tools like grid search help us fine-tune our model’s settings to improve accuracy without extra hassle.
Benefits of using Scikit-learn
- User-Friendly: Scikit-learn’s consistent and simple interface makes it accessible for beginners and best for experts.
- Time-Saving: Pre-built tools and algorithms reduce development time which allows us to focus more on solving problems than coding details.
- Better Model Performance: Easy-to-use tuning and evaluation tools helps in improving model accuracy and reliability.
- Flexible and Scalable: Supports a wide range of algorithms and integrates smoothly with other Python libraries helps in making it suitable for projects of any size.
- Strong Community Support: A large, active community ensures regular updates, extensive documentation and plenty of resources to help when we get stuck.
With its accessible tools and reliable performance, Scikit-learn makes machine learning practical and achievable for everyone.
Similar Reads
Python Tutorial | Learn Python Programming Language
Python Tutorial â Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly.Python is:A high-level language, used in web development, data science, automatio
10 min read
Python Fundamentals
Python Introduction
Python was created by Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was designed with focus on code readability and its syntax allows us to express concepts in fewer lines of code.Key Features of PythonPythonâs simple and readable syntax makes it beginner-frien
3 min read
Input and Output in Python
Understanding input and output operations is fundamental to Python programming. With the print() function, we can display output in various formats, while the input() function enables interaction with users by gathering input during program execution. Taking input in PythonPython input() function is
8 min read
Python Variables
In Python, variables are used to store data that can be referenced and manipulated during program execution. A variable is essentially a name that is assigned to a value. Unlike many other programming languages, Python variables do not require explicit declaration of type. The type of the variable i
6 min read
Python Operators
In Python programming, Operators in general are used to perform operations on values and variables. These are standard symbols used for logical and arithmetic operations. In this article, we will look into different types of Python operators. OPERATORS: These are the special symbols. Eg- + , * , /,
6 min read
Python Keywords
Keywords in Python are reserved words that have special meanings and serve specific purposes in the language syntax. Python keywords cannot be used as the names of variables, functions, and classes or any other identifier. List of Keywords in PythonTrueFalseNoneandornotisifelseelifforwhilebreakconti
11 min read
Python Data Types
Python Data types are the classification or categorization of data items. It represents the kind of value that tells what operations can be performed on a particular data. Since everything is an object in Python programming, Python data types are classes and variables are instances (objects) of thes
9 min read
Conditional Statements in Python
Conditional statements in Python are used to execute certain blocks of code based on specific conditions. These statements help control the flow of a program, making it behave differently in different situations.If Conditional Statement in PythonIf statement is the simplest form of a conditional sta
6 min read
Loops in Python - For, While and Nested Loops
Loops in Python are used to repeat actions efficiently. The main types are For loops (counting through items) and While loops (based on conditions). Additionally, Nested Loops allow looping within loops for more complex tasks. While all the ways provide similar basic functionality, they differ in th
9 min read
Python Functions
Python def Keyword
Python def keyword is used to define a function, it is placed before a function name that is provided by the user to create a user-defined function. In Python, a function is a logical unit of code containing a sequence of statements indented under a name given using the âdefâ keyword. In Python def
6 min read
Python return statement
A return statement is used to end the execution of the function call and it "returns" the value of the expression following the return keyword to the caller. The statements after the return statements are not executed. If the return statement is without any expression, then the special value None is
4 min read
Global and Local Variables in Python
In Python, global variables are declared outside any function and can be accessed anywhere in the program, including inside functions. On the other hand, local variables are created within a function and are only accessible during that functionâs execution. This means local variables exist only insi
7 min read
Recursion in Python
Recursion involves a function calling itself directly or indirectly to solve a problem by breaking it down into simpler and more manageable parts. In Python, recursion is widely used for tasks that can be divided into identical subtasks.In Python, a recursive function is defined like any other funct
6 min read
*args and **kwargs in Python
In Python, *args and **kwargs are used to allow functions to accept an arbitrary number of arguments. These features provide great flexibility when designing functions that need to handle a varying number of inputs.Example:Python# *args example def fun(*args): return sum(args) print(fun(1, 2, 3, 4))
4 min read
Python Lambda Functions
Python Lambda Functions are anonymous functions means that the function is without a name. As we already know the def keyword is used to define a normal function in Python. Similarly, the lambda keyword is used to define an anonymous function in Python. In the example, we defined a lambda function(u
6 min read
Python map() function
The map() function is used to apply a given function to every item of an iterable, such as a list or tuple, and returns a map object (which is an iterator). Let's start with a simple example of using map() to convert a list of strings into a list of integers.Pythons = ['1', '2', '3', '4'] res = map(
4 min read
Python Data Structures
Python String
A string is a sequence of characters. Python treats anything inside quotes as a string. This includes letters, numbers, and symbols. Python has no character data type so single character is a string of length 1.Pythons = "GfG" print(s[1]) # access 2nd char s1 = s + s[0] # update print(s1) # printOut
6 min read
Python Lists
In Python, a list is a built-in dynamic sized array (automatically grows and shrinks). We can store all types of items (including another list) in a list. A list may contain mixed type of items, this is possible because a list mainly stores references at contiguous locations and actual items maybe s
6 min read
Python Tuples
A tuple in Python is an immutable ordered collection of elements. Tuples are similar to lists, but unlike lists, they cannot be changed after their creation (i.e., they are immutable). Tuples can hold elements of different data types. The main characteristics of tuples are being ordered , heterogene
6 min read
Dictionaries in Python
A Python dictionary is a data structure that stores the value in key: value pairs. Values in a dictionary can be of any data type and can be duplicated, whereas keys can't be repeated and must be immutable. Example: Here, The data is stored in key:value pairs in dictionaries, which makes it easier t
5 min read
Python Sets
Python set is an unordered collection of multiple items having different datatypes. In Python, sets are mutable, unindexed and do not contain duplicates. The order of elements in a set is not preserved and can change.Creating a Set in PythonIn Python, the most basic and efficient method for creating
10 min read
Python Arrays
Lists in Python are the most flexible and commonly used data structure for sequential storage. They are similar to arrays in other languages but with several key differences:Dynamic Typing: Python lists can hold elements of different types in the same list. We can have an integer, a string and even
9 min read
List Comprehension in Python
List comprehension is a way to create lists using a concise syntax. It allows us to generate a new list by applying an expression to each item in an existing iterable (such as a list or range). This helps us to write cleaner, more readable code compared to traditional looping techniques.For example,
4 min read
Python OOPs Concepts
Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Python Exception Handling
Python Exception Handling handles errors that occur during the execution of a program. Exception handling allows to respond to the error, instead of crashing the running program. It enables you to catch and manage errors, making your code more robust and user-friendly. Let's look at an example:Handl
7 min read
File Handling in Python
File handling refers to the process of performing operations on a file such as creating, opening, reading, writing and closing it, through a programming interface. It involves managing the data flow between the program and the file system on the storage device, ensuring that data is handled safely a
7 min read
Python Packages or Libraries
Python Modules
Python Module is a file that contains built-in functions, classes,its and variables. There are many Python modules, each with its specific work.In this article, we will cover all about Python modules, such as How to create our own simple module, Import Python modules, From statements in Python, we c
7 min read
Python DSA Libraries
Data Structures and Algorithms (DSA) serve as the backbone for efficient problem-solving and software development. Python, known for its simplicity and versatility, offers a plethora of libraries and packages that facilitate the implementation of various DSA concepts. In this article, we'll delve in
15 min read
List of Python GUI Library and Packages
Graphical User Interfaces (GUIs) play a pivotal role in enhancing user interaction and experience. Python, known for its simplicity and versatility, has evolved into a prominent choice for building GUI applications. With the advent of Python 3, developers have been equipped with lots of tools and li
11 min read
Data Science with Python
Python NumPy
Numpy is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays. It is the fundamental package for scientific computing with Python.Besides its obvious scientific uses, Numpy can also be used as an efficient m
6 min read
Pandas Tutorial
Pandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
Matplotlib Tutorial
Matplotlib is an open-source visualization library for the Python programming language, widely used for creating static, animated and interactive plots. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, Qt, GTK and wxPython. It
5 min read
Introduction to Seaborn - Python
Prerequisite - Matplotlib Library Visualization is an important part of storytelling, we can gain a lot of information from data by simply just plotting the features of data. Python provides a numerous number of libraries for data visualization, we have already seen the Matplotlib library in this ar
5 min read
StatsModel Library- Tutorial
Statsmodels is a useful Python library for doing statistics and hypothesis testing. It provides tools for fitting various statistical models, performing tests and analyzing data. It is especially used for tasks in data science ,economics and other fields where understanding data is important. It is
4 min read
Learning Model Building in Scikit-learn
Building machine learning models from scratch can be complex and time-consuming. Scikit-learn which is an open-source Python library which helps in making machine learning more accessible. It provides a straightforward, consistent interface for a variety of tasks like classification, regression, clu
8 min read
XGBoost
Traditional machine learning models like decision trees and random forests are easy to interpret but often struggle with accuracy on complex datasets. XGBoost short form for eXtreme Gradient Boosting is an advanced machine learning algorithm designed for efficiency, speed and high performance.It is
6 min read
TensorFlow Tutorial
TensorFlow is an open-source machine-learning framework developed by Google. It is written in Python, making it accessible and easy to understand. It is designed to build and train machine learning (ML) and deep learning models. It is highly scalable for both research and production.It supports CPUs
2 min read
What is PyTorch ?
PyTorch is a deep learning library built on Python and Torch (a Lua-based framework). It provides GPU acceleration, dynamic computation graphs, and an intuitive interface for deep learning researchers and developers. PyTorch follows a "define-by-run" approach, meaning that its computational graphs a
5 min read