Logistic Regression using PySpark Python
Last Updated :
24 Apr, 2025
In this tutorial series, we are going to cover Logistic Regression using Pyspark. Logistic Regression is one of the basic ways to perform classification (don’t be confused by the word “regression”). Logistic Regression is a classification method. Some examples of classification are:
Loading Dataframe
We will be using the data for Titanic where I have columns PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked. We have to predict whether the passenger will survive or not using the Logistic Regression machine learning model. To get started, open a new notebook and follow the steps mentioned in the below code:
Python3
# Starting the Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Titanic').getOrCreate()
# Reading the data
df = spark.read.csv('Titanic.csv',inferSchema=True, header=True)
# Showing the data
df.show()
Output:
Showing the datadf.printSchema()
Schema of the datadf.columns
Columns in the dataRemoving NULL Values Columns
The next step includes removing the data having null values as shown in the above picture. We do not need the columns PassengerId, Name, Ticket, and Cabin as they are not required to train and test the model.
Python3
# Selecting the columns which are required
# to train and test the model.
rm_columns = df.select(['Survived','Pclass',
'Sex','Age','SibSp',
'Parch','Fare','Embarked'])
# Drops the data having null values
result = rm_columns.na.drop()
# Again showing the data
result.show()
Output:
Convert String Column to Ordinal Columns
The next task is to convert the string columns (Sex and Embarked) to integral columns as without doing this, we cannot vectorize the data using VectorAssembler.
Python3
# Importing the required libraries
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
# Converting the Sex Column
sexIdx = StringIndexer(inputCol='Sex',
outputCol='SexIndex')
sexEncode = OneHotEncoder(inputCol='SexIndex',
outputCol='SexVec')
# Converting the Embarked Column
embarkIdx = StringIndexer(inputCol='Embarked',
outputCol='EmbarkIndex')
embarkEncode = OneHotEncoder(inputCol='EmbarkIndex',
outputCol='EmbarkVec')
# Vectorizing the data into a new column "features"
# which will be our input/features class
assembler = VectorAssembler(inputCols=['Pclass',
'SexVec','Age',
'SibSp','Parch',
'Fare','EmbarkVec'],
outputCol='features')
Now we need Pipeline to stack the tasks one by one and import and call the Logistic Regression Model.
Python3
# Importing Pipeline and Model
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
log_reg = LogisticRegression(featuresCol='features',
labelCol='Survived')
# Creating the pipeline
pipe = Pipeline(stages=[sexIdx, embarkIdx,
sexEncode, embarkEncode,
assembler, log_reg])
After pipelining the tasks, we will split the data into training data and testing data to train and test the model.
Python3
# Splitting the data into train and test
train_data, test_data = my_final_data.randomSplit([0.7, .3])
# Fitting the model on training data
fit_model = pipeline.fit(train_data)
# Storing the results on test data
results = fit_model.transform(test_data)
# Showing the results
results.show()
Output:
data.show()
Model evaluation using ROC-AUC
The results will add extra columns rawPrediction, probability, and prediction because we are transforming the results on our data. After getting the results, we will now find the AUC(Area under the ROC Curve) which will give the efficiency of the model. For this, we will use BinaryClassificationEvaluator as shown:
Python3
# Importing the evaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Calling the evaluator
res = BinaryClassificationEvaluator
(rawPredictionCol='prediction',labelCol='Survived')
# Evaluating the AUC on results
ROC_AUC = res.evaluate(results)
Output:
Note: In general, an AUC value above 0.7 is considered good, but it's important to compare the value to the expected performance of the problem and the data to determine if it's actually good.
ROC_AUC
Similar Reads
Logistic Regression using Python
A basic machine learning approach that is frequently used for binary classification tasks is called logistic regression. Though its name suggests otherwise, it uses the sigmoid function to simulate the likelihood of an instance falling into a specific class, producing values between 0 and 1. Logisti
8 min read
Placement prediction using Logistic Regression
Prerequisites: Understanding Logistic Regression, Logistic Regression using Python In this article, we are going to discuss how to predict the placement status of a student based on various student attributes using Logistic regression algorithm. Placements hold great importance for students and educ
4 min read
Python | Linear Regression using sklearn
Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models
3 min read
Solving Linear Regression in Python
Linear regression is a widely used statistical method to find the relationship between dependent variable and one or more independent variables. It is used to make predictions by finding a line that best fits the data we have. The most common approach to best fit a linear regression model is least-s
3 min read
ML | Multiple Linear Regression using Python
Linear regression is a statistical method used for predictive analysis. It models the relationship between a dependent variable and a single independent variable by fitting a linear equation to the data. Multiple Linear Regression extends this concept by modelling the relationship between a dependen
4 min read
Simple Linear Regression in Python
Simple linear regression models the relationship between a dependent variable and a single independent variable. In this article, we will explore simple linear regression and it's implementation in Python using libraries such as NumPy, Pandas, and scikit-learn.Understanding Simple Linear RegressionS
7 min read
Recommender System using Pyspark - Python
A recommender system is a type of information filtering system that provides personalized recommendations to users based on their preferences, interests, and past behaviors. Recommender systems come in a variety of forms, such as content-based, collaborative filtering, and hybrid systems. Content-ba
5 min read
Logistic Regression using Statsmodels
Prerequisite: Understanding Logistic RegressionLogistic regression is the type of regression analysis used to find the probability of a certain event occurring. It is the best suited type of regression for cases where we have a categorical dependent variable which can take only discrete values. The
4 min read
Python | Decision Tree Regression using sklearn
Decision Tree Regression is a method used to predict continuous values like prices or scores by using a tree-like structure. It works by splitting the data into smaller parts based on simple rules taken from the input features. These splits help reduce errors in prediction. At the end of each branch
4 min read
K-Means Clustering using PySpark Python
In this tutorial series, we are going to cover K-Means Clustering using Pyspark. K-means is a clustering algorithm that groups data points into K distinct clusters based on their similarity. It is an unsupervised learning technique that is widely used in data mining, machine learning, and pattern re
4 min read