0% found this document useful (0 votes)

114 views

Regression: Pyspark - SQL

This document summarizes the steps taken to perform linear regression on a dataset using PySpark. It loads data from a CSV file, cleans and prepares the data for modeling, trains a linear regression model in a pipeline, evaluates the model on test data to calculate the RMSE, and computes the r2 score. Key steps include splitting the data into training and test sets, fitting a linear regression model in a pipeline, calculating performance metrics like RMSE and r2 score on test data, and extracting the model coefficients and statistics.

Uploaded by

Ali Abdi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views

Regression: Pyspark - SQL

Uploaded by

Ali Abdi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

regression

December 27, 2021

[2]: from pyspark.sql import SparkSession

sp= SparkSession.builder.appName("Python Spark regression example").
,→config("spark.some.config.option", "some-value").getOrCreate()

[3]: df = spark.read.format('csv').options(header='true',inferschema='true').
,→load("data.csv",header=True);

[4]: import pandas as pd

pd.DataFrame(df.take(3), columns=df.columns)

[4]: TV Radio Newspaper Sales

0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3

[5]: df.describe().toPandas()

[5]: summary TV Radio Newspaper \

0 count 200 200 200
1 mean 147.0425 23.264000000000024 30.553999999999995
2 stddev 85.85423631490805 14.846809176168728 21.77862083852283
3 min 0.7 0.0 0.3
4 max 296.4 49.6 114.0

Sales
0 200
1 14.022500000000003
2 5.217456565710477
3 1.6
4 27.0

[6]: df.printSchema()

root

1
|-- TV: double (nullable = true)
|-- Radio: double (nullable = true)
|-- Newspaper: double (nullable = true)
|-- Sales: double (nullable = true)

[7]: df.show(5)

+-----+-----+---------+-----+
| TV|Radio|Newspaper|Sales|
+-----+-----+---------+-----+
|230.1| 37.8| 69.2| 22.1|
| 44.5| 39.3| 45.1| 10.4|
| 17.2| 45.9| 69.3| 9.3|
|151.5| 41.3| 58.5| 18.5|
|180.8| 10.8| 58.4| 12.9|
+-----+-----+---------+-----+
only showing top 5 rows

[32]: transformed= transData(df)

transformed.show(5)

[Stage 20:> (0 + 1) / 1]
+-----------------+-----+
| features|label|
+-----------------+-----+
|[230.1,37.8,69.2]| 22.1|
| [44.5,39.3,45.1]| 10.4|
| [17.2,45.9,69.3]| 9.3|
|[151.5,41.3,58.5]| 18.5|
|[180.8,10.8,58.4]| 12.9|
+-----------------+-----+
only showing top 5 rows

[34]: from pyspark.ml import Pipeline

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Automatically identify categorical features, and index them.

# We specify maxCategories so features with > 4 distinct values are treated as␣
,→continuous.

2
featureIndexer =␣
,→VectorIndexer(inputCol="features",outputCol="indexedFeatures",maxCategories=4).

,→fit(transformed)

data = featureIndexer.transform(transformed)
data.show(5,True)

+-----------------+-----+-----------------+
| features|label| indexedFeatures|
+-----------------+-----+-----------------+
|[230.1,37.8,69.2]| 22.1|[230.1,37.8,69.2]|
| [44.5,39.3,45.1]| 10.4| [44.5,39.3,45.1]|
| [17.2,45.9,69.3]| 9.3| [17.2,45.9,69.3]|
|[151.5,41.3,58.5]| 18.5|[151.5,41.3,58.5]|
|[180.8,10.8,58.4]| 12.9|[180.8,10.8,58.4]|
+-----------------+-----+-----------------+
only showing top 5 rows

[35]: # Split the data into training and test sets (40% held out for testing)
(trainingData, testData) = transformed.randomSplit([0.8, 0.2])

[36]: # Import LinearRegression class

from pyspark.ml.regression import LinearRegression

# Define LinearRegression algorithm

lr = LinearRegression()

[38]: import warnings

warnings.filterwarnings('ignore')
# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, lr])

model = pipeline.fit(trainingData)

21/12/14 14:37:13 WARN Instrumentation: [8d638690] regParam is zero, which might

cause numerical instability and overfitting.

[39]: def modelsummary(model):

import numpy as np
print ("Note: the last rows are the information for Intercept")
print ("##","-------------------------------------------------")
print ("##"," Estimate | Std.Error | t Values | P-value")
coef = np.append(list(model.coefficients),model.intercept)
Summary=model.summary

3
for i in range(len(Summary.pValues)):
print ("##",'{:10.6f}'.format(coef[i]),\
'{:10.6f}'.format(Summary.coefficientStandardErrors[i]),\
'{:8.3f}'.format(Summary.tValues[i]),\
'{:10.6f}'.format(Summary.pValues[i]))

print ("##",'---')
print ("##","Mean squared error: % .6f" \
% Summary.meanSquaredError, ", RMSE: % .6f" \
% Summary.rootMeanSquaredError )
print ("##","Multiple R-squared: %f" % Summary.r2, ", \
Total iterations: %i"% Summary.totalIterations)

[40]: modelsummary(model.stages[-1])

Note: the last rows are the information for Intercept

## -------------------------------------------------
## Estimate | Std.Error | t Values | P-value
## 0.044758 0.001555 28.783 0.000000
## 0.186763 0.009541 19.575 0.000000
## 0.006556 0.007003 0.936 0.350575
## 2.921133 0.343975 8.492 0.000000
## ---
## Mean squared error: 2.828389 , RMSE: 1.681782
## Multiple R-squared: 0.897012 , Total iterations: 0

[41]: # Make predictions.

predictions = model.transform(testData)

[42]: from pyspark.ml.evaluation import RegressionEvaluator

# Select (prediction, true label) and compute test error
evaluator =␣
,→RegressionEvaluator(labelCol="label",predictionCol="prediction",metricName="rmse")

rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

[Stage 30:> (0 + 1) / 1]
Root Mean Squared Error (RMSE) on test data = 1.66064

[43]: y_true = predictions.select("label").toPandas()

y_pred = predictions.select("prediction").toPandas()

import sklearn.metrics
r2_score = sklearn.metrics.r2_score(y_true, y_pred)

4
print('r2_score: {0}'.format(r2_score))

r2_score: 0.8900904334948799

[ ]:

MT8127 Android Scatter
No ratings yet
MT8127 Android Scatter
8 pages
Bigtreetech Manta m8p v2.0 User Manual
100% (1)
Bigtreetech Manta m8p v2.0 User Manual
37 pages
Assignment 2: Introduction To R: Text Like This Will Be Problems For You To Do and Turn In. (There Are 7 in All.)
No ratings yet
Assignment 2: Introduction To R: Text Like This Will Be Problems For You To Do and Turn In. (There Are 7 in All.)
15 pages
Fresco
100% (2)
Fresco
17 pages
AR Model Session2 Output: Install - Packages ("Forecast")
No ratings yet
AR Model Session2 Output: Install - Packages ("Forecast")
30 pages
21BECE30036 Prac 1
No ratings yet
21BECE30036 Prac 1
10 pages
Machine Learning Stock Time Series 1700932258
No ratings yet
Machine Learning Stock Time Series 1700932258
21 pages
linear-regression
No ratings yet
linear-regression
8 pages
Produktivitas Cabai
No ratings yet
Produktivitas Cabai
10 pages
Self Study Assignment Python II
No ratings yet
Self Study Assignment Python II
4 pages
Pandas Dataframe1
No ratings yet
Pandas Dataframe1
43 pages
ccs355 Lab Manual
No ratings yet
ccs355 Lab Manual
24 pages
19.3.4 Klasifikasi Di Spark
No ratings yet
19.3.4 Klasifikasi Di Spark
5 pages
Correlation and Regression (TP)
No ratings yet
Correlation and Regression (TP)
4 pages
Jamboree
No ratings yet
Jamboree
56 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
4 pages
R Console
No ratings yet
R Console
6 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
Import Libraries
No ratings yet
Import Libraries
5 pages
Regression Model Usign Pyspark
No ratings yet
Regression Model Usign Pyspark
4 pages
Expt7_ML2025_250306_143857
No ratings yet
Expt7_ML2025_250306_143857
5 pages
Luas Panen
No ratings yet
Luas Panen
9 pages
Experiment1111
No ratings yet
Experiment1111
25 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
Workshop Notes-2 Handling Array with NumPy
No ratings yet
Workshop Notes-2 Handling Array with NumPy
13 pages
Lab 1 Activities
No ratings yet
Lab 1 Activities
4 pages
Functions and Packages
No ratings yet
Functions and Packages
7 pages
Data Sci
No ratings yet
Data Sci
29 pages
Labpractice 2
100% (2)
Labpractice 2
29 pages
Day42 SVM Regression
No ratings yet
Day42 SVM Regression
3 pages
ML Assignment 6
No ratings yet
ML Assignment 6
3 pages
Spark Walmart Data Analysis Project
No ratings yet
Spark Walmart Data Analysis Project
17 pages
CQF June 2021 M4L4 Solutions
No ratings yet
CQF June 2021 M4L4 Solutions
14 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
2 pages
Customer Data Outliers Pyspark
No ratings yet
Customer Data Outliers Pyspark
1 page
Estiven - Hurtado.Santos - Regresión Con Varios Algoritmos
No ratings yet
Estiven - Hurtado.Santos - Regresión Con Varios Algoritmos
16 pages
Statistical Data Analysis - Ipynb - Colaboratory
No ratings yet
Statistical Data Analysis - Ipynb - Colaboratory
6 pages
DOC%201728741951381
No ratings yet
DOC%201728741951381
19 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Guia para La Importación de Series Financieras de Yahoo F
No ratings yet
Guia para La Importación de Series Financieras de Yahoo F
8 pages
merge
No ratings yet
merge
33 pages
DLL 4
No ratings yet
DLL 4
26 pages
DL 3 Ks
No ratings yet
DL 3 Ks
6 pages
Unit 5 Descriptive Statistics
No ratings yet
Unit 5 Descriptive Statistics
7 pages
ML - Lab-6.ipynb - Colab
No ratings yet
ML - Lab-6.ipynb - Colab
4 pages
NEWCOST Practicals
No ratings yet
NEWCOST Practicals
25 pages
Linear Regression
No ratings yet
Linear Regression
15 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Python Note 3
No ratings yet
Python Note 3
11 pages
Advanced Python
No ratings yet
Advanced Python
48 pages
Sales
No ratings yet
Sales
7 pages
Variosalgoritmos - Jupyter Notebook
100% (1)
Variosalgoritmos - Jupyter Notebook
9 pages
Week 3 GGG
No ratings yet
Week 3 GGG
17 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Answer PDF Lab
No ratings yet
Answer PDF Lab
34 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
PCA Code-Checkpoint
No ratings yet
PCA Code-Checkpoint
4 pages
Kelompok 3 - Latihan 1 Setup Python Dan Aljabar Linier
No ratings yet
Kelompok 3 - Latihan 1 Setup Python Dan Aljabar Linier
12 pages
22BBS0224
No ratings yet
22BBS0224
5 pages
45B Ahmed Shaikh AIML Prac05
No ratings yet
45B Ahmed Shaikh AIML Prac05
4 pages
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
Minimal Audio EULA
No ratings yet
Minimal Audio EULA
3 pages
Digicontent 2023
No ratings yet
Digicontent 2023
22 pages
2018 Executive View Evidian Identity Access Management
No ratings yet
2018 Executive View Evidian Identity Access Management
7 pages
Blue Lock, Chapter 182
No ratings yet
Blue Lock, Chapter 182
1 page
Automation Technology e
No ratings yet
Automation Technology e
160 pages
Sorting Algorithms
No ratings yet
Sorting Algorithms
9 pages
JNORTH Maths GR12 March 2022 QP and Memo
No ratings yet
JNORTH Maths GR12 March 2022 QP and Memo
27 pages
Shelf Life Expiration Date (SLED/BBD) Entered During Goods Movement Posting Is Not Updated To Batch Master - SAP ERP & SAP S/4 HANA
No ratings yet
Shelf Life Expiration Date (SLED/BBD) Entered During Goods Movement Posting Is Not Updated To Batch Master - SAP ERP & SAP S/4 HANA
2 pages
The Role of Dynamic Capabilities in Responding To Digital Disruption A Factor-Based Study of The Newspaper Industry
No ratings yet
The Role of Dynamic Capabilities in Responding To Digital Disruption A Factor-Based Study of The Newspaper Industry
44 pages
Iexpenses Training Manual
100% (4)
Iexpenses Training Manual
52 pages
WebCTRL v7.0 User Manual
No ratings yet
WebCTRL v7.0 User Manual
234 pages
Experiment No 1 Python
No ratings yet
Experiment No 1 Python
7 pages
3G Support List (RK2918)
No ratings yet
3G Support List (RK2918)
3 pages
Oracle PracticeTest 1z0-070 v2017-12-11 by Tristan 49q
No ratings yet
Oracle PracticeTest 1z0-070 v2017-12-11 by Tristan 49q
39 pages
Task 1
No ratings yet
Task 1
7 pages
MFL71793834 01 S 210617+RS-232C
No ratings yet
MFL71793834 01 S 210617+RS-232C
69 pages
DLD Lab: Introduction To VHDL (Very High Speed Integrated Circuit Hardware Description Language)
No ratings yet
DLD Lab: Introduction To VHDL (Very High Speed Integrated Circuit Hardware Description Language)
30 pages
Vinay Srivastava-Updated 2024 Informatica MDM
No ratings yet
Vinay Srivastava-Updated 2024 Informatica MDM
7 pages
E9905G 2-Module In-Circuit Test (ICT) System, I327x Series 6 - Keysight
No ratings yet
E9905G 2-Module In-Circuit Test (ICT) System, I327x Series 6 - Keysight
1 page
Classera Logo - Google Search
No ratings yet
Classera Logo - Google Search
1 page
ICTs in Performing Arts
No ratings yet
ICTs in Performing Arts
1 page
RTN Commissioning
No ratings yet
RTN Commissioning
23 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
8 pages
Difference Between Software and Program
No ratings yet
Difference Between Software and Program
7 pages
Avaya Unified Communications All-Inclusive and Avaya One-X Products
No ratings yet
Avaya Unified Communications All-Inclusive and Avaya One-X Products
4 pages
Basic 5300
No ratings yet
Basic 5300
13 pages
Currency Detection For Blind People
No ratings yet
Currency Detection For Blind People
19 pages
Backend Development: Be The Developer of Your Career
No ratings yet
Backend Development: Be The Developer of Your Career
15 pages