0% found this document useful (0 votes)

735 views

Python - How To Transform Spark Dataframe To Polars Dataframe - Stack Overflow

The document discusses how to transform a Spark dataframe to a Polars dataframe. It provides two main methods: 1. Collect the Spark dataframe as Arrow batches and pass to Polars' from_arrow() method to avoid an extra copy. 2. For larger datasets that don't fit on a single machine, use Spark's mapInArrow() to transform partitions in Arrow directly without collecting. This keeps the process distributed and scalable.

Uploaded by

petersonjr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

735 views

Python - How To Transform Spark Dataframe To Polars Dataframe - Stack Overflow

Uploaded by

petersonjr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

python - How to transform Spark dataframe to Polars dataf... https://stackoverﬂow.com/questions/73203318/how-to-trans...

How to transform Spark dataframe to Polars dataframe?

Asked 1 year ago Modified 11 months ago Viewed 5k times

I wonder how i can transform Spark dataframe to Polars dataframe.

Let's say i have this code on PySpark:

8
df = spark.sql('''select * from tmp''')

I can easily transform it to pandas dataframe using .toPandas . Is there something similar in
polars, as I need to get a polars dataframe for further processing?

python pyspark python-polars

Share Improve this question Follow asked Aug 2, 2022 at 7:08

s1nbad
83 1 4

AFAIK from the doc, spark does not have polars support yet. – samkart Aug 2, 2022 at 9:07

Report this ad

4 Answers Sorted by: Highest score (default)

Join Stack Overflow to find the best answer to your technical question, help others answer theirs.

1 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverﬂow.com/questions/73203318/how-to-trans...

Context
Pyspark uses arrow to convert to pandas. Polars is an abstraction over arrow memory. So we
22
can hijack the API that spark uses internally to create the arrow data and use that to create
the polars DataFrame .

TLDR
Given an spark context we can write:

import pyarrow as pa
import polars as pl

sql_context = SQLContext(spark)

data = [('James',[1, 2]),]

spark_df = sql_context.createDataFrame(data=data, schema =
["name","properties"])

df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))

print(df)

shape: (1, 2)
┌───────┬────────────┐
│ name ┆ properties │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═══════╪════════════╡
│ James ┆ [1, 2] │
└───────┴────────────┘

Serialization steps
This will actually be faster than the toPandas provided by spark itself, because it saves an
extra copy.

toPandas() will lead to this serialization/copy step:

spark-memory -> arrow-memory -> pandas-memory

With the query provided we have:

Join Stack Overflow->toarrow/polars-memory

spark-memory find the best answer to your technical question, help others answer theirs.

Sign upShare
with email
Improve this Sign up with Google
answer Sign up with GitHub Sign up with Facebook
edited Aug 8, 2022 at 5:10 answered Aug 2, 2022 at 10:07
Follow ritchie46

2 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverﬂow.com/questions/73203318/how-to-trans...

Follow ritchie46
10.4k 1 24 43

Join Stack Overflow to find the best answer to your technical question, help others answer theirs.

3 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverﬂow.com/questions/73203318/how-to-trans...

Polars is not distributed, while Spark is

7 Note that Polars is a single-machine multi-threaded DataFrame library. Spark in contrast is a
multi-machine multi-threaded DataFrame library. So Spark distributes the DataFrame across
multiple machines.

Transform Spark DataFrame with Polars code scalable

If your dataset requires this feature, because the DataFrame does not fit onto a single
machine, then _collect_as_arrow , to_dict and from_pandas do not work for you.

If you want to transform your Spark DataFrame using some Polars code (Spark -> Polars ->
Spark), you can do this distributed and scalable using mapInArrow :

import pyarrow as pa
import polars as pl

from typing import Iterator

# The example data as a Spark DataFrame

data = [(1, 1.0), (2, 2.0)]
spark_df = spark.createDataFrame(data=data, schema = ['id', 'value'])
spark_df.show()

# Define your transformation on a Polars DataFrame

# Here we multply the 'value' column by 2
def polars_transform(df: pl.DataFrame) -> pl.DataFrame:
return df.select([
pl.col('id'),
pl.col('value') * 2
])

# Converts a part of the Spark DataFrame into a Polars DataFrame and call
`polars_transform` on it
def arrow_transform(iter: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]:
# Transform a single RecordBatch so data fit into memory
# Increase spark.sql.execution.arrow.maxRecordsPerBatch if batches are too
small
for batch in iter:
polars_df = pl.from_arrow(pa.Table.from_batches([batch]))
polars_df_2 = polars_transform(polars_df)
for b in polars_df_2.to_arrow().to_batches():
yield b

Join Stack
# MapOverflow to find
the Spark the best
DataFrame toanswer
Arrow,to yourto
then technical
Polars,question,
run the help
the others answer theirs.
`polars_transform` on it,
# and transform everything back to Spark DataFrame, all distributed and scalable
Sign up with email
spark_df_2 Sign up with Google Sign up with GitHub
= spark_df.mapInArrow(arrow_transform, schema='id Sign
long,up value
with Facebook
double')

4 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverﬂow.com/questions/73203318/how-to-trans...

spark_df_2.show()

Share Improve this answer Follow answered Sep 23, 2022 at 9:49
EnricoM
321 3 3

You can't directly convert from spark to polars. But you can go from spark to pandas, then
create a dictionary out of the pandas data, and pass it to polars like this:
1
pandas_df = df.toPandas()
data = pandas_df.to_dict('list')
pl_df = pl.DataFrame(data)

As @ritchie46 pointed out, you can use pl.from_pandas() instead of creating a dictionary:

pandas_df = df.toPandas()
pl_df = pl.from_pandas(pandas_df)

Also, as mentioned in @DataPsycho's answer, this may cause out of memory exception for
large datasets. This is because toPandas() will collect the data to the driver first. In this case,
it is better to write to csv or parquet file and then read back. But avoid repartition(1)
because this will move the data to the driver too.

The code I have provided is suitable for datasets that will fit in your driver memory. If you have
the option to increase the driver memory you can do so by increasing the value of
spark.driver.memory .

Share Improve this answer edited Aug 2, 2022 at 10:13 answered Aug 2, 2022 at 9:12
Follow viggnah
1,699 1 3 12

You should never go to polars via a python dictionary. Polars as a pl.from_pandas argument. That
will save you a lot of heap allocations and ensure type correctness. – ritchie46 Aug 2, 2022 at 9:59

Yes, I thought about converting my data into a pandas dataframe first, but I don't think that would work
with the amount of data I'm working with:( Hopefully, Spark will add polar support soon. – s1nbad
Aug 2, 2022 at 11:03

Join Stack Overflow to find the best answer to your technical question, help others answer theirs.

5 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverﬂow.com/questions/73203318/how-to-trans...

It will be good to know your usecase. Heavy transformations you should do either with spark
or polars. You should not be mixing both dataframes. What ever polars can do spark can do
0 all of them. So you should be doing all of your transformation with spark. Then write the file
as csv or parquet format. Then You should read the transformed file with Polars and
everything will run blazing fast, But if you are interested in plotting then read it directly into
pandas and use matplotlib. So if you will have a spark dataframe you can write it as csv:

(transformed_df
.repartition(1)
.write
.option("header",true)
.option("delimiter",",") # by default it is ,
.csv("<your_path>")
)

Now read it with polars or pandas with read_csv . If you will have small amount of memory in
the drive node of spark then transformed_df.toPandas() might fail because of not having
much memory.

Share Improve this answer Follow answered Aug 2, 2022 at 9:46

DataPsycho
958 1 8 28

I mostly work with spark, but sometimes I have to create a pandas dataframe for some extra
analysis/drawing of graphs. So I wanted to know if there is such a possibility while working with polar :)
– s1nbad Aug 2, 2022 at 10:56

You should keep using pandas for plotting unfortunately. – DataPsycho Aug 2, 2022 at 11:07

Join Stack Overflow to find the best answer to your technical question, help others answer theirs.

6 of 6 27/08/2023, 07:40

PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
Frank Kane's Taming Big Data with Apache Spark and Python
From Everand
Frank Kane's Taming Big Data with Apache Spark and Python
Frank Kane
No ratings yet
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
No ratings yet
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
7 pages
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Project Lead and Backend Engineer
No ratings yet
Project Lead and Backend Engineer
2 pages
Honey Backend
No ratings yet
Honey Backend
2 pages
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Azure DataEngineering End To End Videos
No ratings yet
Azure DataEngineering End To End Videos
21 pages
The Little Book of Sitecore® Tips: Volume 2
From Everand
The Little Book of Sitecore® Tips: Volume 2
Neil P Shack
No ratings yet
dp600day1en1731207686301
No ratings yet
dp600day1en1731207686301
41 pages
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
14 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet
Tkinter GUI Application Development Blueprints: Master GUI programming in Tkinter as you design, implement, and deliver 10 real-world applications
From Everand
Tkinter GUI Application Development Blueprints: Master GUI programming in Tkinter as you design, implement, and deliver 10 real-world applications
Bhaskar Chaudhary
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
DP 203t00a Enu Powerpoint 03
No ratings yet
DP 203t00a Enu Powerpoint 03
25 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Deloitte Pyspark Interview Questions for Data Engineer 2024 _ by Ronit Malhotra _ Jun, 2024 _ Medium
No ratings yet
Deloitte Pyspark Interview Questions for Data Engineer 2024 _ by Ronit Malhotra _ Jun, 2024 _ Medium
9 pages
Optimizing 1TB Data Handling using PySpark 3p
No ratings yet
Optimizing 1TB Data Handling using PySpark 3p
3 pages
Ajay Patil ADE
No ratings yet
Ajay Patil ADE
1 page
pyspark interview questions
No ratings yet
pyspark interview questions
9 pages
Building a Product Master
From Everand
Building a Product Master
Edufdev
No ratings yet
Chapter 3
No ratings yet
Chapter 3
25 pages
Spark Devops
0% (1)
Spark Devops
301 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
DP 600t00a Enu Powerpoint 02
No ratings yet
DP 600t00a Enu Powerpoint 02
30 pages
Learning Dart
From Everand
Learning Dart
Ivo Balbaert
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
From Everand
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
Matthew Rosch
No ratings yet
PyTorch Cookbook
From Everand
PyTorch Cookbook
Matthew Rosch
No ratings yet
Sample CV
No ratings yet
Sample CV
1 page
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
assignmnet (1)
No ratings yet
assignmnet (1)
25 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
An Introduction To Polars - Python's Tool For Large-Scale Data Analysis - DataCamp
No ratings yet
An Introduction To Polars - Python's Tool For Large-Scale Data Analysis - DataCamp
10 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Exploratory Data Analysis With Polars
No ratings yet
Exploratory Data Analysis With Polars
339 pages
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
Spark Vs Polars. Real-Life Test Case. - by Daniel Beach
No ratings yet
Spark Vs Polars. Real-Life Test Case. - by Daniel Beach
21 pages
Building Python Real time Applications with Storm: Learn to process massive real-time data streams using Storm and Python—no Java required!
From Everand
Building Python Real time Applications with Storm: Learn to process massive real-time data streams using Storm and Python—no Java required!
Kartik Bhatnagar
No ratings yet
Spark All Optimizations & Code
No ratings yet
Spark All Optimizations & Code
25 pages
OpenStack Sahara Essentials
From Everand
OpenStack Sahara Essentials
Omar Khedher
No ratings yet
bLScCdW1geivYxBAmcEE3u (1)(1)
No ratings yet
bLScCdW1geivYxBAmcEE3u (1)(1)
166 pages
Advanced Data Cleaning Techniques With PySpark
No ratings yet
Advanced Data Cleaning Techniques With PySpark
25 pages
Scientific Computing with Scala
From Everand
Scientific Computing with Scala
Vytautas Jančauskas
No ratings yet
Apache Spark
No ratings yet
Apache Spark
62 pages
Spark for Data Science
From Everand
Spark for Data Science
Srinivas Duvvuri
No ratings yet
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Clarista - Data Engineer - JD
No ratings yet
Clarista - Data Engineer - JD
2 pages
Benefits (Use) of Pointers in C
No ratings yet
Benefits (Use) of Pointers in C
5 pages
OSY Chapter 6
No ratings yet
OSY Chapter 6
12 pages
Combined Paging and Segmentation
100% (1)
Combined Paging and Segmentation
33 pages
PalmOS Swadhik
No ratings yet
PalmOS Swadhik
11 pages
FH2 Crash Log
No ratings yet
FH2 Crash Log
5 pages
Javacore 20190309 142011 1548 0002
No ratings yet
Javacore 20190309 142011 1548 0002
249 pages
VM Practice
No ratings yet
VM Practice
6 pages
SAP Memory Management
No ratings yet
SAP Memory Management
8 pages
Dynamic Memory Allocation
No ratings yet
Dynamic Memory Allocation
11 pages
Chapter 9
No ratings yet
Chapter 9
5 pages
Understanding Operating Systems Seventh Edition: Management of Network Functions
No ratings yet
Understanding Operating Systems Seventh Edition: Management of Network Functions
53 pages
Operating System - Quick Guide
No ratings yet
Operating System - Quick Guide
64 pages
Week 02
No ratings yet
Week 02
41 pages
350 Interview
No ratings yet
350 Interview
88 pages
C WINDOWS SystemApps Microsoft - Windows.search Cw5n1h2txyewy Cache Desktop 2
No ratings yet
C WINDOWS SystemApps Microsoft - Windows.search Cw5n1h2txyewy Cache Desktop 2
21 pages
OSG202 Source Github
No ratings yet
OSG202 Source Github
33 pages
Logical Address
No ratings yet
Logical Address
5 pages
Explain The Services Provided by Common Language Infrastructure
No ratings yet
Explain The Services Provided by Common Language Infrastructure
10 pages
Google - Professional Cloud DevOps Engineer.v2023 12 30.q77
No ratings yet
Google - Professional Cloud DevOps Engineer.v2023 12 30.q77
47 pages
What Are The First Fit, Next Fit and Best Fit Algorithms For Memory Management - Quora
No ratings yet
What Are The First Fit, Next Fit and Best Fit Algorithms For Memory Management - Quora
3 pages
CS152 Quiz #2: Name: - This Is A Closed Book, Closed Notes Exam. 80 Minutes 9 Pages
No ratings yet
CS152 Quiz #2: Name: - This Is A Closed Book, Closed Notes Exam. 80 Minutes 9 Pages
13 pages
Design and Analysis of Algorithms 214.: Introduction To The C Programming Language
No ratings yet
Design and Analysis of Algorithms 214.: Introduction To The C Programming Language
12 pages
Memory Layout of C Program
No ratings yet
Memory Layout of C Program
2 pages
Memory Segmentation Is The Division of Computer
No ratings yet
Memory Segmentation Is The Division of Computer
1 page
Operating System 2 Unit Notes
No ratings yet
Operating System 2 Unit Notes
54 pages
Os 101 Notes
No ratings yet
Os 101 Notes
102 pages
2021-MESI Protocol For Multicore Processors Based On FPGA
No ratings yet
2021-MESI Protocol For Multicore Processors Based On FPGA
10 pages
Milan Milenkovic Operating Systems Concepts and Design DF56E
0% (1)
Milan Milenkovic Operating Systems Concepts and Design DF56E
12 pages
Lecture 8 Memory Management
No ratings yet
Lecture 8 Memory Management
23 pages
Viva 1
No ratings yet
Viva 1
22 pages

Python - How To Transform Spark Dataframe To Polars Dataframe - Stack Overflow

Uploaded by

Python - How To Transform Spark Dataframe To Polars Dataframe - Stack Overflow

Uploaded by

python - How to transform Spark dataframe to Polars dataf... https://stackoverﬂow.com/questions/73203318/how-to-trans...

How to transform Spark dataframe to Polars dataframe?

I wonder how i can transform Spark dataframe to Polars dataframe.

Let's say i have this code on PySpark:

python pyspark python-polars

Share Improve this question Follow asked Aug 2, 2022 at 7:08

4 Answers Sorted by: Highest score (default)

data = [('James',[1, 2]),]

toPandas() will lead to this serialization/copy step:

spark-memory -> arrow-memory -> pandas-memory

With the query provided we have:

Join Stack Overflow->toarrow/polars-memory

Polars is not distributed, while Spark is

Transform Spark DataFrame with Polars code scalable

from typing import Iterator

# The example data as a Spark DataFrame

# Define your transformation on a Polars DataFrame

Share Improve this answer Follow answered Aug 2, 2022 at 9:46

You might also like