Python - How To Transform Spark Dataframe To Polars Dataframe - Stack Overflow
Python - How To Transform Spark Dataframe To Polars Dataframe - Stack Overflow
I can easily transform it to pandas dataframe using .toPandas . Is there something similar in
polars, as I need to get a polars dataframe for further processing?
AFAIK from the doc, spark does not have polars support yet. – samkart Aug 2, 2022 at 9:07
Report this ad
Join Stack Overflow to find the best answer to your technical question, help others answer theirs.
Sign up with email Sign up with Google Sign up with GitHub Sign up with Facebook
1 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...
Context
Pyspark uses arrow to convert to pandas. Polars is an abstraction over arrow memory. So we
22
can hijack the API that spark uses internally to create the arrow data and use that to create
the polars DataFrame .
TLDR
Given an spark context we can write:
import pyarrow as pa
import polars as pl
sql_context = SQLContext(spark)
df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))
print(df)
shape: (1, 2)
┌───────┬────────────┐
│ name ┆ properties │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═══════╪════════════╡
│ James ┆ [1, 2] │
└───────┴────────────┘
Serialization steps
This will actually be faster than the toPandas provided by spark itself, because it saves an
extra copy.
Sign upShare
with email
Improve this Sign up with Google
answer Sign up with GitHub Sign up with Facebook
edited Aug 8, 2022 at 5:10 answered Aug 2, 2022 at 10:07
Follow ritchie46
2 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...
Follow ritchie46
10.4k 1 24 43
Join Stack Overflow to find the best answer to your technical question, help others answer theirs.
Sign up with email Sign up with Google Sign up with GitHub Sign up with Facebook
3 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...
If you want to transform your Spark DataFrame using some Polars code (Spark -> Polars ->
Spark), you can do this distributed and scalable using mapInArrow :
import pyarrow as pa
import polars as pl
# Converts a part of the Spark DataFrame into a Polars DataFrame and call
`polars_transform` on it
def arrow_transform(iter: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]:
# Transform a single RecordBatch so data fit into memory
# Increase spark.sql.execution.arrow.maxRecordsPerBatch if batches are too
small
for batch in iter:
polars_df = pl.from_arrow(pa.Table.from_batches([batch]))
polars_df_2 = polars_transform(polars_df)
for b in polars_df_2.to_arrow().to_batches():
yield b
Join Stack
# MapOverflow to find
the Spark the best
DataFrame toanswer
Arrow,to yourto
then technical
Polars,question,
run the help
the others answer theirs.
`polars_transform` on it,
# and transform everything back to Spark DataFrame, all distributed and scalable
Sign up with email
spark_df_2 Sign up with Google Sign up with GitHub
= spark_df.mapInArrow(arrow_transform, schema='id Sign
long,up value
with Facebook
double')
4 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...
spark_df_2.show()
Share Improve this answer Follow answered Sep 23, 2022 at 9:49
EnricoM
321 3 3
You can't directly convert from spark to polars. But you can go from spark to pandas, then
create a dictionary out of the pandas data, and pass it to polars like this:
1
pandas_df = df.toPandas()
data = pandas_df.to_dict('list')
pl_df = pl.DataFrame(data)
As @ritchie46 pointed out, you can use pl.from_pandas() instead of creating a dictionary:
pandas_df = df.toPandas()
pl_df = pl.from_pandas(pandas_df)
Also, as mentioned in @DataPsycho's answer, this may cause out of memory exception for
large datasets. This is because toPandas() will collect the data to the driver first. In this case,
it is better to write to csv or parquet file and then read back. But avoid repartition(1)
because this will move the data to the driver too.
The code I have provided is suitable for datasets that will fit in your driver memory. If you have
the option to increase the driver memory you can do so by increasing the value of
spark.driver.memory .
Share Improve this answer edited Aug 2, 2022 at 10:13 answered Aug 2, 2022 at 9:12
Follow viggnah
1,699 1 3 12
You should never go to polars via a python dictionary. Polars as a pl.from_pandas argument. That
will save you a lot of heap allocations and ensure type correctness. – ritchie46 Aug 2, 2022 at 9:59
Yes, I thought about converting my data into a pandas dataframe first, but I don't think that would work
with the amount of data I'm working with:( Hopefully, Spark will add polar support soon. – s1nbad
Aug 2, 2022 at 11:03
Join Stack Overflow to find the best answer to your technical question, help others answer theirs.
Sign up with email Sign up with Google Sign up with GitHub Sign up with Facebook
5 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...
It will be good to know your usecase. Heavy transformations you should do either with spark
or polars. You should not be mixing both dataframes. What ever polars can do spark can do
0 all of them. So you should be doing all of your transformation with spark. Then write the file
as csv or parquet format. Then You should read the transformed file with Polars and
everything will run blazing fast, But if you are interested in plotting then read it directly into
pandas and use matplotlib. So if you will have a spark dataframe you can write it as csv:
(transformed_df
.repartition(1)
.write
.option("header",true)
.option("delimiter",",") # by default it is ,
.csv("<your_path>")
)
Now read it with polars or pandas with read_csv . If you will have small amount of memory in
the drive node of spark then transformed_df.toPandas() might fail because of not having
much memory.
I mostly work with spark, but sometimes I have to create a pandas dataframe for some extra
analysis/drawing of graphs. So I wanted to know if there is such a possibility while working with polar :)
– s1nbad Aug 2, 2022 at 10:56
You should keep using pandas for plotting unfortunately. – DataPsycho Aug 2, 2022 at 11:07
Join Stack Overflow to find the best answer to your technical question, help others answer theirs.
Sign up with email Sign up with Google Sign up with GitHub Sign up with Facebook
6 of 6 27/08/2023, 07:40