Pyspark - SQL Module
Pyspark - SQL Module
pyspark.sql module
Module Context
Important classes of Spark SQL and DataFrames:
This Page
class pyspark.sql. SparkSession(sparkContext, jsparkSession=None) [source]
Show Source The entry point to programming Spark with the Dataset and DataFrame API.
Quick search A SparkSession can be used create DataFrame, register DataFrame as tables,
execute SQL over tables, cache tables, and read parquet files. To create a
Go
SparkSession, use the following builder pattern:
builder
A class attribute having a Builder to construct SparkSession instances.
appName(name) [source]
Sets a name for the application, which will be shown in the Spark web UI.
Parameters:
name – an application name
Parameters:
key – a key name string for configuration property
value – a value for configuration property
conf – an instance of SparkConf
enableHiveSupport () [source]
Enables Hive support, including connectivity to a persistent Hive metastore,
support for Hive SerDes, and Hive user-defined functions.
getOrCreate() [source]
Gets an existing SparkSession or, if there is no existing one, creates a new
one based on the options set in this builder.
master(master) [source]
Sets the Spark master URL to connect to, such as “local” to run locally,
“local[4]” to run locally with 4 cores, or “spark://master:7077” to run on a
Spark standalone cluster.
Parameters:
master – a url for spark master
property catalog
Interface through which the user may create, drop, alter or query underlying
databases, tables, functions, etc.
Returns:
Catalog
This is the interface through which the user can get and set all Spark and
Hadoop configurations that are relevant to Spark SQL. When getting the value of
a config, this defaults to the value set in the underlying SparkContext , if any.
When schema is a list of column names, the type of each column will be inferred
from data.
When schema is None, it will try to infer the schema (column names and types)
from data, which should be an RDD of either Row, namedtuple, or dict.
Parameters:
data – an RDD of any kind of SQL data representation (e.g. row, tuple, int,
boolean, etc.), list, or pandas.DataFrame.
schema – a pyspark.sql.types.DataType or a datatype string or a list of
column names, default is None. The data type string format equals to
pyspark.sql.types.DataType.simpleString, except that top level struct
type can omit the struct<> and atomic types use typeName() as their
format, e.g. use byte instead of tinyint for
pyspark.sql.types.ByteType. We can also use int as a short name for
IntegerType.
samplingRatio – the sample ratio of rows used for inferring
verifySchema – verify data types of every row against schema.
Returns:
DataFrame
>>> spark.createDataFrame(df.toPandas()).collect()
[Row(name='Alice', age=1)]
>>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect()
[Row(0=1, 1=2)]
newSession() [source]
Returns a new SparkSession as new session, that has separate SQLConf,
registered temporary views and UDFs, but shared SparkContext and table
cache.
Parameters:
start – the start value
end – the end value (exclusive)
step – the incremental step (default: 1)
numPartitions – the number of partitions of the DataFrame
Returns:
DataFrame
>>> spark.range(3).collect()
[Row(id=0), Row(id=1), Row(id=2)]
property read
Returns a DataFrameReader that can be used to read data in as a DataFrame.
Returns:
DataFrameReader
property readStream
Returns a DataStreamReader that can be used to read data streams as a
streaming DataFrame.
Note: Evolving.
Returns:
DataStreamReader
property sparkContext
Returns the underlying SparkContext .
sql(sqlQuery) [source]
Returns a DataFrame representing the result of the given query.
Returns:
DataFrame
>>> df.createOrReplaceTempView("table1")
>>> df2 = spark.sql("SELECT field1 AS f1, field2 as f2 from table1"
>>> df2.collect()
[Row(f1=1, f2='row1'), Row(f1=2, f2='row2'), Row(f1=3, f2='row3')]
stop () [source]
Stop the underlying SparkContext .
property streams
Returns a StreamingQueryManager that allows managing all the
StreamingQuery instances active on this context.
Note: Evolving.
Returns:
StreamingQueryManager
table(tableName) [source]
Returns the specified table as a DataFrame.
Returns:
DataFrame
>>> df.createOrReplaceTempView("table1")
>>> df2 = spark.table("table1")
>>> sorted(df.collect()) == sorted(df2.collect())
True
property udf
Returns a UDFRegistration for UDF registration.
Returns:
UDFRegistration
property version
The version of Spark on which this application is running.
Parameters:
sparkContext – The SparkContext backing this SQLContext.
sparkSession – The SparkSession around which this SQLContext wraps.
jsqlContext – An optional JVM Scala SQLContext. If set, we do not instantiate a
new SQLContext in the JVM, instead we make all calls to this object.
cacheTable(tableName) [source]
Caches the specified table in-memory.
clearCache() [source]
Removes all cached tables from the in-memory cache.
When schema is a list of column names, the type of each column will be inferred
from data.
When schema is None, it will try to infer the schema (column names and types)
from data, which should be an RDD of Row, or namedtuple, or dict.
Parameters:
data – an RDD of any kind of SQL data representation(e.g. Row, tuple, int,
boolean , etc.), or list, or pandas.DataFrame.
schema – a pyspark.sql.types.DataType or a datatype string or a list of
column names, default is None. The data type string format equals to
pyspark.sql.types.DataType.simpleString, except that top level struct
type can omit the struct<> and atomic types use typeName() as their
format, e.g. use byte instead of tinyint for
pyspark.sql.types.ByteType. We can also use int as a short name for
pyspark.sql.types.IntegerType.
samplingRatio – the sample ratio of rows used for inferring
verifySchema – verify data types of every row against schema.
Returns:
DataFrame
>>> sqlContext.createDataFrame(df.toPandas()).collect()
[Row(name='Alice', age=1)]
>>> sqlContext.createDataFrame(pandas.DataFrame([[1, 2]])).collect()
[Row(0=1, 1=2)]
dropTempTable(tableName) [source]
Remove the temporary table from catalog.
If the key is not set and defaultValue is set, return defaultValue. If the key is not
set and defaultValue is not set, return the system default value.
>>> sqlContext.getConf("spark.sql.shuffle.partitions")
'200'
>>> sqlContext.getConf("spark.sql.shuffle.partitions", u"10")
'10'
>>> sqlContext.setConf("spark.sql.shuffle.partitions", u"50")
>>> sqlContext.getConf("spark.sql.shuffle.partitions", u"10")
'50'
Parameters:
sc – SparkContext
newSession() [source]
Returns a new SQLContext as new session, that has separate SQLConf,
registered temporary views and UDFs, but shared SparkContext and table
cache.
Parameters:
start – the start value
end – the end value (exclusive)
step – the incremental step (default: 1)
numPartitions – the number of partitions of the DataFrame
Returns:
DataFrame
>>> sqlContext.range(3).collect()
[Row(id=0), Row(id=1), Row(id=2)]
property read
Returns a DataFrameReader that can be used to read data in as a DataFrame.
Returns:
DataFrameReader
property readStream
Returns a DataStreamReader that can be used to read data streams as a
streaming DataFrame.
Note: Evolving.
Returns:
DataStreamReader
Temporary tables exist only during the lifetime of this instance of SQLContext.
sql(sqlQuery) [source]
Returns a DataFrame representing the result of the given query.
Returns:
DataFrame
property streams
Returns a StreamingQueryManager that allows managing all the
StreamingQuery StreamingQueries active on this context.
Note: Evolving.
table(tableName) [source]
Returns the specified table or view as a DataFrame.
Returns:
DataFrame
tableNames(dbName=None ) [source]
Returns a list of names of tables in the database dbName.
Parameters:
dbName – string, name of the database to use. Default to the current database.
Returns:
list of table names, in string
tables(dbName=None ) [source]
Returns a DataFrame containing names of tables in the given database.
Parameters:
dbName – string, name of the database to use.
Returns:
DataFrame
property udf
Returns a UDFRegistration for UDF registration.
Returns:
UDFRegistration
uncacheTable(tableName) [source]
Removes the specified table from the in-memory cache.
Parameters:
name – name of the user-defined function in SQL statements.
f – a Python function, or a user-defined function. The user-defined function
can be either row-at-a-time or vectorized. See
pyspark.sql.functions.udf() and
pyspark.sql.functions.pandas_udf().
returnType – the return type of the registered user-defined function. The
value can be either a pyspark.sql.types.DataType object or a DDL-
formatted type string.
Returns:
a user-defined function.
returnType can be optionally specified when f is a Python function but not when f
is a user-defined function. Please see below.
In addition to a name and the function itself, the return type can be optionally
specified. When the return type is not specified we would infer it via reflection.
Parameters:
name – name of the user-defined function
javaClassName – fully qualified name of java class
returnType – the return type of the registered Java function. The value can
be either a pyspark.sql.types.DataType object or a DDL-formatted type
string.
>>> spark.udf.registerJavaFunction(
... "javaStringLength2", "test.org.apache.spark.sql.JavaStringLength"
>>> spark.sql("SELECT javaStringLength2('test')").collect()
[Row(javaStringLength2(test)=4)]
>>> spark.udf.registerJavaFunction(
... "javaStringLength3", "test.org.apache.spark.sql.JavaStringLength"
>>> spark.sql("SELECT javaStringLength3('test')").collect()
[Row(javaStringLength3(test)=4)]
Parameters:
name – name of the user-defined aggregate function
javaClassName – fully qualified name of java class
people = spark.read.parquet("...")
agg(*exprs) [source]
Aggregate on the entire DataFrame without groups (shorthand for
df.groupBy.agg()).
alias(alias) [source]
Returns a new DataFrame with an alias set.
Parameters:
alias – string, an alias name to be set for the DataFrame.
The result of this algorithm has the following deterministic bound: If the
DataFrame has N elements and if we request the quantile at probability p up to
error err, then the algorithm will return a sample x from the DataFrame so that
the exact rank of x is close to (p * N). More precisely,
Note that null values will be ignored in numerical columns before calculation. For
columns only containing null values, an empty list is returned.
Parameters:
col – str, list. Can be a single column name, or a list of names for multiple
columns.
probabilities – a list of quantile probabilities Each number must belong to [0,
1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
relativeError – The relative target precision to achieve (>= 0). If set to zero,
the exact quantiles are computed, which could be very expensive. Note that
values greater than 1 are accepted but give the same result as 1.
Returns:
the approximate quantiles at the given probabilities. If the input col is a string,
the output is a list of floats. If the input col is a list or tuple of strings, the output
is also a list, but each element in it is a list of floats, i.e., the output is a list of list
of floats.
cache() [source]
Persists the DataFrame with the default storage level (MEMORY_AND_DISK).
checkpoint(eager=True) [source]
Returns a checkpointed version of this Dataset. Checkpointing can be used to
truncate the logical plan of this DataFrame, which is especially useful in iterative
algorithms where the plan may grow exponentially. It will be saved to files inside
the checkpoint directory set with SparkContext.setCheckpointDir().
Parameters:
eager – Whether to checkpoint this DataFrame immediately
Note: Experimental
coalesce(numPartitions ) [source]
Returns a new DataFrame that has exactly numPartitions partitions.
Parameters:
numPartitions – int, to specify the target number of partitions
>>> df.coalesce(1).rdd.getNumPartitions()
1
colRegex(colName) [source]
Selects column based on the column name specified as a regex and returns it as
Column.
Parameters:
colName – string, column name specified as a regex.
>>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1"
>>> df.select(df.colRegex("`(Col1)?+.+`")).show()
+----+
|Col2|
+----+
| 1|
| 2|
| 3|
+----+
collect() [source]
Returns all the records as a list of Row.
>>> df.collect()
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]
property columns
Returns all column names as a list.
>>> df.columns
['age', 'name']
Parameters:
col1 – The name of the first column
col2 – The name of the second column
method – The correlation method. Currently only supports “pearson”
count() [source]
Returns the number of rows in this DataFrame.
>>> df.count()
2
Parameters:
col1 – The name of the first column
col2 – The name of the second column
createGlobalTempView(name) [source]
Creates a global temporary view with this DataFrame.
The lifetime of this temporary view is tied to this Spark application. throws
TempTableAlreadyExistsException, if the view name already exists in the
catalog.
>>> df.createGlobalTempView("people")
>>> df2 = spark.sql("select * from global_temp.people")
>>> sorted(df.collect()) == sorted(df2.collect())
True
>>> df.createGlobalTempView("people")
Traceback (most recent call last):
...
AnalysisException: u"Temporary table 'people' already exists;"
>>> spark.catalog.dropGlobalTempView("people")
createOrReplaceGlobalTempView(name) [source]
Creates or replaces a global temporary view using the given name.
>>> df.createOrReplaceGlobalTempView("people")
>>> df2 = df.filter(df.age > 3)
>>> df2.createOrReplaceGlobalTempView("people")
>>> df3 = spark.sql("select * from global_temp.people")
>>> sorted(df3.collect()) == sorted(df2.collect())
True
>>> spark.catalog.dropGlobalTempView("people")
createOrReplaceTempView(name) [source]
Creates or replaces a local temporary view with this DataFrame.
The lifetime of this temporary table is tied to the SparkSession that was used to
create this DataFrame.
>>> df.createOrReplaceTempView("people")
>>> df2 = df.filter(df.age > 3)
>>> df2.createOrReplaceTempView("people")
>>> df3 = spark.sql("select * from people")
>>> sorted(df3.collect()) == sorted(df2.collect())
True
>>> spark.catalog.dropTempView("people")
createTempView(name) [source]
Creates a local temporary view with this DataFrame.
The lifetime of this temporary table is tied to the SparkSession that was used to
create this DataFrame. throws TempTableAlreadyExistsException, if the
view name already exists in the catalog.
>>> df.createTempView("people")
>>> df2 = spark.sql("select * from people")
>>> sorted(df.collect()) == sorted(df2.collect())
True
>>> df.createTempView("people")
Traceback (most recent call last):
...
AnalysisException: u"Temporary table 'people' already exists;"
>>> spark.catalog.dropTempView("people")
Parameters:
col1 – The name of the first column. Distinct items will make the first item of
each row.
col2 – The name of the second column. Distinct items will make the column
names of the DataFrame.
describe(*cols) [source]
Computes basic statistics for numeric and string columns.
This include count, mean, stddev, min, and max. If no columns are given, this
function computes statistics for all numerical or string columns.
Use summary for expanded statistics and control over which statistics to
compute.
distinct() [source]
Returns a new DataFrame containing the distinct rows in this DataFrame.
>>> df.distinct().count()
2
Parameters:
cols – a string name of the column to drop, or a Column to drop, or a list of
string name of the columns to drop.
>>> df.drop('age').collect()
[Row(name='Alice'), Row(name='Bob')]
>>> df.drop(df.age).collect()
[Row(name='Alice'), Row(name='Bob')]
dropDuplicates(subset=None) [source]
Return a new DataFrame with duplicate rows removed, optionally only
considering certain columns.
For a static batch DataFrame, it just drops duplicate rows. For a streaming
DataFrame, it will keep all data across triggers as intermediate state to drop
duplicates rows. You can use withWatermark() to limit how late the duplicate
data can be and system will accordingly limit the state. In addition, too late data
older than watermark will be dropped to avoid any possibility of duplicates.
drop_duplicates(subset=None)
drop_duplicates() is an alias for dropDuplicates().
Parameters:
how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a
row only if all its values are null.
thresh – int, default None If specified, drop rows that have less than thresh
non-null values. This overwrites the how parameter.
subset – optional list of column names to consider.
>>> df4.na.drop().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10| 80|Alice|
+---+------+-----+
property dtypes
Returns all column names and their data types as a list.
>>> df.dtypes
[('age', 'int'), ('name', 'string')]
>>> df1.exceptAll(df2).show()
+---+---+
| C1| C2|
+---+---+
| a| 1|
| a| 1|
| a| 2|
| c| 4|
+---+---+
Parameters:
extended – boolean, default False. If False, prints only the physical plan.
mode –
specifies the expected output format of plans.
simple: Print only a physical plan.
extended: Print both logical and physical plans.
codegen : Print a physical plan and generated codes if they are available.
cost: Print a logical plan and statistics if they are available.
formatted: Split explain output into two sections: a physical plan outline
and node details.
>>> df.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[age#0,name#1]
>>> df.explain(True)
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
...
== Optimized Logical Plan ==
...
== Physical Plan ==
...
>>> df.explain(mode="formatted")
== Physical Plan ==
* Scan ExistingRDD (1)
(1) Scan ExistingRDD [codegen id : 1]
Output: [age#0, name#1]
Parameters:
value – int, long, float, string, bool or dict. Value to replace null values with. If
the value is a dict, then subset is ignored and value must be a mapping from
column name (string) to replacement value. The replacement value must be
an int, long, float, boolean, or string.
subset – optional list of column names to consider. Columns specified in
subset that do not have matching data type are ignored. For example, if value
is a string, and subset contains a non-string column, then the non-string
column is simply ignored.
>>> df4.na.fill(50).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10| 80|Alice|
| 5| 50| Bob|
| 50| 50| Tom|
| 50| 50| null|
+---+------+-----+
>>> df5.na.fill(False).show()
+----+-------+-----+
| age| name| spy|
+----+-------+-----+
| 10| Alice|false|
| 5| Bob|false|
|null|Mallory| true|
+----+-------+-----+
filter(condition) [source]
Filters rows using the given condition.
Parameters:
condition – a Column of types.BooleanType or a string of SQL expression.
first() [source]
Returns the first row as a Row.
>>> df.first()
Row(age=2, name='Alice')
foreach(f) [source]
Applies the f function to all Row of this DataFrame.
foreachPartition(f) [source]
Applies the f function to each partition of this DataFrame.
Parameters:
cols – Names of the columns to calculate frequent items for as a list or tuple
of strings.
support – The frequency with which to consider an item ‘frequent’. Default is
1%. The support must be greater than 1e-4.
groupBy(*cols) [source]
Groups the DataFrame using the specified columns, so we can run aggregation
on them. See GroupedData for all the available aggregate functions.
Parameters:
cols – list of columns to group by. Each element should be a column name
(string) or an expression (Column).
>>> df.groupBy().avg().collect()
[Row(avg(age)=3.5)]
>>> sorted(df.groupBy('name').agg({'age': 'mean'}).collect())
[Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]
>>> sorted(df.groupBy(df.name).avg().collect())
[Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]
>>> sorted(df.groupBy(['name', df.age]).count().collect())
[Row(name='Alice', age=2, count=1), Row(name='Bob', age=5, count=1)]
groupby(*cols)
groupby() is an alias for groupBy().
New in version 1.4.
Note: This method should only be used if the resulting array is expected to
be small, as all the data is loaded into the driver’s memory.
Parameters:
n – int, default 1. Number of rows to return.
Returns:
If n is greater than 1, return a list of Row. If n is 1, return a single Row.
>>> df.head()
Row(age=2, name='Alice')
>>> df.head(1)
[Row(age=2, name='Alice')]
Parameters:
name – A name of the hint.
parameters – Optional parameters.
Returns:
DataFrame
intersectAll(other) [source]
Return a new DataFrame containing rows in both this DataFrame and another
DataFrame while preserving duplicates.
isLocal() [source]
Returns True if the collect() and take() methods can be run locally (without
any Spark executors).
property isStreaming
Returns True if this Dataset contains one or more sources that continuously
return data as it arrives. A Dataset that reads data from a streaming source
must be executed as a StreamingQuery using the start() method in
DataStreamWriter. Methods that return a single answer, (e.g., count() or
collect()) will throw an AnalysisException when there is a streaming source
present.
Note: Evolving
Parameters:
other – Right side of the join
on – a string for the join column name, a list of column names, a join
expression (Column), or a list of Columns. If on is a string or a list of strings
indicating the name of the join column(s), the column(s) must exist on both
sides, and this performs an equi-join.
how – str, default inner. Must be one of: inner, cross, outer, full,
fullouter, full_outer, left, leftouter, left_outer, right,
rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti
and left_anti.
The following performs a full outer join between df1 and df2.
limit(num) [source]
Limits the result count to the number specified.
>>> df.limit(1).collect()
[Row(age=2, name='Alice')]
>>> df.limit(0).collect()
[]
localCheckpoint(eager=True) [source]
Returns a locally checkpointed version of this Dataset. Checkpointing can be
used to truncate the logical plan of this DataFrame, which is especially useful in
iterative algorithms where the plan may grow exponentially. Local checkpoints
are stored in the executors using the caching subsystem and therefore they are
not reliable.
Parameters:
eager – Whether to checkpoint this DataFrame immediately
Note: Experimental
mapInPandas(udf) [source]
Maps an iterator of batches in the current DataFrame using a Pandas user-
defined function and returns the result as a DataFrame.
Parameters:
udf – A function object returned by pyspark.sql.functions.pandas_udf()
property na
Returns a DataFrameNaFunctions for handling missing values.
orderBy(*cols, **kwargs)
Returns a new DataFrame sorted by the specified column(s).
Parameters:
cols – list of Column or column names to sort by.
ascending – boolean or list of boolean (default True). Sort ascending vs.
descending. Specify list for multiple sort orders. If a list is specified, length of
the list must equal length of the cols.
>>> df.sort(df.age.desc()).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
>>> df.sort("age", ascending=False).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
>>> df.orderBy(df.age.desc()).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
>>> from pyspark.sql.functions import *
>>> df.sort(asc("age")).collect()
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]
>>> df.orderBy(desc("age"), "name").collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
>>> df.orderBy(["age", "name"], ascending=[0, 1]).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
printSchema() [source]
Prints out the schema in the tree format.
>>> df.printSchema()
root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)
Parameters:
weights – list of doubles as weights with which to split the DataFrame.
Weights will be normalized if they don’t sum up to 1.0.
seed – The seed for sampling.
>>> splits[1].count()
2
property rdd
Returns the content as an pyspark.RDD of Row.
Parameters:
numPartitions – can be an int to specify the target number of partitions or a
Column. If it is a Column, it will be used as the first partitioning column. If not
specified, the default number of partitions is used.
>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
>>> data = data.repartition("name", "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
Parameters:
numPartitions – can be an int to specify the target number of partitions or a
Column. If it is a Column, it will be used as the first partitioning column. If not
specified, the default number of partitions is used.
Note that due to performance reasons this method uses sampling to estimate the
ranges. Hence, the output may not be consistent, since sampling can return
different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition.
>>> df.repartitionByRange(2, "age").rdd.getNumPartitions()
2
>>> df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
>>> df.repartitionByRange(1, "age").rdd.getNumPartitions()
1
>>> data = df.repartitionByRange("age")
>>> df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
Parameters:
to_replace – bool, int, long, float, string, list or dict. Value to be replaced. If
the value is a dict, then value is ignored or can be omitted, and to_replace
must be a mapping between a value and a replacement.
value – bool, int, long, float, string, list or None. The replacement value must
be a bool, int, long, float, string or None. If value is a list, value should be of
the same length and type as to_replace. If value is a scalar and to_replace is
a sequence, then value is used as a replacement for each item in to_replace.
subset – optional list of column names to consider. Columns specified in
subset that do not have matching data type are ignored. For example, if value
is a string, and subset contains a non-string column, then the non-string
column is simply ignored.
rollup(*cols) [source]
Create a multi-dimensional rollup for the current DataFrame using the specified
columns, so we can run aggregation on them.
Parameters:
withReplacement – Sample with replacement or not (default False).
fraction – Fraction of rows to generate, range [0.0, 1.0].
seed – Seed for sampling (default a random seed).
Note: This is not guaranteed to provide exactly the fraction specified of the
total count of the given DataFrame.
>>> df = spark.range(10)
>>> df.sample(0.5, 3).count()
7
>>> df.sample(fraction=0.5, seed=3).count()
7
>>> df.sample(withReplacement=True, fraction=0.5, seed=3).count()
1
>>> df.sample(1.0).count()
10
>>> df.sample(fraction=1.0).count()
10
>>> df.sample(False, fraction=1.0).count()
10
Parameters:
col – column that defines strata
fractions – sampling fraction for each stratum. If a stratum is not specified,
we treat its fraction as zero.
seed – random seed
Returns:
a new DataFrame that represents the stratified sample
property schema
Returns the schema of this DataFrame as a pyspark.sql.types.StructType.
>>> df.schema
StructType(List(StructField(age,IntegerType,true),StructField(name,String
select(*cols) [source]
Projects a set of expressions and returns a new DataFrame.
Parameters:
cols – list of column names (string) or expressions ( Column). If one of the
column names is ‘*’, that column is expanded to include all columns in the
current DataFrame.
>>> df.select('*').collect()
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]
>>> df.select('name', 'age').collect()
[Row(name='Alice', age=2), Row(name='Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name='Alice', age=12), Row(name='Bob', age=15)]
selectExpr(*expr) [source]
Projects a set of SQL expressions and returns a new DataFrame.
Parameters:
n – Number of rows to show.
truncate – If set to True, truncate strings longer than 20 chars by default. If
set to a number greater than one, truncates long strings to length truncate
and align cells right.
vertical – If set to True, print output rows vertically (one line per column
value).
>>> df
DataFrame[age: int, name: string]
>>> df.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
>>> df.show(truncate=3)
+---+----+
|age|name|
+---+----+
| 2| Ali|
| 5| Bob|
+---+----+
>>> df.show(vertical=True)
-RECORD 0-----
age | 2
name | Alice
-RECORD 1-----
age | 5
name | Bob
Parameters:
cols – list of Column or column names to sort by.
ascending – boolean or list of boolean (default True). Sort ascending vs.
descending. Specify list for multiple sort orders. If a list is specified, length of
the list must equal length of the cols.
>>> df.sort(df.age.desc()).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
>>> df.sort("age", ascending=False).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
>>> df.orderBy(df.age.desc()).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
>>> from pyspark.sql.functions import *
>>> df.sort(asc("age")).collect()
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]
>>> df.orderBy(desc("age"), "name").collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
>>> df.orderBy(["age", "name"], ascending=[0, 1]).collect()
[Row(age=5, name='Bob'), Row(age=2, name='Alice')]
Parameters:
cols – list of Column or column names to sort by.
ascending – boolean or list of boolean (default True). Sort ascending vs.
descending. Specify list for multiple sort orders. If a list is specified, length of
the list must equal length of the cols.
>>> df.sortWithinPartitions("age", ascending=False).show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
+---+-----+
property stat
Returns a DataFrameStatFunctions for statistic functions.
property storageLevel
Get the DataFrame’s current storage level.
>>> df.storageLevel
StorageLevel(False, False, False, False, 1)
>>> df.cache().storageLevel
StorageLevel(True, True, False, True, 1)
>>> df2.persist(StorageLevel.DISK_ONLY_2).storageLevel
StorageLevel(True, False, False, False, 2)
subtract(other) [source]
Return a new DataFrame containing rows in this DataFrame but not in another
DataFrame.
summary(*statistics) [source]
Computes specified statistics for numeric and string columns. Available statistics
are: - count - mean - stddev - min - max - arbitrary approximate percentiles
specified as a percentage (eg, 75%)
If no statistics are given, this function computes count, mean, stddev, min,
approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
>>> df.summary().show()
+-------+------------------+-----+
|summary| age| name|
+-------+------------------+-----+
| count| 2| 2|
| mean| 3.5| null|
| stddev|2.1213203435596424| null|
| min| 2|Alice|
| 25%| 2| null|
| 50%| 2| null|
| 75%| 5| null|
| max| 5| Bob|
+-------+------------------+-----+
>>> df.summary("count", "min", "25%", "75%", "max").show()
+-------+---+-----+
|summary|age| name|
+-------+---+-----+
| count| 2| 2|
| min| 2|Alice|
| 25%| 2| null|
| 75%| 5| null|
| max| 5| Bob|
+-------+---+-----+
>>> df.take(2)
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]
Parameters:
cols – list of new column names (string)
toJSON(use_unicode=True) [source]
Converts a DataFrame into a RDD of string.
Each row is turned into a JSON document as one element in the returned RDD.
>>> df.toJSON().first()
'{"age":2,"name":"Alice"}'
toLocalIterator(prefetchPartitions=False) [source]
Returns an iterator that contains all of the rows in this DataFrame. The iterator
will consume as much memory as the largest partition in this DataFrame. With
prefetch it may consume up to the memory of the 2 largest partitions.
Parameters:
prefetchPartitions – If Spark should pre-fetch the next partition before it is
needed.
>>> list(df.toLocalIterator())
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]
Note: This method should only be used if the resulting Pandas’s DataFrame
is expected to be small, as all the data is loaded into the driver’s memory.
>>> df.toPandas()
age name
0 2 Alice
1 5 Bob
Parameters:
func – a function that takes and returns a class: DataFrame.
union(other) [source]
Return a new DataFrame containing union of rows in this and another
DataFrame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does
deduplication of elements), use this function followed by distinct().
unionAll(other) [source]
Return a new DataFrame containing union of rows in this and another
DataFrame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does
deduplication of elements), use this function followed by distinct().
unionByName(other) [source]
Returns a new DataFrame containing union of rows in this and another
DataFrame.
This is different from both UNION ALL and UNION DISTINCT in SQL. To do a
SQL-style set union (that does deduplication of elements), use this function
followed by distinct().
The difference between this function and union() is that this function resolves
columns by name (not by position):
where(condition)
where() is an alias for filter().
Parameters:
colName – string, name of the new column.
col – a Column expression for the new column.
Parameters:
existing – string, name of the existing column to rename.
new – string, new name of the column.
Parameters:
eventTime – the name of the column that contains the event time of the row.
delayThreshold – the minimum delay to wait to data to arrive late, relative to
the latest record that has been processed in the form of an interval (e.g. “1
minute” or “5 hours”).
Note: Evolving
property write
Interface for saving the content of the non-streaming DataFrame out into external
storage.
Returns:
DataFrameWriter
property writeStream
Interface for saving the content of the streaming DataFrame out into external
storage.
Note: Evolving.
Returns:
DataStreamWriter
agg(*exprs) [source]
Compute aggregates and returns the result as a DataFrame.
If exprs is a single dict mapping from string to string, then the key is the
column to perform aggregation on, and the value is the aggregate function.
Parameters:
exprs – a dict mapping from column name (string) to aggregate functions
(string), or a list of Column.
apply(udf) [source]
Maps each group of the current DataFrame using a pandas udf and returns the
result as a DataFrame.
The returned pandas.DataFrame can be of arbitrary length and its schema must
match the returnType of the pandas udf.
Note: This function requires a full shuffle. All the data of a group will be
loaded into memory, so the user should be aware of the potential OOM risk if
data is skewed and certain groups are too large to fit in memory.
Parameters:
udf – a grouped map user-defined function returned by
pyspark.sql.functions.pandas_udf().
avg(*cols) [source]
Computes average values for each numeric columns for each group.
Parameters:
cols – list of column names (string). Non-numeric columns are ignored.
>>> df.groupBy().avg('age').collect()
[Row(avg(age)=3.5)]
>>> df3.groupBy().avg('age', 'height').collect()
[Row(avg(age)=3.5, avg(height)=82.5)]
cogroup(other) [source]
Cogroups this group with another group so that we can run cogrouped
operations.
count() [source]
Counts the number of records for each group.
>>> sorted(df.groupBy(df.age).count().collect())
[Row(age=2, count=1), Row(age=5, count=1)]
max(*cols) [source]
Computes the max value for each numeric columns for each group.
>>> df.groupBy().max('age').collect()
[Row(max(age)=5)]
>>> df3.groupBy().max('age', 'height').collect()
[Row(max(age)=5, max(height)=85)]
Parameters:
cols – list of column names (string). Non-numeric columns are ignored.
>>> df.groupBy().mean('age').collect()
[Row(avg(age)=3.5)]
>>> df3.groupBy().mean('age', 'height').collect()
[Row(avg(age)=3.5, avg(height)=82.5)]
min(*cols) [source]
Computes the min value for each numeric column for each group.
Parameters:
cols – list of column names (string). Non-numeric columns are ignored.
>>> df.groupBy().min('age').collect()
[Row(min(age)=2)]
>>> df3.groupBy().min('age', 'height').collect()
[Row(min(age)=2, min(height)=80)]
Parameters:
pivot_col – Name of the column to pivot.
values – List of values that will be translated to columns in the output
DataFrame.
# Compute the sum of earnings for each year by course with each course as a
separate column
>>> df4.groupBy("year").pivot("course").sum("earnings").collect()
[Row(year=2012, Java=20000, dotNET=15000), Row(year=2013, Java=30000, dot
>>> df5.groupBy("sales.year").pivot("sales.course").sum("sales.earnings"
[Row(year=2012, Java=20000, dotNET=15000), Row(year=2013, Java=30000, dot
sum(*cols) [source]
Compute the sum for each numeric columns for each group.
Parameters:
cols – list of column names (string). Non-numeric columns are ignored.
>>> df.groupBy().sum('age').collect()
[Row(sum(age)=7)]
>>> df3.groupBy().sum('age', 'height').collect()
[Row(sum(age)=7, sum(height)=165)]
df.colName
df["colName"]
Parameters:
alias – strings of desired column names (collects all positional arguments
passed)
metadata – a dict of information to be stored in metadata attribute of the
corresponding :class: StructField (optional, keyword only argument)
>>> df.select(df.age.alias("age2")).collect()
[Row(age2=2), Row(age2=5)]
>>> df.select(df.age.alias("age3", metadata={'max': 99})).schema['age3'
99
asc()
Returns a sort expression based on ascending order of the column.
asc_nulls_first()
Returns a sort expression based on ascending order of the column, and null
values return before non-null values.
asc_nulls_last()
Returns a sort expression based on ascending order of the column, and null
values appear after non-null values.
astype(dataType)
astype() is an alias for cast().
bitwiseAND(other)
Compute bitwise AND of this expression with another expression.
Parameters:
other – a value or Column to calculate bitwise and(&) against this Column.
bitwiseOR (other)
Compute bitwise OR of this expression with another expression.
Parameters:
other – a value or Column to calculate bitwise or(|) against this Column.
bitwiseXOR(other)
Compute bitwise XOR of this expression with another expression.
Parameters:
other – a value or Column to calculate bitwise xor(^) against this Column.
contains(other)
Contains the other element. Returns a boolean Column based on a string match.
Parameters:
other – string in line
>>> df.filter(df.name.contains('o')).collect()
[Row(age=5, name='Bob')]
desc ()
Returns a sort expression based on the descending order of the column.
desc_nulls_first()
Returns a sort expression based on the descending order of the column, and null
values appear before non-null values.
desc_nulls_last()
Returns a sort expression based on the descending order of the column, and null
values appear after non-null values.
endswith(other)
String ends with. Returns a boolean Column based on a string match.
Parameters:
other – string at end of line (do not use a regex $)
>>> df.filter(df.name.endswith('ice')).collect()
[Row(age=2, name='Alice')]
>>> df.filter(df.name.endswith('ice$')).collect()
[]
eqNullSafe(other)
Equality test that is safe for null values.
Parameters:
other – a value or Column
getField(name) [source]
An expression that gets a field by name in a StructField.
getItem(key) [source]
An expression that gets an item at position ordinal out of a list, or gets an item
by key out of a dict.
>>> df = spark.createDataFrame([([1, 2], {"key": "value"})], ["l",
>>> df.select(df.l.getItem(0), df.d.getItem("key")).show()
+----+------+
|l[0]|d[key]|
+----+------+
| 1| value|
+----+------+
Changed in version 3.0: If key is a Column object, the indexing operator should
be used instead. For example, map_col.getItem(col(‘id’)) should be replaced with
map_col[col(‘id’)].
isNotNull ()
True if the current expression is NOT null.
isNull()
True if the current expression is null.
like (other)
SQL like expression. Returns a boolean Column based on a SQL LIKE match.
Parameters:
other – a SQL LIKE pattern
>>> df.filter(df.name.like('Al%')).collect()
[Row(age=2, name='Alice')]
Parameters:
value – a literal value, or a Column expression.
Parameters:
window – a WindowSpec
Returns:
a Column
rlike(other)
SQL RLIKE expression (LIKE with Regex). Returns a boolean Column based on
a regex match.
Parameters:
other – an extended regex expression
>>> df.filter(df.name.rlike('ice$')).collect()
[Row(age=2, name='Alice')]
startswith(other)
String starts with. Returns a boolean Column based on a string match.
Parameters:
other – string at start of line (do not use a regex ^)
>>> df.filter(df.name.startswith('Al')).collect()
[Row(age=2, name='Alice')]
>>> df.filter(df.name.startswith('^Al')).collect()
[]
Parameters:
startPos – start position (int or Column)
length – length of the substring (int or Column)
>>> df.select(df.name.substr(1, 3).alias("col")).collect()
[Row(col='Ali'), Row(col='Bob')]
Parameters:
condition – a boolean Column expression.
value – a literal value, or a Column expression.
cacheTable(tableName) [source]
Caches the specified table in-memory.
clearCache() [source]
Removes all cached tables from the in-memory cache.
The data source is specified by the source and a set of options . If source is
not specified, the default data source configured by
spark.sql.sources.default will be used. When path is specified, an external
table is created from the data at the given path. Otherwise a managed table is
created.
Returns:
DataFrame
currentDatabase() [source]
Returns the current default database in this session.
dropTempView(viewName) [source]
Drops the local temporary view with the given view name in the catalog. If the
view has been cached before, then it will also be uncached. Returns true if this
view is dropped successfully, false otherwise.
Note that, the return type of this method was None in Spark 2.0, but changed to
Boolean in Spark 2.1.
isCached(tableName) [source]
Returns true if the table is currently cached in-memory.
Note: the order of arguments here is different from that of its JVM counterpart
because Python does not support method overloading.
listDatabases() [source]
Returns a list of databases available across all sessions.
listFunctions(dbName=None ) [source]
Returns a list of functions registered in the specified database.
listTables(dbName=None ) [source]
Returns a list of tables/views in the specified database.
If no database is specified, the current database is used. This includes all
temporary views.
refreshByPath(path) [source]
Invalidates and refreshes all the cached data (and the associated metadata) for
any DataFrame that contains the given data source path.
refreshTable(tableName) [source]
Invalidates and refreshes all the cached data and metadata of the given table.
setCurrentDatabase(dbName) [source]
Sets the current default database in this session.
uncacheTable(tableName) [source]
Removes the specified table from the in-memory cache.
Row can be used to create a row object by using named arguments, the fields will be
sorted by names. It is not allowed to omit a named argument to represent the value is
None or missing. This should be explicitly set to None in this case.
Row also can be used to create another Row like class, then it could be used to
create Row objects, such as
>>> Person = Row("name", "age")
>>> Person
<Row('name', 'age')>
>>> 'name' in Person
True
>>> 'wrong_key' in Person
False
>>> Person("Alice", 11)
Row(name='Alice', age=11)
This form can also be used to create rows as tuple values, i.e. with unnamed fields.
Beware that such Row objects have different equality semantics:
asDict(recursive=False) [source]
Return as an dict
Parameters:
recursive – turns the nested Row as dict (default: False).
Parameters:
how – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a
row only if all its values are null.
thresh – int, default None If specified, drop rows that have less than thresh
non-null values. This overwrites the how parameter.
subset – optional list of column names to consider.
>>> df4.na.drop().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10| 80|Alice|
+---+------+-----+
Parameters:
value – int, long, float, string, bool or dict. Value to replace null values with. If
the value is a dict, then subset is ignored and value must be a mapping from
column name (string) to replacement value. The replacement value must be
an int, long, float, boolean, or string.
subset – optional list of column names to consider. Columns specified in
subset that do not have matching data type are ignored. For example, if value
is a string, and subset contains a non-string column, then the non-string
column is simply ignored.
>>> df4.na.fill(50).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10| 80|Alice|
| 5| 50| Bob|
| 50| 50| Tom|
| 50| 50| null|
+---+------+-----+
>>> df5.na.fill(False).show()
+----+-------+-----+
| age| name| spy|
+----+-------+-----+
| 10| Alice|false|
| 5| Bob|false|
|null|Mallory| true|
+----+-------+-----+
Parameters:
to_replace – bool, int, long, float, string, list or dict. Value to be replaced. If
the value is a dict, then value is ignored or can be omitted, and to_replace
must be a mapping between a value and a replacement.
value – bool, int, long, float, string, list or None. The replacement value must
be a bool, int, long, float, string or None. If value is a list, value should be of
the same length and type as to_replace. If value is a scalar and to_replace is
a sequence, then value is used as a replacement for each item in to_replace.
subset – optional list of column names to consider. Columns specified in
subset that do not have matching data type are ignored. For example, if value
is a string, and subset contains a non-string column, then the non-string
column is simply ignored.
>>> df4.na.replace(10, 20).show()
+----+------+-----+
| age|height| name|
+----+------+-----+
| 20| 80|Alice|
| 5| null| Bob|
|null| null| Tom|
|null| null| null|
+----+------+-----+
The result of this algorithm has the following deterministic bound: If the
DataFrame has N elements and if we request the quantile at probability p up to
error err, then the algorithm will return a sample x from the DataFrame so that
the exact rank of x is close to (p * N). More precisely,
Note that null values will be ignored in numerical columns before calculation. For
columns only containing null values, an empty list is returned.
Parameters:
col – str, list. Can be a single column name, or a list of names for multiple
columns.
probabilities – a list of quantile probabilities Each number must belong to [0,
1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
relativeError – The relative target precision to achieve (>= 0). If set to zero,
the exact quantiles are computed, which could be very expensive. Note that
values greater than 1 are accepted but give the same result as 1.
Returns:
the approximate quantiles at the given probabilities. If the input col is a string,
the output is a list of floats. If the input col is a list or tuple of strings, the output
is also a list, but each element in it is a list of floats, i.e., the output is a list of list
of floats.
Parameters:
col1 – The name of the first column
col2 – The name of the second column
method – The correlation method. Currently only supports “pearson”
Parameters:
col1 – The name of the first column
col2 – The name of the second column
Parameters:
col1 – The name of the first column. Distinct items will make the first item of
each row.
col2 – The name of the second column. Distinct items will make the column
names of the DataFrame.
Parameters:
col – column that defines strata
fractions – sampling fraction for each stratum. If a stratum is not specified,
we treat its fraction as zero.
seed – random seed
Returns:
a new DataFrame that represents the stratified sample
For example:
>>> # ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
>>> window = Window.orderBy("date").rowsBetween(Window.unboundedPreceding
>>> # PARTITION BY country ORDER BY date RANGE BETWEEN 3 PRECEDING AND 3 FOLL
>>> window = Window.orderBy("date").partitionBy("country").rangeBetween
currentRow = 0
Both start and end are relative from the current row. For example, “0” means
“current row”, while “-1” means one off before the current row, and “5” means the
five off after the current row.
Parameters:
start – boundary start, inclusive. The frame is unbounded if this is
Window.unboundedPreceding, or any value less than or equal to max(-
sys.maxsize, -9223372036854775808).
end – boundary end, inclusive. The frame is unbounded if this is
Window.unboundedFollowing, or any value greater than or equal to
min(sys.maxsize, 9223372036854775807).
Both start and end are relative positions from the current row. For example, “0”
means “current row”, while “-1” means the row before the current row, and “5”
means the fifth row after the current row.
A row based boundary is based on the position of the row within the partition. An
offset indicates the number of rows above or below the current row, the frame for
the current row starts or ends. For instance, given a row based sliding frame with
a lower bound offset of -1 and a upper bound offset of +2. The frame for row with
index 5 would range from index 4 to index 7.
Parameters:
start – boundary start, inclusive. The frame is unbounded if this is
Window.unboundedPreceding, or any value less than or equal to -
9223372036854775808.
end – boundary end, inclusive. The frame is unbounded if this is
Window.unboundedFollowing, or any value greater than or equal to
9223372036854775807.
unboundedFollowing = 9223372036854775807
unboundedPreceding = -9223372036854775808
orderBy(*cols) [source]
Defines the ordering columns in a WindowSpec.
Parameters:
cols – names of columns or expressions
partitionBy(*cols) [source]
Defines the partitioning columns in a WindowSpec.
Parameters:
cols – names of columns or expressions
Both start and end are relative from the current row. For example, “0” means
“current row”, while “-1” means one off before the current row, and “5” means the
five off after the current row.
We recommend users use Window.unboundedPreceding,
Window.unboundedFollowing, and Window.currentRow to specify special
boundary values, rather than using integral values directly.
Parameters:
start – boundary start, inclusive. The frame is unbounded if this is
Window.unboundedPreceding, or any value less than or equal to max(-
sys.maxsize, -9223372036854775808).
end – boundary end, inclusive. The frame is unbounded if this is
Window.unboundedFollowing, or any value greater than or equal to
min(sys.maxsize, 9223372036854775807).
Both start and end are relative positions from the current row. For example, “0”
means “current row”, while “-1” means the row before the current row, and “5”
means the fifth row after the current row.
Parameters:
start – boundary start, inclusive. The frame is unbounded if this is
Window.unboundedPreceding, or any value less than or equal to max(-
sys.maxsize, -9223372036854775808).
end – boundary end, inclusive. The frame is unbounded if this is
Window.unboundedFollowing, or any value greater than or equal to
min(sys.maxsize, 9223372036854775807).
This function will go through the input once to determine the input schema if
inferSchema is enabled. To avoid going through the entire data once, disable
inferSchema option or specify the schema explicitly using schema.
Parameters:
path – string, or list of strings, for input path(s), or RDD of Strings storing
CSV rows.
schema – an optional pyspark.sql.types.StructType for the input
schema or a DDL-formatted string (For example col0 INT, col1 DOUBLE).
sep – sets a separator (one or more characters) for each field and value. If
None is set, it uses the default value, ,.
encoding – decodes the CSV files by the given encoding type. If None is set,
it uses the default value, UTF-8.
quote – sets a single character used for escaping quoted values where the
separator can be part of the value. If None is set, it uses the default value, ".
If you would like to turn off quotations, you need to set an empty string.
escape – sets a single character used for escaping quotes inside an already
quoted value. If None is set, it uses the default value, \.
comment – sets a single character used for skipping lines beginning with this
character. By default (None), it is disabled.
header – uses the first line as names of columns. If None is set, it uses the
default value, false.
inferSchema – infers the input schema automatically from data. It requires
one extra pass over the data. If None is set, it uses the default value, false.
enforceSchema – If it is set to true, the specified or inferred schema will be
forcibly applied to datasource files, and headers in CSV files will be ignored. If
the option is set to false, the schema will be validated against all headers in
CSV files or the first header in RDD if the header option is set to true. Field
names in the schema and column names in CSV headers are checked by
their positions taking into account spark.sql.caseSensitive. If None is
set, true is used by default. Though the default value is true, it is
recommended to disable the enforceSchema option to avoid incorrect
results.
ignoreLeadingWhiteSpace – A flag indicating whether or not leading
whitespaces from values being read should be skipped. If None is set, it uses
the default value, false.
ignoreTrailingWhiteSpace – A flag indicating whether or not trailing
whitespaces from values being read should be skipped. If None is set, it uses
the default value, false.
nullValue – sets the string representation of a null value. If None is set, it
uses the default value, empty string. Since 2.0.1, this nullValue param
applies to all supported types including the string type.
nanValue – sets the string representation of a non-number value. If None is
set, it uses the default value, NaN.
positiveInf – sets the string representation of a positive infinity value. If None
is set, it uses the default value, Inf.
negativeInf – sets the string representation of a negative infinity value. If
None is set, it uses the default value, Inf.
dateFormat – sets the string that indicates a date format. Custom date
formats follow the formats at java.time.format.DateTimeFormatter. This
applies to date type. If None is set, it uses the default value, uuuu-MM-dd.
timestampFormat – sets the string that indicates a timestamp format.
Custom date formats follow the formats at
java.time.format.DateTimeFormatter. This applies to timestamp type. If
None is set, it uses the default value, uuuu-MM-dd'T'HH:mm:ss.SSSXXX.
maxColumns – defines a hard limit of how many columns a record can
have. If None is set, it uses the default value, 20480.
maxCharsPerColumn – defines the maximum number of characters allowed
for any given value being read. If None is set, it uses the default value, -1
meaning unlimited length.
maxMalformedLogPerPartition – this parameter is no longer used since
Spark 2.2.0. If specified, it is ignored.
mode –
allows a mode for dealing with corrupt records during parsing. If None is
set, it uses the default value, PERMISSIVE. Note that Spark tries to parse
only required columns in CSV under column pruning. Therefore, corrupt
records can be different based on required set of fields. This behavior
can be controlled by
spark.sql.csv.parser.columnPruning.enabled (enabled by
default).
PERMISSIVE : when it meets a corrupted record, puts the malformed string
into a field configured by columnNameOfCorruptRecord, and sets
malformed fields to null. To keep corrupt records, an user can set a string
type field named columnNameOfCorruptRecord in an user-defined
schema. If a schema does not have the field, it drops corrupt records
during parsing. A record with less/more tokens than schema is not a
corrupted record to CSV. When it meets a record having fewer tokens than
the length of the schema, sets null to extra fields. When the record has
more tokens than the length of the schema, it drops extra tokens.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord – allows renaming the new field having
malformed string created by PERMISSIVE mode. This overrides
spark.sql.columnNameOfCorruptRecord. If None is set, it uses the value
specified in spark.sql.columnNameOfCorruptRecord.
multiLine – parse records, which may span multiple lines. If None is set, it
uses the default value, false.
charToEscapeQuoteEscaping – sets a single character used for escaping
the escape for the quote character. If None is set, the default value is escape
character when escape and quote characters are different, \0 otherwise.
samplingRatio – defines fraction of rows used for schema inferring. If None
is set, it uses the default value, 1.0.
emptyValue – sets the string representation of an empty value. If None is
set, it uses the default value, empty string.
locale – sets a locale as language tag in IETF BCP 47 format. If None is set,
it uses the default value, en-US. For instance, locale is used while parsing
dates and timestamps.
lineSep – defines the line separator that should be used for parsing. If None
is set, it covers all \\r, \\r\\n and \\n. Maximum length is 1 character.
recursiveFileLookup – recursively scan a directory for files. Using this
option disables partition discovery.
>>> df = spark.read.csv('python/test_support/sql/ages.csv')
>>> df.dtypes
[('_c0', 'string'), ('_c1', 'string')]
>>> rdd = sc.textFile('python/test_support/sql/ages.csv')
>>> df2 = spark.read.csv(rdd)
>>> df2.dtypes
[('_c0', 'string'), ('_c1', 'string')]
format(source) [source]
Specifies the input data source format.
Parameters:
source – string, name of the data source, e.g. ‘json’, ‘parquet’.
>>> df = spark.read.format('json').load('python/test_support/sql/people.j
>>> df.dtypes
[('age', 'bigint'), ('name', 'string')]
Parameters:
url – a JDBC URL of the form jdbc:subprotocol:subname
table – the name of the table
column – the name of a column of numeric, date, or timestamp type that will
be used for partitioning; if this parameter is specified, then numPartitions,
lowerBound (inclusive), and upperBound (exclusive) will form partition
strides for generated WHERE clause expressions used to split the column
column evenly
lowerBound – the minimum value of column used to decide partition stride
upperBound – the maximum value of column used to decide partition stride
numPartitions – the number of partitions
predicates – a list of expressions suitable for inclusion in WHERE clauses;
each one defines one partition of the DataFrame
properties – a dictionary of JDBC database connection arguments. Normally
at least properties “user” and “password” with their corresponding values. For
example { ‘user’ : ‘SYSTEM’, ‘password’ : ‘mypassword’ }
Returns:
a DataFrame
If the schema parameter is not specified, this function goes through the input
once to determine the input schema.
Parameters:
path – string represents path to the JSON dataset, or a list of paths, or RDD
of Strings storing JSON objects.
schema – an optional pyspark.sql.types.StructType for the input
schema or a DDL-formatted string (For example col0 INT, col1 DOUBLE).
primitivesAsString – infers all primitive values as a string type. If None is
set, it uses the default value, false.
prefersDecimal – infers all floating-point values as a decimal type. If the
values do not fit in decimal, then it infers them as doubles. If None is set, it
uses the default value, false.
allowComments – ignores Java/C++ style comment in JSON records. If
None is set, it uses the default value, false.
allowUnquotedFieldNames – allows unquoted JSON field names. If None is
set, it uses the default value, false.
allowSingleQuotes – allows single quotes in addition to double quotes. If
None is set, it uses the default value, true.
allowNumericLeadingZero – allows leading zeros in numbers (e.g. 00012).
If None is set, it uses the default value, false.
allowBackslashEscapingAnyCharacter – allows accepting quoting of all
character using backslash quoting mechanism. If None is set, it uses the
default value, false.
mode –
allows a mode for dealing with corrupt records during parsing. If None is
set, it uses the default value, PERMISSIVE.
Parameters:
path – optional string or a list of string for file-system backed data sources.
format – optional string for format of the data source. Default to ‘parquet’.
schema – optional pyspark.sql.types.StructType for the input schema
or a DDL-formatted string (For example col0 INT, col1 DOUBLE).
options – all other string options
>>> df = spark.read.format("parquet").load('python/test_support/sql/parqu
... opt1=True, opt2=1, opt3='str')
>>> df.dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
>>> df = spark.read.format('json').load(['python/test_support/sql/people.
... 'python/test_support/sql/people1.json'])
>>> df.dtypes
[('age', 'bigint'), ('aka', 'string'), ('name', 'string')]
options(**options) [source]
Adds input options for the underlying data source.
Parameters:
mergeSchema – sets whether we should merge schemas collected from all
ORC part-files. This will override spark.sql.orc.mergeSchema. The default
value is specified in spark.sql.orc.mergeSchema.
recursiveFileLookup – recursively scan a directory for files. Using this
option disables partition discovery.
>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
>>> df.dtypes
[('a', 'bigint'), ('b', 'int'), ('c', 'int')]
Parameters:
mergeSchema – sets whether we should merge schemas collected from all
Parquet part-files. This will override spark.sql.parquet.mergeSchema. The
default value is specified in spark.sql.parquet.mergeSchema.
recursiveFileLookup – recursively scan a directory for files. Using this
option disables partition discovery.
>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned'
>>> df.dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
schema(schema) [source]
Specifies the input schema.
Some data sources (e.g. JSON) can infer the input schema automatically from
data. By specifying the schema here, the underlying data source can skip the
schema inference step, and thus speed up data loading.
Parameters:
schema – a pyspark.sql.types.StructType object or a DDL-formatted
string (For example col0 INT, col1 DOUBLE).
table(tableName) [source]
Returns the specified table as a DataFrame.
Parameters:
tableName – string, name of the table.
>>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned'
>>> df.createOrReplaceTempView('tmpTable')
>>> spark.read.table('tmpTable').dtypes
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
By default, each line in the text file is a new row in the resulting DataFrame.
Parameters:
paths – string, or list of strings, for input path(s).
wholetext – if true, read each file from input path(s) as a single row.
lineSep – defines the line separator that should be used for parsing. If None
is set, it covers all \r , \r\n and \n .
recursiveFileLookup – recursively scan a directory for files. Using this
option disables partition discovery.
>>> df = spark.read.text('python/test_support/sql/text-test.txt')
>>> df.collect()
[Row(value='hello'), Row(value='this')]
>>> df = spark.read.text('python/test_support/sql/text-test.txt', wholete
>>> df.collect()
[Row(value='hello\nthis')]
Parameters:
numBuckets – the number of buckets to save
col – a name of a column, or a list of names.
cols – additional names (optional). If col is a list it should be empty.
>>> (df.write.format('parquet')
... .bucketBy(100, 'year', 'month')
... .mode("overwrite")
... .saveAsTable('bucketed_table'))
Parameters:
path – the path in any Hadoop supported file system
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
format(source) [source]
Specifies the underlying output data source.
Parameters:
source – string, name of the data source, e.g. ‘json’, ‘parquet’.
It requires that the schema of the class:DataFrame is the same as the schema of
the table.
Parameters:
url – a JDBC URL of the form jdbc:subprotocol:subname
table – Name of the table in the external database.
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error or errorifexists (default case): Throw an exception if data
already exists.
properties – a dictionary of JDBC database connection arguments. Normally
at least properties “user” and “password” with their corresponding values. For
example { ‘user’ : ‘SYSTEM’, ‘password’ : ‘mypassword’ }
Parameters:
path – the path in any Hadoop supported file system
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error or errorifexists (default case): Throw an exception if data
already exists.
compression – compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none, bzip2, gzip, lz4,
snappy and deflate).
dateFormat – sets the string that indicates a date format. Custom date
formats follow the formats at java.time.format.DateTimeFormatter. This
applies to date type. If None is set, it uses the default value, uuuu-MM-dd.
timestampFormat – sets the string that indicates a timestamp format.
Custom date formats follow the formats at
java.time.format.DateTimeFormatter. This applies to timestamp type. If
None is set, it uses the default value, uuuu-MM-dd'T'HH:mm:ss.SSSXXX.
encoding – specifies encoding (charset) of saved json files. If None is set,
the default UTF-8 charset will be used.
lineSep – defines the line separator that should be used for writing. If None is
set, it uses the default value, \n .
ignoreNullFields – Whether to ignore null fields when generating JSON
objects. If None is set, it uses the default value, true.
Options include:
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
error or errorifexists: Throw an exception if data already exists.
ignore: Silently ignore this operation if data already exists.
>>> df.write.mode('append').parquet(os.path.join(tempfile.mkdtemp(),
options(**options) [source]
Adds output options for the underlying data source.
Parameters:
path – the path in any Hadoop supported file system
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error or errorifexists (default case): Throw an exception if data
already exists.
partitionBy – names of partitioning columns
compression – compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none, snappy, zlib, and
lzo). This will override orc.compress and
spark.sql.orc.compression.codec. If None is set, it uses the value
specified in spark.sql.orc.compression.codec.
Parameters:
path – the path in any Hadoop supported file system
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error or errorifexists (default case): Throw an exception if data
already exists.
partitionBy – names of partitioning columns
compression – compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none, uncompressed,
snappy, gzip, lzo, brotli, lz4, and zstd). This will override
spark.sql.parquet.compression.codec. If None is set, it uses the value
specified in spark.sql.parquet.compression.codec.
partitionBy(*cols) [source]
Partitions the output by the given columns on the file system.
If specified, the output is laid out on the file system similar to Hive’s partitioning
scheme.
Parameters:
cols – name of columns
The data source is specified by the format and a set of options . If format is
not specified, the default data source configured by
spark.sql.sources.default will be used.
Parameters:
path – the path in a Hadoop supported file system
format – the format used to save
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error or errorifexists (default case): Throw an exception if data
already exists.
partitionBy – names of partitioning columns
options – all other string options
Parameters:
name – the table name
format – the format used to save
mode – one of append, overwrite, error, errorifexists, ignore (default: error)
partitionBy – names of partitioning columns
options – all other string options
Parameters:
col – a name of a column, or a list of names.
cols – additional names (optional). If col is a list it should be empty.
>>> (df.write.format('parquet')
... .bucketBy(100, 'year', 'month')
... .sortBy('day')
... .mode("overwrite")
... .saveAsTable('sorted_bucketed_table'))
Parameters:
path – the path in any Hadoop supported file system
compression – compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none, bzip2, gzip, lz4,
snappy and deflate).
lineSep – defines the line separator that should be used for writing. If None is
set, it uses the default value, \n .
The DataFrame must have only one column that is of string type. Each row
becomes a new line in the output file.
Note: Experimental
apply(udf) [source]
Applies a function to each cogroup using a pandas udf and returns the result as
a DataFrame.
The user-defined function should take two pandas.DataFrame and return another
pandas.DataFrame. For each side of the cogroup, all columns are passed
together as a pandas.DataFrame to the user-function and the returned
pandas.DataFrame are combined as a DataFrame.
The returned pandas.DataFrame can be of arbitrary length and its schema must
match the returnType of the pandas udf.
Note: This function requires a full shuffle. All the data of a cogroup will be
loaded into memory, so the user should be aware of the potential OOM risk if
data is skewed and certain groups are too large to fit in memory.
Note: Experimental
Parameters:
udf – a cogrouped map user-defined function returned by
pyspark.sql.functions.pandas_udf().
Alternatively, the user can define a function that takes three arguments. In this
case, the grouping key(s) will be passed as the first argument and the data will be
passed as the second and third arguments. The grouping key(s) will be passed
as a tuple of numpy data types, e.g., numpy.int32 and numpy.float64. The data
will still be passed in as two pandas.DataFrame containing all columns from the
original Spark DataFrames.
pyspark.sql.types module
class pyspark.sql.types.DataType [source]
Base class for data types.
fromInternal(obj) [source]
Converts an internal SQL object into a native Python object.
json () [source]
jsonValue () [source]
needConversion() [source]
Does this type need to conversion between Python object and internal SQL
object.
simpleString() [source]
toInternal(obj) [source]
Converts a Python object into an internal SQL object.
The data type representing None, used for the types that cannot be inferred.
EPOCH_ORDINAL = 719163
fromInternal(v) [source]
Converts an internal SQL object into a native Python object.
needConversion() [source]
Does this type need to conversion between Python object and internal SQL
object.
toInternal(d) [source]
Converts a Python object into an internal SQL object.
fromInternal(ts) [source]
Converts an internal SQL object into a native Python object.
needConversion() [source]
Does this type need to conversion between Python object and internal SQL
object.
toInternal(dt) [source]
Converts a Python object into an internal SQL object.
class pyspark.sql.types.DecimalType(precision=10, scale=0 ) [source]
Decimal (decimal.Decimal) data type.
The DecimalType must have fixed precision (the maximum total number of digits) and
scale (the number of digits on the right of dot). For example, (5, 2) can support the
value from [-999.99 to 999.99].
The precision can be up to 38, the scale must be less or equal to precision.
When create a DecimalType, the default precision and scale is (10, 0). When infer
schema from decimal.Decimal objects, it will be DecimalType(38, 18).
Parameters:
precision – the maximum total number of digits (default: 10)
scale – the number of digits on right side of dot. (default: 0)
jsonValue () [source]
simpleString() [source]
simpleString() [source]
simpleString() [source]
simpleString() [source]
simpleString() [source]
Parameters:
elementType – DataType of each element in the array.
containsNull – boolean, whether the array can contain null (None) values.
fromInternal(obj) [source]
Converts an internal SQL object into a native Python object.
jsonValue () [source]
needConversion() [source]
Does this type need to conversion between Python object and internal SQL
object.
simpleString() [source]
toInternal(obj) [source]
Converts a Python object into an internal SQL object.
Parameters:
keyType – DataType of the keys in the map.
valueType – DataType of the values in the map.
valueContainsNull – indicates whether values can contain null (None) values.
fromInternal(obj) [source]
Converts an internal SQL object into a native Python object.
jsonValue () [source]
needConversion() [source]
Does this type need to conversion between Python object and internal SQL
object.
simpleString() [source]
toInternal(obj) [source]
Converts a Python object into an internal SQL object.
Parameters:
name – string, name of the field.
dataType – DataType of the field.
nullable – boolean, whether the field can be null (None) or not.
metadata – a dict from string to simple type that can be toInternald to JSON
automatically
fromInternal(obj) [source]
Converts an internal SQL object into a native Python object.
jsonValue () [source]
needConversion() [source]
Does this type need to conversion between Python object and internal SQL
object.
simpleString() [source]
toInternal(obj) [source]
Converts a Python object into an internal SQL object.
typeName() [source]
Parameters:
field – Either the name of the field or a StructField object
data_type – If present, the DataType of the StructField to create
nullable – Whether the field to add should be nullable (default True)
metadata – Any additional metadata (default None)
Returns:
a new updated StructType
fieldNames() [source]
Returns all field names in a list.
fromInternal(obj) [source]
Converts an internal SQL object into a native Python object.
jsonValue () [source]
needConversion() [source]
Does this type need to conversion between Python object and internal SQL
object.
This is used to avoid the unnecessary conversion for
ArrayType/MapType/StructType.
simpleString() [source]
toInternal(obj) [source]
Converts a Python object into an internal SQL object.
pyspark.sql.functions module
A collections of builtin functions
COGROUPED_MAP = 206
GROUPED_AGG = 202
GROUPED_MAP = 201
MAP_ITER = 205
SCALAR = 200
SCALAR_ITER = 204
pyspark.sql.functions. abs(col)
Computes the absolute value.
Returns:
inverse cosine of col, as if computed by java.lang.Math.acos()
Parameters:
rsd – maximum estimation error allowed (default = 0.05). For rsd < 0.01, it is more
efficient to use countDistinct()
>>> df.agg(approx_count_distinct(df.age).alias('distinct_ages')).collect
[Row(distinct_ages=2)]
Parameters:
col – name of column containing array
value – value or column to check for in array
Parameters:
col1 – name of column containing array
col2 – name of column containing array
Parameters:
col1 – name of column containing array
col2 – name of column containing array
>>> from pyspark.sql import Row
>>> df = spark.createDataFrame([Row(c1=["b", "a", "c"], c2=["c", "d", "a"
>>> df.select(array_intersect(df.c1, df.c2)).collect()
[Row(array_intersect(c1, c2)=['a', 'c'])]
Parameters:
col – name of column or expression
Parameters:
col – name of column or expression
Note: The position is not zero based, but 1 based index. Returns 0 if the given
value could not be found in the array.
Parameters:
col – name of column containing array
element – element to be removed from the array
Parameters:
col – name of column or expression
Parameters:
col1 – name of column containing array
col2 – name of column containing array
pyspark.sql.functions. asc(col)
Returns a sort expression based on the ascending order of the given column name.
pyspark.sql.functions. asc_nulls_first(col)
Returns a sort expression based on the ascending order of the given column name,
and null values return before non-null values.
pyspark.sql.functions. asc_nulls_last(col)
Returns a sort expression based on the ascending order of the given column name,
and null values appear after non-null values.
pyspark.sql.functions. ascii(col)
Computes the numeric value of the first character of the string column.
Returns:
inverse sine of col, as if computed by java.lang.Math.asin()
Returns:
inverse tangent of col, as if computed by java.lang.Math.atan()
Parameters:
col1 – coordinate on y-axis
col2 – coordinate on x-axis
Returns:
the theta component of the point (r, theta) in polar coordinates that corresponds to
the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2()
pyspark.sql.functions. avg(col)
Aggregate function: returns the average of the values in a group.
pyspark.sql.functions. base64(col)
Computes the BASE64 encoding of a binary column and returns it as a string column.
pyspark.sql.functions. basestring
alias of builtins.str
>>> df.select(bin(df.age).alias('c')).collect()
[Row(c='10'), Row(c='101')]
pyspark.sql.functions. bitwiseNOT(col)
Computes bitwise not.
pyspark.sql.functions. col(col)
Returns a Column based on the given column name.
pyspark.sql.functions. collect_list(col)
Aggregate function: returns a list of objects with duplicates.
pyspark.sql.functions. collect_set(col)
Aggregate function: returns a set of objects with duplicate elements eliminated.
pyspark.sql.functions. column(col)
Returns a Column based on the given column name.
>>> a = range(20)
>>> b = [2 * x for x in range(20)]
>>> df = spark.createDataFrame(zip(a, b), ["a", "b"])
>>> df.agg(corr("a", "b").alias('c')).collect()
[Row(c=1.0)]
pyspark.sql.functions. cos(col)
Parameters:
col – angle in radians
Returns:
cosine of the angle, as if computed by java.lang.Math.cos().
Parameters:
col – hyperbolic angle
Returns:
hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh()
pyspark.sql.functions. count(col)
Aggregate function: returns the number of items in a group.
>>> a = [1] * 10
>>> b = [1] * 10
>>> df = spark.createDataFrame(zip(a, b), ["a", "b"])
>>> df.agg(covar_samp("a", "b").alias('c')).collect()
[Row(c=0.0)]
Parameters:
cols – list of column names (string) or list of Column expressions that are grouped
as key-value pairs, e.g. (key1, value1, key2, value2, …).
pyspark.sql.functions. cume_dist ()
Window function: returns the cumulative distribution of values within a window
partition, i.e. the fraction of rows that are below the current row.
A pattern could be for instance dd.MM.yyyy and could return a string like ‘18.03.1993’.
All pattern letters of the Java class java.time.format.DateTimeFormatter can be used.
Note: Use when ever possible specialized functions like year. These benefit from
a specialized implementation.
Parameters:
format – ‘year’, ‘yyyy’, ‘yy’, ‘month’, ‘mon’, ‘mm’, ‘day’, ‘dd’, ‘hour’, ‘minute’, ‘second’,
‘week’, ‘quarter’
pyspark.sql.functions. degrees(col)
Converts an angle measured in radians to an approximately equivalent angle
measured in degrees. :param col: angle in radians :return: angle in degrees, as if
computed by java.lang.Math.toDegrees()
pyspark.sql.functions. dense_rank()
Window function: returns the rank of rows within a window partition, without any gaps.
The difference between rank and dense_rank is that dense_rank leaves no gaps in
ranking sequence when there are ties. That is, if you were ranking a competition
using dense_rank and had three people tie for second place, you would say that all
three were in second place and that the next person came in third. Rank would give
me sequential numbers, making the person that came in third place (after the ties)
would register as coming in fifth.
pyspark.sql.functions. desc_nulls_first(col)
Returns a sort expression based on the descending order of the given column name,
and null values appear before non-null values.
pyspark.sql.functions. desc_nulls_last(col)
Returns a sort expression based on the descending order of the given column name,
and null values appear after non-null values
Parameters:
col – name of column containing array or map
extraction – index to check for in array or key to check for in map
Note: The position is not zero based, but 1 based index.
pyspark.sql.functions. exp(col)
Computes the exponential of the given value.
>>> df = spark.createDataFrame(
... [(1, ["foo", "bar"], {"x": 1.0}), (2, [], {}), (3, None, None)],
... ("id", "an_array", "a_map")
... )
>>> df.select("id", "an_array", explode_outer("a_map")).show()
+---+----------+----+-----+
| id| an_array| key|value|
+---+----------+----+-----+
| 1|[foo, bar]| x| 1.0|
| 2| []|null| null|
| 3| null|null| null|
+---+----------+----+-----+
>>> df.select("id", "a_map", explode_outer("an_array")).show()
+---+----------+----+
| id| a_map| col|
+---+----------+----+
| 1|[x -> 1.0]| foo|
| 1|[x -> 1.0]| bar|
| 2| []|null|
| 3| null|null|
+---+----------+----+
pyspark.sql.functions. expm1(col)
Computes the exponential of the given value minus one.
>>> df.select(expr("length(name)")).collect()
[Row(length(name)=5), Row(length(name)=3)]
The function by default returns the first values it sees. It will return the first non-null
value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Parameters:
col – name of column or expression
pyspark.sql.functions. floor(col)
Computes the floor of the given value.
Parameters:
col – the column name of the numeric value to be formatted
d – the N decimal places
Parameters:
format – string that can contain embedded format tags and used as result
column’s value
cols – list of column names (string) or list of Column expressions to be used in
formatting
Parameters:
col – string column in CSV format
schema – a string with schema in DDL format to use when parsing the CSV
column.
options – options to control parsing. accepts the same options as the CSV
datasource
Parameters:
col – string column in json format
schema – a StructType or ArrayType of StructType to use when parsing the json
column.
options – options to control parsing. accepts the same options as the json
datasource
Note: Since Spark 2.3, the DDL-formatted string or a JSON format string is also
supported for schema.
This function may return confusing result if the input is a string with timezone, e.g.
‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to
timestamp according to the timezone in the string, and finally display the result by
converting the timestamp to string according to the session local timezone.
Parameters:
timestamp – the column that contains timestamps
tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc
Parameters:
col – string column in json format
path – path to the json object to extract
>>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "v
>>> df = spark.createDataFrame(data, ("key", "jstring"))
>>> df.select(df.key, get_json_object(df.jstring, '$.f1').alias("c0"), \
... get_json_object(df.jstring, '$.f2').alias("c1") )
[Row(key='1', c0='value1', c1='value2'), Row(key='2', c0='value12', c1=None)]
Note: The list of columns should match with grouping columns exactly, or empty
(means all the grouping columns).
>>> df.cube("name").agg(grouping_id(), sum("age")).orderBy("name").show
+-----+-------------+--------+
| name|grouping_id()|sum(age)|
+-----+-------------+--------+
| null| 1| 7|
|Alice| 0| 2|
| Bob| 0| 5|
+-----+-------------+--------+
Note: The position is not zero based, but 1 based index. Returns 0 if substr could
not be found in str.
Parameters:
col – string column in json format
fields – list of fields to extract
>>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "v
>>> df = spark.createDataFrame(data, ("key", "jstring"))
>>> df.select(df.key, json_tuple(df.jstring, 'f1', 'f2')).collect()
[Row(key='1', c0='value1', c1='value2'), Row(key='2', c0='value12', c1=None)]
pyspark.sql.functions. kurtosis(col)
Aggregate function: returns the kurtosis of the values in a group.
Parameters:
col – name of column or expression
offset – number of row to extend
default – default value
The function by default returns the last values it sees. It will return the last non-null
value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
Parameters:
col – name of column or expression
offset – number of row to extend
default – default value
pyspark.sql.functions. lit(col)
Creates a Column of literal value.
>>> df.select(lit(5).alias('height')).withColumn('spark_user', lit(True
[Row(height=5, spark_user=True)]
Note: The position is not zero based, but 1 based index. Returns 0 if substr could
not be found in str.
Parameters:
substr – a string
str – a Column of pyspark.sql.types.StringType
pos – start position (zero based)
If there is only one argument, then this takes the natural logarithm of the argument.
pyspark.sql.functions. log10(col)
Computes the logarithm of the given value in Base 10.
pyspark.sql.functions. log1p(col)
Computes the natural logarithm of the given value plus one.
pyspark.sql.functions. lower(col)
Converts a string expression to lower case.
pyspark.sql.functions. ltrim(col)
Trim the spaces from left end for the specified string value.
Parameters:
cols – list of column names (string) or list of Column expressions
Parameters:
col – name of column or expression
Parameters:
col1 – name of column containing a set of keys. All elements should not be null
col2 – name of column containing a set of values
Parameters:
col – name of column or expression
Parameters:
col – name of column or expression
pyspark.sql.functions. max(col)
Aggregate function: returns the maximum value of the expression in a group.
pyspark.sql.functions. min(col)
Aggregate function: returns the minimum value of the expression in a group.
As an example, consider a DataFrame with two partitions, each with 3 records. This
expression would return the following IDs: 0, 1, 2, 8589934592 (1L << 33),
8589934593, 8589934594.
Parameters:
n – an integer
Parameters:
f – user-defined function. A python function if used as a standalone function
returnType – the return type of the user-defined function. The value can be either
a pyspark.sql.types.DataType object or a DDL-formatted type string.
functionType – an enum value in pyspark.sql.functions.PandasUDFType.
Default: SCALAR.
1. SCALAR
Note: The length of pandas.Series within a scalar UDF is not that of the
whole input column, but is the length of an internal batch used for each call
to the function. Therefore, this can be used, for example, to ensure the
length of each returned pandas.Series, and can not be used as the column
length.
2. SCALAR_ITER
A scalar iterator UDF is semantically the same as the scalar Pandas UDF
above except that the wrapped Python function takes an iterator of batches as
input instead of a single batch and, instead of returning a single output batch, it
yields output batches or explicitly returns an generator or an iterator of output
batches. It is useful when the UDF execution requires initializing some state,
e.g., loading a machine learning model file to apply inference to every input
batch.
Note: It is not guaranteed that one invocation of a scalar iterator UDF will
process all batches from one partition, although it is currently implemented
this way. Your code shall not rely on this behavior because it might change
in the future for further optimization, e.g., one invocation processes multiple
partitions.
When the UDF is called with a single column that is not StructType, the input to
the underlying function is an iterator of pd.Series.
>>> @pandas_udf("long", PandasUDFType.SCALAR_ITER)
... def plus_one(batch_iter):
... for x in batch_iter:
... yield x + 1
...
>>> df.select(plus_one(col("x"))).show()
+-----------+
|plus_one(x)|
+-----------+
| 2|
| 3|
| 4|
+-----------+
When the UDF is called with more than one columns, the input to the
underlying function is an iterator of pd.Series tuple.
When the UDF is called with a single column that is StructType, the input to the
underlying function is an iterator of pd.DataFrame.
In the UDF, you can initialize some states before processing batches, wrap
your code with try … finally … or use context managers to ensure the release
of resources at the end or in case of early termination.
Alternatively, the user can define a function that takes two arguments. In this
case, the grouping key(s) will be passed as the first argument and the data will
be passed as the second argument. The grouping key(s) will be passed as a
tuple of numpy data types, e.g., numpy.int32 and numpy.float64. The data will
still be passed in as a pandas.DataFrame containing all columns from the
original Spark DataFrame. This is useful when the user does not want to
hardcode grouping key(s) in the function.
4. GROUPED_AGG
Note: For performance reasons, the input series to window functions are
not copied. Therefore, mutating the input series is not allowed and will cause
incorrect results. For the same reason, users should also not rely on the
index of the input series.
5. MAP_ITER
A map iterator Pandas UDFs are used to transform data with an iterator of
batches. It can be used with pyspark.sql.DataFrame.mapInPandas().
It can return the output of arbitrary length in contrast to the scalar Pandas UDF.
It maps an iterator of batches in the current DataFrame using a Pandas user-
defined function and returns the result as a DataFrame.
6. COGROUPED_MAP
Alternatively, the user can define a function that takes three arguments. In this
case, the grouping key(s) will be passed as the first argument and the data will
be passed as the second and third arguments. The grouping key(s) will be
passed as a tuple of numpy data types, e.g., numpy.int32 and numpy.float64.
The data will still be passed in as two pandas.DataFrame containing all
columns from the original Spark DataFrames. >>> @pandas_udf(“time int, id
int, v1 double, v2 string”, … PandasUDFType.COGROUPED_MAP) # doctest:
+SKIP … def asof_join(k, l, r): … if k == (1,): … return pd.merge_asof(l, r,
on=”time”, by=”id”) … else: … return pd.DataFrame(columns=[‘time’, ‘id’, ‘v1’,
‘v2’]) >>> df1.groupby(“id”).cogroup(df2.groupby(“id”)).apply(asof_join).show()
# doctest: +SKIP +———+—+—+—+ | time| id| v1| v2| +———+—+—+—+ |
20000101| 1|1.0| x| | 20000102| 1|3.0| x| +———+—+—+—+
Note: The user-defined functions are considered deterministic by default. Due to
optimization, duplicate invocations may be eliminated or the function may even be
invoked more times than it is present in the query. If your function is not
deterministic, call asNondeterministic on the user defined function. E.g.:
Note: The user-defined functions do not take keyword arguments on the calling
side.
Note: The data type of returned pandas.Series from the user-defined functions
should be matched with defined returnType (see types.to_arrow_type() and
types.from_arrow_type()). When there is mismatch between them, Spark might
do conversion on returned data. The conversion is not guaranteed to be correct
and results should be checked for accuracy by users.
pyspark.sql.functions. percent_rank()
Window function: returns the relative rank (i.e. percentile) of rows within a window
partition.
>>> eDF.select(posexplode(eDF.mapfield)).show()
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| b|
+---+---+-----+
pyspark.sql.functions. radians(col)
Converts an angle measured in degrees to an approximately equivalent angle
measured in radians. :param col: angle in degrees :return: angle in radians, as if
computed by java.lang.Math.toRadians()
pyspark.sql.functions. rank ()
Window function: returns the rank of rows within a window partition.
The difference between rank and dense_rank is that dense_rank leaves no gaps in
ranking sequence when there are ties. That is, if you were ranking a competition
using dense_rank and had three people tie for second place, you would say that all
three were in second place and that the next person came in third. Rank would give
me sequential numbers, making the person that came in third place (after the ties)
would register as coming in fifth.
Parameters:
col – name of column or expression
pyspark.sql.functions. row_number()
Window function: returns a sequential number starting at 1 within a window partition.
pyspark.sql.functions. rtrim(col)
Trim the spaces from right end for the specified string value.
Parameters:
col – a CSV string or a string literal containing a CSV string.
options – options to control parsing. accepts the same options as the CSV
datasource
>>> df = spark.range(1)
>>> df.select(schema_of_csv(lit('1|a'), {'sep':'|'}).alias("csv")).collect
[Row(csv='struct<_c0:int,_c1:string>')]
>>> df.select(schema_of_csv('1|a', {'sep':'|'}).alias("csv")).collect()
[Row(csv='struct<_c0:int,_c1:string>')]
Parameters:
json – a JSON string or a string literal containing a JSON string.
options – options to control parsing. accepts the same options as the JSON
datasource
Parameters:
col – name of column or expression
pyspark.sql.functions. signum(col)
Computes the signum of the given value.
pyspark.sql.functions. sin(col)
Parameters:
col – angle in radians
Returns:
sine of the angle, as if computed by java.lang.Math.sin()
Parameters:
col – hyperbolic angle
Returns:
hyperbolic sine of the given value, as if computed by java.lang.Math.sinh()
Parameters:
col – name of column or expression
pyspark.sql.functions. skewness(col)
Aggregate function: returns the skewness of the values in a group.
Parameters:
x – the array to be sliced
start – the starting index
length – the length of the slice
Parameters:
col – name of column or expression
>>> df.repartition(1).select(spark_partition_id().alias("pid")).collect
[Row(pid=0), Row(pid=0)]
limit > 0: The resulting array’s length will not be more than limit, and the
resulting array’s last entry will contain all input beyond the last matched
pattern.
limit <= 0: pattern will be applied as many times as possible, and the
resulting
array can be of any size.
Changed in version 3.0: split now takes an optional limit field. If not provided, default
limit value is -1.
pyspark.sql.functions. stddev(col)
Aggregate function: alias for stddev_samp.
pyspark.sql.functions. stddev_pop(col)
Aggregate function: returns population standard deviation of the expression in a
group.
pyspark.sql.functions. stddev_samp(col)
Aggregate function: returns the unbiased sample standard deviation of the expression
in a group.
Parameters:
cols – list of column names (string) or list of Column expressions
pyspark.sql.functions. sum(col)
Aggregate function: returns the sum of all values in the expression.
pyspark.sql.functions. sumDistinct(col)
Aggregate function: returns the sum of distinct values in the expression.
pyspark.sql.functions. tan(col)
Parameters:
col – angle in radians
Returns:
tangent of the given value, as if computed by java.lang.Math.tan()
Parameters:
col – hyperbolic angle
Returns:
hyperbolic tangent of the given value, as if computed by java.lang.Math.tanh()
Parameters:
col – name of column containing a struct.
options – options to control converting. accepts the same options as the CSV
datasource.
Parameters:
col – name of column containing a struct, an array or a map.
options – options to control converting. accepts the same options as the JSON
datasource. Additionally the function supports the pretty option which enables
pretty JSON generation.
This function may return confusing result if the input is a string with timezone, e.g.
‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to
timestamp according to the timezone in the string, and finally display the result by
converting the timestamp to string according to the session local timezone.
Parameters:
timestamp – the column that contains timestamps
tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc
Parameters:
format – ‘year’, ‘yyyy’, ‘yy’ or ‘month’, ‘mon’, ‘mm’
Note: The user-defined functions do not take keyword arguments on the calling
side.
Parameters:
f – python function if used as a standalone function
returnType – the return type of the user-defined function. The value can be either
a pyspark.sql.types.DataType object or a DDL-formatted type string.
pyspark.sql.functions. unbase64(col)
Decodes a BASE64 encoded string column and returns it as a binary column.
pyspark.sql.functions. upper(col)
Converts a string expression to upper case.
pyspark.sql.functions. var_pop(col)
Aggregate function: returns the population variance of the values in a group.
pyspark.sql.functions. var_samp(col)
Aggregate function: returns the unbiased sample variance of the values in a group.
pyspark.sql.functions. variance(col)
Aggregate function: alias for var_samp.
Durations are provided as strings, e.g. ‘1 second’, ‘1 day 12 hours’, ‘2 minutes’. Valid
interval strings are ‘week’, ‘day’, ‘hour’, ‘minute’, ‘second’, ‘millisecond’, ‘microsecond’.
If the slideDuration is not provided, the windows will be tumbling windows.
The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to
start window intervals. For example, in order to have hourly tumbling windows that
start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15… provide startTime as
15 minutes.
The output column will be a struct called ‘window’ by default with the nested columns
‘start’ and ‘end’, where ‘start’ and ‘end’ will be of
pyspark.sql.types.TimestampType.
Note: Avro is built-in but external data source module since Spark 2.4. Please deploy
the application as per the deployment section of “Apache Avro Data Source Guide”.
Parameters:
data – the binary column.
jsonFormatSchema – the avro schema in JSON string format.
options – options to control how the Avro record is parsed.
Note: Avro is built-in but external data source module since Spark 2.4. Please deploy
the application as per the deployment section of “Apache Avro Data Source Guide”.
Parameters:
data – the data column.
jsonFormatSchema – user-specified output avro schema in JSON string format.
pyspark.sql.streaming module
class pyspark.sql.streaming. StreamingQuery(jsq) [source]
A handle to a query that is executing continuously in the background as new data
arrives. All these methods are thread-safe.
Note: Evolving
New in version 2.0.
awaitTermination(timeout=None) [source]
Waits for the termination of this query, either by query.stop() or by an
exception. If the query has terminated with an exception, then the exception will
be thrown. If timeout is set, it returns whether the query has terminated or not
within the timeout seconds.
If the query has terminated, then all subsequent calls to this method will either
return immediately (if the query was terminated by stop()), or throw the
exception immediately (if the query has terminated with exception).
exception () [source]
Returns:
the StreamingQueryException if the query was terminated by an exception, or
None.
explain(extended=False) [source]
Prints the (logical and physical) plans to the console for debugging purpose.
Parameters:
extended – boolean, default False. If False, prints only the physical plan.
>>> sq = sdf.writeStream.format('memory').queryName('query_explain'
>>> sq.processAllAvailable() # Wait a bit to generate the runtime plans.
>>> sq.explain()
== Physical Plan ==
...
>>> sq.explain(True)
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
...
== Optimized Logical Plan ==
...
== Physical Plan ==
...
>>> sq.stop()
property id
Returns the unique id of this query that persists across restarts from checkpoint
data. That is, this id is generated when a query is started for the first time, and
will be the same every time it is restarted from checkpoint data. There can only
be one query with the same id active in a Spark cluster. Also see, runId.
property isActive
Whether this streaming query is currently active or not.
property lastProgress
Returns the most recent StreamingQueryProgress update of this streaming
query or None if there were no progress updates :return: a map
processAllAvailable() [source]
Blocks until all available data in the source has been processed and committed
to the sink. This method is intended for testing.
Note: In the case of continually arriving data, this method may block forever.
Additionally, this method is only guaranteed to block until data that has been
synchronously appended data to a stream source prior to invocation. (i.e.
getOffset must immediately reflect the addition).
property recentProgress
Returns an array of the most recent [[StreamingQueryProgress]] updates for this
query. The number of progress updates retained for each stream is configured
by Spark session configuration
spark.sql.streaming.numRecentProgressUpdates.
property runId
Returns the unique id of this query that does not persist across restarts. That is,
every query that is started (or restarted from checkpoint) will have a different
runId.
property status
Returns the current status of the query.
stop () [source]
Stop this streaming query.
Note: Evolving
property active
Returns a list of active queries associated with this SQLContext
>>> sq = sdf.writeStream.format('memory').queryName('this_query').start
>>> sqm = spark.streams
>>> # get the list of active streaming queries
>>> [q.name for q in sqm.active]
['this_query']
>>> sq.stop()
awaitAnyTermination(timeout=None) [source]
Wait until any of the queries on the associated SQLContext has terminated since
the creation of the context, or since resetTerminated() was called. If any
query was terminated with an exception, then the exception will be thrown. If
timeout is set, it returns whether the query has terminated or not within the
timeout seconds.
get(id) [source]
Returns an active query from this SQLContext or throws exception if an active
query with this name doesn’t exist.
>>> sq = sdf.writeStream.format('memory').queryName('this_query').start
>>> sq.name
'this_query'
>>> sq = spark.streams.get(sq.id)
>>> sq.isActive
True
>>> sq = sqlContext.streams.get(sq.id)
>>> sq.isActive
True
>>> sq.stop()
resetTerminated() [source]
Forget about past terminated queries so that awaitAnyTermination() can be
used again to wait for new terminations.
>>> spark.streams.resetTerminated()
Note: Evolving.
This function will go through the input once to determine the input schema if
inferSchema is enabled. To avoid going through the entire data once, disable
inferSchema option or specify the schema explicitly using schema.
Note: Evolving.
Parameters:
path – string, or list of strings, for input path(s).
schema – an optional pyspark.sql.types.StructType for the input
schema or a DDL-formatted string (For example col0 INT, col1 DOUBLE).
sep – sets a separator (one or more characters) for each field and value. If
None is set, it uses the default value, ,.
encoding – decodes the CSV files by the given encoding type. If None is set,
it uses the default value, UTF-8.
quote – sets a single character used for escaping quoted values where the
separator can be part of the value. If None is set, it uses the default value, ".
If you would like to turn off quotations, you need to set an empty string.
escape – sets a single character used for escaping quotes inside an already
quoted value. If None is set, it uses the default value, \.
comment – sets a single character used for skipping lines beginning with this
character. By default (None), it is disabled.
header – uses the first line as names of columns. If None is set, it uses the
default value, false.
inferSchema – infers the input schema automatically from data. It requires
one extra pass over the data. If None is set, it uses the default value, false.
enforceSchema – If it is set to true, the specified or inferred schema will be
forcibly applied to datasource files, and headers in CSV files will be ignored. If
the option is set to false, the schema will be validated against all headers in
CSV files or the first header in RDD if the header option is set to true. Field
names in the schema and column names in CSV headers are checked by
their positions taking into account spark.sql.caseSensitive. If None is
set, true is used by default. Though the default value is true, it is
recommended to disable the enforceSchema option to avoid incorrect
results.
ignoreLeadingWhiteSpace – a flag indicating whether or not leading
whitespaces from values being read should be skipped. If None is set, it uses
the default value, false.
ignoreTrailingWhiteSpace – a flag indicating whether or not trailing
whitespaces from values being read should be skipped. If None is set, it uses
the default value, false.
nullValue – sets the string representation of a null value. If None is set, it
uses the default value, empty string. Since 2.0.1, this nullValue param
applies to all supported types including the string type.
nanValue – sets the string representation of a non-number value. If None is
set, it uses the default value, NaN.
positiveInf – sets the string representation of a positive infinity value. If None
is set, it uses the default value, Inf.
negativeInf – sets the string representation of a negative infinity value. If
None is set, it uses the default value, Inf.
dateFormat – sets the string that indicates a date format. Custom date
formats follow the formats at java.time.format.DateTimeFormatter. This
applies to date type. If None is set, it uses the default value, uuuu-MM-dd.
timestampFormat – sets the string that indicates a timestamp format.
Custom date formats follow the formats at
java.time.format.DateTimeFormatter. This applies to timestamp type. If
None is set, it uses the default value, uuuu-MM-dd'T'HH:mm:ss.SSSXXX.
maxColumns – defines a hard limit of how many columns a record can
have. If None is set, it uses the default value, 20480.
maxCharsPerColumn – defines the maximum number of characters allowed
for any given value being read. If None is set, it uses the default value, -1
meaning unlimited length.
maxMalformedLogPerPartition – this parameter is no longer used since
Spark 2.2.0. If specified, it is ignored.
mode –
allows a mode for dealing with corrupt records during parsing. If None is
set, it uses the default value, PERMISSIVE.
format(source) [source]
Specifies the input data source format.
Note: Evolving.
Parameters:
source – string, name of the data source, e.g. ‘json’, ‘parquet’.
>>> s = spark.readStream.format("text")
If the schema parameter is not specified, this function goes through the input
once to determine the input schema.
Note: Evolving.
Parameters:
path – string represents path to the JSON dataset, or RDD of Strings storing
JSON objects.
schema – an optional pyspark.sql.types.StructType for the input
schema or a DDL-formatted string (For example col0 INT, col1 DOUBLE).
primitivesAsString – infers all primitive values as a string type. If None is
set, it uses the default value, false.
prefersDecimal – infers all floating-point values as a decimal type. If the
values do not fit in decimal, then it infers them as doubles. If None is set, it
uses the default value, false.
allowComments – ignores Java/C++ style comment in JSON records. If
None is set, it uses the default value, false.
allowUnquotedFieldNames – allows unquoted JSON field names. If None is
set, it uses the default value, false.
allowSingleQuotes – allows single quotes in addition to double quotes. If
None is set, it uses the default value, true.
allowNumericLeadingZero – allows leading zeros in numbers (e.g. 00012).
If None is set, it uses the default value, false.
allowBackslashEscapingAnyCharacter – allows accepting quoting of all
character using backslash quoting mechanism. If None is set, it uses the
default value, false.
mode –
allows a mode for dealing with corrupt records during parsing. If None is
set, it uses the default value, PERMISSIVE.
Note: Evolving.
Parameters:
path – optional string for file-system backed data sources.
format – optional string for format of the data source. Default to ‘parquet’.
schema – optional pyspark.sql.types.StructType for the input schema
or a DDL-formatted string (For example col0 INT, col1 DOUBLE).
options – all other string options
Note: Evolving.
>>> s = spark.readStream.option("x", 1)
options(**options) [source]
Adds input options for the underlying data source.
Note: Evolving.
Note: Evolving.
Parameters:
mergeSchema – sets whether we should merge schemas collected from all
ORC part-files. This will override spark.sql.orc.mergeSchema. The default
value is specified in spark.sql.orc.mergeSchema.
recursiveFileLookup – recursively scan a directory for files. Using this
option disables partition discovery.
Note: Evolving.
Parameters:
mergeSchema – sets whether we should merge schemas collected from all
Parquet part-files. This will override spark.sql.parquet.mergeSchema. The
default value is specified in spark.sql.parquet.mergeSchema.
recursiveFileLookup – recursively scan a directory for files. Using this
option disables partition discovery.
schema(schema) [source]
Specifies the input schema.
Some data sources (e.g. JSON) can infer the input schema automatically from
data. By specifying the schema here, the underlying data source can skip the
schema inference step, and thus speed up data loading.
Note: Evolving.
Parameters:
schema – a pyspark.sql.types.StructType object or a DDL-formatted
string (For example col0 INT, col1 DOUBLE).
>>> s = spark.readStream.schema(sdf_schema)
>>> s = spark.readStream.schema("col0 INT, col1 DOUBLE")
By default, each line in the text file is a new row in the resulting DataFrame.
Note: Evolving.
Parameters:
paths – string, or list of strings, for input path(s).
wholetext – if true, read each file from input path(s) as a single row.
lineSep – defines the line separator that should be used for parsing. If None
is set, it covers all \r , \r\n and \n .
recursiveFileLookup – recursively scan a directory for files. Using this
option disables partition discovery.
Note: Evolving.
foreach(f) [source]
Sets the output of the streaming query to be processed using the provided writer
f. This is often used to write the output of a streaming query to arbitrary storage
systems. The processing logic can be specified in two ways.
2. An object with a process method and optional open and close methods.
The object can have the following methods.
Note: Evolving.
foreachBatch(func) [source]
Sets the output of the streaming query to be processed using the provided
function. This is supported only the in the micro-batch execution modes (that is,
when the trigger is not continuous). In every micro-batch, the provided function
will be called in every micro-batch with (i) the output rows as a DataFrame and
(ii) the batch identifier. The batchId can be used deduplicate and transactionally
write the output (that is, the provided Dataset) to external systems. The output
DataFrame is guaranteed to exactly same for the same batchId (assuming all
operations are deterministic in the query).
Note: Evolving.
format(source) [source]
Specifies the underlying output data source.
Note: Evolving.
Parameters:
source – string, name of the data source, which for now can be ‘parquet’.
>>> writer = sdf.writeStream.format('json')
Note: Evolving.
options(**options) [source]
Adds output options for the underlying data source.
Note: Evolving.
outputMode(outputMode) [source]
Specifies how data of a streaming DataFrame/Dataset is written to a streaming
sink.
Options include:
Note: Evolving.
partitionBy(*cols) [source]
Partitions the output by the given columns on the file system.
If specified, the output is laid out on the file system similar to Hive’s partitioning
scheme.
Note: Evolving.
Parameters:
cols – name of columns
Note: Evolving.
Parameters:
queryName – unique name for the query
The data source is specified by the format and a set of options . If format is
not specified, the default data source configured by
spark.sql.sources.default will be used.
Note: Evolving.
Parameters:
path – the path in a Hadoop supported file system
format – the format used to save
outputMode –
Note: Evolving.
Parameters:
processingTime – a processing time interval as a string, e.g. ‘5 seconds’, ‘1
minute’. Set a trigger that runs a microbatch query periodically based on the
processing time. Only one trigger can be set.
once – if set to True, set a trigger that processes only one batch of data in a
streaming query then terminates the query. Only one trigger can be set.
continuous – a time interval as a string, e.g. ‘5 seconds’, ‘1 minute’. Set a
trigger that runs a continuous query with a given checkpoint interval. Only
one trigger can be set.