Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Methods & Function in Databricks

The document provides a comprehensive overview of various column methods and functions used in data manipulation, particularly in PySpark. It includes methods for sorting, casting, and manipulating data types, as well as mathematical, datetime, and collection functions. Additionally, it describes operations for handling arrays and maps, such as filtering, transforming, and aggregating data.

Uploaded by

proxy9819
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Methods & Function in Databricks

The document provides a comprehensive overview of various column methods and functions used in data manipulation, particularly in PySpark. It includes methods for sorting, casting, and manipulating data types, as well as mathematical, datetime, and collection functions. Additionally, it describes operations for handling arrays and maps, such as filtering, transforming, and aggregating data.

Uploaded by

proxy9819
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Column Method

Column.__getattr__(item) An expression that gets an item at position


ordinal out of a list, or gets an item by key out of
a dict.

Column.__getitem__(k) An expression that gets an item at position


ordinal out of a list, or gets an item by key out of
a dict.

Column.alias(*alias, **kwargs) Returns this column aliased with a new name or


names (in the case of expressions that return
more than one column, such as explode).

Column.asc() Returns a sort expression based on the


ascending order of the column.

Column.asc_nulls_first() Returns a sort expression based on the


ascending order of the column, and null values
return before non-null values.

Column.asc_nulls_last() Returns a sort expression based on the


ascending order of the column, and null values
appear after non-null values.

Column.astype(dataType) astype() is an alias for cast().

Column.between(lowerBound, True if the current column is between the lower


upperBound) bound and upper bound, inclusive.

Column.bitwiseAND(other) Compute bitwise AND of this expression with


another expression.

Column.bitwiseOR(other) Compute bitwise OR of this expression with


another expression.

Column.bitwiseXOR(other) Compute bitwise XOR of this expression with


another expression.

Column.cast(dataType) Casts the column into type dataType.


Column.contains(other) Contains the other element.

Column.desc() Returns a sort expression based on the


descending order of the column.

Column.desc_nulls_first() Returns a sort expression based on the


descending order of the column, and null values
appear before non-null values.

Column.desc_nulls_last() Returns a sort expression based on the


descending order of the column, and null values
appear after non-null values.

Column.dropFields(*fieldNames) An expression that drops fields in StructType by


name.

Column.endswith(other) String ends with.

Column.eqNullSafe(other) Equality test that is safe for null values.

Column.getField(name) An expression that gets a field by name in a


StructType.

Column.getItem(key) An expression that gets an item at position


ordinal out of a list, or gets an item by key out of
a dict.

Column.ilike(other) SQL ILIKE expression (case insensitive LIKE).

Column.isNotNull() True if the current expression is NOT null.

Column.isNull() True if the current expression is null.

Column.isin(*cols) A boolean expression that is evaluated to true if


the value of this expression is contained by the
evaluated values of the arguments.

Column.like(other) SQL like expression.


Column.name(*alias, **kwargs) name() is an alias for alias().

Column.otherwise(value) Evaluates a list of conditions and returns one of


multiple possible result expressions.

Column.over(window) Define a windowing column.

Column.rlike(other) SQL RLIKE expression (LIKE with Regex).

Column.startswith(other) String starts with.

Column.substr(startPos, length) Return a Column which is a substring of the


column.

Column.when(condition, value) Evaluates a list of conditions and returns one of


multiple possible result expressions.

Column.withField(fieldName, An expression that adds/replaces a field in


col) StructType by name.

Functions
Normal Functions
col(col) Returns a Column based on the given column name.

column(col) Returns a Column based on the given column name.

lit(col) Creates a Column of literal value.

broadcast(df) Marks a DataFrame as small enough for use in broadcast


joins.
coalesce(*cols) Returns the first column that is not null.

input_file_name() Creates a string column for the file name of the current
Spark task.

isnan(col) An expression that returns true if the column is NaN.

isnull(col) An expression that returns true if the column is null.

monotonically_increas A column that generates monotonically increasing 64-bit


ing_id() integers.

nanvl(col1, col2) Returns col1 if it is not NaN, or col2 if col1 is NaN.

rand([seed]) Generates a random column with independent and


identically distributed (i.i.d.) samples uniformly distributed
in [0.0, 1.0).

randn([seed]) Generates a column with independent and identically


distributed (i.i.d.) samples from the standard normal
distribution.

spark_partition_id() A column for partition ID.

when(condition, value) Evaluates a list of conditions and returns one of multiple


possible result expressions.

bitwise_not(col) Computes bitwise not.

bitwiseNOT(col) Computes bitwise not.

expr(str) Parses the expression string into the column that it


represents

greatest(*cols) Returns the greatest value of the list of column names,


skipping null values.
least(*cols) Returns the least value of the list of column names,
skipping null values.

Math Functions
sqrt(col) Computes the square root of the specified float value.

abs(col) Computes the absolute value.

acos(col) Computes inverse cosine of the input column.

acosh(col) Computes inverse hyperbolic cosine of the input column.

asin(col) Computes inverse sine of the input column.

asinh(col) Computes inverse hyperbolic sine of the input column.

atan(col) Compute inverse tangent of the input column.

atanh(col) Computes inverse hyperbolic tangent of the input column.

atan2(col1, col2) New in version 1.4.0.

bin(col) Returns the string representation of the binary value of the given
column.

cbrt(col) Computes the cube-root of the given value.

ceil(col) Computes the ceiling of the given value.


conv(col, Convert a number in a string column from one base to another.
fromBase,
toBase)

cos(col) Computes cosine of the input column.

cosh(col) Computes hyperbolic cosine of the input column.

cot(col) Computes cotangent of the input column.

csc(col) Computes cosecant of the input column.

exp(col) Computes the exponential of the given value.

expm1(col) Computes the exponential of the given value minus one.

factorial(col) Computes the factorial of the given value.

floor(col) Computes the floor of the given value.

hex(col) Computes hex value of the given column, which could be


pyspark.sql.types.StringType,
pyspark.sql.types.BinaryType,
pyspark.sql.types.IntegerType or
pyspark.sql.types.LongType.

unhex(col) Inverse of hex.

hypot(col1, col2) Computes sqrt(a^2 + b^2) without intermediate overflow or


underflow.

log(arg1[, arg2]) Returns the first argument-based logarithm of the second


argument.

log10(col) Computes the logarithm of the given value in Base 10.


log1p(col) Computes the natural logarithm of the “given value plus one”.

log2(col) Returns the base-2 logarithm of the argument.

pmod(dividend, Returns the positive value of dividend mod divisor.


divisor)

pow(col1, col2) Returns the value of the first argument raised to the power of the
second argument.

rint(col) Returns the double value that is closest in value to the argument
and is equal to a mathematical integer.

round(col[, Round the given value to scale decimal places using HALF_UP
scale]) rounding mode if scale >= 0 or at integral part when scale < 0.

bround(col[, Round the given value to scale decimal places using HALF_EVEN
scale]) rounding mode if scale >= 0 or at integral part when scale < 0.

sec(col) Computes secant of the input column.

shiftleft(col, Shift the given value numBits left.


numBits)

shiftright(col, (Signed) shift the given value numBits right.


numBits)

shiftrightunsig Unsigned shift the given value numBits right.


ned(col, numBits)

signum(col) Computes the signum of the given value.

sin(col) Computes sine of the input column.

sinh(col) Computes hyperbolic sine of the input column.

tan(col) Computes tangent of the input column.


tanh(col) Computes hyperbolic tangent of the input column.

toDegrees(col) New in version 1.4.0.

degrees(col) Converts an angle measured in radians to an approximately


equivalent angle measured in degrees.

toRadians(col) New in version 1.4.0.

radians(col) Converts an angle measured in degrees to an approximately


equivalent angle measured in radians.

Datetime Functions
add_months(start, months) Returns the date that is months months after
start.

current_date() Returns the current date at the start of query


evaluation as a DateType column.

current_timestamp() Returns the current timestamp at the start of


query evaluation as a TimestampType column.

date_add(start, days) Returns the date that is days days after start.

date_format(date, format) Converts a date/timestamp/string to a value of


string in the format specified by the date format
given by the second argument.

date_sub(start, days) Returns the date that is days days before start.
date_trunc(format, timestamp) Returns timestamp truncated to the unit
specified by the format.

datediff(end, start) Returns the number of days from start to end.

dayofmonth(col) Extract the day of the month of a given


date/timestamp as integer.

dayofweek(col) Extract the day of the week of a given


date/timestamp as integer.

dayofyear(col) Extract the day of the year of a given


date/timestamp as integer.

second(col) Extract the seconds of a given date as integer.

weekofyear(col) Extract the week number of a given date as


integer.

year(col) Extract the year of a given date/timestamp as


integer.

quarter(col) Extract the quarter of a given date/timestamp as


integer.

month(col) Extract the month of a given date/timestamp as


integer.

last_day(date) Returns the last day of the month which the


given date belongs to.

localtimestamp() Returns the current timestamp without time


zone at the start of query evaluation as a
timestamp without time zone column.

minute(col) Extract the minutes of a given timestamp as


integer.
months_between(date1, date2[, Returns number of months between dates date1
roundOff]) and date2.

next_day(date, dayOfWeek) Returns the first date which is later than the
value of the date column based on second week
day argument.

hour(col) Extract the hours of a given timestamp as


integer.

make_date(year, month, day) Returns a column with a date built from the year,
month and day columns.

from_unixtime(timestamp[, Converts the number of seconds from unix


format]) epoch (1970-01-01 00:00:00 UTC) to a string
representing the timestamp of that moment in
the current system time zone in the given
format.

unix_timestamp([timestamp, Convert time string with given pattern


format]) (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix
time stamp (in seconds), using the default
timezone and the default locale, returns null if
failed.

to_timestamp(col[, format]) Converts a Column into


pyspark.sql.types.TimestampType using the
optionally specified format.

to_date(col[, format]) Converts a Column into


pyspark.sql.types.DateType using the
optionally specified format.

trunc(date, format) Returns date truncated to the unit specified by


the format.

from_utc_timestamp(timestamp, This is a common function for databases


tz) supporting TIMESTAMP WITHOUT TIMEZONE.

to_utc_timestamp(timestamp, tz) This is a common function for databases


supporting TIMESTAMP WITHOUT TIMEZONE.
window(timeColumn, Bucketize rows into one or more time windows
windowDuration[, …]) given a timestamp specifying column.

session_window(timeColumn, Generates session window given a timestamp


gapDuration) specifying column.

timestamp_seconds(col) Converts the number of seconds from the Unix


epoch (1970-01-01T00:00:00Z) to a timestamp.

window_time(windowColumn) Computes the event time from a window


column.

Collection Functions
array(*cols) Creates a new array column.

array_contains(col, value) Collection function: returns null if the array is


null, true if the array contains the given value,
and false otherwise.

arrays_overlap(a1, a2) Collection function: returns true if the arrays


contain any common non-null element; if not,
returns null if both the arrays are non-empty
and any of them contains a null element;
returns false otherwise.

array_join(col, delimiter[, Concatenates the elements of column using


null_replacement]) the delimiter.

create_map(*cols) Creates a new map column.

slice(x, start, length) Collection function: returns an array containing


all the elements in x from index start (array
indices start at 1, or from the end if start is
negative) with the specified length.
concat(*cols) Concatenates multiple input columns together
into a single column.

array_position(col, value) Collection function: Locates the position of the


first occurrence of the given value in the given
array.

element_at(col, extraction) Collection function: Returns element of array


at given index in extraction if col is array.

array_append(col, value) Collection function: returns an array of the


elements in col1 along with the added element
in col2 at the last of the array.

array_sort(col[, comparator]) Collection function: sorts the input array in


ascending order.

array_insert(arr, pos, value) Collection function: adds an item into a given


array at a specified array index.

array_remove(col, element) Collection function: Remove all elements that


equal to element from the given array.

array_distinct(col) Collection function: removes duplicate values


from the array.

array_intersect(col1, col2) Collection function: returns an array of the


elements in the intersection of col1 and col2,
without duplicates.

array_union(col1, col2) Collection function: returns an array of the


elements in the union of col1 and col2, without
duplicates.

array_except(col1, col2) Collection function: returns an array of the


elements in col1 but not in col2, without
duplicates.

array_compact(col) Collection function: removes null values from


the array.
transform(col, f) Returns an array of elements after applying a
transformation to each element in the input
array.

exists(col, f) Returns whether a predicate holds for one or


more elements in the array.

forall(col, f) Returns whether a predicate holds for every


element in the array.

filter(col, f) Returns an array of elements for which a


predicate holds in a given array.

aggregate(col, initialValue, merge[, Applies a binary operator to an initial state and


finish]) all elements in the array, and reduces this to a
single state.

zip_with(left, right, f) Merge two given arrays, element-wise, into a


single array using a function.

transform_keys(col, f) Applies a function to every key-value pair in a


map and returns a map with the results of
those applications as the new keys for the
pairs.

transform_values(col, f) Applies a function to every key-value pair in a


map and returns a map with the results of
those applications as the new values for the
pairs.

map_filter(col, f) Returns a map whose key-value pairs satisfy a


predicate.

map_from_arrays(col1, col2) Creates a new map from two arrays.

map_zip_with(col1, col2, f) Merge two given maps, key-wise into a single


map using a function.
explode(col) Returns a new row for each element in the
given array or map.

explode_outer(col) Returns a new row for each element in the


given array or map.

posexplode(col) Returns a new row for each element with


position in the given array or map.

posexplode_outer(col) Returns a new row for each element with


position in the given array or map.

inline(col) Explodes an array of structs into a table.

inline_outer(col) Explodes an array of structs into a table.

get(col, index) Collection function: Returns element of array


at given (0-based) index.

get_json_object(col, path) Extracts json object from a json string based


on json path specified, and returns json string
of the extracted json object.

json_tuple(col, *fields) Creates a new row for a json column


according to the given field names.

from_json(col, schema[, options]) Parses a column containing a JSON string into


a MapType with StringType as keys type,
StructType or ArrayType with the specified
schema.

schema_of_json(json[, options]) Parses a JSON string and infers its schema in


DDL format.

to_json(col[, options]) Converts a column containing a StructType,


ArrayType or a MapType into a JSON string.
size(col) Collection function: returns the length of the
array or map stored in the column.

struct(*cols) Creates a new struct column.

sort_array(col[, asc]) Collection function: sorts the input array in


ascending or descending order according to
the natural ordering of the array elements.

array_max(col) Collection function: returns the maximum


value of the array.

array_min(col) Collection function: returns the minimum value


of the array.

shuffle(col) Collection function: Generates a random


permutation of the given array.

reverse(col) Collection function: returns a reversed string


or an array with reverse order of elements.

flatten(col) Collection function: creates a single array from


an array of arrays.

sequence(start, stop[, step]) Generate a sequence of integers from start to


stop, incrementing by step.

array_repeat(col, count) Collection function: creates an array


containing a column repeated count times.

map_contains_key(col, value) Returns true if the map contains the key.

map_keys(col) Collection function: Returns an unordered


array containing the keys of the map.
map_values(col) Collection function: Returns an unordered
array containing the values of the map.

map_entries(col) Collection function: Returns an unordered


array of all entries in the given map.

map_from_entries(col) Collection function: Converts an array of


entries (key value struct types) to a map of
values.

arrays_zip(*cols) Collection function: Returns a merged array of


structs in which the N-th struct contains all
N-th values of input arrays.

map_concat(*cols) Returns the union of all the given maps.

from_csv(col, schema[, options]) Parses a column containing a CSV string to a


row with the specified schema.

schema_of_csv(csv[, options]) Parses a CSV string and infers its schema in


DDL format.

to_csv(col[, options]) Converts a column containing a StructType


into a CSV string.

Partition Transformation
Functions
years(col) Partition transform function: A transform for timestamps and
dates to partition data into years.

months(col) Partition transform function: A transform for timestamps and


dates to partition data into months.
days(col) Partition transform function: A transform for timestamps and
dates to partition data into days.

hours(col) Partition transform function: A transform for timestamps to


partition data into hours.

bucket(numBuckets, Partition transform function: A transform for any type that


col) partitions by a hash of the input column.

Aggregate Functions
approxCountDistinct(col[, rsd]) New in version 1.3.0.

approx_count_distinct(col[, rsd]) Aggregate function: returns a new Column for


approximate distinct count of column col.

avg(col) Aggregate function: returns the average of


the values in a group.

collect_list(col) Aggregate function: returns a list of objects


with duplicates.

collect_set(col) Aggregate function: returns a set of objects


with duplicate elements eliminated.

corr(col1, col2) Returns a new Column for the Pearson


Correlation Coefficient for col1 and col2.

count(col) Aggregate function: returns the number of


items in a group.

count_distinct(col, *cols) Returns a new Column for distinct count of


col or cols.
countDistinct(col, *cols) Returns a new Column for distinct count of
col or cols.

covar_pop(col1, col2) Returns a new Column for the population


covariance of col1 and col2.

covar_samp(col1, col2) Returns a new Column for the sample


covariance of col1 and col2.

first(col[, ignorenulls]) Aggregate function: returns the first value in


a group.

grouping(col) Aggregate function: indicates whether a


specified column in a GROUP BY list is
aggregated or not, returns 1 for aggregated
or 0 for not aggregated in the result set.

grouping_id(*cols) Aggregate function: returns the level of


grouping, equals to

kurtosis(col) Aggregate function: returns the kurtosis of


the values in a group.

last(col[, ignorenulls]) Aggregate function: returns the last value in


a group.

max(col) Aggregate function: returns the maximum


value of the expression in a group.

max_by(col, ord) Returns the value associated with the


maximum value of ord.

mean(col) Aggregate function: returns the average of


the values in a group.

median(col) Returns the median of the values in a group.


min(col) Aggregate function: returns the minimum
value of the expression in a group.

min_by(col, ord) Returns the value associated with the


minimum value of ord.

mode(col) Returns the most frequent value in a group.

percentile_approx(col, percentage[, Returns the approximate percentile of the


accuracy]) numeric column col which is the smallest
value in the ordered col values (sorted from
least to greatest) such that no more than
percentage of col values is less than the
value or equal to that value.

product(col) Aggregate function: returns the product of


the values in a group.

skewness(col) Aggregate function: returns the skewness of


the values in a group.

stddev(col) Aggregate function: alias for stddev_samp.

stddev_pop(col) Aggregate function: returns population


standard deviation of the expression in a
group.

stddev_samp(col) Aggregate function: returns the unbiased


sample standard deviation of the expression
in a group.

sum(col) Aggregate function: returns the sum of all


values in the expression.

sum_distinct(col) Aggregate function: returns the sum of


distinct values in the expression.
sumDistinct(col) Aggregate function: returns the sum of
distinct values in the expression.

var_pop(col) Aggregate function: returns the population


variance of the values in a group.

var_samp(col) Aggregate function: returns the unbiased


sample variance of the values in a group.

variance(col) Aggregate function: alias for var_samp

Window Functions
cume_dist() Window function: returns the cumulative distribution of
values within a window partition, i.e.

dense_rank() Window function: returns the rank of rows within a


window partition, without any gaps.

lag(col[, offset, default]) Window function: returns the value that is offset rows
before the current row, and default if there is less than
offset rows before the current row.

lead(col[, offset, default]) Window function: returns the value that is offset rows
after the current row, and default if there is less than
offset rows after the current row.

nth_value(col, offset[, Window function: returns the value that is the offsetth
ignoreNulls]) row of the window frame (counting from 1), and null if
the size of window frame is less than offset rows.

ntile(n) Window function: returns the ntile group id (from 1 to n


inclusive) in an ordered window partition.
percent_rank() Window function: returns the relative rank (i.e.

rank() Window function: returns the rank of rows within a


window partition.

row_number() Window function: returns a sequential number starting


at 1 within a window partition.

Sort Functions
asc(col) Returns a sort expression based on the ascending order of the
given column name.

asc_nulls_first( Returns a sort expression based on the ascending order of the


col) given column name, and null values return before non-null
values.

asc_nulls_last(c Returns a sort expression based on the ascending order of the


ol) given column name, and null values appear after non-null values.

desc(col) Returns a sort expression based on the descending order of the


given column name.

desc_nulls_first Returns a sort expression based on the descending order of the


(col) given column name, and null values appear before non-null
values.

desc_nulls_last( Returns a sort expression based on the descending order of the


col) given column name, and null values appear after non-null values.

String Functions
ascii(col) Computes the numeric value of the first
character of the string column.
base64(col) Computes the BASE64 encoding of a binary
column and returns it as a string column.

bit_length(col) Calculates the bit length for the specified string


column.

concat_ws(sep, *cols) Concatenates multiple input string columns


together into a single string column, using the
given separator.

decode(col, charset) Computes the first argument into a string from


a binary using the provided character set (one
of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’,
‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).

encode(col, charset) Computes the first argument into a binary from


a string using the provided character set (one
of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’,
‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).

format_number(col, d) Formats the number X to a format like


‘#,–#,–#.–’, rounded to d decimal places with
HALF_EVEN round mode, and returns the
result as a string.

format_string(format, *cols) Formats the arguments in printf-style and


returns the result as a string column.

initcap(col) Translate the first letter of each word to upper


case in the sentence.

instr(str, substr) Locate the position of the first occurrence of


substr column in the given string.

length(col) Computes the character length of string data or


number of bytes of binary data.

lower(col) Converts a string expression to lower case.


levenshtein(left, right) Computes the Levenshtein distance of the two
given strings.

locate(substr, str[, pos]) Locate the position of the first occurrence of


substr in a string column, after position pos.

lpad(col, len, pad) Left-pad the string column to width len with
pad.

ltrim(col) Trim the spaces from left end for the specified
string value.

octet_length(col) Calculates the byte length for the specified


string column.

regexp_extract(str, pattern, idx) Extract a specific group matched by a Java


regex, from the specified string column.

regexp_replace(string, pattern, Replace all substrings of the specified string


replacement) value that match regexp with replacement.

unbase64(col) Decodes a BASE64 encoded string column


and returns it as a binary column.

rpad(col, len, pad) Right-pad the string column to width len with
pad.

repeat(col, n) Repeats a string column n times, and returns it


as a new string column.

rtrim(col) Trim the spaces from right end for the specified
string value.

soundex(col) Returns the SoundEx encoding for a string

split(str, pattern[, limit]) Splits str around matches of the given pattern.
substring(str, pos, len) Substring starts at pos and is of length len
when str is String type or returns the slice of
byte array that starts at pos in byte and is of
length len when str is Binary type.

substring_index(str, delim, count) Returns the substring from string str before
count occurrences of the delimiter delim.

overlay(src, replace, pos[, len]) Overlay the specified portion of src with
replace, starting from byte position pos of src
and proceeding for len bytes.

sentences(string[, language, Splits a string into arrays of sentences, where


country]) each sentence is an array of words.

translate(srcCol, matching, A function translates any character in the


replace) srcCol by a character in matching.

trim(col) Trim the spaces from both ends for the


specified string column.

upper(col) Converts a string expression to uppercase.

UDF
call_udf(udfName, *cols) Call a user-defined function.

pandas_udf([f, returnType, Creates a pandas user defined function (a.k.a.


functionType])

udf([f, returnType]) Creates a user defined function (UDF).

unwrap_udt(col) Unwrap UDT data type column into its underlying


type.
Misc Functions
md5(col) Calculates the MD5 digest and returns the value as a 32
character hex string.

sha1(col) Returns the hex string result of SHA-1.

sha2(col, numBits) Returns the hex string result of SHA-2 family of hash
functions (SHA-224, SHA-256, SHA-384, and SHA-512).

crc32(col) Calculates the cyclic redundancy check value (CRC32) of a


binary column and returns the value as a bigint.

hash(*cols) Calculates the hash code of given columns, and returns the
result as an int column.

xxhash64(*cols) Calculates the hash code of given columns using the 64-bit
variant of the xxHash algorithm, and returns the result as a
long column.

assert_true(col[, Returns null if the input column is true; throws an exception


errMsg]) with the provided error message otherwise.

raise_error(errMsg) Throws an exception with the provided error message.

DataFrame
DataFrame.__getattr__(name) Returns the Column denoted by name.

DataFrame.__getitem__(item) Returns the column as a Column.

DataFrame.agg(*exprs) Aggregate on the entire DataFrame without


groups (shorthand for
df.groupBy().agg()).
DataFrame.alias(alias) Returns a new DataFrame with an alias set.

DataFrame.approxQuantile(col, Calculates the approximate quantiles of


probabilities, …) numerical columns of a DataFrame.

DataFrame.cache() Persists the DataFrame with the default


storage level (MEMORY_AND_DISK).

DataFrame.checkpoint([eager]) Returns a checkpointed version of this


DataFrame.

DataFrame.coalesce(numPartitions) Returns a new DataFrame that has exactly


numPartitions partitions.

DataFrame.colRegex(colName) Selects column based on the column


name specified as a regex and returns it
as Column.

DataFrame.collect() Returns all the records as a list of Row.

DataFrame.columns Retrieves the names of all columns in the


DataFrame as a list.

DataFrame.corr(col1, col2[, method]) Calculates the correlation of two columns


of a DataFrame as a double value.

DataFrame.count() Returns the number of rows in this


DataFrame.

DataFrame.cov(col1, col2) Calculate the sample covariance for the


given columns, specified by their names,
as a double value.

DataFrame.createGlobalTempView(nam Creates a global temporary view with this


e) DataFrame.

DataFrame.createOrReplaceGlobalTemp Creates or replaces a global temporary


View(name) view using the given name.
DataFrame.createOrReplaceTempView(n Creates or replaces a local temporary view
ame) with this DataFrame.

DataFrame.createTempView(name) Creates a local temporary view with this


DataFrame.

DataFrame.crossJoin(other) Returns the cartesian product with another


DataFrame.

DataFrame.crosstab(col1, col2) Computes a pair-wise frequency table of


the given columns.

DataFrame.cube(*cols) Create a multi-dimensional cube for the


current DataFrame using the specified
columns, so we can run aggregations on
them.

DataFrame.describe(*cols) Computes basic statistics for numeric and


string columns.

DataFrame.distinct() Returns a new DataFrame containing the


distinct rows in this DataFrame.

DataFrame.drop(*cols) Returns a new DataFrame without specified


columns.

DataFrame.dropDuplicates([subset]) Return a new DataFrame with duplicate


rows removed, optionally only considering
certain columns.

DataFrame.dropDuplicatesWithinWater Return a new DataFrame with duplicate


mark([subset]) rows removed,

DataFrame.drop_duplicates([subset]) drop_duplicates() is an alias for


dropDuplicates().

DataFrame.dropna([how, thresh, subset]) Returns a new DataFrame omitting rows


with null values.

DataFrame.dtypes Returns all column names and their data


types as a list.
DataFrame.exceptAll(other) Return a new DataFrame containing rows
in this DataFrame but not in another
DataFrame while preserving duplicates.

DataFrame.explain([extended, mode]) Prints the (logical and physical) plans to


the console for debugging purposes.

DataFrame.fillna(value[, subset]) Replace null values, alias for na.fill().

DataFrame.filter(condition) Filters rows using the given condition.

DataFrame.first() Returns the first row as a Row.

DataFrame.foreach(f) Applies the f function to all Row of this


DataFrame.

DataFrame.foreachPartition(f) Applies the f function to each partition of


this DataFrame.

DataFrame.freqItems(cols[, support]) Finding frequent items for columns,


possibly with false positives.

DataFrame.groupBy(*cols) Groups the DataFrame using the specified


columns, so we can run aggregation on
them.

DataFrame.head([n]) Returns the first n rows.

DataFrame.hint(name, *parameters) Specifies srow(ome hint on the current


DataFrame.

DataFrame.inputFiles() Returns a best-effort snapshot of the files


that compose this DataFrame.

DataFrame.intersect(other) Return a new DataFrame containing rows


only in both this DataFrame and another
DataFrame.
DataFrame.intersectAll(other) Return a new DataFrame containing rows
in both this DataFrame and another
DataFrame while preserving duplicates.

DataFrame.isEmpty() Checks if the DataFrame is empty and


returns a boolean value.

DataFrame.isLocal() Returns True if the collect() and take()


methods can be run locally (without any
Spark executors).

DataFrame.isStreaming Returns True if this DataFrame contains


one or more sources that continuously
return data as it arrives.

DataFrame.join(other[, on, how]) Joins with another DataFrame, using the


given join expression.

DataFrame.limit(num) Limits the result count to the number


specified.

DataFrame.localCheckpoint([eager]) Returns a locally checkpointed version of


this DataFrame.

DataFrame.mapInPandas(func, schema[, Maps an iterator of batches in the current


barrier]) DataFrame using a Python native function
that takes and outputs a pandas
DataFrame, and returns the result as a
DataFrame.

DataFrame.mapInArrow(func, schema[, Maps an iterator of batches in the current


barrier]) DataFrame using a Python native function
that takes and outputs a PyArrow’s
RecordBatch, and returns the result as a
DataFrame.

DataFrame.melt(ids, values, …) Unpivot a DataFrame from wide format to


long format, optionally leaving identifier
columns set.

DataFrame.na Returns a DataFrameNaFunctions for


handling missing values.
DataFrame.observe(observation, *exprs) Define (named) metrics to observe on the
DataFrame.

DataFrame.offset(num) Returns a new :class: DataFrame by


skipping the first n rows.

DataFrame.orderBy(*cols, **kwargs) Returns a new DataFrame sorted by the


specified column(s).

DataFrame.persist([storageLevel]) Sets the storage level to persist the


contents of the DataFrame across
operations after the first time it is
computed.

DataFrame.printSchema([level]) Prints out the schema in the tree format.

DataFrame.randomSplit(weights[, seed]) Randomly splits this DataFrame with the


provided weights.

DataFrame.rdd Returns the content as an pyspark.RDD of


Row.

DataFrame.registerTempTable(name) Registers this DataFrame as a temporary


table using the given name.

DataFrame.repartition(numPartitions, Returns a new DataFrame partitioned by


*cols) the given partitioning expressions.

DataFrame.repartitionByRange(numPa Returns a new DataFrame partitioned by


rtitions, …) the given partitioning expressions.

DataFrame.replace(to_replace[, value, Returns a new DataFrame replacing a


subset]) value with another value.

DataFrame.rollup(*cols) Create a multi-dimensional rollup for the


current DataFrame using the specified
columns, so we can run aggregation on
them.
DataFrame.sameSemantics(other) Returns True when the logical query plans
inside both DataFrames are equal and
therefore return the same results.

DataFrame.sample([withReplacement, Returns a sampled subset of this


…]) DataFrame.

DataFrame.sampleBy(col, fractions[, Returns a stratified sample without


seed]) replacement based on the fraction given
on each stratum.

DataFrame.schema Returns the schema of this DataFrame as a


pyspark.sql.types.StructType.

DataFrame.select(*cols) Projects a set of expressions and returns a


new DataFrame.

DataFrame.selectExpr(*expr) Projects a set of SQL expressions and


returns a new DataFrame.

DataFrame.semanticHash() Returns a hash code of the logical query


plan against this DataFrame.

DataFrame.show([n, truncate, vertical]) Prints the first n rows to the console.

DataFrame.sort(*cols, **kwargs) Returns a new DataFrame sorted by the


specified column(s).

DataFrame.sortWithinPartitions(*cols Returns a new DataFrame with each


, **kwargs) partition sorted by the specified column(s).

DataFrame.sparkSession Returns Spark session that created this


DataFrame.

DataFrame.stat Returns a DataFrameStatFunctions for


statistic functions.

DataFrame.storageLevel Get the DataFrame’s current storage level.


DataFrame.subtract(other) Return a new DataFrame containing rows
in this DataFrame but not in another
DataFrame.

DataFrame.summary(*statistics) Computes specified statistics for numeric


and string columns.

DataFrame.tail(num) Returns the last num rows as a list of Row.

DataFrame.take(num) Returns the first num rows as a list of Row.

DataFrame.to(schema) Returns a new DataFrame where each row


is reconciled to match the specified
schema.

DataFrame.toDF(*cols) Returns a new DataFrame that with new


specified column names

DataFrame.toJSON([use_unicode]) Converts a DataFrame into a RDD of string.

DataFrame.toLocalIterator([prefetchP Returns an iterator that contains all of the


artitions]) rows in this DataFrame.

DataFrame.toPandas() Returns the contents of this DataFrame as


Pandas pandas.DataFrame.

DataFrame.to_pandas_on_spark([index_
col])

DataFrame.transform(func, *args, Returns a new DataFrame.


**kwargs)

DataFrame.union(other) Return a new DataFrame containing the


union of rows in this and another
DataFrame.

DataFrame.unionAll(other) Return a new DataFrame containing the


union of rows in this and another
DataFrame.
DataFrame.unionByName(other[, …]) Returns a new DataFrame containing a
union of rows in this and another
DataFrame.

DataFrame.unpersist([blocking]) Marks the DataFrame as non-persistent,


and removes all blocks for it from memory
and disk.

DataFrame.unpivot(ids, values, …) Unpivot a DataFrame from wide format to


long format, optionally leaving identifier
columns set.

DataFrame.where(condition) where() is an alias for filter().

DataFrame.withColumn(colName, col) Returns a new DataFrame by adding a


column or replacing the existing column
that has the same name.

DataFrame.withColumns(*colsMap) Returns a new DataFrame by adding


multiple columns or replacing the existing
columns that have the same names.

DataFrame.withColumnRenamed(existing, Returns a new DataFrame by renaming an


new) existing column.

DataFrame.withColumnsRenamed(colsMa Returns a new DataFrame by renaming


p) multiple columns.

DataFrame.withMetadata(columnName, Returns a new DataFrame by updating an


metadata) existing column with metadata.

DataFrame.withWatermark(eventTime, Defines an event time watermark for this


…) DataFrame.

DataFrame.write Interface for saving the content of the


non-streaming DataFrame out into external
storage.

DataFrame.writeStream Interface for saving the content of the


streaming DataFrame out into external
storage.
DataFrame.writeTo(table) Create a write configuration builder for v2
sources.

DataFrame.pandas_api([index_col]) Converts the existing DataFrame into a


pandas-on-Spark DataFrame.

DataFrameNaFunctions.drop([how, Returns a new DataFrame omitting rows


thresh, subset]) with null values.

DataFrameNaFunctions.fill(value[, Replace null values, alias for na.fill().


subset])

DataFrameNaFunctions.replace(to_repl Returns a new DataFrame replacing a


ace[, …]) value with another value.

DataFrameStatFunctions.approxQuanti Calculates the approximate quantiles of


le(col, …) numerical columns of a DataFrame.

DataFrameStatFunctions.corr(col1, Calculates the correlation of two columns


col2[, method]) of a DataFrame as a double value.

DataFrameStatFunctions.cov(col1, Calculate the sample covariance for the


col2) given columns, specified by their names,
as a double value.

DataFrameStatFunctions.crosstab(col1 Computes a pair-wise frequency table of


, col2) the given columns.

DataFrameStatFunctions.freqItems(col Finding frequent items for columns,


s[, support]) possibly with false positives.

DataFrameStatFunctions.sampleBy(col, Returns a stratified sample without


fractions) replacement based on the fraction given
on each stratum.

You might also like