Methods & Function in Databricks
Methods & Function in Databricks
Functions
Normal Functions
col(col) Returns a Column based on the given column name.
input_file_name() Creates a string column for the file name of the current
Spark task.
Math Functions
sqrt(col) Computes the square root of the specified float value.
bin(col) Returns the string representation of the binary value of the given
column.
pow(col1, col2) Returns the value of the first argument raised to the power of the
second argument.
rint(col) Returns the double value that is closest in value to the argument
and is equal to a mathematical integer.
round(col[, Round the given value to scale decimal places using HALF_UP
scale]) rounding mode if scale >= 0 or at integral part when scale < 0.
bround(col[, Round the given value to scale decimal places using HALF_EVEN
scale]) rounding mode if scale >= 0 or at integral part when scale < 0.
Datetime Functions
add_months(start, months) Returns the date that is months months after
start.
date_add(start, days) Returns the date that is days days after start.
date_sub(start, days) Returns the date that is days days before start.
date_trunc(format, timestamp) Returns timestamp truncated to the unit
specified by the format.
next_day(date, dayOfWeek) Returns the first date which is later than the
value of the date column based on second week
day argument.
make_date(year, month, day) Returns a column with a date built from the year,
month and day columns.
Collection Functions
array(*cols) Creates a new array column.
Partition Transformation
Functions
years(col) Partition transform function: A transform for timestamps and
dates to partition data into years.
Aggregate Functions
approxCountDistinct(col[, rsd]) New in version 1.3.0.
Window Functions
cume_dist() Window function: returns the cumulative distribution of
values within a window partition, i.e.
lag(col[, offset, default]) Window function: returns the value that is offset rows
before the current row, and default if there is less than
offset rows before the current row.
lead(col[, offset, default]) Window function: returns the value that is offset rows
after the current row, and default if there is less than
offset rows after the current row.
nth_value(col, offset[, Window function: returns the value that is the offsetth
ignoreNulls]) row of the window frame (counting from 1), and null if
the size of window frame is less than offset rows.
Sort Functions
asc(col) Returns a sort expression based on the ascending order of the
given column name.
String Functions
ascii(col) Computes the numeric value of the first
character of the string column.
base64(col) Computes the BASE64 encoding of a binary
column and returns it as a string column.
lpad(col, len, pad) Left-pad the string column to width len with
pad.
ltrim(col) Trim the spaces from left end for the specified
string value.
rpad(col, len, pad) Right-pad the string column to width len with
pad.
rtrim(col) Trim the spaces from right end for the specified
string value.
split(str, pattern[, limit]) Splits str around matches of the given pattern.
substring(str, pos, len) Substring starts at pos and is of length len
when str is String type or returns the slice of
byte array that starts at pos in byte and is of
length len when str is Binary type.
substring_index(str, delim, count) Returns the substring from string str before
count occurrences of the delimiter delim.
overlay(src, replace, pos[, len]) Overlay the specified portion of src with
replace, starting from byte position pos of src
and proceeding for len bytes.
UDF
call_udf(udfName, *cols) Call a user-defined function.
sha2(col, numBits) Returns the hex string result of SHA-2 family of hash
functions (SHA-224, SHA-256, SHA-384, and SHA-512).
hash(*cols) Calculates the hash code of given columns, and returns the
result as an int column.
xxhash64(*cols) Calculates the hash code of given columns using the 64-bit
variant of the xxHash algorithm, and returns the result as a
long column.
DataFrame
DataFrame.__getattr__(name) Returns the Column denoted by name.
DataFrame.to_pandas_on_spark([index_
col])