Apache Spark Builtin Functions
Apache Spark Builtin Functions
Array Operations
• array_except(): Returns an array of the elements in the first array but not in the
second array.
• posexplode(): Like explode(), but includes the position of the element in the array.
• otherwise(): Specifies the value to return if the when() conditions are not met.
• ifnull(): Returns the second value if the first is null, otherwise returns the first.
• nvl2(): Returns the second value if the first is not null; otherwise, it returns the third
value.
• nullif(): Returns null if both values are equal, otherwise returns the first value.
Map Operations
• map_entries(): Converts a map into an array of structs with key and value fields.
• element_at(): Returns the value associated with the given key in the map.
String Operations
• ceil(): Returns the smallest integer greater than or equal to the value.
• floor(): Returns the largest integer less than or equal to the value.
• year(), month(), dayofmonth(): Extracts the year, month, day from a date.
• hour(), minute(), second(): Extracts the hour, minute, second from a timestamp.
• last_day(): Returns the last day of the month for a given date.
• next_day(): Returns the first date after a given date that falls on the specified day of
the week.
Aggregate Functions
These are some of the more advanced functions that do not fit directly into other categories
but are useful for certain types of data manipulation.
• rollup(): Similar to cube(), but provides hierarchical rollups (useful for subtotal
calculations).
• pivot(): Pivots a DataFrame by turning distinct values from one column into multiple
columns.
Functions that generate hash values, often used for unique identifiers or partitioning.
• sha2(): Calculates the SHA-2 family of hash functions (sha224, sha256, sha384,
sha512).
Window Functions
Functions that operate over a window of rows (often used in conjunction with Window
specifications).
• row_number(): Assigns a unique row number to each row within a window partition.
• lead(): Returns the value from the next row in the window.
• lag(): Returns the value from the previous row in the window.
Miscellaneous Functions
Other useful functions that don't fit neatly into the above categories.