0% found this document useful (0 votes)

6 views

Apache Spark Builtin Functions

Uploaded by

Tuan Anh Tran

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Apache Spark Builtin Functions

Uploaded by

Tuan Anh Tran

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Spark built-in functions

Documentation: Functions — PySpark 3.5.3 documentation

Array Operations

Functions designed to work with array columns.

• arrays_zip(): Merges the values of the arrays into a struct.

• array(): Creates a new array column.

• array_contains(): Returns true if the array contains a given value.

• array_distinct(): Removes duplicate values from the array.

• array_except(): Returns an array of the elements in the first array but not in the
second array.

• array_intersect(): Returns an array of the elements in both arrays.

• array_join(): Concatenates the elements of an array using a delimiter.

• array_max(): Returns the maximum value in the array.

• array_min(): Returns the minimum value in the array.

• array_position(): Returns the position of the first occurrence of an element in the

array.

• array_remove(): Removes all occurrences of a given value from the array.

• array_repeat(): Returns a new array with repeated elements.

• array_sort(): Sorts the array in ascending order.

• array_union(): Returns an array of the elements in both arrays, without duplicates.

• explode(): Creates a new row for each element in the array.

• posexplode(): Like explode(), but includes the position of the element in the array.

• flatten(): Flattens an array of arrays into a single array.

• reverse(): Reverses the order of the elements in the array.

• size(): Returns the length of the array.

• slice(): Subsets the array starting from a specified position.

Conditional Functions

These functions are used to apply conditional logic within DataFrames.

• when(condition, value): Similar to SQL’s CASE WHEN, returns a value when a

condition is true.

• otherwise(): Specifies the value to return if the when() conditions are not met.

• ifnull(): Returns the second value if the first is null, otherwise returns the first.

• nvl(): An alias of ifnull().

• nvl2(): Returns the second value if the first is not null; otherwise, it returns the third
value.

• nullif(): Returns null if both values are equal, otherwise returns the first value.

Map Operations

Functions that operate on map columns.

• map(): Creates a new map column.

• map_concat(): Concatenates multiple maps into one.

• map_entries(): Converts a map into an array of structs with key and value fields.

• map_from_arrays(): Creates a map from two arrays (keys and values).

• map_keys(): Returns an array of the keys in the map.

• map_values(): Returns an array of the values in the map.

• element_at(): Returns the value associated with the given key in the map.
String Operations

Functions for manipulating and working with string columns.

• concat(): Concatenates multiple columns or strings.

• concat_ws(): Concatenates multiple columns or strings with a given separator.

• instr(): Returns the position of the first occurrence of a substring.

• length(): Returns the length of a string.

• lower(): Converts a string to lowercase.

• upper(): Converts a string to uppercase.

• regexp_extract(): Extracts a substring using a regular expression.

• regexp_replace(): Replaces substrings that match a regular expression.

• split(): Splits a string into an array based on a delimiter.

• substring(): Extracts a substring from a string.

• replace(): Replaces all occurrences of a substring with another substring.

• translate(): Replaces characters in a string with other characters.

• trim(): Trims the spaces from both ends of a string.

• ltrim(): Trims spaces from the left side of a string.

• rtrim(): Trims spaces from the right side of a string.

• initcap(): Capitalizes the first letter of each word.

• soundex(): Returns the Soundex code for a string.

• levenshtein(): Returns the Levenshtein distance between two strings.

Math Operations

Functions for performing mathematical operations on numeric columns.

• abs(): Returns the absolute value.

• ceil(): Returns the smallest integer greater than or equal to the value.

• floor(): Returns the largest integer less than or equal to the value.

• round(): Rounds a number to the nearest integer or specified decimal places.

• sqrt(): Returns the square root.

• log(): Returns the natural logarithm.

• log10(): Returns the base 10 logarithm.

• exp(): Returns the exponential value of a number.

• sin(), cos(), tan(): Trigonometric sine, cosine, and tangent.

• asin(), acos(), atan(): Inverse trigonometric functions.

• signum(): Returns the sign of a number (-1, 0, or 1).

• pow(): Raises a number to a given power.

• greatest(): Returns the greatest value among the arguments.

• least(): Returns the least value among the arguments.

• rand(): Generates a random number between 0 and 1.

• randn(): Generates a random number from the normal distribution.

• pi(): Returns the value of Pi.

• degrees(): Converts radians to degrees.

• radians(): Converts degrees to radians.

Date and Time Operations

Functions for working with date and timestamp columns.

• current_date(): Returns the current date.

• current_timestamp(): Returns the current timestamp.

• date_add(): Adds a specified number of days to a date.

• date_sub(): Subtracts a specified number of days from a date.

• datediff(): Returns the difference in days between two dates.

• add_months(): Adds a specified number of months to a date.

• months_between(): Returns the number of months between two dates.

• year(), month(), dayofmonth(): Extracts the year, month, day from a date.

• hour(), minute(), second(): Extracts the hour, minute, second from a timestamp.

• to_date(): Converts a string to a date.

• to_timestamp(): Converts a string to a timestamp.

• from_unixtime(): Converts Unix time to a timestamp.

• unix_timestamp(): Converts a timestamp to Unix time.

• date_format(): Formats a date or timestamp as a string.

• last_day(): Returns the last day of the month for a given date.

• next_day(): Returns the first date after a given date that falls on the specified day of
the week.
Aggregate Functions

Functions that aggregate data across rows.

• count(): Returns the count of rows.

• countDistinct(): Returns the count of distinct values.

• sum(): Returns the sum of values.

• avg(): Returns the average of values.

• max(): Returns the maximum value.

• min(): Returns the minimum value.

• stddev(): Returns the standard deviation.

• variance(): Returns the variance.

• first(): Returns the first value.

• last(): Returns the last value.

• collect_list(): Returns a list of all values.

• collect_set(): Returns a set of all distinct values.

Advanced DataFrame Operations

These are some of the more advanced functions that do not fit directly into other categories
but are useful for certain types of data manipulation.

• broadcast(): Marks a DataFrame as small enough for broadcasting during join

operations.

• approx_count_distinct(): Returns the approximate count of distinct items using the

HyperLogLog algorithm.

• cube(): Computes aggregations on a multidimensional cube.

• rollup(): Similar to cube(), but provides hierarchical rollups (useful for subtotal
calculations).

• grouping(): Used to differentiate between aggregated and non-aggregated data

when using cube or rollup.

• pivot(): Pivots a DataFrame by turning distinct values from one column into multiple
columns.

• to_json(): Converts a struct (or array of structs) to a JSON string.

• from_json(): Parses a JSON string into a struct or array of structs.

• schema_of_json(): Infers the schema of a JSON string.

• schema_of_csv(): Infers the schema of a CSV string.

• to_csv(): Converts a struct or array of structs into a CSV string.

Hashing Functions

Functions that generate hash values, often used for unique identifiers or partitioning.

• hash(): Returns a hash value of the column.

• md5(): Calculates the MD5 digest of a string as a 32-character hexadecimal string.

• sha1(): Calculates the SHA-1 digest of a string as a 40-character hexadecimal string.

• sha2(): Calculates the SHA-2 family of hash functions (sha224, sha256, sha384,
sha512).

• crc32(): Computes a cyclic redundancy check (CRC32) of a string.

• xxhash64(): Computes a 64-bit hash using the xxHash algorithm.

Window Functions

Functions that operate over a window of rows (often used in conjunction with Window
specifications).

• row_number(): Assigns a unique row number to each row within a window partition.

• rank(): Returns the rank of rows within a window partition.

• dense_rank(): Returns the dense rank of rows within a window partition.

• ntile(): Divides rows into a specified number of roughly equal groups.

• lead(): Returns the value from the next row in the window.

• lag(): Returns the value from the previous row in the window.

• cume_dist(): Returns the cumulative distribution of values within a window

partition.

• percent_rank(): Returns the relative rank of a row as a percentage.

Null Handling

Functions for handling null values.

• isnull(): Returns true if the column is null.

• isnan(): Returns true if the column contains NaN (Not a Number).

• coalesce(): Returns the first non-null value.

• na.fill(): Replaces null values with a specified value.

• na.drop(): Drops rows with null values.

• na.replace(): Replaces values in a column with other values.

Miscellaneous Functions

Other useful functions that don't fit neatly into the above categories.

• lit(): Creates a column of a literal value.

• col(): Returns a column based on a string name.

• when(): A conditional expression (similar to SQL CASE WHEN).

• expr(): Parses the expression string into a column.

• monotonically_increasing_id(): Returns a column that generates unique increasing

64-bit integers.

• input_file_name(): Returns the name of the file being read.

• struct(): Creates a struct column from multiple columns.

PD 2 Preview
0% (3)
PD 2 Preview
32 pages
5 6323551620588110404
100% (1)
5 6323551620588110404
212 pages
Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
Collection Framework in Java
100% (1)
Collection Framework in Java
14 pages
Data Science Tools Study Guides For MIT's 15.003
No ratings yet
Data Science Tools Study Guides For MIT's 15.003
23 pages
Powerqueryguidetopandas Sample
No ratings yet
Powerqueryguidetopandas Sample
52 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Methods & Function in Databricks
No ratings yet
Methods & Function in Databricks
34 pages
Pandasguide
No ratings yet
Pandasguide
65 pages
Learneverythingai 1661068200
No ratings yet
Learneverythingai 1661068200
66 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
10 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Pandasguide Readthedocs Io en Latest PDF
No ratings yet
Pandasguide Readthedocs Io en Latest PDF
65 pages
Pandas Guide
No ratings yet
Pandas Guide
65 pages
Pandas Guide
No ratings yet
Pandas Guide
64 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
NTU AB0403 Quiz Notes
No ratings yet
NTU AB0403 Quiz Notes
18 pages
Learningthepandaslibrary PDF
100% (1)
Learningthepandaslibrary PDF
233 pages
SQL Cheat Sheet Python
No ratings yet
SQL Cheat Sheet Python
1 page
4 BNI Python Training
100% (1)
4 BNI Python Training
126 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Data Analtycs Professional-1
No ratings yet
Data Analtycs Professional-1
15 pages
Pandas: Powerful Python Data Analysis Toolkit: Release 0.7.1
No ratings yet
Pandas: Powerful Python Data Analysis Toolkit: Release 0.7.1
283 pages
Data Science(Oct 2024)
No ratings yet
Data Science(Oct 2024)
13 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Data Analytics at NP IT SOLUTIONS
No ratings yet
Data Analytics at NP IT SOLUTIONS
4 pages
Notes For Fintech Assesment, Cheatsheet
No ratings yet
Notes For Fintech Assesment, Cheatsheet
19 pages
Panda Python
100% (1)
Panda Python
398 pages
IP Imp Notes
No ratings yet
IP Imp Notes
5 pages
SQL Notes
No ratings yet
SQL Notes
25 pages
02. Python Pandas - 2 2020-21
No ratings yet
02. Python Pandas - 2 2020-21
21 pages
Databricks vs SQL Cheat Sheet
No ratings yet
Databricks vs SQL Cheat Sheet
11 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
What is pandas
No ratings yet
What is pandas
9 pages
Data Science Professional
No ratings yet
Data Science Professional
21 pages
dsmlusingpython
No ratings yet
dsmlusingpython
10 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
Data Science Using Python
No ratings yet
Data Science Using Python
10 pages
Topper style notes
No ratings yet
Topper style notes
5 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
A Z Cheatsheet Python DA
No ratings yet
A Z Cheatsheet Python DA
7 pages
Preface 1 Data Handling in Files: VIII 1
No ratings yet
Preface 1 Data Handling in Files: VIII 1
179 pages
Python Tips For Data Scientist
No ratings yet
Python Tips For Data Scientist
87 pages
Python DA Interview Topics
No ratings yet
Python DA Interview Topics
2 pages
Python_for_DataScience
No ratings yet
Python_for_DataScience
47 pages
Study Guide Data Manipulation With R
No ratings yet
Study Guide Data Manipulation With R
4 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
rajni_ip_file_final
No ratings yet
rajni_ip_file_final
42 pages
Christian Mayer, Lukas Rieger, Kyrylo Kravets - Coffee Break Pandas - 74 Pandas Puzzles To Build Your Pandas Data Science Superpower-Finxter - Com (2020)
No ratings yet
Christian Mayer, Lukas Rieger, Kyrylo Kravets - Coffee Break Pandas - 74 Pandas Puzzles To Build Your Pandas Data Science Superpower-Finxter - Com (2020)
156 pages
Python For Data Science Cheat Sheet 2.0
100% (1)
Python For Data Science Cheat Sheet 2.0
11 pages
Computer School
No ratings yet
Computer School
29 pages
Informatics Practices Practical File
No ratings yet
Informatics Practices Practical File
8 pages
Pandas: Powerful Python Data Analysis Toolkit: Release 0.10.0
No ratings yet
Pandas: Powerful Python Data Analysis Toolkit: Release 0.10.0
432 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Lecture 15 (DS) - Pandas - DataFrame Merging, String Operations
No ratings yet
Lecture 15 (DS) - Pandas - DataFrame Merging, String Operations
25 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Graphs with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Graphs with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
4/5 (2)
listening
No ratings yet
listening
5 pages
A_Unit16_page145
No ratings yet
A_Unit16_page145
1 page
10.11648.j.ie.20180202.111
No ratings yet
10.11648.j.ie.20180202.111
11 pages
reading1
No ratings yet
reading1
9 pages
journal.pone.0286362
No ratings yet
journal.pone.0286362
19 pages
Ready_for_IELTS_AK
No ratings yet
Ready_for_IELTS_AK
38 pages
Midterm-Exam-Multiple-Choice
No ratings yet
Midterm-Exam-Multiple-Choice
8 pages
Midterm-Exam-Multiple-Choice
No ratings yet
Midterm-Exam-Multiple-Choice
8 pages
Writing sục
No ratings yet
Writing sục
7 pages
sugar canes Task 1
No ratings yet
sugar canes Task 1
6 pages
Advanced_Idioms4
No ratings yet
Advanced_Idioms4
1 page
NLP_Kneserney
No ratings yet
NLP_Kneserney
10 pages
Apache Spark - Practices 2nd
No ratings yet
Apache Spark - Practices 2nd
26 pages
Course 2
No ratings yet
Course 2
2 pages
The Million Song Dataset
No ratings yet
The Million Song Dataset
7 pages
Advanced_ Idioms3
No ratings yet
Advanced_ Idioms3
1 page
Will We Ever Have A Fool
No ratings yet
Will We Ever Have A Fool
6 pages
A Unit4 Page37
No ratings yet
A Unit4 Page37
1 page
Neural Network
No ratings yet
Neural Network
36 pages
HWW7 Linked List V2
No ratings yet
HWW7 Linked List V2
5 pages
Music Recommendation System and Recommendation Model
No ratings yet
Music Recommendation System and Recommendation Model
14 pages
Intro Stat 153
No ratings yet
Intro Stat 153
198 pages
Week 13
No ratings yet
Week 13
26 pages
Tran Tuan Anh-11219259-Hw6
No ratings yet
Tran Tuan Anh-11219259-Hw6
31 pages
Workshop RecSys Challenge 2018
No ratings yet
Workshop RecSys Challenge 2018
6 pages
Unit1 LS
No ratings yet
Unit1 LS
20 pages
In Accoint
No ratings yet
In Accoint
57 pages
Case Study - The Spilt Liquid
33% (3)
Case Study - The Spilt Liquid
2 pages
Queues: Chapter 6 - Principles of Data Structures Using C by Vinu V Das
No ratings yet
Queues: Chapter 6 - Principles of Data Structures Using C by Vinu V Das
25 pages
System Variables Supported by Azure Data Factory
No ratings yet
System Variables Supported by Azure Data Factory
2 pages
Green Harbour Report Semester 6
No ratings yet
Green Harbour Report Semester 6
30 pages
Sample Output To Test PDF Combine Only
No ratings yet
Sample Output To Test PDF Combine Only
138 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
PHD Position
No ratings yet
PHD Position
2 pages
1.2 Software and Software Development.280155520
No ratings yet
1.2 Software and Software Development.280155520
2 pages
Pranjali P Jagtap - Resume
No ratings yet
Pranjali P Jagtap - Resume
6 pages
Question Bank
No ratings yet
Question Bank
5 pages
Linux OS Basic Commands
No ratings yet
Linux OS Basic Commands
3 pages
Final Lecture Chapter 2 2 Intro To Asml Memory Segmentation
No ratings yet
Final Lecture Chapter 2 2 Intro To Asml Memory Segmentation
28 pages
A) The Least-Squares Method
No ratings yet
A) The Least-Squares Method
19 pages
Research On SQL Injection Attack and Prevention Technology Based On Web
No ratings yet
Research On SQL Injection Attack and Prevention Technology Based On Web
4 pages
Grade 7 Operations On Integers: Choose Correct Answer(s) From The Given Choices
No ratings yet
Grade 7 Operations On Integers: Choose Correct Answer(s) From The Given Choices
2 pages
Bilinear Quad Source Code in Matlab
No ratings yet
Bilinear Quad Source Code in Matlab
2 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
50 pages
Rack Position Guidance Tools in Indoor Based On Node Localization
No ratings yet
Rack Position Guidance Tools in Indoor Based On Node Localization
12 pages
2nd Puc Computer Science Notes PDF (1 Mark Questions and Answers) - 2nd Puc Computer Science
No ratings yet
2nd Puc Computer Science Notes PDF (1 Mark Questions and Answers) - 2nd Puc Computer Science
15 pages
Intelligence Enabled Research DoSIER 2020 Siddhartha Bhattacharyya all chapter instant download
100% (6)
Intelligence Enabled Research DoSIER 2020 Siddhartha Bhattacharyya all chapter instant download
55 pages
Bugreport Viva - Global SP1A.210812.016 2023 07 07 09 24 22 Dumpstate - Log 12592
No ratings yet
Bugreport Viva - Global SP1A.210812.016 2023 07 07 09 24 22 Dumpstate - Log 12592
35 pages
Software Testing and Important Questions TYBCS
No ratings yet
Software Testing and Important Questions TYBCS
2 pages
Discrete Maths
No ratings yet
Discrete Maths
7 pages
JAVA UNIT-3 Notes
No ratings yet
JAVA UNIT-3 Notes
24 pages
3303Starting Out With Programming Logic and Design, 6e 6th Edition Tony Gaddis - eBook PDFpdf download
100% (2)
3303Starting Out With Programming Logic and Design, 6e 6th Edition Tony Gaddis - eBook PDFpdf download
62 pages
[FREE PDF sample] (Ebook) Real World ASP.NET Best Practices by Farhan Muhammad, Matt Milner (auth.) ISBN 9781430207696, 9781590591000, 1430207698, 1590591003 ebooks
100% (2)
[FREE PDF sample] (Ebook) Real World ASP.NET Best Practices by Farhan Muhammad, Matt Milner (auth.) ISBN 9781430207696, 9781590591000, 1430207698, 1590591003 ebooks
86 pages
Nov Dec 2023
No ratings yet
Nov Dec 2023
3 pages
Pattern Recognition and Machine Learning: Fuzzy Sets in Pattern Recognition Debrup Chakraborty Cinvestav
No ratings yet
Pattern Recognition and Machine Learning: Fuzzy Sets in Pattern Recognition Debrup Chakraborty Cinvestav
38 pages
Chainsaw Man Devils Heart (BUG FIXES) AUTO FARM - FEBRUARY 2023
No ratings yet
Chainsaw Man Devils Heart (BUG FIXES) AUTO FARM - FEBRUARY 2023
3 pages