Outer join Spark dataframe with non-identical join column
Last Updated :
24 Apr, 2025
In PySpark, data frames are one of the most important data structures used for data processing and manipulation. The outer join operation in PySpark data frames is an important operation to combine data from multiple sources. However, sometimes the join column in the two DataFrames may not be identical, which may result in missing values.
In this article, we will discuss how to perform an outer join operation on two PySpark DataFrame with non-identical join columns and then merge the join columns.
Syntax of join() function
Syntax: DataFrame.join(other, on=None, how=None)
Parameters:Â
- other: DataFrame. Right side of the join
- on :Â str, list or Column, optional. A list of column names, a join expression (Column).
- how : str, optional
standard inner Inner, cross, outer, full, fullouter, full outer, left, leftouter, left outer, right, rightouter, right outer, semi, leftsemi, left semi, anti, leftanti, and left anti are the only options that can be used.
Dataframes Used for Outer Join and Merge Join Columns in PySpark
To illustrate the concept of outer join and merging join columns in PySpark data frames, we will create two sample data frames with non-identical join columns. We can see that the join column in the first data frame is "Name," and the join column in the second data frame is "Name" Also, the values in the join column are not identical.
Dataframe 1
Outer Join using the Join function
To perform an outer join on the two DataFrame, we will use the "join" function in PySpark. The "join" function accepts the two DataFrame and the join column name as arguments. The outer join operation returns all the rows from both DataFrame, along with the matching rows. For non-matching rows, the corresponding columns will contain null values.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Create a SparkSession
spark = SparkSession.builder.appName
("OuterJoin").getOrCreate()
# Create Dataframe 1
df1 = spark.createDataFrame([
("Alice", 22, "Female"),
("Bob", 35, "Male"),
("Jack", 28, "Male"),
("Jill", 30, "Female")
], ["Name", "Age", "Gender"])
df1.show()
# Create Dataframe 2
df2 = spark.createDataFrame([
("Alice", "Chicago", "IL"),
("Bob", "Boston", "MA"),
("Charlie", "Houston", "TX"),
("David", "Austin", "TX")
], ["Name", "City", "State"])
df2.show()
# Outer Join operation
# df3 = df1.join(df2, "Name", how='outer')
df3 = df1.join(df2, df1.Name == df2.Name, how='outer')
df3.show()
Output:
Here, we can see that "dataframe1" and "dataframe2" is outer merged, where data is not present "null" is provided in that place. All the columns in both DataFrame are present in the final DataFrame.
Now, we will use one column for our Name column to increase readability with some simple changes, If you want both columns use the above method.
Python3
# Outer Join operation
df3 = df1.join(df2, "Name", how='outer')
df3.show()
Output:

Outer Join Using Merge
The merge method is not available in PySpark. However, it is available in Pandas. If you are working with small data sets and you want to use the merge method, you can convert your PySpark data frames to Pandas data frames, merge them using the merge method, and then convert the resulting Pandas data frame back to a PySpark data frame.Â
Here, we first create two PySpark data frames "df1" and "df2". We then convert these data frames to Pandas data frames using the toPandas() method. Next, we use the merge method from Pandas to merge the two data frames pdf1 and pdf2 on the common column "Name" and "people". We use the how="outer" argument to perform an outer join. Finally, we convert the resulting Pandas data frame pdf3 back to a PySpark data frame using the createDataFrame() method.Â
Python3
from pyspark.sql import SparkSession
import pandas as pd
# Create a SparkSession
spark = SparkSession.builder.appName
("OuterJoin").getOrCreate()
# Create Dataframe 1
df1 = spark.createDataFrame([
("Alice", 22, "Female"),
("Bob", 35, "Male"),
("Jack", 28, "Male"),
("Jill", 30, "Female")
], ["Name", "Age", "Gender"])
df1.show()
# Create Dataframe 2
df2 = spark.createDataFrame([
("Alice", "Chicago", "IL"),
("Bob", "Boston", "MA"),
("Charlie", "Houston", "TX"),
("David", "Austin", "TX")
], ["people", "City", "State"])
df2.show()
# Convert PySpark data frames to Pandas data frames
pdf1 = df1.toPandas()
pdf2 = df2.toPandas()
# Merge Pandas data frames using the merge method
pdf3 = pd.merge(pdf1, pdf2, how="outer", left_on="Name", right_on="people")
print(pdf3)
Output:
Similar Reads
Create new column with function in Spark Dataframe
In this article, we are going to learn how to create a new column with a function in the PySpark data frame in Python. PySpark is a popular Python library for distributed data processing that provides high-level APIs for working with big data. The data frame class is a key component of PySpark, as i
3 min read
How to get name of dataframe column in PySpark ?
In this article, we will discuss how to get the name of the Dataframe column in PySpark. To get the name of the columns present in the Dataframe we are using the columns function through this function we will get the list of all the column names present in the Dataframe. Syntax: df.columns We can a
3 min read
Add new column with default value in PySpark dataframe
In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS
3 min read
Filter PySpark DataFrame Columns with None or Null Values
Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter th
4 min read
Full outer join in PySpark dataframe
In this article, we are going to see how to perform Full Outer Join in PySpark DataFrames in Python. Create the first dataframe:Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app na
4 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing
2 min read
Removing duplicate columns after DataFrame join in PySpark
In this article, we will discuss how to remove duplicate columns after a DataFrame join in PySpark. Create the first dataframe for demonstration:Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark \ - exa
3 min read
PySpark Dataframe distinguish columns with duplicated name
In this article, we are going to learn how to distinguish columns with duplicated names in the Pyspark data frame in Python. A dispersed collection of data grouped into named columns is known as the Pyspark data frame. While working in Pyspark, there occurs various situations in which we get the dat
5 min read
How to Order Pyspark dataframe by list of columns ?
In this article, we are going to apply OrderBy with multiple columns over pyspark dataframe in Python. Ordering the rows means arranging the rows in ascending or descending order. Method 1: Using OrderBy() OrderBy() function is used to sort an object by its index value. Syntax: dataframe.orderBy(['
2 min read
How to find the sum of Particular Column in PySpark Dataframe
In this article, we are going to find the sum of PySpark dataframe column in Python. We are going to find the sum in a column using agg() function. Let's create a sample dataframe. Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import Spa
2 min read