0% found this document useful (0 votes)

60 views

Combining Data in Pandas With Merge, .Join, and Concat - Real Python

Combining Data in Pandas With merge(), .join(), and concat() – Real Python

Uploaded by

Jorge Forero

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

Combining Data in Pandas With Merge, .Join, and Concat - Real Python

Combining Data in Pandas With merge(), .join(), and concat() – Real Python

Uploaded by

Jorge Forero

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Start Here ' Learn Python Store More ( Search

All Tutorial Topics

advanced api basics best-practices
community databases data-science
devops django docker flask front-end
gamedev gui intermediate
machine-learning projects python
testing tools web-dev web-scraping

Table of Contents
→ Pandas merge(): Combining Data on
Common Columns or Indices
Pandas .join(): Combining Data on a
Column or Index
Pandas concat(): Combining Data
Combining Data in Pandas With Across Rows or Columns

merge(), .join(), and concat() Conclusion

by Kyle Stratis ! 5 Comments " data-science intermediate

Mark as Completed &

Mark as Completed & # Tweet $ Share % Email

# Tweet $ Share % Email

Table of Contents
Pandas merge(): Combining Data on Common Columns or Indices
How to merge()
Examples
Pandas .join(): Combining Data on a Column or Index
How to .join()
Examples
Pandas concat(): Combining Data Across Rows or Columns
How to Add to a DataFrame With append()
Examples
Conclusion

Pandas’ Series and DataFrame objects are powerful tools for exploring and analyzing data.
Part of their power comes from a multifaceted approach to combining separate datasets.
With Pandas, you can merge, join, and concatenate your datasets, allowing you to unify
and better understand your data as you analyze it.

In this tutorial, you’ll learn how and when to combine your data in Pandas with:

merge() for combining data on common columns or indices

.join() for combining data on a key column or an index

concat() for combining DataFrames across rows or columns

If you have some experience using DataFrame and Series objects in Pandas and you’re
ready to learn how to combine them, then this tutorial will help you do exactly that. If you
want a quick refresher on DataFrames before proceeding, then Pandas DataFrames 101 will
get you caught up in no time.

You can follow along with the examples in this tutorial using the interactive Jupyter
Notebook and data files available at the link below:

Download the notebook and data set: Click here to get the Jupyter Notebook and
CSV data set you’ll use to learn about Pandas merge(), .join(), and concat() in this
tutorial.

Note: The techniques you’ll learn about below will generally work for both DataFrame
and Series objects. But for simplicity and conciseness, the examples will use the term
dataset to refer to objects that can be either DataFrames or Series.

Pandas merge(): Combining Data on Common

Columns or Indices
The first technique you’ll learn is merge(). You can use merge() any time you want to do
database-like join operations. It’s the most flexible of the three operations you’ll learn.

When you want to combine data objects based on one or more keys in a similar way to a
relational database, merge() is the tool you need. More specifically, merge() is most useful
when you want to combine rows that share data.

You can achieve both many-to-one and many-to-many joins with merge(). In a many-to-
one join, one of your datasets will have many rows in the merge column that repeat the
same values (such as 1, 1, 3, 5, 5), while the merge column in the other dataset will not have
repeat values (such as 1, 3, 5).

As you might have guessed, in a many-to-many join, both of your merge columns will have
repeat values. These merges are more complex and result in the Cartesian product of the
joined rows.

This means that, after the merge, you’ll have every combination of rows that share the same
value in the key column. You’ll see this in action in the examples below.

What makes merge() so flexible is the sheer number of options for defining the behavior of
your merge. While the list can seem daunting, with practice you’ll be able to expertly merge
datasets of all kinds.

When you use merge(), you’ll provide two required arguments:

1. The left DataFrame

2. The right DataFrame

After that, you can provide a number of optional arguments to define how your datasets are
merged:

how: This defines what kind of merge to make. It defaults to 'inner', but other
possible options include 'outer', 'left', and 'right'.

on: Use this to tell merge() which columns or indices (also called key columns or key
indices) you want to join on. This is optional. If it isn’t specified, and left_index and
right_index (covered below) are False, then columns from the two DataFrames that
share names will be used as join keys. If you use on, then the column or index you
specify must be present in both objects.

left_on and right_on: Use either of these to specify a column or index that is present
only in the left or right objects that you are merging. Both default to None.

left_index and right_index: Set these to True to use the index of the left or right
objects to be merged. Both default to False.

suffixes: This is a tuple of strings to append to identical column names that are not
merge keys. This allows you to keep track of the origins of columns with the same
name.

These are some of the most important parameters to pass to merge(). For the full list, see
the Pandas documentation.

Note: In this tutorial, you’ll see that examples always specify which column(s) to join
on with on. This is the safest way to merge your data because you and anyone reading
your code will know exactly what to expect when merge() is called. If you do not
specify the merge column(s) with on, then Pandas will use any columns with the same
name as the merge keys.

How to merge()
Before getting into the details of how to use merge(), you should first understand the
various forms of joins:

inner

outer

left

right

Note: Even though you’re learning about merging, you’ll see inner, outer, left, and
right also referred to as join operations. For this tutorial, you can consider these
terms equivalent.

You’ll learn about these in detail below, but first take a look at this visual representation of
the different joins:

Visual Representation of Join Types

In this image, the two circles are your two datasets, and the labels point to which part or
parts of the datasets you can expect to see. While this diagram doesn’t cover all the nuance,
it can be a handy guide for visual learners.

If you have an SQL background, then you may recognize the merge operation names from
the JOIN syntax. Except for inner, all of these techniques are types of outer joins. With
outer joins, you’ll merge your data based on all the keys in the left object, the right object, or
both. For keys that only exist in one object, unmatched columns in the other object will be
filled in with NaN (Not a Number).

You can also see a visual explanation of the various joins in a SQL context on Coding Horror.
Now let’s take a look at the different joins in action.

Examples
Many Pandas tutorials provide very simple DataFrames to illustrate the concepts they are
trying to explain. This approach can be confusing since you can’t relate the data to anything
concrete. So, for this tutorial, you’ll use two real-world datasets as the DataFrames to be
merged:

1. Climate normals for California (temperatures)

2. Climate normals for California (precipitation)

You can explore these datasets and follow along with the examples below using the
interactive Jupyter Notebook and climate data CSVs:

Download the notebook and data set: Click here to get the Jupyter Notebook and
CSV data set you’ll use to learn about Pandas merge(), .join(), and concat() in this
tutorial.

If you’d like to learn how to use Jupyter Notebooks, then check out Jupyter Notebook: An
Introduction.

These two datasets are from the National Oceanic and Atmospheric Administration (NOAA)
and were derived from the NOAA public data repository. First, load the datasets into
separate DataFrames:

Python >>>

>>> import pandas as pd

>>> climate_temp = pd.read_csv("climate_temp.csv")
>>> climate_precip = pd.read_csv("climate_precip.csv")

In the code above, you used Pandas’ read_csv() to conveniently load your source CSV files
into DataFrame objects. You can then look at the headers and first few rows of the loaded
DataFrames with .head():

Python >>>

>>> climate_temp.head()
STATION STATION_NAME ... DLY-HTDD-BASE60 DLY-HTDD-NORMAL
0 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15
1 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15
2 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15
3 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15
4 GHCND:USC00049099 TWENTYNINE PALMS CA US ... 10 15

>>> climate_precip.head()
STATION ... DLY-SNOW-PCTALL-GE050TI
0 GHCND:USC00049099 ... -9999
1 GHCND:USC00049099 ... -9999
2 GHCND:USC00049099 ... -9999
3 GHCND:USC00049099 ... 0
4 GHCND:USC00049099 ... 0

Here, you used .head() to get the first five rows of each DataFrame. Make sure to try this on
your own, either with the interactive Jupyter Notebook or in your console, so that you can
explore the data in greater depth.

Next, take a quick look at the dimensions of the two DataFrames:

Python >>>

>>> climate_temp.shape
(127020, 21)
>>> climate_precip.shape
(151110, 29)

Note that .shape is a property of DataFrame objects that tells you the dimensions of the
DataFrame. For climate_temp, the output of .shape says that the DataFrame has 127,020
rows and 21 columns.

Inner Join
In this example, you’ll use merge() with its default arguments, which will result in an inner
join. Remember that in an inner join, you will lose rows that don’t have a match in the other
DataFrame’s key column.

With the two datasets loaded into DataFrame objects, you’ll select a small slice of the
precipitation dataset, and then use a plain merge() call to do an inner join. This will result in
a smaller, more focused dataset:

Python >>>

>>> precip_one_station = climate_precip[climate_precip["STATION"] == "GHCND:USC00045721"

>>> precip_one_station.head()
STATION ... DLY-SNOW-PCTALL-GE050TI
1460 GHCND:USC00045721 ... -9999
1461 GHCND:USC00045721 ... -9999
1462 GHCND:USC00045721 ... -9999
1463 GHCND:USC00045721 ... -9999
1464 GHCND:USC00045721 ... -9999

Here you have created a new DataFrame called precip_one_station from the
climate_precip DataFrame, selecting only rows in which the STATION field is
"GHCND:USC00045721".

If you check the shape attribute, then you’ll see that it has 365 rows. When you do the
merge, how many rows do you think you’ll get in the merged DataFrame? Remember that
you’ll be doing an inner join:

Python >>>

>>> inner_merged = pd.merge(precip_one_station, climate_temp)

>>> inner_merged.head()
STATION STATION_NAME ... DLY-HTDD-BASE60 DLY-HTDD-NORMAL
0 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19
1 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19
2 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19
3 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19
4 GHCND:USC00045721 MITCHELL CAVERNS CA US ... 14 19

>>> inner_merged.shape
(365, 47)

If you guessed 365 rows, then you were correct! This is because merge() defaults to an inner
join, and an inner join will discard only those rows that do not match. Since all of your rows
had a match, none were lost. You should also notice that there are many more columns now:
47 to be exact.

With merge(), you also have control over which column(s) to join on. Let’s say you want to
merge both entire datasets, but only on Station and Date since the combination of the two
will yield a unique value for each row. To do so, you can use the on parameter:

Python

inner_merged_total = pd.merge(climate_temp, climate_precip, on=["STATION", "DATE"

inner_merged_total.head()
inner_merged_total.shape

You can specify a single key column with a string or multiple key columns with a list. This
results in a DataFrame with 123,005 rows and 48 columns.

Why 48 columns instead of 47? Because you specified the key columns to join on, Pandas
doesn’t try to merge all mergeable columns. This can result in “duplicate” column names,
which may or may not have different values.

“Duplicate” is in quotes because the column names will not be an exact match. By default
they are appended with _x and _y. You can also use the suffixes parameter to control what
is appended to the column names.

To prevent surprises, all following examples will use the on parameter to specify the column
or columns on which to join.

Outer Join
Here, you’ll specify an outer join with the how parameter. Remember from the diagrams
above that in an outer join (also known as a full outer join), all rows from both DataFrames
will be present in the new DataFrame.

If a row doesn’t have a match in the other DataFrame (based on the key column[s]), then you
won’t lose the row like you would with an inner join. Instead, the row will be in the merged
DataFrame with NaN values filled in where appropriate.

This is best illustrated in an example:

Python

outer_merged = pd.merge(precip_one_station, climate_temp, how="outer", on=["STATION"

outer_merged.head()
outer_merged.shape

If you remember from when you checked the .shape attribute of climate_temp, then you’ll
see that the number of rows in outer_merged is the same. With an outer join, you can expect
to have the same number of rows as the larger DataFrame. That’s because no rows are lost
in an outer join, even when they don’t have a match in the other DataFrame.

Left Join
In this example, you’ll specify a left join—also known as a left outer join—with the how
parameter. Using a left outer join will leave your new merged DataFrame with all rows from
the left DataFrame, while discarding rows from the right DataFrame that don’t have a match
in the key column of the left DataFrame.

You can think of this as a half-outer, half-inner merge. The example below shows you this in
action:

Python

left_merged = pd.merge(climate_temp, precip_one_station,

how="left", on=["STATION", "DATE"])
left_merged.head()
left_merged.shape

left_merged has 127,020 rows, matching the number of rows in the left DataFrame,
climate_temp. To prove that this only holds for the left DataFrame, run the same code, but
change the position of precip_one_station and climate_temp:

Python

left_merged_reversed = pd.merge(precip_one_station, climate_temp, how="left", on=

left_merged_reversed.head()
left_merged_reversed.shape

This results in a DataFrame with 365 rows, matching the number of rows in
precip_one_station.

Right Join
The right join (or right outer join) is the mirror-image version of the left join. With this join,
all rows from the right DataFrame will be retained, while rows in the left DataFrame without
a match in the key column of the right DataFrame will be discarded.

To demonstrate how right and left joins are mirror images of each other, in the example
below you’ll recreate the left_merged DataFrame from above, only this time using a right
join:

Python

right_merged = pd.merge(precip_one_station, climate_temp, how="right", on=["STATION"

right_merged.head()
right_merged.shape

Here, you simply flipped the positions of the input DataFrames and specified a right join.
When you inspect right_merged, you might notice that it’s not exactly the same as
left_merged. The only difference between the two is the order of the columns: the first
input’s columns will always be the first in the newly formed DataFrame.

merge() is the most complex of the Pandas data combination tools. It’s also the foundation
on which the other tools are built. Its complexity is its greatest strength, allowing you to
combine datasets in every which way and to generate new insights into your data.

On the other hand, this complexity makes merge() difficult to use without an intuitive grasp
of set theory and database operations. In this section, you’ve learned about the various data
merging techniques, as well as many-to-one and many-to-many merges, which ultimately
come from set theory. For more information on set theory, check out Sets in Python.

Now, you’ll look at a simplified version of merge(): .join().

Pandas .join(): Combining Data on a Column or

Index
While merge() is a module function, .join() is an object function that lives on your
DataFrame. This enables you to specify only one DataFrame, which will join the DataFrame
you call .join() on.

Under the hood, .join() uses merge(), but it provides a more efficient way to join
DataFrames than a fully specified merge() call. Before diving in to the options available to
you, take a look at this short example:

Python

precip_one_station.join(climate_temp, lsuffix="_left", rsuffix="_right")

With the indices visible, you can see a left join happening here, with precip_one_station
being the left DataFrame. You might notice that this example provides the parameters
lsuffix and rsuffix. Because .join() joins on indices and doesn’t directly merge
DataFrames, all columns, even those with matching names, are retained in the resulting
DataFrame.

If you flip the previous example around and instead call .join() on the larger DataFrame,
then you’ll notice that the DataFrame is larger, but data that doesn’t exist in the smaller
DataFrame (precip_one_station) is filled in with NaN values:

Python

climate_temp.join(precip_one_station, lsuffix="_left", rsuffix="_right")

How to .join()
By default, .join() will attempt to do a left join on indices. If you want to join on columns
like you would with merge(), then you’ll need to set the columns as indices.

Like merge(), .join() has a few parameters that give you more flexibility in your joins.
However, with .join(), the list of parameters is relatively short:

other: This is the only required parameter. It defines the other DataFrame to join. You
can also specify a list of DataFrames here, allowing you to combine a number of
datasets in a single .join() call.

on: This parameter specifies an optional column or index name for the left DataFrame
(climate_temp in the previous example) to join the other DataFrame’s index. If it’s set
to None, which is the default, then the join will be index-on-index.

how: This has the same options as how from merge(). The difference is that it is index-
based unless you also specify columns with on.

lsuffix and rsuffix: These are similar to suffixes in merge(). They specify a suffix
to add to any overlapping columns but have no effect when passing a list of other
DataFrames.

sort: Enable this to sort the resulting DataFrame by the join key.

Examples
In this section, you’ll see examples showing a few different use cases for .join(). Some will
be simplifications of merge() calls. Others will be features that set .join() apart from the
more verbose merge() calls.

Since you already saw a short .join() call, in this first example you’ll attempt to recreate a
merge() call with .join(). What will this require? Take a second to think about a possible
solution, and then look at the proposed solution below:

Python

inner_merged_total = pd.merge(climate_temp, climate_precip, on=["STATION", "DATE"

inner_merged_total.head()
inner_joined_total = climate_temp.join(
climate_precip.set_index(["STATION", "DATE"]),
lsuffix="_x",
rsuffix="_y",
on=["STATION", "DATE"],
)
inner_joined_total.head()

Because .join() works on indices, if we want to recreate merge() from before, then we
must set indices on the join columns we specify. In this example, you used .set_index() to
set your indices to the key columns within the join.

With this, the connection between merge() and .join() should be more clear.

Below you’ll see an almost-bare .join() call. Because there are overlapping columns, you’ll
need to specify a suffix with lsuffix, rsuffix, or both, but this example will demonstrate
the more typical behavior of .join():

Python

climate_temp.join(climate_precip, lsuffix="_left")

This example should be reminiscent of what you saw in the introduction to .join() earlier.
The call is the same, resulting in a left join that produces a DataFrame with the same number
of rows as cliamte_temp.

In this section, you have learned about .join() and its parameters and uses. You have also
learned about how .join() works under the hood and recreated a merge() call with
.join() to better understand the connection between the two techniques.

Pandas concat(): Combining Data Across Rows

or Columns
Concatenation is a bit different from the merging techniques you saw above. With merging,
you can expect the resulting dataset to have rows from the parent datasets mixed in
together, often based on some commonality. Depending on the type of merge, you might
also lose rows that don’t have matches in the other dataset.

With concatenation, your datasets are just stitched together along an axis — either the row
axis or column axis. Visually, a concatenation with no parameters along rows would look
like this:

To implement this in code, you’ll use concat() and pass it a list of DataFrames that you
want to concatenate. Code for this task would like like this:

Python

concatenated = pandas.concat([df1, df2])

Note: This example assumes that your column names are the same. If your column
names are different while concatenating along rows (axis 0), then by default the
columns will also be added, and NaN values will be filled in as applicable.

What if instead you wanted to perform a concatenation along columns? First, take a look at a
visual representation of this operation:

To accomplish this, you’ll use a concat() call like you did above, but you also will need to
pass the axis parameter with a value of 1:

Python

concatenated = pandas.concat([df1, df2], axis=1)

Note: This example assumes that your indices are the same between datasets. If they
are different while concatenating along columns (axis 1), then by default the extra
indices (rows) will also be added, and NaN values will be filled in as applicable.

You’ll learn more about the parameters for concat() in the section below. As you can see,
concatenation is a simpler way to combine datasets. It is often used to form a single, larger
set to do additional operations on.
Note: When you call concat(), a copy of all the data you are concatenating is made.
You should be careful with multiple concat() calls, as the many copies that are made
may negatively affect performance. Alternatively, you can set the optional copy
parameter to False

When you concatenate datasets, you can specify the axis along which you will concatenate.
But what happens with the other axis?

Nothing. By default, a concatenation results in a set union, where all data is preserved.
You’ve seen this with merge() and .join() as an outer join, and you can specify this with
the join parameter.

If you use this parameter, then your options are outer (by default) and inner, which will
perform an inner join (or set intersection).

As with the other inner joins you saw earlier, some data loss can occur when you do an inner
join with concat(). Only where the axis labels match will you preserve rows or columns.

Note: Remember, the join parameter only specifies how to handle the axes that you
are not concatenating along.

Since you learned about the join parameter, here are some of the other parameters that
concat() takes:

objs: This parameter takes any sequence (typically a list) of Series or DataFrame
objects to be concatenated. You can also provide a dictionary. In this case, the keys will
be used to construct a hierarchical index.

axis: Like in the other techniques, this represents the axis you will concatenate along.
The default value is 0, which concatenates along the index (or row axis), while 1
concatenates along columns (vertically). You can also use the string values index or
columns.

join: This is similar to the how parameter in the other techniques, but it only accepts
the values inner or outer. The default value is outer, which preserves data, while
inner would eliminate data that does not have a match in the other dataset.

ignore_index: This parameter takes a Boolean (True or False) and defaults to False.
If True, then the new combined dataset will not preserve the original index values in
the axis specified in the axis parameter. This lets you have entirely new index values.

keys: This parameter allows you to construct a hierarchical index. One common use
case is to have a new index while preserving the original indices so that you can tell
which rows, for example, come from which original dataset.

copy: This parameter specifies whether you want to copy the source data. The default
value is True. If the value is set to False, then Pandas won’t make copies of the source
data.

This list isn’t exhaustive. You can find the complete, up-to-date list of parameters in the
Pandas documentation.

How to Add to a DataFrame With append()

Before getting into concat() examples, you should know about .append(). This is a
shortcut to concat() that provides a simpler, more restrictive interface to concatenation.
You can use .append() on both Series and DataFrame objects, and both work the same
way.

To use .append(), you call it on one of the datasets you have available and pass the other
dataset (or a list of datasets) as an argument to the method:

Python

concatenated = df1.append(df2)

You did the same thing here as you did when you called pandas.concat([df1, df2]),
except you used the instance method .append() instead of the module method concat().

Examples
First, you’ll do a basic concatenation along the default axis using the DataFrames you’ve
been playing with throughout this tutorial:

Python

double_precip = pd.concat([precip_one_station, precip_one_station])

This one is very simple by design. Here, you created a DataFrame that is a double of a small
DataFrame that was made earlier. One thing to notice is that the indices repeat. If you want a
fresh, 0-based index, then you can use the ignore_index parameter:

Python

reindexed = pd.concat([precip_one_station, precip_one_station], ignore_index=True

As noted before, if you concatenate along axis 0 (rows) but have labels in axis 1 (columns)
that don’t match, then those will be added and filled in with NaN values. This results in an
outer join:

Python

outer_joined = pd.concat([climate_precip, climate_temp])

With these two DataFrames, since you’re just concatenating along rows, very few columns
have the same name. That means you’ll see a lot of columns with NaN values.

To instead drop columns that have any missing data, use the join parameter with the value
"inner" to do an inner join:

Python

inner_joined = pd.concat([climate_temp, climate_precip], join="inner")

Using the inner join, you’ll be left with only those columns that the original DataFrames have
in common: STATION, STATION_NAME, and DATE.

You can also flip this by setting the axis parameter:

Python

inner_joined_cols = pd.concat([climate_temp, climate_precip], axis=1, join="inner"

Now you have only the rows that have data for all columns in both DataFrames. It’s no
coincidence that the number of rows corresponds with that of the smaller DataFrame.

Another useful trick for concatenation is using the keys parameter to create hierarchical axis
labels. This is useful if you want to preserve the indices or column names of the original
datasets but also to have new ones one level up:

Python

hierarchical_keys = pd.concat([climate_temp, climate_precip], keys=["temp", "precip"

If you check on the original DataFrames, then you can verify whether the higher-level axis
labels temp and precip were added to the appropriate rows.

Finally, take a look at the first concatenation example rewritten to use .append():

Python

appended = precip_one_station.append(precip_one_station)

Notice that the result of using .append() is the same as when you used concat() at the
beginning of this section.

Conclusion
You have now learned the three most important techniques for combining data in Pandas:

1. merge() for combining data on common columns or indices

2. .join() for combining data on a key column or an index
3. concat() for combining DataFrames across rows or columns

In addition to learning how to use these techniques, you also learned about set logic by
experimenting with the different ways to join your datasets. You also learned about the APIs
to the above techniques and some alternative calls like .append() that you can use to
simplify your code.

You saw these techniques in action on a real dataset obtained from the NOAA, which showed
you not only how to combine your data but also the benefits of doing so with Pandas’ built-
in techniques. If you haven’t downloaded the project files yet, you can get them here:

Download the notebook and data set: Click here to get the Jupyter Notebook and
CSV data set you’ll use to learn about Pandas merge(), .join(), and concat() in this
tutorial.

Did you learn something new? Figure out a creative way to solve a problem by combining
complex datasets? Let us know in the comments below!

Mark as Completed &

About Kyle Stratis

Kyle is a self-taught developer working as a senior data

engineer at Vizit Labs. In the past, he has founded DanqEx
(formerly Nasdanq: the original meme stock exchange) and
Encryptid Gaming.

» More about Kyle

Each tutorial at Real Python is created by a team of developers so that it meets our high
quality standards. The team members who worked on this tutorial are:

Aldren Bryan Geir Arne

Joanna Jacob

What Do You Think?

# Tweet $ Share % Email

Real Python Comment Policy: The most useful comments are those written
with the goal of learning from or helping out other readers—after reading the
whole article and all the earlier comments. Complaints and insults generally
won’t make the cut here.

What’s your #1 takeaway or favorite thing you learned? How are you going to put your
newfound skills to use? Leave a comment below and let us know.

Keep Learning

Python Tutorials ∙ Search ∙ Privacy Policy ∙ Energy Policy ∙ Advertise ∙ Contact Help

❤ Happy Pythoning!

Google Hacking Database
83% (18)
Google Hacking Database
91 pages
Dangerous Google - Searching For Secrets PDF
88% (26)
Dangerous Google - Searching For Secrets PDF
12 pages
Download ebooks file The Volatility Edge in Options Trading New Technical Strategies for Investing in Unstable Markets 1st Edition Jeff Augen all chapters
No ratings yet
Download ebooks file The Volatility Edge in Options Trading New Technical Strategies for Investing in Unstable Markets 1st Edition Jeff Augen all chapters
55 pages
Dangerous Google Searching For Secrets
No ratings yet
Dangerous Google Searching For Secrets
12 pages
Google Hacking Database
No ratings yet
Google Hacking Database
91 pages
David Amos, Dan Bader, Joanna Jablonski, Fletcher Heisler Python
100% (15)
David Amos, Dan Bader, Joanna Jablonski, Fletcher Heisler Python
643 pages
Understanding Database Types - by Alex Xu
No ratings yet
Understanding Database Types - by Alex Xu
13 pages
Policy Document Ucc Redemption Understanding The Process Further
80% (20)
Policy Document Ucc Redemption Understanding The Process Further
37 pages
How To Use Google Hack
100% (1)
How To Use Google Hack
4 pages
PayPal Hacks
100% (1)
PayPal Hacks
6 pages
Hackers Black Book (2011-Edition)
No ratings yet
Hackers Black Book (2011-Edition)
6 pages
UCC-1 Financing Statement
87% (39)
UCC-1 Financing Statement
94 pages
Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
91% (11)
Dark Web Market Price Index Hacking Tools July 2018 Top10VPN2
7 pages
Hackers Favorite Search Queries 4
100% (1)
Hackers Favorite Search Queries 4
6 pages
Kali Linux Tools Descriptions
100% (2)
Kali Linux Tools Descriptions
26 pages
Allison, Berkowitz - 2008 - SQL For Microsoft Access PDF
100% (1)
Allison, Berkowitz - 2008 - SQL For Microsoft Access PDF
393 pages
canadianResumeTemplate 1
No ratings yet
canadianResumeTemplate 1
2 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
IV Unit Fds
No ratings yet
IV Unit Fds
16 pages
UnitIV.1
No ratings yet
UnitIV.1
4 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
1 page
Pandas
No ratings yet
Pandas
94 pages
99c949c0-5910-425f-9ac5-155882800fa5
No ratings yet
99c949c0-5910-425f-9ac5-155882800fa5
36 pages
Combining Datasets
No ratings yet
Combining Datasets
36 pages
Data Wrangling and Analysis
100% (1)
Data Wrangling and Analysis
36 pages
OOM Unit 2
No ratings yet
OOM Unit 2
145 pages
Praveen PPT
No ratings yet
Praveen PPT
9 pages
Unit 4 DSE
No ratings yet
Unit 4 DSE
9 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
python interviews
No ratings yet
python interviews
154 pages
Data Science Data Manipulation With Pandas
No ratings yet
Data Science Data Manipulation With Pandas
77 pages
python 2.1.3 (2)
No ratings yet
python 2.1.3 (2)
6 pages
UNIT IV Material
No ratings yet
UNIT IV Material
23 pages
Python Lecture 5 (2025)
No ratings yet
Python Lecture 5 (2025)
29 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
4th Unit Answer Bank
No ratings yet
4th Unit Answer Bank
40 pages
Lecture 8 - Data Wrangling Using Pandas
No ratings yet
Lecture 8 - Data Wrangling Using Pandas
31 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Introduction to Pandas Programming 2
No ratings yet
Introduction to Pandas Programming 2
3 pages
Chapter 4
No ratings yet
Chapter 4
40 pages
DS FINAL
No ratings yet
DS FINAL
46 pages
S
No ratings yet
S
22 pages
Joining Data 4
No ratings yet
Joining Data 4
40 pages
Chapter 4
No ratings yet
Chapter 4
40 pages
Big Book of Python Scripts
No ratings yet
Big Book of Python Scripts
93 pages
Ch8 Data Wrangling Join, Combine, and Reshape
No ratings yet
Ch8 Data Wrangling Join, Combine, and Reshape
13 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas Roadmap
No ratings yet
Pandas Roadmap
6 pages
DA - 2. Pandas
No ratings yet
DA - 2. Pandas
79 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Python Libraries Cheat Sheets
No ratings yet
Python Libraries Cheat Sheets
6 pages
Python Notes by Prof T
No ratings yet
Python Notes by Prof T
10 pages
Data Analtycs Professional-1
No ratings yet
Data Analtycs Professional-1
15 pages
CO3_3_Indexing and Sorting, Loading Data From CSV
No ratings yet
CO3_3_Indexing and Sorting, Loading Data From CSV
29 pages
Data Analytics at NP IT SOLUTIONS
No ratings yet
Data Analytics at NP IT SOLUTIONS
4 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Data Analysis Python Read The Docs Io en Latest
No ratings yet
Data Analysis Python Read The Docs Io en Latest
79 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
10 pages
Reference Guide - Pandas Tools For Structuring A Dataset
No ratings yet
Reference Guide - Pandas Tools For Structuring A Dataset
5 pages
Fundamental - Python
No ratings yet
Fundamental - Python
3 pages
Pandas DataFrames
No ratings yet
Pandas DataFrames
1 page
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Python
No ratings yet
Python
5 pages
dataframing_in_csv
No ratings yet
dataframing_in_csv
14 pages
Iloc, Loc, and Ix For Data Selection in Python Pandas - Shane Lynn
No ratings yet
Iloc, Loc, and Ix For Data Selection in Python Pandas - Shane Lynn
2 pages
Pandas_Notes_Design
No ratings yet
Pandas_Notes_Design
5 pages
Lecture 15 (DS) - Pandas - DataFrame Merging, String Operations
No ratings yet
Lecture 15 (DS) - Pandas - DataFrame Merging, String Operations
25 pages
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Python for Data Science: A Hands-On Introduction
From Everand
Python for Data Science: A Hands-On Introduction
Yuli Vasiliev
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Spse 01515
No ratings yet
Spse 01515
119 pages
Spse 01510
No ratings yet
Spse 01510
137 pages
Moving and Rotating Faces: Publication Number Spse01520
No ratings yet
Moving and Rotating Faces: Publication Number Spse01520
101 pages
Sending Emails With Python - Real Python
No ratings yet
Sending Emails With Python - Real Python
2 pages
Watson Information
No ratings yet
Watson Information
17 pages
The Psychology of Appraisal Specific Emotions and Decision-Making
No ratings yet
The Psychology of Appraisal Specific Emotions and Decision-Making
14 pages
How To Build A Data Science Portfolio - by Michael Galarnyk - Towards Data Science
No ratings yet
How To Build A Data Science Portfolio - by Michael Galarnyk - Towards Data Science
2 pages
Risk Marvel Avengers Edition - Rules
No ratings yet
Risk Marvel Avengers Edition - Rules
16 pages
The Appraisal-Tendency Framework
No ratings yet
The Appraisal-Tendency Framework
47 pages
Acme Thread Dimensions
60% (5)
Acme Thread Dimensions
74 pages
Underreamer Maintenance
No ratings yet
Underreamer Maintenance
5 pages
08cutting Tool Angles and Their Significance PDF
No ratings yet
08cutting Tool Angles and Their Significance PDF
37 pages
Scholarship Brochure
No ratings yet
Scholarship Brochure
2 pages
Underreamer Maintenance
No ratings yet
Underreamer Maintenance
5 pages
Mud Eng
No ratings yet
Mud Eng
81 pages
Sand Control Overview
100% (5)
Sand Control Overview
85 pages
SQL Crash Course
No ratings yet
SQL Crash Course
17 pages
Google Hacking Database PDF
0% (1)
Google Hacking Database PDF
100 pages
Introduction To Database Systems
No ratings yet
Introduction To Database Systems
42 pages
Microsoft Access For Beginners PDF
100% (2)
Microsoft Access For Beginners PDF
196 pages
Useful Google Hacks
100% (4)
Useful Google Hacks
7 pages
TITLE 28 United States Code Sec. 3002
91% (11)
TITLE 28 United States Code Sec. 3002
77 pages
Full download Network Security and Cryptography Sarhan M. Musa pdf docx
No ratings yet
Full download Network Security and Cryptography Sarhan M. Musa pdf docx
40 pages
Excel Cheat Sheet: Travis Cuzick
100% (1)
Excel Cheat Sheet: Travis Cuzick
15 pages
Mythic Magazine #015
100% (3)
Mythic Magazine #015
34 pages
Master Cyber Digital Forensics
50% (2)
Master Cyber Digital Forensics
114 pages
Record Keeping and Documentation
100% (4)
Record Keeping and Documentation
18 pages
SFDSFD401 - Basics and Fundamentals of Database
No ratings yet
SFDSFD401 - Basics and Fundamentals of Database
77 pages
JCL Reference
No ratings yet
JCL Reference
722 pages
L1 CH 1 Introd
No ratings yet
L1 CH 1 Introd
97 pages
Data Carving Presentation by Aditya Upadhyay - 20231009 - 122725 - 0000
No ratings yet
Data Carving Presentation by Aditya Upadhyay - 20231009 - 122725 - 0000
11 pages
Renil Benny - SR Data Analyst - Resume
No ratings yet
Renil Benny - SR Data Analyst - Resume
5 pages
MS Access_Queries+1
No ratings yet
MS Access_Queries+1
9 pages
DBMS Notes Class 10
No ratings yet
DBMS Notes Class 10
12 pages
Activity 10
No ratings yet
Activity 10
13 pages
Slip16 (Employee Investment) (1 M)
No ratings yet
Slip16 (Employee Investment) (1 M)
3 pages
Talend Examples DataIntegration EN 7.2.1
No ratings yet
Talend Examples DataIntegration EN 7.2.1
54 pages
SY14 - Multimedia Database
No ratings yet
SY14 - Multimedia Database
21 pages
Data Abstraction
No ratings yet
Data Abstraction
9 pages
Module 1 - Database Concepts and Architecture
No ratings yet
Module 1 - Database Concepts and Architecture
94 pages
BCA-341-18,341-20, BCA-CS-341-20 Database Management System
No ratings yet
BCA-341-18,341-20, BCA-CS-341-20 Database Management System
2 pages
Class 12 Computer Science Sample Paper Set 14
No ratings yet
Class 12 Computer Science Sample Paper Set 14
13 pages
DBMS MCQ
No ratings yet
DBMS MCQ
1 page

Combining Data in Pandas With Merge, .Join, and Concat - Real Python

Uploaded by

Combining Data in Pandas With Merge, .Join, and Concat - Real Python

Uploaded by

Start Here ' Learn Python Store More ( Search

All Tutorial Topics

merge(), .join(), and concat() Conclusion

by Kyle Stratis ! 5 Comments " data-science intermediate

Mark as Completed & # Tweet $ Share % Email

merge() for combining data on common columns or indices

.join() for combining data on a key column or an index

concat() for combining DataFrames across rows or columns

Pandas merge(): Combining Data on Common

When you use merge(), you’ll provide two required arguments:

1. The left DataFrame

Visual Representation of Join Types

1. Climate normals for California (temperatures)

>>> import pandas as pd

Next, take a quick look at the dimensions of the two DataFrames:

>>> precip_one_station = climate_precip[climate_precip["STATION"] == "GHCND:USC00045721"

>>> inner_merged = pd.merge(precip_one_station, climate_temp)

inner_merged_total = pd.merge(climate_temp, climate_precip, on=["STATION", "DATE"

This is best illustrated in an example:

outer_merged = pd.merge(precip_one_station, climate_temp, how="outer", on=["STATION"

left_merged = pd.merge(climate_temp, precip_one_station,

left_merged_reversed = pd.merge(precip_one_station, climate_temp, how="left", on=

right_merged = pd.merge(precip_one_station, climate_temp, how="right", on=["STATION"

Now, you’ll look at a simplified version of merge(): .join().

Pandas .join(): Combining Data on a Column or

precip_one_station.join(climate_temp, lsuffix="_left", rsuffix="_right")

climate_temp.join(precip_one_station, lsuffix="_left", rsuffix="_right")

inner_merged_total = pd.merge(climate_temp, climate_precip, on=["STATION", "DATE"

Pandas concat(): Combining Data Across Rows

concatenated = pandas.concat([df1, df2])

concatenated = pandas.concat([df1, df2], axis=1)

How to Add to a DataFrame With append()

double_precip = pd.concat([precip_one_station, precip_one_station])

reindexed = pd.concat([precip_one_station, precip_one_station], ignore_index=True

outer_joined = pd.concat([climate_precip, climate_temp])

inner_joined = pd.concat([climate_temp, climate_precip], join="inner")

You can also flip this by setting the axis parameter:

inner_joined_cols = pd.concat([climate_temp, climate_precip], axis=1, join="inner"

hierarchical_keys = pd.concat([climate_temp, climate_precip], keys=["temp", "precip"

1. merge() for combining data on common columns or indices

Mark as Completed &

About Kyle Stratis

Kyle is a self-taught developer working as a senior data

» More about Kyle

Aldren Bryan Geir Arne

What Do You Think?

# Tweet $ Share % Email

Related Tutorial Categories: data-science intermediate

© 2012–2021 Real Python ∙ Newsletter ∙ Podcast ∙ YouTube ∙ Twitter ∙ Facebook ∙ Instagram ∙

You might also like