User-Defined Functions (UDFs)#

In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s functionality by allowing users to apply custom computations to their data. While pandas comes with a set of built-in functions for data manipulation, UDFs offer flexibility when built-in methods are not sufficient. These functions can be applied at different levels: element-wise, row-wise, column-wise, or group-wise, and behave differently, depending on the method used.

Here’s a simple example to illustrate a UDF applied to a Series:

In [1]: s = pd.Series([1, 2, 3])

# Simple UDF that adds 1 to a value
In [2]: def add_one(x):
   ...:     return x + 1
   ...: 

# Apply the function element-wise using .map
In [3]: s.map(add_one)
Out[3]: 
0    2
1    3
2    4
dtype: int64

Why Not To Use User-Defined Functions#

While UDFs provide flexibility, they come with significant drawbacks, primarily related to performance and behavior. When using UDFs, pandas must perform inference on the result, and that inference could be incorrect. Furthermore, unlike vectorized operations, UDFs are slower because pandas can’t optimize their computations, leading to inefficient processing.

Note

In general, most tasks can and should be accomplished using pandas’ built-in methods or vectorized operations.

Despite their drawbacks, UDFs can be helpful when:

  • Custom Computations Are Needed: Implementing complex logic or domain-specific calculations that pandas’ built-in methods cannot handle.

  • Extending pandas’ Functionality: Applying external libraries or specialized algorithms unavailable in pandas.

  • Handling Complex Grouped Operations: Performing operations on grouped data that standard methods do not support.

For example:

from sklearn.linear_model import LinearRegression

# Sample data
df = pd.DataFrame({
    'group': ['A', 'A', 'A', 'B', 'B', 'B'],
    'x': [1, 2, 3, 1, 2, 3],
    'y': [2, 4, 6, 1, 2, 1.5]
})

# Function to fit a model to each group
def fit_model(group):
    model = LinearRegression()
    model.fit(group[['x']], group['y'])
    group['y_pred'] = model.predict(group[['x']])
    return group

result = df.groupby('group').apply(fit_model)

Methods that support User-Defined Functions#

User-Defined Functions can be applied across various pandas methods:

Method

Function Input

Function Output

Description

Series.map() and DataFrame.map()

Scalar

Scalar

Apply a function to each element

Series.apply() and DataFrame.apply() (axis=0)

Column (Series)

Column (Series)

Apply a function to each column

Series.apply() and DataFrame.apply() (axis=1)

Row (Series)

Row (Series)

Apply a function to each row

Series.pipe() and DataFrame.pipe()

Series or DataFrame

Series or DataFrame

Chain functions together to apply to Series or Dataframe

Series.filter() and DataFrame.filter()

Series or DataFrame

Boolean

Only accepts UDFs in group by. Function is called for each group, and the group is removed from the result if the function returns False

Series.agg() and DataFrame.agg()

Series or DataFrame

Scalar or Series

Aggregate and summarizes values, e.g., sum or custom reducer

Series.transform() and DataFrame.transform() (axis=0)

Column (Series)

Column (Series)

Same as apply() with (axis=0), but it raises an exception if the function changes the shape of the data

Series.transform() and DataFrame.transform() (axis=1)

Row (Series)

Row (Series)

Same as apply() with (axis=1), but it raises an exception if the function changes the shape of the data

When applying UDFs in pandas, it is essential to select the appropriate method based on your specific task. Each method has its strengths and is designed for different use cases. Understanding the purpose and behavior of each method will help you make informed decisions, ensuring more efficient and maintainable code.

Note

Some of these methods are can also be applied to groupby, resample, and various window objects. See Group by: split-apply-combine, resample(), rolling(), expanding(), and ewm() for details.

Series.map() and DataFrame.map()#

The map() method is used specifically to apply element-wise UDFs. This means the function will be called for each element in the Series or DataFrame, with the individual value or the cell as the function argument.

In [4]: temperature_celsius = pd.DataFrame({
   ...:     "NYC": [14, 21, 23],
   ...:     "Los Angeles": [22, 28, 31],
   ...: })
   ...: 

In [5]: def to_fahrenheit(value):
   ...:     return value * (9 / 5) + 32
   ...: 

In [6]: temperature_celsius.map(to_fahrenheit)
Out[6]: 
    NYC  Los Angeles
0  57.2         71.6
1  69.8         82.4
2  73.4         87.8

In this example, the function to_fahrenheit will be called 6 times, once for each value in the DataFrame. And the result of each call will be returned in the corresponding cell of the resulting DataFrame.

In general, map will be slow, as it will not make use of vectorization. Instead, a Python function call for each value will be required, which will slow down things significantly if working with medium or large data.

When to use: Use map() for applying element-wise UDFs to DataFrames or Series.

Series.apply() and DataFrame.apply()#

The apply() method allows you to apply UDFs for a whole column or row. This is different from map() in that the function will be called for each column (or row), not for each individual value.

In [7]: temperature_celsius = pd.DataFrame({
   ...:     "NYC": [14, 21, 23],
   ...:     "Los Angeles": [22, 28, 31],
   ...: })
   ...: 

In [8]: def to_fahrenheit(column):
   ...:     return column * (9 / 5) + 32
   ...: 

In [9]: temperature_celsius.apply(to_fahrenheit)
Out[9]: 
    NYC  Los Angeles
0  57.2         71.6
1  69.8         82.4
2  73.4         87.8

In the example, to_fahrenheit will be called only twice, as opposed to the 6 times with map(). This will be faster than using map(), since the operations for each column are vectorized, and the overhead of iterating over data in Python and calling Python functions is significantly reduced.

In some cases, the function may require all the data to be able to compute the result. So apply() is needed, since with map() the function can only access one element at a time.

In [10]: temperature = pd.DataFrame({
   ....:     "NYC": [14, 21, 23],
   ....:     "Los Angeles": [22, 28, 31],
   ....: })
   ....: 

In [11]: def normalize(column):
   ....:     return column / column.mean()
   ....: 

In [12]: temperature.apply(normalize)
Out[12]: 
        NYC  Los Angeles
0  0.724138     0.814815
1  1.086207     1.037037
2  1.189655     1.148148

In the example, the normalize function needs to compute the mean of the whole column in order to divide each element by it. So, we cannot call the function for each element, but we need the function to receive the whole column.

apply() can also execute function by row, by specifying axis=1.

In [13]: temperature = pd.DataFrame({
   ....:     "NYC": [14, 21, 23],
   ....:     "Los Angeles": [22, 28, 31],
   ....: })
   ....: 

In [14]: def hotter(row):
   ....:     return row["Los Angeles"] - row["NYC"]
   ....: 

In [15]: temperature.apply(hotter, axis=1)
Out[15]: 
0    8
1    7
2    8
dtype: int64

In the example, the function hotter will be called 3 times, once for each row. And each call will receive the whole row as the argument, allowing computations that require more than one value in the row.

apply is also available for SeriesGroupBy.apply(), DataFrameGroupBy.apply(), Rolling.apply(), Expanding.apply() and Resampler.apply(). You can read more about apply in groupby operations Flexible apply.

When to use: apply() is suitable when no alternative vectorized method or UDF method is available, but consider optimizing performance with vectorized operations wherever possible.

Series.pipe() and DataFrame.pipe()#

The pipe method is similar to map and apply, but the function receives the whole Series or DataFrame it is called on.

In [16]: temperature = pd.DataFrame({
   ....:     "NYC": [14, 21, 23],
   ....:     "Los Angeles": [22, 28, 31],
   ....: })
   ....: 

In [17]: def normalize(df):
   ....:     return df / df.mean().mean()
   ....: 

In [18]: temperature.pipe(normalize)
Out[18]: 
        NYC  Los Angeles
0  0.604317     0.949640
1  0.906475     1.208633
2  0.992806     1.338129

This is equivalent to calling the normalize function with the DataFrame as the parameter.

In [19]: normalize(temperature)
Out[19]: 
        NYC  Los Angeles
0  0.604317     0.949640
1  0.906475     1.208633
2  0.992806     1.338129

The main advantage of using pipe is readability. It allows method chaining and clearer code when calling multiple functions.

In [20]: temperature_celsius = pd.DataFrame({
   ....:     "NYC": [14, 21, 23],
   ....:     "Los Angeles": [22, 28, 31],
   ....: })
   ....: 

In [21]: def multiply_by_9(value):
   ....:     return value * 9
   ....: 

In [22]: def divide_by_5(value):
   ....:     return value / 5
   ....: 

In [23]: def add_32(value):
   ....:     return value + 32
   ....: 

# Without `pipe`:
In [24]: fahrenheit = add_32(divide_by_5(multiply_by_9(temperature_celsius)))

# With `pipe`:
In [25]: fahrenheit = (temperature_celsius.pipe(multiply_by_9)
   ....:                                  .pipe(divide_by_5)
   ....:                                  .pipe(add_32))
   ....: 

pipe is also available for SeriesGroupBy.pipe(), DataFrameGroupBy.pipe() and Resampler.pipe(). You can read more about pipe in groupby operations in Piping function calls.

When to use: Use pipe() when you need to create a pipeline of operations and want to keep the code readable and maintainable.

Series.filter() and DataFrame.filter()#

The filter method is used to select a subset of rows that match certain criteria. Series.filter() and DataFrame.filter() do not support user defined functions, but SeriesGroupBy.filter() and DataFrameGroupBy.filter() do. You can read more about filter in groupby operations in Filtration.

Series.agg() and DataFrame.agg()#

The agg method is used to aggregate a set of data points into a single one. The most common aggregation functions such as min, max, mean, sum, etc. are already implemented in pandas. agg allows to implement other custom aggregate functions.

In [26]: temperature = pd.DataFrame({
   ....:     "NYC": [14, 21, 23],
   ....:     "Los Angeles": [22, 28, 31],
   ....: })
   ....: 

In [27]: def highest_jump(column):
   ....:     return column.pct_change().max()
   ....: 

In [28]: temperature.agg(highest_jump)
Out[28]: 
NYC            0.500000
Los Angeles    0.272727
dtype: float64

When to use: Use agg() for performing custom aggregations, where the operation returns a scalar value on each input.

Series.transform() and DataFrame.transform()#

The transform` method is similar to an aggregation, with the difference that the result is broadcasted to the original data.

In [29]: temperature = pd.DataFrame({
   ....:     "NYC": [14, 21, 23],
   ....:     "Los Angeles": [22, 28, 31]},
   ....:     index=pd.date_range("2000-01-01", "2000-01-03"))
   ....: 

In [30]: def warm_up_all_days(column):
   ....:     return pd.Series(column.max(), index=column.index)
   ....: 

In [31]: temperature.transform(warm_up_all_days)
Out[31]: 
            NYC  Los Angeles
2000-01-01   23           31
2000-01-02   23           31
2000-01-03   23           31

In the example, the warm_up_all_days function computes the max like an aggregation, but instead of returning just the maximum value, it returns a DataFrame with the same shape as the original one with the values of each day replaced by the the maximum temperature of the city.

transform is also available for SeriesGroupBy.transform(), DataFrameGroupBy.transform() and Resampler.transform(), where it’s more common. You can read more about transform in groupby operations in Transformation.

When to use: When you need to perform an aggregation that will be returned in the original structure of the DataFrame.

Performance#

While UDFs provide flexibility, their use is generally discouraged as they can introduce performance issues, especially when written in pure Python. To improve efficiency, consider using built-in NumPy or pandas functions instead of UDFs for common operations.

Note

If performance is critical, explore vectorized operations before resorting to UDFs.

Vectorized Operations#

Below is a comparison of using UDFs versus using Vectorized Operations:

# User-defined function
def calc_ratio(row):
    return 100 * (row["one"] / row["two"])

df["new_col"] = df.apply(calc_ratio, axis=1)

# Vectorized Operation
df["new_col2"] = 100 * (df["one"] / df["two"])

Measuring how long each operation takes:

User-defined function:  5.6435 secs
Vectorized:             0.0043 secs

Vectorized operations in pandas are significantly faster than using DataFrame.apply() with UDFs because they leverage highly optimized C functions via NumPy to process entire arrays at once. This approach avoids the overhead of looping through rows in Python and making separate function calls for each row, which is slow and inefficient. Additionally, NumPy arrays benefit from memory efficiency and CPU-level optimizations, making vectorized operations the preferred choice whenever possible.

Improving Performance with UDFs#

In scenarios where UDFs are necessary, there are still ways to mitigate their performance drawbacks. One approach is to use Numba, a Just-In-Time (JIT) compiler that can significantly speed up numerical Python code by compiling Python functions to optimized machine code at runtime.

By annotating your UDFs with @numba.jit, you can achieve performance closer to vectorized operations, especially for computationally heavy tasks.

Note

You may also refer to the user guide on Enhancing performance for a more detailed guide to using Numba.

Using DataFrame.pipe() for Composable Logic#

Another useful pattern for improving readability and composability, especially when mixing vectorized logic with UDFs, is to use the DataFrame.pipe() method.

DataFrame.pipe() doesn’t improve performance directly, but it enables cleaner method chaining by passing the entire object into a function. This is especially helpful when chaining custom transformations:

def add_ratio_column(df):
    df["ratio"] = 100 * (df["one"] / df["two"])
    return df

df = (
    df
    .query("one > 0")
    .pipe(add_ratio_column)
    .dropna()
)

This is functionally equivalent to calling add_ratio_column(df), but keeps your code clean and composable. The function you pass to DataFrame.pipe() can use vectorized operations, row-wise UDFs, or any other logic; DataFrame.pipe() is agnostic.

Note

While DataFrame.pipe() does not improve performance on its own, it promotes clean, modular design and allows both vectorized and UDF-based logic to be composed in method chains.