How to Filter Pandas DataFrame by Substring Criteria
As a data scientist or software engineer, it’s common to work with large datasets that require filtering based on specific criteria. One common task is filtering a DataFrame based on a substring criteria. In this blog post, we’ll explore how to achieve this using the popular Python library, pandas.
Table of Contents
What is pandas?
Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures for efficiently storing and manipulating large datasets, as well as a wide range of functions for data analysis tasks such as filtering, aggregating, and transforming data.
The Problem
Suppose we have a pandas DataFrame with a column containing strings, and we want to filter this DataFrame to include only rows where the string in this column contains a specific substring.
For example, consider the following DataFrame:
import pandas as pd
data = {
'names': ['John', 'Jane', 'Bob', 'Alice', 'Mary'],
'emails': ['john@email.com', 'jane@email.com', 'bob@gmail.com',
'alice@yahoo.com', 'mary@hotmail.com']
}
df = pd.DataFrame(data)
names emails
0 John john@email.com
1 Jane jane@email.com
2 Bob bob@gmail.com
3 Alice alice@yahoo.com
4 Mary mary@hotmail.com
Suppose we want to filter this DataFrame to include only rows where the email address contains the substring gmail
. The expected output should be:
names emails
2 Bob bob@gmail.com
The Solution
To filter a pandas DataFrame based on substring criteria, we can use the str.contains()
method. This method returns a boolean mask indicating whether each element in the specified column contains the specified substring.
Here’s how we can use the str.contains()
method to filter the DataFrame in the previous example:
filtered_df = df[df['emails'].str.contains('gmail')]
This code first selects the ‘emails’ column of the DataFrame using the indexing operator ([]
). The str.contains()
method is then applied to this column, with the argument ‘gmail’ specifying the substring to search for. This method returns a boolean mask indicating which elements of the column contain the substring ‘gmail’.
Finally, this boolean mask is used to filter the original DataFrame using the indexing operator ([]
), resulting in the filtered DataFrame:
names emails
2 Bob bob@gmail.com
Advanced Filtering
The str.contains()
method supports several optional parameters that allow for more advanced filtering. Here are a few examples:
Case Sensitivity
By default, the str.contains()
method is case-sensitive. This means that it will only match substrings that have the same case as the specified substring.
To perform a case-insensitive search, we can set the case
parameter to False
:
filtered_df = df[df['emails'].str.contains('GMAIL', case=False)]
print(filtered_df)
This code filters the DataFrame to include only rows where the emails
column contains the substring GMAIL
, regardless of case.
Output:
names emails
2 Bob bob@gmail.com
Regular Expressions
The str.contains()
method also supports regular expressions. Regular expressions are a powerful tool for matching and manipulating text.
For example, suppose we want to filter the DataFrame to include only rows where the email address contains either gmail.com
or yahoo.com
. We can achieve this using a regular expression:
filtered_df = df[df['emails'].str.contains('gmail.com|yahoo.com')]
print(filtered_df)
This code filters the DataFrame to include only rows where the emails
column contains either the substring gmail.com
or yahoo.com
.
Output:
names emails
2 Bob bob@gmail.com
3 Alice alice@yahoo.com
Common Errors and Solutions
Error 1: Incorrect Column Name: Description: If the column name is misspelled, Pandas won’t find the column, resulting in a KeyError.
Solution: Double-check the column name for accuracy.
Example:
# Incorrect column name
df[df['name'].str.contains('Alice')] # KeyError: 'name'
Error 2: Case Sensitivity Issues: Description: Case-sensitive searches may lead to unexpected results.
Solution: Use the case
parameter or convert the column to lowercase before filtering.
Example:
# Case-sensitive search
df[df['names'].str.contains('alice')] # Empty DataFrame
Conclusion
Filtering a pandas DataFrame based on substring criteria is a common task in data analysis and manipulation. In this blog post, we explored how to achieve this using the str.contains()
method. We also looked at some advanced filtering options, such as case-insensitive search and regular expressions.
By mastering these techniques, you’ll be able to efficiently filter large datasets based on specific criteria, making your data analysis tasks more efficient and effective.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.