How to Apply Regex to a Pandas DataFrame
As a data scientist or software engineer, you may come across situations where you need to extract or manipulate specific data from a pandas DataFrame using regular expressions (regex). Pandas is a popular Python library for data manipulation and analysis, and regex is a powerful tool for pattern matching and text processing. In this article, we will explain how to apply regex to a pandas DataFrame, step-by-step.
What is Regex?
Regex is a sequence of characters that defines a search pattern. It is a powerful tool for pattern matching and text processing. In Python, regex is implemented in the re
module. Regex patterns can be used to match, search, replace or extract specific text from a string. The basic syntax for a regex pattern in Python is as follows:
import re
pattern = r'expression'
match = re.search(pattern, string)
pattern
is the regex pattern we want to match, string
is the input string we want to search in, and re.search()
function searches the input string for the pattern and returns the first match it finds.
What is Pandas?
Pandas is a popular Python library for data manipulation and analysis. It provides data structures for efficiently storing and manipulating large datasets, as well as tools for data cleaning, reshaping, merging, and aggregation. A pandas DataFrame is a 2-dimensional table with rows and columns, similar to a spreadsheet or SQL table. It is a powerful tool for data analysis, especially when combined with other Python libraries such as NumPy, Matplotlib, and Scikit-learn.
Applying Regex to a Pandas DataFrame
To apply regex to a pandas DataFrame, we need to use the pandas str
accessor. The str
accessor allows us to apply string methods to each element of a pandas Series or DataFrame. Here are the steps to apply regex to a pandas DataFrame:
- Import the pandas library and load the data into a pandas DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
- Select the column or columns that contain the text we want to match with regex.
text_column = df['text_column']
- Use the
str
accessor to apply regex to the text column.
import re
pattern = r'expression'
match = text_column.str.match(pattern)
text_column.str
returns a pandas Series with string methods, and match()
method applies the regex pattern to each element of the Series and returns a boolean Series with True or False values indicating whether the pattern matches the element. Along with match()
, we can use contains()
to get boolean for filtering and extract()
or replace()
to manipulate the content of the DataFrame.
- Use the boolean Series to filter the DataFrame.
df_filtered = df[match]
This will create a new DataFrame with only the rows that match the regex pattern in the text column.
Examples
Let’s see some examples of applying regex to a pandas DataFrame.
Example 1: Extracting Phone Numbers
Suppose we have a DataFrame with a column containing phone numbers in the format "(123) 456-7890"
. We want to extract the area code and the last 4 digits of each phone number. Here’s how we can do it with regex:
import pandas as pd
df = pd.DataFrame({'phone': ['(123) 456-7890', '(456) 789-0123', '(789) 012-3456']})
phone_column = df['phone']
area_code = phone_column.str.extract(r'\((\d{3})\)')
last_four_digits = phone_column.str.extract(r'-(\d{4})')
df['area_code'] = area_code
df['last_four_digits'] = last_four_digits
print(df)
This will produce the output:
phone area_code last_four_digits
0 (123) 456-7890 123 7890
1 (456) 789-0123 456 0123
2 (789) 012-3456 789 3456
Example 2: Cleaning Text Data
Suppose we have a DataFrame with a column containing text data with extra spaces and special characters. We want to clean the text data by removing extra spaces and special characters. Here’s how we can do it with regex:
import pandas as pd
df = pd.DataFrame({'text': ['hello world!', ' how are you? ', ' $%!# 123']})
text_column = df['text']
clean_text = text_column.str.replace(r'[^a-zA-Z0-9 ]+', '').str.replace(r'\s+', ' ').str.strip()
df['clean_text'] = clean_text
print(df)
This will produce the output:
text clean_text
0 hello world! hello world
1 how are you? how are you
2 $%!# 123 123
Example 3: Filtering Rows Based on Pattern
Let’s consider a DataFrame containing a column of email addresses, and you want to filter out rows where the email addresses follow a specific pattern. Let’s say you want to keep only those email addresses that end with "@example.com."
import pandas as pd
# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Email': ['alice@example.com', 'bob@gmail.com', 'charlie@example.com', 'david@example.com']}
df = pd.DataFrame(data)
# Applying regex to filter rows
pattern = r'@example\.com$'
filtered_df = df[df['Email'].str.contains(pattern)]
print(filtered_df)
Output:
Name Email
0 Alice alice@example.com
2 Charlie charlie@example.com
3 David david@example.com
Conclusion
In this article, we explained how to apply regex to a pandas DataFrame. We showed how to use the str
accessor to apply regex to a text column, and how to filter the DataFrame based on the regex results. We also provided some examples of using regex to extract or clean text data. By applying regex to a pandas DataFrame, you can efficiently manipulate and analyze text data for your data science or software engineering projects.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.