How to Filter in NaN Pandas
As a data scientist or software engineer, you are often faced with the task of cleaning and processing large datasets. One common issue you might encounter is missing data, represented in Pandas as NaN
(Not a Number). In this article, we will discuss how to filter NaN
values in a Pandas DataFrame.
Table of Contents
- What is
NaN
? - Understanding
NaN
values in Pandas - Filtering
NaN
values in a Pandas DataFrame - Common Errors and How to Handle Them
- Conclusion
What is NaN
?
NaN
is a special floating-point value used to represent missing or undefined data in Pandas. It can arise due to a variety of reasons, such as incomplete data, errors in data collection, or data corruption. NaN
can also be generated as a result of mathematical operations involving missing values.
Understanding NaN
values in Pandas
Before we dive into filtering NaN
values, it is essential to understand how Pandas handles them. NaN
values are considered to be neither greater than nor less than any other value, including other NaN
values. This means that NaN
cannot be compared using standard comparison operators like <
or >
. Instead, we use special functions provided by Pandas to handle NaN
values.
Filtering NaN
values in a Pandas DataFrame
To filter NaN
values in a Pandas DataFrame, we use the isna()
or isnull()
functions. These functions return a boolean mask that indicates whether each element in the DataFrame is NaN
or not. We can then use this boolean mask to filter out rows or columns with NaN
values.
Filtering rows with NaN
values
To filter rows with NaN
values, we use the dropna()
function. This function removes any row with a NaN
value and returns a new DataFrame with the filtered rows. By default, dropna()
removes any row with at least one NaN
value.
import pandas as pd
# Creating a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [6, 7, pd.np.nan, 9, 10],
'C': [11, pd.np.nan, 13, 14, 15]})
# Filtering rows with NaN values
filtered_df = df.dropna()
print(filtered_df)
Output:
A B C
0 1 6.0 11.0
3 4 9.0 14.0
4 5 10.0 15.0
As you can see, the rows with NaN
values in column B and C have been removed.
Filtering columns with NaN
values
To filter columns with NaN
values, we use the dropna()
function with the axis
parameter set to 1. This function removes any column with a NaN
value and returns a new DataFrame with the filtered columns. By default, dropna()
removes any column with at least one NaN
value.
import pandas as pd
# Creating a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [6, 7, pd.np.nan, 9, 10],
'C': [11, pd.np.nan, 13, 14, 15]})
# Filtering columns with NaN values
filtered_df = df.dropna(axis=1)
print(filtered_df)
Output:
A
0 1
1 2
2 3
3 4
4 5
As you can see, the column with NaN
values has been removed.
Filling NaN
values
Using fillna()
In some cases, it might be preferable to fill NaN
values with a specific value instead of removing them. To fill NaN
values, we use the fillna()
function. This function replaces NaN
values with the specified value and returns a new DataFrame with the filled values.
import numpy as np
import pandas as pd
# Creating a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [6, 7, np.nan, 9, 10],
'C': [11, np.nan, 13, 14, 15]})
# Filling NaN values with 0
filled_df = df.fillna(0)
print(filled_df)
Output:
A B C
0 1 6.0 11.0
1 2 7.0 0.0
2 3 0.0 13.0
3 4 9.0 14.0
4 5 10.0 15.0
As you can see, the NaN
values have been replaced with 0.
Using interpolate()
Another solution to replace NaN
is to use interpolate()
. The interpolate()
method is useful when you want to fill NaN
values with interpolated values, making it suitable for time-series data.
# Interpolate NaN values
df_interpolated = df.interpolate()
print(df_interpolated)
Output:
A B C
0 1 6.0 11.0
1 2 7.0 12.0
2 3 8.0 13.0
3 4 9.0 14.0
4 5 10.0 15.0
Common Errors and How to Handle Them
- Setting inplace parameter: When using methods like
dropna()
orfillna()
, be cautious with the inplace parameter. Not setting it to True might lead to unexpected results.
# Incorrect usage without setting inplace=True
df.dropna() # This does not modify the original DataFrame
To avoid this, either set inplace=True or assign the result back to the original DataFrame:
# Correct usage
df.dropna(inplace=True) # Modifies the original DataFrame
# or
df = df.dropna() # Assigns the result back to the original DataFrame
Conclusion
In this article, we discussed how to filter NaN
values in a Pandas DataFrame. We learned that Pandas provides two functions, isna()
and isnull()
, to detect NaN
values and the dropna()
and fillna()
functions to filter or replace them. By understanding how to handle NaN
values in Pandas, data scientists and software engineers can clean and process large datasets with ease.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.