How to change column type in Pandas
Changing a column’s data type is often a necessary step in the data cleaning process. There are several options for changing types in pandas - which to use depends on your data and what you want to accomplish.
as_numeric() and astype()
To convert one or more columns to numeric values, pandas.to_numeric()
is often a good option. This function will attempt to convert non-numeric values, such as strings, to either float64
or int64
depending on the input data. Note that extremely large numbers may lose precision; see the documentation for more information.
Here’s how to use it to convert one of more column types:
import pandas as pd
data = pd.DataFrame({'a': '1 2 3'.split(), 'b': '10 20 30'.split(),
'c': '100 200 300'.split(), 'd': '4.3 5.1 6.2'.split()})
#convert just column 'a' to numeric
data['a'] = pd.to_numeric(data['a'])
#convert columns 'b' and 'c' to numeric - note double brackets
data[['b', 'c']] = data[['a', 'b']].apply(pd.to_numeric)
#convert all columns to numeric
data = data.apply(pd.to_numeric)
Note: To check the data types of your DataFrame
columns, you can use data.dtypes
. If you check the data types of the example above after converting all columns, you should see that you now have three int64
columns and one float64
column.
One benefit of to_numeric()
is built-in error handling, which comes in handy in cases with mixed dtypes. By default, this function raises an error if it encounters a value it can’t convert to numeric. You can change this behavior with the errors
parameter:
import pandas as pd
data = pd.DataFrame({'a': '1 2 3'.split(), 'b': '10 20 chicken'.split()})
#default behavior - raises an error
data['b'] = pd.to_numeric(data['b'])
#ignore invalid values
data['b'] = pd.to_numeric(data['b'], errors = 'ignore')
#convert invalid values to NaN
data['b'] = pd.to_numeric(data['b'], errors = 'coerce')
Another feature of to_numeric()
is the ability to downcast numeric values. This means that instead of converting values to float64
or int64
, the function will pick a smaller numeric dtpye (minimum np.int8
, np.uint8
, or np.float32
for integer, unsigned, and float data types respectively). Downcasting can help save memory when working with large datasets.
import pandas as pd
data = pd.DataFrame({'a': '1 2 3'.split(), 'b': '10 20 300'.split()})
#downcast to integer
data['a'] = pd.to_numeric(data['a'], downcast = 'integer')
#downcast to float
data['b'] = pd.to_numeric(data['b'], downcast = 'float')
If you need a little more flexibility when converting column types, you can consider using DataFrame.astype()
. This function can be used to convert a column to (almost) any data type. Here it is in action:
import pandas as pd
data = pd.DataFrame({'a': '1 2 3'.split(), 'b': '10 20 300'.split(),
'c': '100 200 300'.split(), 'd': '4.3 5.1 6.2'.split()})
#convert column 'a' to complex:
data = data.astype({'a': complex})
#convert column 'b' to int and column 'c' to float
data = data.astype({'b': int, 'c': float})
#convert whole dataframe back to object
data = data.astype(str)
Note that astype()
allows for ignoring invalid values using errors = 'ignore'
, but does not allow for coercing invalid values.
A caveat: astype()
is powerful, but its flexibility comes with a greater risk of unexpected behavior when converting data. Make sure to verify that your conversions worked as expected.
infer_objects()
To let pandas try to figure out the best data types to use, you can use DataFrame.infer_objects()
. This method attempts soft conversion of all columns in a DataFrame
, which is useful for cases where all columns have the unspecified object dtype. Here, infer_objects
will convert column ‘b’ to int64
but will not convert column ‘a’ from an object
type:
import pandas as pd
data = pd.DataFrame({'a': '1 2 3'.split(), 'b': [10, 20, 30]}, dtype = 'object')
data = data.infer_objects()
data.dtypes
convert_dtypes()
Finally, the method DataFrame.convert_dtypes()
can also be called on an entire DataFrame
. This method will attempt to convert each column to the “best possible” data type that still supports pd.NA
missing values. Using our dataset from the previous example, column ‘a’ is converted from object
to string
, while column ‘b’ is converted from object
to Int64
. While infer_objects()
would convert column ‘b’ to int64
(lowercase “i”), convert_dtpyes()
instead chooses Int64
(uppercase “I”) because this type supports pd.NA
values.
import pandas as pd
data = pd.DataFrame({'a': '1 2 3'.split(), 'b': [10, 20, 30]}, dtype = 'object')
data = data.convert_dtypes()
data.dtypes
You can also control whether or not this method attempts to infer object type, and you can turn off individual conversion to certain data types using the various convert_*
flags; see the documentation for more details.
In summary, there are several flexible options for converting column data types built into pandas. as_numeric()
and astype()
allow you to manually convert one or more columns to another type, while infer_objects()
and convert_dtypes()
attempt to select the best data types for your dataset.
Additional Resources:
How to drop Pandas DataFrame rows with NAs in a specific column
How to drop Pandas DataFrame rows with NAs in a specific column
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.