How to change column type in Pandas

Learn how to change the data type of DataFrame columns

By Saturn Cloud | Tuesday, December 20, 2022 | Data Science & ML

Changing a column’s data type is often a necessary step in the data cleaning process. There are several options for changing types in pandas - which to use depends on your data and what you want to accomplish.

as_numeric() and astype()

To convert one or more columns to numeric values, pandas.to_numeric() is often a good option. This function will attempt to convert non-numeric values, such as strings, to either float64 or int64 depending on the input data. Note that extremely large numbers may lose precision; see the documentation for more information.

Here’s how to use it to convert one of more column types:

import pandas as pd

data = pd.DataFrame({'a': '1 2 3'.split(), 'b': '10 20 30'.split(), 
                     'c': '100 200 300'.split(), 'd': '4.3 5.1 6.2'.split()})

#convert just column 'a' to numeric
data['a'] = pd.to_numeric(data['a'])

#convert columns 'b' and 'c' to numeric - note double brackets
data[['b', 'c']] = data[['a', 'b']].apply(pd.to_numeric)

#convert all columns to numeric
data = data.apply(pd.to_numeric)

Note: To check the data types of your DataFrame columns, you can use data.dtypes. If you check the data types of the example above after converting all columns, you should see that you now have three int64 columns and one float64 column.

One benefit of to_numeric() is built-in error handling, which comes in handy in cases with mixed dtypes. By default, this function raises an error if it encounters a value it can’t convert to numeric. You can change this behavior with the errors parameter:

import pandas as pd

data = pd.DataFrame({'a': '1 2 3'.split(), 'b': '10 20 chicken'.split()})

#default behavior - raises an error
data['b'] = pd.to_numeric(data['b'])

#ignore invalid values
data['b'] = pd.to_numeric(data['b'], errors = 'ignore')

#convert invalid values to NaN
data['b'] = pd.to_numeric(data['b'], errors = 'coerce')

Another feature of to_numeric() is the ability to downcast numeric values. This means that instead of converting values to float64 or int64, the function will pick a smaller numeric dtpye (minimum np.int8, np.uint8, or np.float32 for integer, unsigned, and float data types respectively). Downcasting can help save memory when working with large datasets.

import pandas as pd

data = pd.DataFrame({'a': '1 2 3'.split(), 'b': '10 20 300'.split()})

#downcast to integer
data['a'] = pd.to_numeric(data['a'], downcast = 'integer')

#downcast to float
data['b'] = pd.to_numeric(data['b'], downcast = 'float')

If you need a little more flexibility when converting column types, you can consider using DataFrame.astype(). This function can be used to convert a column to (almost) any data type. Here it is in action:

import pandas as pd

data = pd.DataFrame({'a': '1 2 3'.split(), 'b': '10 20 300'.split(), 
                     'c': '100 200 300'.split(), 'd': '4.3 5.1 6.2'.split()})

#convert column 'a' to complex:
data = data.astype({'a': complex})

#convert column 'b' to int and column 'c' to float
data = data.astype({'b': int, 'c': float})

#convert whole dataframe back to object
data = data.astype(str)

Note that astype() allows for ignoring invalid values using errors = 'ignore', but does not allow for coercing invalid values.

A caveat: astype() is powerful, but its flexibility comes with a greater risk of unexpected behavior when converting data. Make sure to verify that your conversions worked as expected.

infer_objects()

To let pandas try to figure out the best data types to use, you can use DataFrame.infer_objects(). This method attempts soft conversion of all columns in a DataFrame, which is useful for cases where all columns have the unspecified object dtype. Here, infer_objects will convert column ‘b’ to int64 but will not convert column ‘a’ from an object type:

import pandas as pd

data = pd.DataFrame({'a': '1 2 3'.split(), 'b': [10, 20, 30]}, dtype = 'object')

data = data.infer_objects()

data.dtypes

convert_dtypes()

Finally, the method DataFrame.convert_dtypes() can also be called on an entire DataFrame. This method will attempt to convert each column to the “best possible” data type that still supports pd.NA missing values. Using our dataset from the previous example, column ‘a’ is converted from object to string, while column ‘b’ is converted from object to Int64. While infer_objects() would convert column ‘b’ to int64 (lowercase “i”), convert_dtpyes() instead chooses Int64 (uppercase “I”) because this type supports pd.NA values.

import pandas as pd

data = pd.DataFrame({'a': '1 2 3'.split(), 'b': [10, 20, 30]}, dtype = 'object')

data = data.convert_dtypes()

data.dtypes

You can also control whether or not this method attempts to infer object type, and you can turn off individual conversion to certain data types using the various convert_* flags; see the documentation for more details.

In summary, there are several flexible options for converting column data types built into pandas. as_numeric() and astype() allow you to manually convert one or more columns to another type, while infer_objects() and convert_dtypes() attempt to select the best data types for your dataset.

Additional Resources:

About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.

Start for free