Loading Multiple CSV Files from a Folder into One DataFrame: A Guide
Table of Contents
- Step 1: Importing Necessary Libraries
- Step 2: Generating Some Dummy Files
- Step 3: Getting the List of CSV Files
- Step 4: Loading the CSV Files into a DataFrame
- Step 5: Concatenating the DataFrames
- Common Errors
- Conclusion
Prerequisites
Before we start, make sure you have the following installed on your system:
- Python 3.6 or higher
- pandas library
If you haven’t installed pandas yet, you can do so using pip:
pip install pandas
Step 1: Importing Necessary Libraries
First, we need to import the necessary libraries. We will be using pandas, a powerful data manipulation library, os, a built-in Python library for interacting with the operating system, and numpy to generate some csv files to work with.
import pandas as pd
import os
import numpy as np
Step 2: Generating Some Dummy Files
Next, we will utilize numpy to generate some csv files and store them in the dummy_csv_files
folder(directory). we need to set the working directory to the folder containing the CSV files. We can do this using the os.chdir()
function.
# Create a directory to store the dummy CSV files
os.makedirs("dummy_csv_files", exist_ok=True)
# Generate and save dummy CSV files
for i in range(1, 4): # Create 3 dummy CSV files
data = {
'Column_A': np.random.randint(1, 100, 5),
'Column_B': np.random.rand(5),
'Column_C': np.random.choice(['Category1', 'Category2'], 5)
}
df = pd.DataFrame(data)
csv_filename = f"dummy_file_{i}.csv"
df.to_csv(os.path.join("dummy_csv_files", csv_filename), index=False)
This code above will create 3 csv files that looks like the following:
Column_A,Column_B,Column_C
23,0.7461118185309882,Category1
10,0.19218583714699145,Category1
64,0.5342754209988878,Category1
52,0.2466430926417844,Category2
62,0.16424525311965432,Category1
You can replace dummy_csv_files
with the path to your csv files using the following command:
os.chdir('/path/to/your/csv/files')
Replace ‘/path/to/your/csv/files’ with the actual path to your folder.
Step 3: Getting the List of CSV Files
We can get a list of all CSV files in the directory using the os.listdir()
function and list comprehension. The output should be the csv files we created.
csv_files = [f for f in os.listdir("dummy_csv_files") if f.endswith('.csv')]
print(csv_files)
Output:
['dummy_file_1.csv', 'dummy_file_2.csv', 'dummy_file_3.csv']
Step 4: Loading the CSV Files into a DataFrame
Now, we can load each CSV file into a DataFrame and append it to a list of DataFrames using a for loop.
dfs = []
for csv in csv_files:
df = pd.read_csv(os.path.join("dummy_csv_files", csv))
dfs.append(df)
Step 5: Concatenating the DataFrames
Finally, we can concatenate all the DataFrames in the list into a single DataFrame using the pd.concat()
function.
final_df = pd.concat(dfs, ignore_index=True)
Output:
Column_A Column_B Column_C
0 23 0.746112 Category1
1 10 0.192186 Category1
2 64 0.534275 Category1
3 52 0.246643 Category2
4 62 0.164245 Category1
5 81 0.840457 Category1
6 3 0.660372 Category1
7 98 0.785123 Category1
8 65 0.392288 Category2
9 38 0.670140 Category1
10 62 0.589897 Category2
11 90 0.283221 Category2
12 96 0.697957 Category2
13 65 0.882271 Category2
14 67 0.279740 Category1
The ignore_index=True
argument is used to reset the index of the final DataFrame.
And there you have it! You’ve successfully loaded multiple CSV files from a folder into a single DataFrame.
Common Errors
Incorrect file path: Not setting the working directory correctly or providing an invalid path to the folder containing the CSV files.
Column mismatch: Assuming all files have the same column names and order. This can lead to errors when merging DataFrames with different structures.
Conclusion
Loading multiple CSV files into one DataFrame is a common task in data science. With Python and pandas, this task becomes a breeze. We hope this guide has been helpful in your data science journey.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.