Loading Multiple CSV Files from a Folder into One DataFrame: A Guide

Data scientists often encounter situations where they need to load multiple CSV files from a folder into a single DataFrame. This process can be tedious and time-consuming, especially when dealing with a large number of files. In this blog post, we will walk you through a step-by-step guide on how to efficiently load multiple CSV files into one DataFrame using Python and pandas.

Table of Contents

  1. Step 1: Importing Necessary Libraries
  2. Step 2: Generating Some Dummy Files
  3. Step 3: Getting the List of CSV Files
  4. Step 4: Loading the CSV Files into a DataFrame
  5. Step 5: Concatenating the DataFrames
  6. Common Errors
  7. Conclusion

Prerequisites

Before we start, make sure you have the following installed on your system:

  • Python 3.6 or higher
  • pandas library

If you haven’t installed pandas yet, you can do so using pip:

pip install pandas

Step 1: Importing Necessary Libraries

First, we need to import the necessary libraries. We will be using pandas, a powerful data manipulation library, os, a built-in Python library for interacting with the operating system, and numpy to generate some csv files to work with.

import pandas as pd
import os
import numpy as np

Step 2: Generating Some Dummy Files

Next, we will utilize numpy to generate some csv files and store them in the dummy_csv_files folder(directory). we need to set the working directory to the folder containing the CSV files. We can do this using the os.chdir() function.

# Create a directory to store the dummy CSV files
os.makedirs("dummy_csv_files", exist_ok=True)

# Generate and save dummy CSV files
for i in range(1, 4):  # Create 3 dummy CSV files
    data = {
        'Column_A': np.random.randint(1, 100, 5),
        'Column_B': np.random.rand(5),
        'Column_C': np.random.choice(['Category1', 'Category2'], 5)
    }
    df = pd.DataFrame(data)

    csv_filename = f"dummy_file_{i}.csv"
    df.to_csv(os.path.join("dummy_csv_files", csv_filename), index=False)

This code above will create 3 csv files that looks like the following:

Column_A,Column_B,Column_C
23,0.7461118185309882,Category1
10,0.19218583714699145,Category1
64,0.5342754209988878,Category1
52,0.2466430926417844,Category2
62,0.16424525311965432,Category1

You can replace dummy_csv_files with the path to your csv files using the following command:

os.chdir('/path/to/your/csv/files')

Replace ‘/path/to/your/csv/files’ with the actual path to your folder.

Step 3: Getting the List of CSV Files

We can get a list of all CSV files in the directory using the os.listdir() function and list comprehension. The output should be the csv files we created.

csv_files = [f for f in os.listdir("dummy_csv_files") if f.endswith('.csv')]
print(csv_files)

Output:

['dummy_file_1.csv', 'dummy_file_2.csv', 'dummy_file_3.csv']

Step 4: Loading the CSV Files into a DataFrame

Now, we can load each CSV file into a DataFrame and append it to a list of DataFrames using a for loop.

dfs = []

for csv in csv_files:
    df = pd.read_csv(os.path.join("dummy_csv_files", csv))
    dfs.append(df)

Step 5: Concatenating the DataFrames

Finally, we can concatenate all the DataFrames in the list into a single DataFrame using the pd.concat() function.

final_df = pd.concat(dfs, ignore_index=True)

Output:

    Column_A  Column_B   Column_C
0         23  0.746112  Category1
1         10  0.192186  Category1
2         64  0.534275  Category1
3         52  0.246643  Category2
4         62  0.164245  Category1
5         81  0.840457  Category1
6          3  0.660372  Category1
7         98  0.785123  Category1
8         65  0.392288  Category2
9         38  0.670140  Category1
10        62  0.589897  Category2
11        90  0.283221  Category2
12        96  0.697957  Category2
13        65  0.882271  Category2
14        67  0.279740  Category1

The ignore_index=True argument is used to reset the index of the final DataFrame.

And there you have it! You’ve successfully loaded multiple CSV files from a folder into a single DataFrame.

Common Errors

  • Incorrect file path: Not setting the working directory correctly or providing an invalid path to the folder containing the CSV files.

  • Column mismatch: Assuming all files have the same column names and order. This can lead to errors when merging DataFrames with different structures.

Conclusion

Loading multiple CSV files into one DataFrame is a common task in data science. With Python and pandas, this task becomes a breeze. We hope this guide has been helpful in your data science journey.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.