Downloading a CSV from a URL and Converting it to a DataFrame using Python Pandas
In the world of data science, Python’s Pandas library is a powerful tool for data manipulation and analysis. One common task that data scientists often encounter is downloading a CSV file from a URL and converting it into a DataFrame for further processing. This blog post will guide you through this process step-by-step.
Table of Contents
- Prerequisites
- Step-by-Step downloading a csv from url
- Pros and Cons of This Method
- Common Errors and How to Handle Them
- Conclusion
Prerequisites
Before we start, make sure you have the following installed on your system:
- Python 3.6 or later
- Pandas library
If you haven’t installed Pandas yet, you can do so using pip:
pip install pandas
Step-by-Step downloading a CSV from URL
Step 1: Importing the Required Libraries
The first step is to import the necessary libraries. We will need the pandas
library for creating the DataFrame and the requests
library for downloading the CSV file.
import pandas as pd
import requests
Step 2: Downloading the CSV File
Next, we will download the CSV file from the URL. We will use the requests
library’s get
method to do this. The get
method sends a GET request to the specified URL and returns the response.
url = "https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv"
response = requests.get(url)
In this example, we will use a real-world dataset related to COVID-19, specifically country-wise aggregated data.
Step 3: Converting the CSV File to a DataFrame
After downloading the CSV file, we can convert it into a DataFrame using the pandas
library’s read_csv
method. The read_csv
method reads a CSV file and converts it into a DataFrame.
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Save the content of the response to a local CSV file
with open("downloaded_data.csv", "wb") as f:
f.write(response.content)
print("CSV file downloaded successfully")
else:
print("Failed to download CSV file. Status code:", response.status_code)
# Read the CSV file into a Pandas DataFrame
df = pd.read_csv("downloaded_data.csv")
The StringIO
function is used to convert the response text into a file-like object, which can then be passed to the read_csv
method.
Step 4: Exploring the DataFrame
Now that we have our DataFrame, we can start exploring it. Here are a few methods you can use:
df.head()
: This method returns the first 5 rows of the DataFrame.df.describe()
: This method provides a statistical summary of the DataFrame.df.info()
: This method provides a concise summary of the DataFrame, including the number of non-null entries in each column.
print("\n--- HEAD ---")
print(df.head())
print("\n--- DESCRIBE ---")
print(df.describe())
print("\n--- INFO ---")
print(df.info())
Output:
--- HEAD ---
Date Country Confirmed Recovered Deaths
0 2020-01-22 Afghanistan 0 0 0
1 2020-01-23 Afghanistan 0 0 0
2 2020-01-24 Afghanistan 0 0 0
3 2020-01-25 Afghanistan 0 0 0
4 2020-01-26 Afghanistan 0 0 0
--- DESCRIBE ---
Confirmed Recovered Deaths
count 1.615680e+05 1.615680e+05 161568.000000
mean 7.361569e+05 1.453967e+05 13999.436089
std 3.578884e+06 9.748275e+05 59113.581271
min 0.000000e+00 0.000000e+00 0.000000
25% 1.220000e+03 0.000000e+00 17.000000
50% 2.369200e+04 1.260000e+02 365.000000
75% 2.558420e+05 1.797225e+04 4509.000000
max 8.062512e+07 3.097475e+07 988609.000000
--- INFO ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161568 entries, 0 to 161567
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 161568 non-null object
1 Country 161568 non-null object
2 Confirmed 161568 non-null int64
3 Recovered 161568 non-null int64
4 Deaths 161568 non-null int64
dtypes: int64(3), object(2)
memory usage: 6.2+ MB
None
Pros and Cons of This Method
Pros:
- Simple and straightforward implementation.
- Suitable for smaller datasets.
- No need for additional dependencies beyond Pandas and Requests.
Cons:
- Not optimal for handling large datasets due to the entire file being downloaded first.
- Dependency on internet connectivity for downloading the file.
Common Errors and How to Handle Them
Error 1: ConnectionError
try:
response = requests.get(csv_url)
response.raise_for_status()
except requests.exceptions.HTTPError as errh:
print("HTTP Error:", errh)
except requests.exceptions.ConnectionError as errc:
print("Error Connecting:", errc)
except requests.exceptions.Timeout as errt:
print("Timeout Error:", errt)
except requests.exceptions.RequestException as err:
print("Error:", err)
This code snippet handles various connection-related errors that may occur during the download.
Error 2: File Not Found
try:
df = pd.read_csv("downloaded_data.csv")
except FileNotFoundError:
print("The specified CSV file was not found.")
This snippet addresses the scenario where the downloaded file is not found.
Conclusion
In this guide, we covered the process of downloading a CSV file from a URL and converting it into a Pandas DataFrame using Python. We discussed the pros and cons of this method, common errors, and provided detailed examples for handling potential issues. Incorporate these steps into your data analysis projects to efficiently work with remote datasets.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.