How to Scrape an HTML Table with Beautiful Soup into Pandas
As a data scientist or software engineer, you may often encounter the need to extract data from an HTML table on a website. This task can seem daunting at first, especially if you are not familiar with the necessary tools and techniques. Fortunately, with the help of Python and the Beautiful Soup library, extracting data from an HTML table is a relatively straightforward process.
In this article, we will walk through the steps of scraping an HTML table using Beautiful Soup and then importing the data into a Pandas DataFrame. By the end of this article, you will have a solid understanding of how to extract data from an HTML table and use it in your data science or software engineering projects.
Table of Contents
What is Beautiful Soup?
Beautiful Soup is a Python library designed for web scraping purposes. It allows you to parse HTML and XML documents, extract data, and navigate the parse tree with ease. Beautiful Soup provides a simple interface for working with HTML and XML files, making it an ideal tool for web scraping.
Scraping an HTML table with Beautiful Soup
To scrape an HTML table using Beautiful Soup, you will need to follow these steps:
- Install Beautiful Soup
Before you can start using Beautiful Soup, you will need to install it. You can install Beautiful Soup using pip, a package manager for Python:
pip install beautifulsoup4
- Import the necessary libraries
After installing Beautiful Soup, you will need to import the necessary libraries into your Python script:
from bs4 import BeautifulSoup
import requests
import pandas as pd
The requests
library is used to make HTTP requests to the website from which you want to scrape the HTML table. The pandas
library is used to create a DataFrame from the scraped data.
- Make an HTTP request to the website
Next, you will need to make an HTTP request to the website from which you want to scrape the HTML table. You can do this using the requests.get()
method:
url = 'https://gcoins.net/en/catalog/view/45518'
response = requests.get(url)
Replace https://gcoins.net/en/catalog/view/45518
with the URL of the website from which you want to scrape the HTML table. In this tutorial, we will use a table that shows some old coin prices.
- Parse the HTML document
After making the HTTP request, you will need to parse the HTML document using Beautiful Soup. You can do this by passing the response.text
attribute to the BeautifulSoup()
constructor:
soup = BeautifulSoup(response.text, 'html.parser')
The html.parser
argument tells Beautiful Soup to use the built-in HTML parser to parse the HTML document.
- Find the HTML table
Once the HTML document has been parsed, you can find the HTML table by inspecting the HTML code of the website. You will need to find the HTML table using its tag name, ID, class, or other attributes.
For example, if the HTML table has a class of subs noBorders evenRows
, you can find it using the soup.find()
method:
table = soup.find('table', attrs={'class':'subs noBorders evenRows'})
table_rows = table.find_all('tr')
Replace my_table
with the ID of the HTML table you want to scrape.
- Extract the data from the HTML table
After finding the HTML table, you can extract the data from it using Beautiful Soup. You will need to loop through the rows and columns of the HTML table and extract the text from each cell.
For example, you can extract the data from the HTML table as follows:
data = []
for row in table.find_all('tr'):
row_data = []
for cell in row.find_all('td'):
row_data.append(cell.text)
data.append(row_data)
This code loops through each row and column of the HTML table and extracts the text from each cell. It then appends the row data to a list of data.
- Convert the data to a Pandas DataFrame
After extracting the data from the HTML table, you can convert it to a Pandas DataFrame using the pd.DataFrame()
constructor:
df = pd.DataFrame(data)
print(df)
Output:
0 1 2 3 4 5 6
0 None None None None None None None
1 1882 108,000 UNC —
2 1883 786,000 UNC ~ $3.20
3 \n\n\n 1884 4,604,000 UNC ~ $1.67–$5.77
4 1885 1,314,000 UNC ~ $2.56
5 1886 444,000 UNC —
6 1888 413,000 UNC ~ $2.31
7 1889 568,000 UNC ~ $2.05
8 \n\n\n 1890 2,137,000 UNC ~ $1.03–$6.41
9 1891 605,000 UNC —
10 1892 205,000 UNC ~ $3.59
11 \n\n\n 1893 754,000 UNC ~ $3.84–$11.28
12 1894 532,000 UNC ~ $2.56
13 1895 423,000 UNC ~ $1.92
14 1896 174,000 UNC —
Conclusion
In this article, we have walked through the steps of scraping an HTML table using Beautiful Soup and then importing the data into a Pandas DataFrame. We have also discussed how to clean up the data if necessary.
Scraping HTML tables is a common task in data science and software engineering, and Beautiful Soup provides a simple and effective way to accomplish this task. By following the steps outlined in this article, you should now be able to scrape HTML tables with ease and use the extracted data in your projects.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.