Python beautifulsoup

Table of Contents

Introduction

Beautifulsoup is a Python package that allows you to easily parse and extract data from HTML and XML files.
With a few simple instructions it is possible to convert the analyzed file into a structure formed by Tags.
Basically, every HTML or XML tag corresponds to a beautifulsoup Tag.

The next sections will show you how to install this Python package and how to best use it!
Don’t worry, you will also find the link to GitHub with a practical example ready to download and use!

Beautifulsoup – Python package

As already mentioned, beautifulsoup is a Python package that parses and extracts data from HTML and XML files.
It is possible to use the parser already included or with others.
In this section you can find some more details on how to choose the best parser, but if you don’t know which one to use, the default one will do just fine!

Since beautifulsoup is capable of parsing HTML pages, it is very often used for web scraping tasks.
In this post we will see a practical example of web scraping using beautifulsoup.

How to install Python beautifulsoup

Since we are installing an external package, I recommend that you create a virtualenv.
If you don’t know how to do it, here there is a post with the procedure on how to create a virtualenv.

Regardless of whether you have created the virtualenv or not (I recommend it) you can install beautifulsoup with the following command:

pip install beautifulsoup4

If you need to install a specific version, for example 4.10.0, use this command:

pip install beautifulsoup4==4.10.0

Feel free to replace 4.10.0 with the version you need.

IMPORTANT!
Be careful to install the package called beautifulsoup4.
There is also a package called just beautifulsoup but it is not what you want, in fact this is the old version of the same package and no longer maintained.
It is kept alive as many old projects are tied to that version.

Python beautifulsoup – installation issues

Beautifulsoup is packaged with Python2 and later converted to Python3 if needed.
This process can lead to some ImportError immediately after installation such as:

No module named HTMLParser

Or

No module named html.parser

To fix these problems it is recommended to uninstall and reinstall the package from scratch (delete also any directories created when you unzipped the tarball).

If you get instead the following SyntaxError on the line ROOT_TAG_NAME = u'[document]'

Invalid syntax

You should convert the Python2 code to Python3.
You can do this either by installing the package:

python3 setup.py install

or by manually running Python’s 2to3 conversion script on the bs4 directory:

2to3-3.2 -w bs4

How to use Python beautifulsoup

This section will first show you how to use beautifulsoup to parse static web pages (HTML files) and then how to perform a small web scraping task.

Python beautifulsoup with HTML files

Here you can find an example of how to use beautifulsoup to parse an HTML file.
I invite you to clone the repository with the code you will see in this section at this GitHub repository.

With beautifulsoup you can parse an HTML file like this:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
    <h1>A Title</h1>
    <p id="my_paragraph">A paragraph in the HTML file.</p>
</body>
</html>

Let’s see a simple code example to parse the HTML file above:

from bs4 import BeautifulSoup
# html_file/simple_file.html is the file above
with open(r"html_file/simple_file.html") as html_file:
    soup = BeautifulSoup(html_file, 'html.parser')
print("The soup: \n{}".format(soup))
print("\nThe <p> element with its attributes: \n{}\n{}".format(soup.p, soup.p.attrs))
print("\nThe <h1> element with its attributes: \n{}\n{}".format(soup.h1, soup.h1.attrs))

Let’s go through the code I just showed you step by step.
It may seem complicated to you but it is simpler than expected.

First of all you will notice that BeautifulSoup was imported with the following instruction.

from bs4 import BeautifulSoup

Then what is called Making the soup is done, that is to pass a string or an open file to BeautifulSoup input and let it convert everything into a Tags structure.

with open(r"html_file/simple_file.html") as html_file:
    soup = BeautifulSoup(html_file, 'html.parser')

Once the soup has been created, we have all the elements available to analyze our file.
In fact, as shown in the example, we can access the HTML elements (<p> and <h1>) with dot-notation.
We can also see all the attributes of these elements simply by using the .attrs property.

print("The soup: \n{}".format(soup))
print("\nThe <p> element with its attributes: \n{}\n{}".format(soup.p, soup.p.attrs))
print("\nThe <h1> element with its attributes: \n{}\n{}".format(soup.h1, soup.h1.attrs))

These are just some of the great potential offered by beautifulsoup.
I invite you to take a look at the documentation for all the details.

Web scraping with Python beautifulsoup

If you are interested in learning how to web scraping with the beautifulsoup Python package you are in the right place.
Let’s start by saying that there is a GitHub repository ready to clone in which there is a working example.
I recommend that you try to play around with the code a bit after reading the rest of the post.

To do web scraping, you first need to have a target site. In this case we will use the Hello World page of wikipedia.
Then you have to use the DevTools console (press F12 button or right click and inspect the web page) to see the structure of the page.

Unlike the previous section where we analyzed a static HTML page, here you need to make a call to the site.
We will do this via the Python package requests.

from bs4 import BeautifulSoup
import requests
response = requests.get(url="https://en.wikipedia.org/wiki/%22Hello,_World!%22_program")
soup = BeautifulSoup(response.text, "html.parser")

At this point, just like in the previous example we have made the soup so we can access any element of the page.
Let’s see how we can retrieve the page title via the find() method.

title = soup.find("h1", class_="firstHeading")
print("Title of the page: {}".format(title.get_text()))

The find() method returns an occurrence of a web page element, in this case <h1>.
It is also possible to specify the attributes of the element we are looking for in order to refine the search.
In fact, in this example, we are looking for the <h1> element that has firstHeading as its class.

The get_text() method instead returns human-readable text inside a document or tag.

Another popular method is find_all().
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

chapters_list = soup.find_all("h2")
for chapter in chapters_list:
    chapter_text = chapter.get_text()
    chapter_link = chapter.a
    if chapter_link:
        chapter_href = chapter.a.get("href")
    else:
        chapter_href = ""
    print(chapter_text, chapter_href)

This piece of code searches for all <h2> elements within the page and prints the chapter name with its href link, if present.

Conclusion

Here we are at the end of this post, as always I hope this article will be useful to you and that now you know everything about the Python package beautifulsoup.
If you find that something is unclear or you have a problem that you don’t know how to solve, leave me a comment below and I will try to help you as soon as possible.

If this topic is clear to you, take a look at the latest posts!

Leave a Comment

Your email address will not be published. Required fields are marked *