Extract text from PDF Python

Table of Contents

Introduction

If at least once in your life you wondered how to extract text from a PDF file using Python, you’re in the right place!
Very often, PDF is seen as a document that is difficult to edit and requires additional tools.

In this post we will see that managing a PDF in Python is very simple and requires very few lines of code.

Extract text from PDF using Python – PyPDF2

Extract text from a PDF file using Python is very simple.
For this tutorial we will use PyPDF2, a Python package that allows you to read, merge and modify PDFs in few lines of code.

Disclaimer:
This tutorial works well only if the PDF is not an image or if it is not the result of a scan.

Having said that, let’s proceed and begin to see how to install the package we need!

PyPDF2 package installation

Since we are installing an external package, I recommend that you create a virtualenv.
If you don’t know how to do it, here there is a post with the procedure on how to create a virtualenv.

Regardless of whether you have created the virtualenv or not (I recommend it) you can install qrcode with the following command:

pip install PyPDF2

If you need to install a specific version, for example 2.10.0, use this command:

pip install PyPDF2==2.10.0

Feel free to replace 2.10.0 with the version you need.

It is possible to use PyPDF2 with both Python2 and Python3 but be careful which version you need to install.
If you look at the documentation you will be able to understand which version is suitable for you.

For simplicity, the table below shows the version scheme.

PyPDF2 version based on Python installation
Which PyPDF2 version to install based on Python version

How to extract text from a PDF file using Python

In this section we will see how you can extarct text from a PDF file using Python.
To solve this problem we will use the PyPDF2 package.

First we need to import the necessary packages and read the file whose text we want to extract.

from PyPDF2 import PdfReader
reader = PdfReader("my_amazing_file.pdf")  # put the full path if the file is not in the same directory of the script

If the pdf you want to read is not in the same directory as the Python script, remember to put the full path instead of the filename.

The next step is to check if the PDF file is empty or if it contains at least one page.
To do this we can use the .pages attribute.
This attribute returns a list of all the pages contained in the PDF file.

num_of_pages = reader.pages
if num_of_pages:
	print("Found {} pages".format(len(num_of_pages)))
else:
	print("Empty PDF, nothing to read")

The last, and most important step, is to extract the text from PDF file. This is done using the extract_text method.
We can modify the code above like this:

num_of_pages = reader.pages
if num_of_pages:
	print("Found {} pages".format(len(num_of_pages)))
	for page in reader.pages:
		text = page.extract_text()
		print(text)
else:
	print("Empty PDF, nothing to read")

At this point you should see in the terminal all the text present in the PDF file.

I remind you that this is only possible if the PDF is not an image or if it is not the result of a scan.

Conclusion

Here we are at the end of this post, as always I hope this article will be useful to you and that now you know everything about the Python package PyPDF2.
If you find that something is unclear or you have a problem that you don’t know how to solve, leave me a comment below and I will try to help you as soon as possible.

If this topic is clear to you, take a look at the latest posts!

Leave a Comment

Your email address will not be published. Required fields are marked *