Convert PDF to TXT file using Python

You must all be aware of what PDFs are. They are, in fact, one of the most essential and extensively utilized forms of digital media. PDF is an abbreviation for Portable Document Format. It has the.pdf extension. It is used to reliably exhibit and share documents, regardless of software, hardware, or operating system.

Text Extraction from a PDF File
The Python module PyPDF can be used to achieve what we want (text extraction), but it can also do more. This software can also produce, decrypt, and merge PDF files.

Why pdf to txt is needed?

Before we get into the meat of this post, I’ll go over some scenarios in which this type of PDF extraction is required.

One example is that you are using a job portal where people used to upload their CV in PDF format. And when

recruiters are looking for specific keywords, such as Hadoop developers, big data developers, python developers,

java developers, and so on. As a result, the keyword will be matched with the skills that you have specified in your

resume. This is another processing step in which they extract data from your PDF document and match it with the

keyword that the recruiter is looking for, and then they simply give you your name, email, or other information.

As a result, this is the use case.

Python has various libraries for PDF extraction, but we’ll look at the PyPDF2 module here. So, let’s look at how to

extract text from a PDF file using this module.

Drive into Python Programming Examples and explore more instances related to python concepts so that you can become proficient in generating programs in Python Programming Language.

1)PyPDF2 module

PyPDF2 is a Pure-Python package designed as a PDF toolkit. It is capable of:

obtaining document information (title, author, etc)

separating documents page by page

merging documents page by page

cropping pages

merging several pages into a single page

encoding and decrypting PDF files and more!
So, now we’ll look at how to extract text from a PDF file using the PyPDF2 module. In your Python IDE, enter the following code (check best python IDEs).

2)Creating a Pdf file

Make a new document in Word.
Fill up the word document with whatever material you choose.
Now, Go to File > Print > Save.
Remember to save your pdf file in the same folder as your Python script.
Your.pdf file has now been created and saved, and it will be converted to a.txt file later.

3)Install PyPDF2

First, we’ll add an external module called PyPDF2.

The PyPDF2 package is a pure Python pdf library that may be used to divide, merge, crop, and alter PDF files. PyPDF2 may also be used to add data, viewing choices, and passwords to PDFs, according to the PyPDF2 website.

To install the PyPDF2 package, start a command prompt in Windows and use the pip command to install PyPDF2

4)Creating and opening new Python Project

Open the Python IDLE and hit the ctrl + N keys. This launches your text editor.

You are free to use any other text editor of your choosing.

You should save the file as your pdf file_name.py.

Save this.py file in the same folder as your pdf.

5)Implementation

Below is the implementation:

import PyPDF2

# The opening procedure for a file object variable will be rb
pdffile = open(r'C:\Users\Vikram\Desktop\samplepdf.pdf', 'rb')

# create a variable called reader that will read the pdf file
pdfReader = PyPDF2.PdfFileReader(pdffile)

# The number of pages in this pdf file will be saved.
num = pdfReader.numPages

# create a variable that will select the selected number of pages
pageobj = pdfReader.getPage(num+1)

resulttext = pageobj.extractText()

newfile = open(
    r"C:\Users\Vikram\Desktop\Calender\\sample.txt", "a")
newfile.writelines(resulttext)

Output:

Python Programming Online
Tutorial | Free Beginners’ Guide on
Python Programming Language
Do you Love to Program in Python Language? Are you completely new to the Phyton programming language? Then, refer to this ultimate guide on Python Programming and become the top programmer. For detailed information such as What is Python? Why we use it? Tips to Learn Python Programming Language, Applications for Python dive into this article.

6)Explanation

We start by creating a Python file object and then opening the PDF file in “read binary (rb)” mode.
The PdfFileReader object is then created, which will read the file opened in the previous step.
The number of pages in the file is stored in a variable.
The final step saves the detected lines from the PDF to a text file you designate.
Related Programs: