You must all be aware of what PDFs are. They are, in fact, one of the most essential and extensively utilized forms of digital media. PDF is an abbreviation for Portable Document Format. It has the.pdf extension. It is used to reliably exhibit and share documents, regardless of software, hardware, or operating system.
Text Extraction from a PDF File
The Python module PyPDF can be used to achieve what we want (text extraction), but it can also do more. This software can also produce, decrypt, and merge PDF files.
Why pdf to txt is needed?
Before we get into the meat of this post, I’ll go over some scenarios in which this type of PDF extraction is required.
One example is that you are using a job portal where people used to upload their CV in PDF format. And when
recruiters are looking for specific keywords, such as Hadoop developers, big data developers, python developers,
java developers, and so on. As a result, the keyword will be matched with the skills that you have specified in your
resume. This is another processing step in which they extract data from your PDF document and match it with the
keyword that the recruiter is looking for, and then they simply give you your name, email, or other information.
As a result, this is the use case.
Python has various libraries for PDF extraction, but we’ll look at the PyPDF2 module here. So, let’s look at how to
extract text from a PDF file using this module.
Convert PDF to TXT file using Python
- PyPDF2 module
- Creating a Pdf file
- Install PyPDF2
- Creating and opening new Python Project
- Implementation
- Explanation
Drive into Python Programming Examples and explore more instances related to python concepts so that you can become proficient in generating programs in Python Programming Language.
1)PyPDF2 module
PyPDF2 is a Pure-Python package designed as a PDF toolkit. It is capable of:
obtaining document information (title, author, etc)
separating documents page by page
merging documents page by page
cropping pages
merging several pages into a single page
encoding and decrypting PDF files and more!
So, now we’ll look at how to extract text from a PDF file using the PyPDF2 module. In your Python IDE, enter the following code (check best python IDEs).
2)Creating a Pdf file
- Make a new document in Word.
- Fill up the word document with whatever material you choose.
- Now, Go to File > Print > Save.
- Remember to save your pdf file in the same folder as your Python script.
- Your.pdf file has now been created and saved, and it will be converted to a.txt file later.
3)Install PyPDF2
First, we’ll add an external module called PyPDF2.
The PyPDF2 package is a pure Python pdf library that may be used to divide, merge, crop, and alter PDF files. PyPDF2 may also be used to add data, viewing choices, and passwords to PDFs, according to the PyPDF2 website.
To install the PyPDF2 package, start a command prompt in Windows and use the pip command to install PyPDF2
4)Creating and opening new Python Project
Open the Python IDLE and hit the ctrl + N keys. This launches your text editor.
You are free to use any other text editor of your choosing.
You should save the file as your pdf file_name.py.
Save this.py file in the same folder as your pdf.
5)Implementation
Below is the implementation:
import PyPDF2 # The opening procedure for a file object variable will be rb pdffile = open(r'C:\Users\Vikram\Desktop\samplepdf.pdf', 'rb') # create a variable called reader that will read the pdf file pdfReader = PyPDF2.PdfFileReader(pdffile) # The number of pages in this pdf file will be saved. num = pdfReader.numPages # create a variable that will select the selected number of pages pageobj = pdfReader.getPage(num+1) resulttext = pageobj.extractText() newfile = open( r"C:\Users\Vikram\Desktop\Calender\\sample.txt", "a") newfile.writelines(resulttext)
Output:
Python Programming Online
Tutorial | Free Beginners’ Guide on
Python Programming Language
Do you Love to Program in Python Language? Are you completely new to the Phyton programming language? Then, refer to this ultimate guide on Python Programming and become the top programmer. For detailed information such as What is Python? Why we use it? Tips to Learn Python Programming Language, Applications for Python dive into this article.
6)Explanation
We start by creating a Python file object and then opening the PDF file in “read binary (rb)” mode.
The PdfFileReader object is then created, which will read the file opened in the previous step.
The number of pages in the file is stored in a variable.
The final step saves the detected lines from the PDF to a text file you designate.
Related Programs:
- Python program to Convert Kilometers to Miles and Vice Versa – A Step-By-Step Approach
- Python Program To Display Powers of 2 Using Anonymous Function
- Python Program to Convert Decimal to Binary, Octal, and Hexadecimal
- Python program to convert seconds into day, hours, minutes, and seconds
- Python Program to Count the Frequency of Words Appearing in a String Using a Dictionary
- Python Program to Convert Binary to Gray Code
- Python Program to Print Numbers in a Range (1,upper) Without Using any Loops or by Using Recursion