{"id":8008,"date":"2021-06-07T19:16:39","date_gmt":"2021-06-07T13:46:39","guid":{"rendered":"https:\/\/python-programs.com\/?p=8008"},"modified":"2021-11-22T18:40:41","modified_gmt":"2021-11-22T13:10:41","slug":"convert-pdf-to-txt-file-using-python","status":"publish","type":"post","link":"https:\/\/python-programs.com\/convert-pdf-to-txt-file-using-python\/","title":{"rendered":"Convert PDF to TXT file using Python"},"content":{"rendered":"
You must all be aware of what PDFs are. They are, in fact, one of the most essential and extensively utilized forms of digital media. PDF is an abbreviation for Portable Document Format. It has the.pdf extension. It is used to reliably exhibit and share documents, regardless of software, hardware, or operating system.<\/p>\n
Text Extraction from a PDF File
\nThe Python module PyPDF can be used to achieve what we want (text extraction), but it can also do more. This software can also produce, decrypt, and merge PDF files.<\/p>\n
Why pdf to txt is needed?<\/strong><\/p>\n Before we get into the meat of this post, I’ll go over some scenarios in which this type of PDF extraction is required.<\/p>\n One example is that you are using a job portal where people used to upload their CV in PDF format. And when<\/p>\n recruiters are looking for specific keywords, such as Hadoop developers, big data developers, python developers,<\/p>\n java developers, and so on. As a result, the keyword will be matched with the skills that you have specified in your<\/p>\n resume. This is another processing step in which they extract data from your PDF document and match it with the<\/p>\n keyword that the recruiter is looking for, and then they simply give you your name, email, or other information.<\/p>\n As a result, this is the use case.<\/p>\n Python has various libraries for PDF extraction, but we’ll look at the PyPDF2 module here. So, let’s look at how to<\/p>\n extract text from a PDF file using this module.<\/p>\n Drive into Python Programming Examples<\/a> and explore more instances related to python concepts so that you can become proficient in generating programs in Python Programming Language.<\/p>\n PyPDF2 is a Pure-Python package designed as a PDF toolkit. It is capable of:<\/p>\n obtaining document information (title, author, etc)<\/p>\n separating documents page by page<\/p>\n merging documents page by page<\/p>\n cropping pages<\/p>\n merging several pages into a single page<\/p>\n encoding and decrypting PDF files and more! First, we’ll add an external module called PyPDF2.<\/p>\n The PyPDF2 package is a pure Python pdf library that may be used to divide, merge, crop, and alter PDF files. PyPDF2 may also be used to add data, viewing choices, and passwords to PDFs, according to the PyPDF2 website.<\/p>\n To install the PyPDF2 package, start a command prompt in Windows and use the pip command to install\u00a0\u00a0PyPDF2<\/p>\n <\/p>\n Open the Python IDLE and hit the ctrl + N keys. This launches your text editor.<\/p>\n You are free to use any other text editor of your choosing.<\/p>\n You should save the file as your pdf file_name.py.<\/p>\n Save this.py file in the same folder as your pdf.<\/p>\n Below is the implementation:<\/strong><\/p>\n Output:<\/strong><\/p>\n Python Programming Online We start by creating a Python file object and then opening the PDF file in \u201cread binary (rb)\u201d mode. You must all be aware of what PDFs are. They are, in fact, one of the most essential and extensively utilized forms of digital media. PDF is an abbreviation for Portable Document Format. It has the.pdf extension. It is used to reliably exhibit and share documents, regardless of software, hardware, or operating system. Text Extraction …<\/p>\nConvert PDF to TXT file using Python<\/h2>\n
\n
1)PyPDF2 module<\/h3>\n
\nSo, now we’ll look at how to extract text from a PDF file using the PyPDF2 module. In your Python IDE, enter the following code (check best python IDEs).<\/p>\n2)Creating a Pdf file<\/h3>\n
\n
3)Install PyPDF2<\/h3>\n
4)Creating and opening new Python Project<\/h3>\n
5)Implementation<\/h3>\n
import PyPDF2\r\n\r\n# The opening procedure for a file object variable will be rb\r\npdffile = open(r'C:\\Users\\Vikram\\Desktop\\samplepdf.pdf', 'rb')\r\n\r\n# create a variable called reader that will read the pdf file\r\npdfReader = PyPDF2.PdfFileReader(pdffile)\r\n\r\n# The number of pages in this pdf file will be saved.\r\nnum = pdfReader.numPages\r\n\r\n# create a variable that will select the selected number of pages\r\npageobj = pdfReader.getPage(num+1)\r\n\r\nresulttext = pageobj.extractText()\r\n\r\nnewfile = open(\r\n r\"C:\\Users\\Vikram\\Desktop\\Calender\\\\sample.txt\", \"a\")\r\nnewfile.writelines(resulttext)\r\n<\/pre>\n
\nTutorial | Free Beginners\u2019 Guide on
\nPython Programming Language
\nDo you Love to Program in Python Language? Are you completely new to the Phyton programming language? Then, refer to this ultimate guide on Python Programming and become the top programmer. For detailed information such as What is Python? Why we use it? Tips to Learn Python Programming Language, Applications for Python dive into this article.<\/p>\n6)Explanation<\/h3>\n
\nThe PdfFileReader object is then created, which will read the file opened in the previous step.
\nThe number of pages in the file is stored in a variable.
\nThe final step saves the detected lines from the PDF to a text file you designate.
\nRelated Programs<\/strong>:<\/p>\n\n