{"id":26751,"date":"2022-03-31T02:21:04","date_gmt":"2022-03-30T20:51:04","guid":{"rendered":"https:\/\/python-programs.com\/?p=26751"},"modified":"2022-03-31T02:21:04","modified_gmt":"2022-03-30T20:51:04","slug":"how-to-extract-images-from-a-pdf-in-python","status":"publish","type":"post","link":"https:\/\/python-programs.com\/how-to-extract-images-from-a-pdf-in-python\/","title":{"rendered":"How to Extract Images from a PDF in Python?"},"content":{"rendered":"
What is PDF?<\/strong><\/p>\n PDFs are a popular format for distributing text. PDF is an abbreviation for Portable Document Format, and it utilizes the\u00a0.pdf<\/strong>\u00a0file extension. Adobe Systems designed it in the early 1990s.<\/p>\n Reading PDF documents in Python can assist you in automating a wide range of operations.<\/p>\n Let us now see how to extract images from a PDF file in python. For this purpose, we use the PyMuPDF and Pillow modules.<\/p>\n Installation<\/strong><\/p>\n Approach:<\/strong><\/p>\n Below is the implementation:<\/strong><\/p>\n Output:<\/strong><\/p>\n <\/p>\n <\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" What is PDF? PDFs are a popular format for distributing text. PDF is an abbreviation for Portable Document Format, and it utilizes the\u00a0.pdf\u00a0file extension. Adobe Systems designed it in the early 1990s. Reading PDF documents in Python can assist you in automating a wide range of operations. Let us now see how to extract images …<\/p>\npip\u00a0install\u00a0PyMuPDF<\/pre>\n
pip\u00a0install\u00a0Pillow<\/pre>\n
Collecting PyMuPDF\r\nDownloading PyMuPDF-1.19.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014\r\n_x86_64.whl (8.8 MB)\r\n|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 8.8 MB 4.5 MB\/s \r\nInstalling collected packages: PyMuPDF\r\nSuccessfully installed PyMuPDF-1.19.6<\/pre>\n<\/div>\n
Extract Images from a PDF in Python<\/h2>\n
\n
# Import fitz module using the import keyword\r\nimport fitz\r\n# Import io(input-output) module using the import keyword\r\nimport io\r\n# Import Image from PIL module using the import keyword\r\nfrom PIL import Image\r\n\r\n# Open some random PDF file using the open() function of the fitz module \r\n# by passing the filename\/path as an argument to it.\r\ngvn_pdf= fitz.open(\"samplepdf_file.pdf\")\r\n# Calculate the number of pages in a given PDF file using the len() function\r\n# by passing the given pdf file as an argument to it.\r\nno_of_pages = len(gvn_pdf)\r\n\r\n# Loop in each page of the pdf till the number of pages using the for loop\r\nfor k in range(no_of_pages):\r\n # Get the page at iterator index\r\n page = gvn_pdf[k]\r\n # Get all image objects present in this page using the getImageList() function\r\n # and store it in a variable\r\n img_lst = page.getImageList()\r\n # Loop in these image objects using the for loop and enumerate() function\r\n for imgcount, image in enumerate(img_lst, start=1):\r\n # Get the XREF of the image information.\r\n img_xref = image[0]\r\n # Extract image information by passing the above image xref to the extractImage() function\r\n imageinformation = gvn_pdf.extractImage(img_xref)\r\n # Extract image bytes by passing \"image\" to the above imageinformation\r\n img_bytes = imageinformation[\"image\"]\r\n # Access the image extension by passing \"ext\" to the above imageinformation\r\n img_extension = imageinformation[\"ext\"]\r\n # Load this image to PIL using the BytesIO() function by passing the above \r\n # image bytes as argument to it.\r\n rslt_img= Image.open(io.BytesIO(img_bytes))\r\n # Save this above result image using the save() function with the given image count and extension\r\n rslt_img.save(open(f\"page{k+1}_img{imgcount}.{img_extension}\", \"wb\"))<\/pre>\n
Deprecation: 'getImageList' removed from class 'Page' after v1.19 - use\r\n'get_images'.\r\nDeprecation: 'extractImage' removed from class 'Document' after v1.19 - \r\nuse 'extract_image'.<\/pre>\n