How to Extract Images from a PDF in Python?

What is PDF?

PDFs are a popular format for distributing text. PDF is an abbreviation for Portable Document Format, and it utilizes the .pdf file extension. Adobe Systems designed it in the early 1990s.

Reading PDF documents in Python can assist you in automating a wide range of operations.

Let us now see how to extract images from a PDF file in python. For this purpose, we use the PyMuPDF and Pillow modules.

Installation

pip install PyMuPDF
pip install Pillow
Output:
Collecting PyMuPDF
Downloading PyMuPDF-1.19.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014
_x86_64.whl (8.8 MB)
|████████████████████████████████| 8.8 MB 4.5 MB/s 
Installing collected packages: PyMuPDF
Successfully installed PyMuPDF-1.19.6
PyMuPDF module: PyMuPDF is binding for MuPDF in Python, a lightweight PDF viewer.
Pillow Module: Pillow is a Python Imaging Library (PIL) that allows you to open, manipulate, and save images in a variety of formats.

Extract Images from a PDF in Python

Approach:

  • Import fitz module using the import keyword
  • Import io(input-output) module using the import keyword
  • Import Image from PIL module using the import keyword
  • Open some random PDF file using the open() function of the fitz module by passing the filename/path as an argument to it.
  • Calculate the number of pages in a given PDF file using the len() function by passing the given pdf file as an argument to it.
  • Loop in each page of the pdf till the number of pages using the for loop
  • Get the page at iterator index
  • Get all image objects present in this page using the getImageList() function and store it in a variable
  • Loop in these image objects using the for loop and enumerate() function
  • Get the XREF of the image information.
  • Extract image information by passing the above image xref to the extractImage() function
  • Extract image bytes by passing “image” to the above image information
  • Access the image extension by passing “ext” to the above image information
  • Load this image to PIL using the BytesIO() function by passing the above image bytes as an argument to it.
  • Save this above result image using the save() function with the given image count and extension.
  • The Exit of the Program.

Below is the implementation:

# Import fitz module using the import keyword
import fitz
# Import io(input-output) module using the import keyword
import io
# Import Image from PIL module using the import keyword
from PIL import Image

# Open some random PDF file using the open() function of the fitz module 
# by passing the filename/path as an argument to it.
gvn_pdf= fitz.open("samplepdf_file.pdf")
# Calculate the number of pages in a given PDF file using the len() function
# by passing the given pdf file as an argument to it.
no_of_pages = len(gvn_pdf)

# Loop in each page of the pdf till the number of pages using the for loop
for k in range(no_of_pages):
    # Get the page at iterator index
    page = gvn_pdf[k]
    # Get all image objects present in this page using the getImageList() function
    # and store it in a variable
    img_lst = page.getImageList()
    # Loop in these image objects using the for loop and enumerate() function
    for imgcount, image in enumerate(img_lst, start=1):
        # Get the XREF of the image information.
        img_xref = image[0]
        # Extract image information by passing the above image xref to the extractImage() function
        imageinformation = gvn_pdf.extractImage(img_xref)
        # Extract image bytes by passing "image" to the above imageinformation
        img_bytes = imageinformation["image"]
        # Access the image extension by passing "ext" to the above imageinformation
        img_extension = imageinformation["ext"]
        # Load this image to PIL using the BytesIO() function by passing the above 
        # image bytes as argument to it.
        rslt_img= Image.open(io.BytesIO(img_bytes))
        # Save this above result image using the save() function with the given image count and extension
        rslt_img.save(open(f"page{k+1}_img{imgcount}.{img_extension}", "wb"))

Output:

Deprecation: 'getImageList' removed from class 'Page' after v1.19 - use
'get_images'.
Deprecation: 'extractImage' removed from class 'Document' after v1.19 - 
use 'extract_image'.

Output Images in google colab