Python: Find and Highlight Text in PDF

Highlighting important text with vibrant colors is a commonly employed method for navigating and emphasizing content in PDF documents. Particularly in lengthy PDFs, emphasizing key information aids readers swiftly comprehending the document content, thereby enhancing reading efficiency. Utilizing Python programs enables document creators to effortlessly and expeditiously execute the highlighting process. This article will explain how to use Spire.PDF for Python to find and highlight text in PDF documents with Python programs.

Install Spire.PDF for Python

This scenario requires Spire.PDF for Python and plum-dispatch v1.7.4. They can be easily installed in your Windows through the following pip command.

pip install Spire.PDF

If you are unsure how to install, please refer to this tutorial: How to Install Spire.PDF for Python on Windows

Find and Highlight Specific Text in PDF with Python

Spire.PDF for Python enables developers to find all occurrences of specific text on a page with PdfPageBase.FindText() method and apply highlight color to an occurrence with ApplyHighLight() method. Below is an example of using Spire.PDF for Python to highlight all occurrence of specific text:

  • Create an object of PdfDocument class and load a PDF document using PdfDocument.LoadFromFile() method.
  • Loop through the pages in the document.
  • Get a page using PdfDocument.Pages.get_Item() method.
  • Find all occurrences of specific text on the page using PdfPageBase.FindText() method.
  • Loop through the occurrences and apply a highlight color to each occurrence using ApplyHighLight() method.
  • Save the document using PdfDocument.SaveToFile() method.
  • Python
from spire.pdf import *
from spire.pdf.common import*

# Create an object of PdfDocument class and load a PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Loop through the pages in the PDF document
for i in range(pdf.Pages.Count):
    # Get a page
    page = pdf.Pages.get_Item(i)
    # Find all occurrences of specific text on the page
    result = page.FindText("cloud server", TextFindParameter.IgnoreCase).Finds
    # Highlight all the occurrences
    for text in result:
        text.ApplyHighLight(Color.get_Cyan())

# Save the document
pdf.SaveToFile("output/FindHighlight.pdf")
pdf.Close()

Python: Find and Highlight Text in PDF

Find and Highlight Text in a Specified PDF Page Area with Python

In addition to finding and highlighting specified text on the entire PDF page, Spire.PDF for Python also supports finding and highlighting specified text in specified areas on the page by passing a RectangleF instance as parameter to the PdfPageBase.FindText() method. The detailed steps are as follows:

  • Create an object of PdfDocument class and load a PDF document using PdfDocument.LoadFromFile() method.
  • Get the first page of the document using PdfDocument.Pages.get_Item() method.
  • Define a rectangular area.
  • Find all occurrences of specific text in the specified rectangular area on the first page using PdfPageBase.FindText() method.
  • Loop through the occurrences and apply a highlight color to each occurrence using ApplyHighLight() method.
  • Save the document using PdfDocument.SaveToFile() method.
  • Python
from spire.pdf.common import *
from spire.pdf import *

# Create an objetc of PdfDocument and load a PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Get a page
pdfPageBase = pdf.Pages.get_Item(0)

# Define a rectangular area
rctg = RectangleF(0.0, 0.0, pdfPageBase.ActualSize.Width, 300.0)

# Find all the occurrences of specified text in the rectangular area
findCollection = pdfPageBase.FindText(rctg,"cloud server",TextFindParameter.IgnoreCase)

# Find text in the rectangle
for find in findCollection.Finds:
        #Highlight searched text
        find.ApplyHighLight(Color.get_Green())

# Save the document
pdf.SaveToFile("output/FindHighlightArea.pdf")
pdf.Close()

Python: Find and Highlight Text in PDF

Find and Highlight Text in PDF using Regular Expression with Python

Sometimes the text that needs to be highlighted is not exactly the same words. In this case, the use of regular expressions allows more flexibility in text search. By passing TextFindParameter.Regex as parameter to the PdfPageBase.FindText() method, we can find text using regular expression in PDF documents. The detailed steps are as follows:

  • Create an object of PdfDocument class and load a PDF document using PdfDocument.LoadFromFile() method.
  • Specify the regular expression.
  • Get a page using PdfDocument.Pages.get_Item() method.
  • Find matched text with the regular expression on the page using PdfPageBase.FindText() method.
  • Loop through the matched text and apply Highlight color to the text using ApplyHighLight() method.
  • Save the document using PdfDocument.SaveToFile() method.
  • Python
from spire.pdf import *
from spire.pdf.common import*

# Create an object of PdfDocument class and load a PDF document
pdf = PdfDocument()
pdf.LoadFromFile("Sample.pdf")

# Specify the regular expression that matches two words after *
regex = "\\*(\\w+\\s+\\w+)"

# Get the second page
page = pdf.Pages.get_Item(1)

# Find matched text on the page using regular expression
result = page.FindText(regex, TextFindParameter.Regex)

# Highlight the matched text
for text in result.Finds:
    text.ApplyHighLight(Color.get_DeepPink())

# Save the document
pdf.SaveToFile("output/FindHighlightRegex.pdf")

Python: Find and Highlight Text in PDF

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.