Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Fri Jan 03, 2025 9:31 am

Is there any way I can search for a text in the PDF by removing the special characters and spaces in the text to retrieve the coordinates of the text?

vmsubramanian1979
 
Posts: 6
Joined: Tue Jul 16, 2024 4:34 am

Fri Jan 03, 2025 10:04 am

Hello,

Thank you for your inquiry.

I recommend that you try using a regular expression to search through the text. Here is an example code snippet:

Code: Select all
PdfDocument doc = new PdfDocument();

// Read a pdf file
doc.LoadFromFile(input);

// Get the first page of pdf file
PdfPageBase page = doc.Pages[0];

// Create a PdfTextFinder object for searching text within the first page
PdfTextFinder finder = new PdfTextFinder(page);
finder.Options.Parameter = Spire.Pdf.Texts.TextFindParameter.Regex;

// Find occurrences of the specified text within the first page
List<PdfTextFragment> finds = finder.Find("hello[\\s\\S]*world");

// Creates a brush
PdfBrush brush = new PdfSolidBrush(Color.DarkBlue);

// Defines a font
PdfTrueTypeFont font = new PdfTrueTypeFont(new Font("Arial", 7f, FontStyle.Bold));

// Defines text horizontal/vertical center format
PdfStringFormat centerAlign = new PdfStringFormat(PdfTextAlignment.Center, PdfVerticalAlignment.Middle);

RectangleF rec;

// Iterate through each found text fragment
foreach (PdfTextFragment find in finds)
{
    // Gets the bound of the found text in page
    rec = find.Bounds[0];
    float x = rec.X;
    float y = rec.Y;

}


If you need further assistance or have more questions, feel free to ask!

Sincerely,
Amy
E-iceblue support team
User avatar

amy.zhao
 
Posts: 3011
Joined: Wed Jun 27, 2012 8:50 am

Tue Jan 07, 2025 5:34 am

Hi Amy,

I have the SearchTerm in a json file and need to search for that term in the PDF file by ignoring the spaces and special characters to find the coordinate of the text in the PDF. I have options to remove the special characters and spaces from the SearchTerm but is it possible to remove them from the pdf text also for matching them to retrieve the coordinates of the text from the PDF. Can you let me know how this can be done? I have given an example of a text below which needs to be searched in the PDF

Search Term in json file: BMI : 31.0 - 39.0 , adult
PDF Text: BMI:31.0-39.0, adult

I am currently using the below query to search for a text in the PDF

PdfTextFinder finder = new PdfTextFinder(page);
PdfTextFindOptions options = new PdfTextFindOptions();
options.Parameter = Spire.Pdf.Texts.TextFindParameter.IgnoreCase;
finder.Options = options;
List<PdfTextFragment> fragments = finder.Find(SearchKeyTerm);

vmsubramanian1979
 
Posts: 6
Joined: Tue Jul 16, 2024 4:34 am

Tue Jan 07, 2025 7:50 am

Hello,

Thank you for your feedback.

Based on your requirements, we have prepared the following demo for your reference. Please note that it is not possible to remove the original text from the PDF content directly, instead, a white rectangular area has been used to cover the target text, achieving a visual effect of making it invisible.

Code: Select all
PdfDocument doc = new PdfDocument();

// Read a pdf file
doc.LoadFromFile(path+"1.pdf");

// Get the first page of pdf file
PdfPageBase page = doc.Pages[0];

// Create a PdfTextFinder object for searching text within the first page
PdfTextFinder finder = new PdfTextFinder(page);
finder.Options.Parameter = Spire.Pdf.Texts.TextFindParameter.Regex;

String regex = "BMI\\s*:\\s*\\d+(?:\\.\\d+)?\\s*-\\s*\\d+(?:\\.\\d+)?\\s*,\\s*adult";
// Find occurrences of the specified text within the first page
List<PdfTextFragment> finds = finder.Find(regex);

// Creates a brush
PdfBrush brush = new PdfSolidBrush(Color.DarkBlue);

// Defines a font
PdfTrueTypeFont font = new PdfTrueTypeFont(new Font("Arial", 7f, FontStyle.Bold));

// Defines text horizontal/vertical center format
PdfStringFormat centerAlign = new PdfStringFormat(PdfTextAlignment.Center, PdfVerticalAlignment.Middle);

RectangleF rec;

// Iterate through each found text fragment
foreach (PdfTextFragment find in finds)
{
    // Gets the bound of the found text in page
    rec = find.Bounds[0];
    page.Canvas.DrawRectangle(PdfBrushes.White, rec);

    // Draws new text as defined font and color
    page.Canvas.DrawString("", font, brush, rec);

}
doc.SaveToFile(path+"result.pdf");
doc.Close();
doc.Dispose();

If the above example does not meet your needs, please provide us with further feedback.

Sincerely,
Amy
E-iceblue support team
User avatar

amy.zhao
 
Posts: 3011
Joined: Wed Jun 27, 2012 8:50 am

Mon Feb 17, 2025 10:47 am

Hi. While extracting the text using the below code from the attached section of the PDFs marked in red color using FeeSpire.PDF it is unable to extract the highlighted text. Please advise.

PdfTextExtractor textExtractor = new PdfTextExtractor(page);

//Create a PdfTextExtractOptions object
PdfTextExtractOptions extractOptions = new PdfTextExtractOptions();

//Set isExtractAllText to true
//extractOptions.IsSimpleExtraction = true;

extractOptions.IsExtractAllText = true;

//Extract text from the page
string pdftextcontent = textExtractor.ExtractText(extractOptions);

vmsubramanian1979
 
Posts: 6
Joined: Tue Jul 16, 2024 4:34 am

Tue Feb 18, 2025 2:28 am

Hi,

Thanks for your inquiry.
Please provide your input PDF file to help me investigate your issue accurately and quickly. You can upload your file here or send it to my email ([email protected]).

Sincerely,
Nina
E-iceblue support team
User avatar

Nina.Tang
 
Posts: 1379
Joined: Tue Sep 27, 2016 1:06 am

Return to Spire.PDF

cron