Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Tue Mar 04, 2025 11:32 am

hello,
I'm using Spire.Office version 9.10.0 and trying to extract PDF file content. All the texts are extracted as expected. However, I want to extract all the text information with their X and Y coordinates.

I used the sample test program from the site https://www.e-iceblue.com/Tutorials/JAVA/Spire.PDF-for-JAVA/Program-Guide/Extract/Read/Extract/Read-Text-from-PDF-in-Java.html#:~:text=%3C/dependencies%3E-,Extract%20All%20Text%20from%20a%20Specified%20Page,-Spire.PDF%20for

Kindly guide me to extract all the text coordinates.
Thanks in advance.

Usha.Thavasiappan
 
Posts: 40
Joined: Mon Nov 06, 2023 8:16 am

Wed Mar 05, 2025 2:50 am

Hello,

Thanks for your inquiry.
Please note that PDF is composed of text fragments. You can use the following code to get the coordinate information of each text fragment. If you want to get the location information of a specific word, please refer to the example in this tutorial.
Code: Select all
PdfDocument doc = new PdfDocument();
doc.loadFromFile("test.pdf");
for (Object pageObj : doc.getPages())
{
    PdfPageBase page = (PdfPageBase)pageObj;
    PdfTextFinder finder = new PdfTextFinder(page);
    finder.getOptions().setTextFindParameter(EnumSet.of(TextFindParameter.WholeWord));
    List<PdfTextFragment> allText = finder.findAllText(page);
    for (PdfTextFragment textFragment: allText)
    {
        Rectangle2D[] bounds = textFragment.getBounds();
        double x = bounds[0].getX();
        double y = bounds[0].getY();
        System.out.println(
                textFragment.getText() + ":  x:" + x + "  y:" + y
        );
    }
}

Sincerely,
William
E-iceblue support team
User avatar

William.Zhang
 
Posts: 732
Joined: Mon Dec 27, 2021 2:23 am

Wed Mar 05, 2025 4:34 am

Hello,
It saved my time, thanks, William!

Usha.Thavasiappan
 
Posts: 40
Joined: Mon Nov 06, 2023 8:16 am

Wed Mar 05, 2025 6:23 am

Hello,

Thanks for your reply.
If you have any further questions, please feel free to write back.

Sincerely,
William
E-iceblue support team
User avatar

William.Zhang
 
Posts: 732
Joined: Mon Dec 27, 2021 2:23 am

Wed Mar 05, 2025 11:52 am

Hi William,

I reviewed several PDF documents, and most extracted text with coordinates perfectly. However, one document only extracted text as individual characters (see example below). How can this be addressed?
Code: Select all
{
    "Page 1": [
        {
            "text": "4",
            "x": 70.98759460449219,
            "y": 16.934228897094727,
            "width": 4.447999477386475,
            "height": 7.999999523162842
        },
        {
            "text": "6",
            "x": 75.43559265136719,
            "y": 16.934228897094727,
            "width": 4.447999477386475,
            "height": 7.999999523162842
        },
        {
            "text": "P",
            "x": 79.88359069824219,
            "y": 16.934228897094727,
            "width": 5.335999488830566,
            "height": 7.999999523162842
        },
        {
            "text": "M",
            "x": 85.21958923339844,
            "y": 16.934228897094727,
            "width": 6.663999557495117,
            "height": 7.999999523162842
        }
    ]
}


I will send the confidential document via email since it cannot be shared publicly.

Usha.Thavasiappan
 
Posts: 40
Joined: Mon Nov 06, 2023 8:16 am

Thu Mar 06, 2025 1:59 am

Hello,

Thanks for your feedback.
This issue should be related to your file. In order to help us investigate further, please provide us with your current pdf file. You can upload it to the attachment or send it to this email: [email protected] . Thanks in advance.

Sincerely,
William
E-iceblue support team
User avatar

William.Zhang
 
Posts: 732
Joined: Mon Dec 27, 2021 2:23 am

Thu Mar 06, 2025 4:00 am

Hey William,
I have shared the document with your email inbox. Kindly go through it and let me know how to proceed with this.

Usha.Thavasiappan
 
Posts: 40
Joined: Mon Nov 06, 2023 8:16 am

Thu Mar 06, 2025 6:58 am

Hello,

Thanks for your file.
Kindly note that "PdfTextFragment" actually corresponds to the text output by the "TJ operator" in the PDF document specification. "TJ" is a text display operator used to output text with adjustable character spacing. We analyzed your document and found that each character in your document corresponds to a separate TJ operator, so the extracted content is independent. For this type of pdf file,we cannot further process the results. Thank you for your understanding.

Sincerely,
William
E-iceblue support team
User avatar

William.Zhang
 
Posts: 732
Joined: Mon Dec 27, 2021 2:23 am

Fri Mar 07, 2025 7:31 am

Hello,
Thank you for the information.
Do you have any workaround or plan to extract these kinds of documents in future?

Usha.Thavasiappan
 
Posts: 40
Joined: Mon Nov 06, 2023 8:16 am

Fri Mar 07, 2025 8:03 am

Hello,

Thanks for your reply.
As I explained before, the reason for this issue is that the document itself uses a single TJ operator to draw a single character. Currently our Spire.PDF can only extract text based on the TJ operator. Sorry, for this kind of file, we are currently unable to process further. Thank you for your understanding.

Sincerely,
William
E-iceblue support team
User avatar

William.Zhang
 
Posts: 732
Joined: Mon Dec 27, 2021 2:23 am

Return to Spire.PDF

cron