Extract PDF text content with XY coordinates

Tue Mar 04, 2025 11:32 am

hello,
I'm using Spire.Office version 9.10.0 and trying to extract PDF file content. All the texts are extracted as expected. However, I want to extract all the text information with their X and Y coordinates.

I used the sample test program from the site https://www.e-iceblue.com/Tutorials/JAVA/Spire.PDF-for-JAVA/Program-Guide/Extract/Read/Extract/Read-Text-from-PDF-in-Java.html#:~:text=%3C/dependencies%3E-,Extract%20All%20Text%20from%20a%20Specified%20Page,-Spire.PDF%20for

Kindly guide me to extract all the text coordinates.
Thanks in advance.

Wed Mar 05, 2025 2:50 am

Hello,

Thanks for your inquiry.
Please note that PDF is composed of text fragments. You can use the following code to get the coordinate information of each text fragment. If you want to get the location information of a specific word, please refer to the example in this tutorial.

Code: Select all: PdfDocument doc = new PdfDocument(); doc.loadFromFile("test.pdf"); for (Object pageObj : doc.getPages()) { PdfPageBase page = (PdfPageBase)pageObj; PdfTextFinder finder = new PdfTextFinder(page); finder.getOptions().setTextFindParameter(EnumSet.of(TextFindParameter.WholeWord)); List<PdfTextFragment> allText = finder.findAllText(page); for (PdfTextFragment textFragment: allText) { Rectangle2D[] bounds = textFragment.getBounds(); double x = bounds[0].getX(); double y = bounds[0].getY(); System.out.println( textFragment.getText() + ": x:" + x + " y:" + y ); } }

Sincerely,
William
E-iceblue support team

Wed Mar 05, 2025 4:34 am

Hello,
It saved my time, thanks, William!

Wed Mar 05, 2025 6:23 am

Hello,

Thanks for your reply.
If you have any further questions, please feel free to write back.

Sincerely,
William
E-iceblue support team

Wed Mar 05, 2025 11:52 am

Hi William,

I reviewed several PDF documents, and most extracted text with coordinates perfectly. However, one document only extracted text as individual characters (see example below). How can this be addressed?

Code: Select all: { "Page 1": [ { "text": "4", "x": 70.98759460449219, "y": 16.934228897094727, "width": 4.447999477386475, "height": 7.999999523162842 }, { "text": "6", "x": 75.43559265136719, "y": 16.934228897094727, "width": 4.447999477386475, "height": 7.999999523162842 }, { "text": "P", "x": 79.88359069824219, "y": 16.934228897094727, "width": 5.335999488830566, "height": 7.999999523162842 }, { "text": "M", "x": 85.21958923339844, "y": 16.934228897094727, "width": 6.663999557495117, "height": 7.999999523162842 } ] }

I will send the confidential document via email since it cannot be shared publicly.

Thu Mar 06, 2025 1:59 am

Hello，

Thanks for your feedback.
This issue should be related to your file. In order to help us investigate further, please provide us with your current pdf file. You can upload it to the attachment or send it to this email: [email protected] . Thanks in advance.

Sincerely,
William
E-iceblue support team

Thu Mar 06, 2025 4:00 am

Hey William,
I have shared the document with your email inbox. Kindly go through it and let me know how to proceed with this.

Thu Mar 06, 2025 6:58 am

Hello,

Thanks for your file.
Kindly note that "PdfTextFragment" actually corresponds to the text output by the "TJ operator" in the PDF document specification. "TJ" is a text display operator used to output text with adjustable character spacing. We analyzed your document and found that each character in your document corresponds to a separate TJ operator, so the extracted content is independent. For this type of pdf file，we cannot further process the results. Thank you for your understanding.

Sincerely,
William
E-iceblue support team

Fri Mar 07, 2025 7:31 am

Hello,
Thank you for the information.
Do you have any workaround or plan to extract these kinds of documents in future?

Fri Mar 07, 2025 8:03 am

Hello,

Thanks for your reply.
As I explained before, the reason for this issue is that the document itself uses a single TJ operator to draw a single character. Currently our Spire.PDF can only extract text based on the TJ operator. Sorry, for this kind of file, we are currently unable to process further. Thank you for your understanding.

Sincerely,
William
E-iceblue support team