Getting word coordinates from PDF

Sat Jan 25, 2014 7:32 pm

Hi all,

I'm trying to get the coordinates (and bounds) of every word in a pdf. I found no easier way to do so other than searching for each unique word on every page with

Code: Select all: PdfTextFindCollection words = page.FindText(word); foreach (PdfTextFind find in words.Finds) { ... bound.X = find.Position.X; bound.Y = find.Position.Y; }

For some reason sometimes bound.Y is a very large negative number like -89463.002
(I haven't noticed bound.X having such values, they are usually seemingly normal, but I haven't checked all of them)
Why is that?

(The page dimensions are Width = 612.0 Height = 792.0)

My end goal here is to draw the page image into a picturebox and put red boxes around certain words.

Any help?

P.S. I really don't like this method of getting word locations as it also finds words that are part of sentences: e.g. if I try to find location of "may" it also finds "may" in "mayhem" and "maybe" etc. If anyone has a better method, please suggest.

Sun Jan 26, 2014 8:23 am

Hello,

Thanks for your inquiry.
Sorry for the inconvenience. Sorry that our Spire.PDF doesn't support to get word coordinates from PDF well at present.
We will continue to improve it. We will tell you when there is version to support it well in the future.

Welcome to write to us again for further inquiry.

Best wishes,
Amy
E-iceblue support team

Sun Jan 26, 2014 8:53 pm

So what your saying is you have the feature implemented, however it is broken, and you simply distribute your product with a completely broken major feature? That's very unfortunate.

Thank you for your response, but unless by "in the future" you mean the next few days, I will be looking elsewhere for this functionality

Mon Jan 27, 2014 4:12 am

Hello,

Thanks for your reply.
Sorry for the inconvenience again. Our dev team is working on the feature. Would you please provide us with your pdf file?
It will help us to do the research and fix the issues soon. Thank you.

You could also send your file to support@e-iceblue.com.
We promise to keep your document confidential and we will not use it for any other purpose. Besides, you could also remove the security data of your document and then send it to us.

Best wishes,
Amy
E-iceblue support team

Mon Jan 27, 2014 4:39 am

Here is a document I tested with that exhibits the glitch with the broken Y coordinate: (url)ge. tt/ 4KGGWeG1 /v /0?c(/url)
(remove spaces, sorry for weird url format - I don't have permissions to post urls)
I noticed that the first word it finds has a seemingly normal Y coordinate, however all the rest are -700 and subsequently larger negative numbers (even though that document has only one line).

Mon Jan 27, 2014 8:06 am

Hello,

Thanks for your supporting.
I just tested your pdf file with Spire.PDF Pack(Hot Fix) Version:2.9.13 (http://www.e-iceblue.com/Download/downl ... t-now.html). The coordinates of all words were normal.
I attached our test code and some screenshots as below.

Code: Select all: PdfDocument document=new PdfDocument(); document.LoadFromFile(@"..\..\pdftest.pdf"); PdfTextFindCollection words = document.Pages[0].FindText("nine"); List<PointF> bounds = new List<PointF>(); foreach (PdfTextFind find in words.Finds) { PointF bound = new PointF(); bound.X = find.Position.X; bound.Y = find.Position.Y; bounds.Add(bound); }

Would you please try the Version:2.9.13 if you could? Please feel free to contact us if you have any problem.

Thank you.
Best wishes,
Amy
E-iceblue support team

Mon Jan 27, 2014 2:08 pm

That's very strange. I am already using the hotfix version : http://pbrd.co/1mNNgVC
Here are the results I'm getting: http://pbrd.co/1mNNWdP
And here: http://pbrd.co/1mNOJLP

Very strange you are not getting the same error with the same file.

Tue Jan 28, 2014 3:54 am

Hello,

Thanks for your feedback.
Maybe it is caused by the different culture, please try the following method

Code: Select all: CultureInfo cc = Thread.CurrentThread.CurrentCulture; Thread.CurrentThread.CurrentCulture = CultureInfo.InvariantCulture; PdfDocument document = new PdfDocument(); document.LoadFromFile(@"..\..\pdftest.pdf"); PdfTextFindCollection words = document.Pages[0].FindText("two"); List<PointF> bounds = new List<PointF>(); foreach (PdfTextFind find in words.Finds) { PointF bound = new PointF(); bound.X = find.Position.X; bound.Y = find.Position.Y; bounds.Add(bound); } Thread.CurrentThread.CurrentCulture = cc;

If there are any questions, welcome to get it back to us.
Sincerely,
Gary
E-iceblue support team

Tue Jan 28, 2014 4:34 am

I added the thread culture code, it did not help. If threading may pose issues, I should note that the function is running inside a
System.ComponentModel.BackgroundWorker.

Thanks for all your support.

EDIT:
I just tried to run it on the main (UI) thread and it has the same issue.

In an attempt to eliminate as many variables as possible I would also like to point out that I am running this in .NET Framework 4.0, targeting the x86 platform.

Tue Jan 28, 2014 6:21 am

Hello,

Thanks for your prompt reply.
What's your Operation System? for example:

1.Windows 7 Enterprise Edition SP1 x64
2.Regional and Language Options

Thanks,
Gary
E-iceblue support team

Tue Jan 28, 2014 3:11 pm

Windows 7 Professional SP1 x64
Regional/Language is attached

reglang.PNG

If you need any specific Regional or Language setting please let me know.

Wed Jan 29, 2014 9:29 am

Hi rpolyano,

We can not reproduce this issue. Did you try to convert it to an image, please try the code below:

Code: Select all: PdfDocument doc = new PdfDocument(@"..\..\pdftest.pdf"); doc.SaveAsImage(0).Save("test.png");

And check whether it works fine in your computer.

Regards,

Wed Jan 29, 2014 1:44 pm

I have been using the Document Imaging code for quite some time now, it works great.

Thu Jan 30, 2014 9:12 am

Hi rpolyano,

Thanks for your feedback.
To help us reproduce the issue about getting word coordinates quickly, would you please provide us your project?
You could also send it to support@e-iceblue.com.
Many thanks.

Best wishes,
Amy
E-iceblue support team

Thu Jan 30, 2014 8:19 pm

I'm afraid at this time I can not disclose the source code of this project.
If the issue can not be reproduced easily at your end, then it must be something on my side. I will try looking into the issue further and testing it in various environments, and although I am very grateful for your efforts thus far, I think anything further is a waste of your time. I have found alternative components that provide this functionality, and although they are more expensive, I'm kinda out of options. Thanks again very much for your help this far.

Getting word coordinates from PDF

Purchase

Partnership

Products

Corporation