Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Wed Jan 10, 2018 3:10 pm

Hi,
Is there an easy way of getting all text, with positioning, on a page? Something like page.FindText("") or page.FindText(null) to return a collection of all text with their bounds? Currently this just seems to return a null collection.

As I see it, the only way to accomplish this would be to get all text on a page, eliminate duplicates, then do a FindText for each entry, add them all to a word collection, (and in my case, sort them by Y value to actually sort into line by line)

Seems like something that should already be built in since you essentially already do it, then just filter the word passed in, but I can't seem to find any other way.

Thanks,

mfreeze
 
Posts: 9
Joined: Fri Dec 01, 2017 12:45 pm

Thu Jan 11, 2018 7:52 am

Hello,

Thanks for your inquiry. Sorry that there is no easy way to get a collection of all text with position at present. However I have added it as a new feature in our schedule. Once there is any good news, we will inform you.

Best regards,
Simon
E-iceblue support team
User avatar

Simon.yang
 
Posts: 620
Joined: Wed Jan 11, 2017 2:03 am

Mon Jan 15, 2018 1:44 am

Hello,

Glad to inform you that the new function has been accomplished. Once it passes the test, we will prepare the hotfix for you.

Best regards,
Simon
E-iceblue support team
User avatar

Simon.yang
 
Posts: 620
Joined: Wed Jan 11, 2017 2:03 am

Mon Jan 15, 2018 10:41 am

Great!
Thank you for the update Simon!

mfreeze
 
Posts: 9
Joined: Fri Dec 01, 2017 12:45 pm

Wed Jan 17, 2018 7:12 pm

As a temporary workaround, I'm doing as I suggested above.
I'm getting all text on the page, eliminating the duplicates, and doing a FindText for each "word" found. However I'm running into some errors with FindText.

Just doing a quick test, I found errors with 2 "words" (athough one is just a symbol)
"+" and "filter"

Filter throws

VER: 01/17/2018 15:00:18: Getting collection for: 'filter'
ERR: 01/17/2018 15:00:18: error message in getPageCoordinatesData: Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index
ERR: 01/17/2018 15:00:18: stack trace in getPageCoordinatesData: at System.ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument argument, ExceptionResource resource)
at Spire.Pdf.PdfPageBase.ᜀ(PdfTextFindCollection A_0, String A_1)
at Spire.Pdf.PdfPageBase.ᜀ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜁ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜀ(String[] A_0, PdfPageResources A_1, sprᤩ A_2)
at Spire.Pdf.PdfPageBase.ᜀ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜁ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜀ(String[] A_0, PdfPageResources A_1, sprᤩ A_2)
at Spire.Pdf.PdfPageBase.ᜀ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜁ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ExecuteCommandFindText(String searchPatternText)
at Spire.Pdf.PdfPageBase.FindText(String searchPatternText)


And "+" throws
VER: 1/17/2018 2:59:37 PM: Getting collection for: '+'
ERR: 01/17/2018 14:59:37: error message in getPageCoordinatesData: parsing "+" - Quantifier {x,y} following nothing.
ERR: 01/17/2018 14:59:37: stack trace in getPageCoordinatesData: at System.Text.RegularExpressions.RegexParser.ScanRegex()
at System.Text.RegularExpressions.RegexParser.Parse(String re, RegexOptions op)
at System.Text.RegularExpressions.Regex..ctor(String pattern, RegexOptions options, TimeSpan matchTimeout, Boolean useCache)
at System.Text.RegularExpressions.Regex..ctor(String pattern, RegexOptions options)
at Spire.Pdf.PdfPageBase.ᜀ(PdfTextFindCollection A_0, String A_1)
at Spire.Pdf.PdfPageBase.ᜀ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜁ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜀ(String[] A_0, PdfPageResources A_1, sprᤩ A_2)
at Spire.Pdf.PdfPageBase.ᜀ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜁ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜀ(String[] A_0, PdfPageResources A_1, sprᤩ A_2)
at Spire.Pdf.PdfPageBase.ᜀ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ᜁ(String A_0, sprᤩ A_1, sprᩓ A_2)
at Spire.Pdf.PdfPageBase.ExecuteCommandFindText(String searchPatternText)
at Spire.Pdf.PdfPageBase.FindText(String searchPatternText)


Just letting you know in case others are having the same problem. If this isn't something that is easily fixed, it might be nice to have some "words" to avoid listed in the documentation.

mfreeze
 
Posts: 9
Joined: Fri Dec 01, 2017 12:45 pm

Thu Jan 18, 2018 3:39 am

Hello,

Thanks for your sharing. I have reproduced the issue while finding "+" and logged it into our bug tracking system. If there is any update, we will let you know. As for the word of "filter", it works fine on my side. I am using Spire.PDF Pack(Hot Fix) Version:3.9.584.

Best regards,
Simon
E-iceblue support team
User avatar

Simon.yang
 
Posts: 620
Joined: Wed Jan 11, 2017 2:03 am

Fri Jan 19, 2018 10:03 am

Hello,

Glad to inform you that the hotfix is available now. Please download Spire.PDF Pack(Hot Fix) Version:3.9.604 and refer to the below code snippet to use it.
Code: Select all
PdfDocument doc=new PdfDocument (file);
PdfTextFindCollection allTextFind=doc.pages[i].FindAllText();

Regarding the special symbols("+","*" and "?") which are qualifier of Regex, you need to add "\\" while using them. Such as "\\+".

Best regards,
Simon
E-iceblue support team
User avatar

Simon.yang
 
Posts: 620
Joined: Wed Jan 11, 2017 2:03 am

Tue Jan 23, 2018 8:16 am

Hello,

Greeting from E-iceblue.
Has the hotfix resolved your issue?
Your feedback will be greatly appreciated.

Best regards,
Simon
E-iceblue support team
User avatar

Simon.yang
 
Posts: 620
Joined: Wed Jan 11, 2017 2:03 am

Tue Jan 23, 2018 11:36 am

Hi Simon,

Yes, that seems to be working perfectly. Thank you!

mfreeze
 
Posts: 9
Joined: Fri Dec 01, 2017 12:45 pm

Wed Jan 24, 2018 1:41 am

Hello,

Glad to hear that. If you encounter other queries, please write back to us.

Best regards,
Simon
E-iceblue support team
User avatar

Simon.yang
 
Posts: 620
Joined: Wed Jan 11, 2017 2:03 am

Thu Jul 05, 2018 12:11 pm

Hi Simon,

I just noticed that this seems to return all text, but what's the logic behind how it groups text?

Where we're getting all text, I originally assumed it was going to be each word with their bounds. As we would want each word to be it's own object. But currently, it seems to be anything from a single word, to a group, to a full sentence.

mfreeze
 
Posts: 9
Joined: Fri Dec 01, 2017 12:45 pm

Fri Jul 06, 2018 7:01 am

Dear mfreeze,

Thanks for your inquiry.
The method returns the text with its bounds according to TJ element in pdf.
I am afraid it is difficult to return each word and its position, since each TJ element does not contain exactly a complete word, it could include a few characters, maybe a few words. And there is no other good way to achieve your target at present.

Sincerely,
Betsy
E-iceblue support team
User avatar

Betsy.jiang
 
Posts: 3099
Joined: Tue Sep 06, 2016 8:30 am

Fri Jul 06, 2018 12:44 pm

Oh. Well I guess that means Spire isn't going to work as I had hoped.
Please keep me updated if you plan on adding that in the future.

mfreeze
 
Posts: 9
Joined: Fri Dec 01, 2017 12:45 pm

Mon Jul 09, 2018 1:33 am

Dear mfreeze,

Thanks for your feedback.
If there is any good news in the future, we will let you know.

Sincerely,
Betsy
E-iceblue support team
User avatar

Betsy.jiang
 
Posts: 3099
Joined: Tue Sep 06, 2016 8:30 am

Wed Jul 25, 2018 6:26 am

Dear mfreeze,

After repeated attempts and further investigation, I am afraid the feature you want cannot be achieved at present. Our Spire.PDF follows the Adobe standard, in the standard, PDF doesn't have the capacity of recognizing if some characters were a word.

Sincerely,
Betsy
E-iceblue support team
User avatar

Betsy.jiang
 
Posts: 3099
Joined: Tue Sep 06, 2016 8:30 am

Return to Spire.PDF