Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Wed Sep 02, 2015 7:25 pm

I am trying to use Extract Text from a PDF file feature. For some reason, for some PDF files the result of the extracted text are not correct. It creates double characters. For example, for the sentence "this letter is to inform you that your request", it comes out like "this letter is to infform you thatt your requestt" .It is really bad for me since I am trying to catch some account number and the number will be added some duplicated numbers in the middle.

I am currently using version 3.5.0.5040. Please let me know if there is any solution for this issue.

Thanks

sherry_xiaoxiao@hotmail.com
 
Posts: 2
Joined: Wed Sep 02, 2015 6:51 pm

Thu Sep 03, 2015 2:05 am

Hello,

Thanks for your inquiry.
Could you please offer us your sample pdf document?
It would be helpful to replicate the issue and work out the solution for you ASAP.
If the information is confidential, you can send it to us ( Support@e-iceblue.com ) via email.

Best Regards,
Sweety
E-iceblue support team
User avatar

sweety1
 
Posts: 539
Joined: Wed Mar 11, 2015 1:14 am

Thu Sep 03, 2015 5:06 pm

I have the text extracting, but it all comes out as one long text string. Is there a way to keep the paragraph separation that the PDF has?

nmobley
 
Posts: 3
Joined: Wed Sep 02, 2015 4:16 pm

Thu Sep 03, 2015 6:23 pm

Hi there,

Thanks for the reply. I have sent the sample PDF files and the output text files through Support@e-iceblue.com. Hopefully they help to resolve this issue.

Thanks!

sherry_xiaoxiao@hotmail.com
 
Posts: 2
Joined: Wed Sep 02, 2015 6:51 pm

Fri Sep 04, 2015 2:09 am

Hello,

Thanks for sharing files. I have reproduced your issue and have posted it to our Dev team. We will inform you when it is resolved.

Sincerely,
Betsy
E-iceblue support team
User avatar

Betsy
 
Posts: 802
Joined: Mon Jan 19, 2015 6:14 am

Wed Sep 09, 2015 6:17 am

Hello,

After investigation, we found that there are double characters in your sample document. We only can see one character as the other is covered or cut. And our product will not deal with these situations. So when you extract the text, there will be double characters.

Sincerely,
Betsy
E-iceblue support team
User avatar

Betsy
 
Posts: 802
Joined: Mon Jan 19, 2015 6:14 am

Thu Jan 05, 2017 11:46 am

The function--ExtractText seem to have one problem, it gets all the text in a page including hidden text and invisible text, what's worse, it put all of those types text together by location from top to bottom. Is there any solution?

MakeSense
 
Posts: 7
Joined: Thu Jan 05, 2017 4:14 am

Fri Jan 06, 2017 4:03 am

Dear MakeSense,

Thanks for your inquiry.
Sorry that at present Spire.PDF will extract the ininvisible text and hidden text when using ExtractText method.
Could you please provide us a sample file which includes that sort of text ? We will investigate whether we can add a new feature.

Thanks,
Betsy
E-iceblue support team
User avatar

Betsy.jiang
 
Posts: 3099
Joined: Tue Sep 06, 2016 8:30 am

Wed Jan 18, 2017 8:36 am

Betsy.jiang wrote:Dear MakeSense,

Thanks for your inquiry.
Sorry that at present Spire.PDF will extract the ininvisible text and hidden text when using ExtractText method.
Could you please provide us a sample file which includes that sort of text ? We will investigate whether we can add a new feature.

Thanks,
Betsy
E-iceblue support team


Well, I used the free version. Now I has another issue with the ExtractText method. In this PDF file, there are fixed vertical splits and unfixed horizontal splits, most case cut one page into 4 parts, I don't have much knowledge of PDF, I don't know what they are. Just want to get all the text in this file in right order, but ExtractText method just read text by line ignoring the splits, I don't know how to deal with it. And I want to know whether Spire.PDF could solve this?

Thanks,
Kevin

MakeSense
 
Posts: 7
Joined: Thu Jan 05, 2017 4:14 am

Wed Jan 18, 2017 9:09 am

Dear Kevin,

Thanks for your feedback.
I have tested the file with the latest Spire.PDF Pack(Hot Fix) Version:3.8.158. But didn't find the issue you mentioned, please try to use this version.

Sincerely,
Betsy
E-iceblue support team
User avatar

Betsy.jiang
 
Posts: 3099
Joined: Tue Sep 06, 2016 8:30 am

Wed Jan 18, 2017 9:40 am

Betsy.jiang wrote:Dear Kevin,

Thanks for your feedback.
I have tested the file with the latest Spire.PDF Pack(Hot Fix) Version:3.8.158. But didn't find the issue you mentioned, please try to use this version.

Sincerely,
Betsy
E-iceblue support team


I have try this new version, this issue still existed, I think you didn't understand what I said. I mean this PDF(Capture.PNG) should read this way:part①->part②->part③->part④(Red mark, in fact part② & part③ belong to a whole logically), not by line(like blue mark, or like the Capture2.PNG attachment)

Capture.PNG

Capture2.PNG

MakeSense
 
Posts: 7
Joined: Thu Jan 05, 2017 4:14 am

Wed Jan 18, 2017 10:12 am

Betsy.jiang wrote:Dear MakeSense,

Thanks for your inquiry.
Sorry that at present Spire.PDF will extract the ininvisible text and hidden text when using ExtractText method.
Could you please provide us a sample file which includes that sort of text ? We will investigate whether we can add a new feature.

Thanks,
Betsy
E-iceblue support team


you mean this overwriting method ? It doesn't work(Test PDF File), just remove the white space but not deal with the invisible text.
Overwrite.PNG

MakeSense
 
Posts: 7
Joined: Thu Jan 05, 2017 4:14 am

Thu Jan 19, 2017 3:11 am

Dear MakeSense,

Thanks for your detailed information.
Sorry that I misunderstood you before. For the order of the text, our product extracts text accroding to the order of document flow. The document you provided is that order. Sorry that there is no way to change that order.
About the invisible text issue, the overwriting method is just to keep white space. And we found in the file(egn201620507123.pdf) some text doesn't display as some reasons, but it is not the invisible text(the textrendermode value should be 3 for invisible text). This sort of text on your file will be extrated. Could you please provide us other sample file so that we can check the invisible text?

Thanks,
Betsy
E-iceblue support team
User avatar

Betsy.jiang
 
Posts: 3099
Joined: Tue Sep 06, 2016 8:30 am

Thu Jan 19, 2017 4:44 am

Betsy.jiang wrote:Dear MakeSense,

Thanks for your detailed information.
Sorry that I misunderstood you before. For the order of the text, our product extracts text accroding to the order of document flow. The document you provided is that order. Sorry that there is no way to change that order.
About the invisible text issue, the overwriting method is just to keep white space. And we found in the file(egn201620507123.pdf) some text doesn't display as some reasons, but it is not the invisible text(the textrendermode value should be 3 for invisible text). This sort of text on your file will be extrated. Could you please provide us other sample file so that we can check the invisible text?

Thanks,
Betsy
E-iceblue support team


Dear Betsy,

First, I want to say, I really don't have much knowledge of PDF, I will say something that may be wrong or not professional. if that happens, I am sorry.

As you can see in the first PDF, there are fixed vertical splits and unfixed horizontal splits and couldn't be selected. these splits must be technology within PDF's scope. So I think, there must be Reverse Engineering that could get rid of there splits and make the text to display the general way, then I could get the text the right order. so first, we should know what they are. I don't get much useful information from the Internet by basic searching...do you know what these splits are?

The invisible text(egn201620507123.pdf) may not the common invisible text(the text render mode value should be 3 for invisible text), I couldn't see them, so I called the invisible text. I deal with it by looping and filtering each page to get the information I want. Here is another example, it's the Chinese version of this file. hope it will be help to Improve Spire.PDF to deal with this rare case.

Best Regards,
Kevin

MakeSense
 
Posts: 7
Joined: Thu Jan 05, 2017 4:14 am

Thu Jan 19, 2017 7:43 am

Dear Kevin,

Thanks for your information.
The file you just provided is same with the file(egn201620507123.pdf). So sorry that I am afraid that there is no way to achieve your two targets.

Sincerely,
Betsy
E-iceblue support team
User avatar

Betsy.jiang
 
Posts: 3099
Joined: Tue Sep 06, 2016 8:30 am

Return to Spire.PDF