Specifying the encoding for ExtractText

Mon Nov 19, 2018 8:23 am

I have an issue extracting text from a PDF that has been scanned and OCR'd in Greece. When I call ExtractText, the string returned includes a bunch of undisplayable characters meaning that although I can cut and paste the contents from Adobe Reader, I can't get at from within my software.

Is there a way to specify the codepage or encoding of the string returned by ExtractText or do you have any alternate advice?

Many thanks in advance,

Darren.

Mon Nov 19, 2018 9:02 am

Hi Darren Wray,

Thank you for your letter.
Spire.PDF takes the Unicode as the default encoding. At present, it does not support specifying the encoding.
Anyway, could you please share your sample PDF document to help us look into it?

Sincerely,
Jane
E-iceblue support team

Mon Nov 19, 2018 2:06 pm

An example page is attached. I can cut and paste from this document in both Adobe Reader (windows and mac) and the mac preview app, but as shown in the screenshot in Example.zip the values returned are unintelligible.

Thanks in advance,

Darren.

Tue Nov 20, 2018 1:50 am

Dear Darren,

Thank you for providing the sample.
When I copy the content directly in Adobe or extract text in Adobe, the content is still incorrect. Just see the space between the characters.

Adobe.jpg

Sincerely,
Jane
E-iceblue support team

Tue Nov 20, 2018 7:01 am

Actually, the text in your screenshot is pretty accurate - please see the attached showing your screenshot and the PDF side by side.

Side by Side.png

OCR quality aside, is there a way to get the text from this PDF without what appears to be codepage errors?

I'm not sure if it is related but when I look at the fonts used in the PDF, the encoding is set to custom, this seems to be a common trait amongst the PDFs that I can't process.

Any and all help appreciated,

Darren.

Tue Nov 20, 2018 8:13 am

Dear Darren,

After further investigation, we found that you are right.
I have logged the issue in our bug tracking system with a high priority and our dev team is now investigating the issue.
Once there's an update, I will inform you.
Sorry for the inconvenience caused.

Sincerely,
Jane
E-iceblue support team

Tue Nov 20, 2018 10:08 am

Thanks Jane, that is perfect

Wed Nov 28, 2018 5:49 am

Dear Darren,

Thanks for your patient waiting.
Glad to inform that your issue has been resolved. Please download the hotfix from the following link.
https://www.e-iceblue.com/downloads/Tem ... 3.11.6.zip

Sincerely,
Jane
E-iceblue support team

Sat Dec 01, 2018 11:39 pm

Thank you - I'll download and implement.

Mon Dec 03, 2018 2:11 am

Hi Darren,

Thank you for your quick response.
I will look forward to your reply.

Sincerely,
Jane
E-iceblue support team

Fri Dec 07, 2018 8:38 am

Hi Darren,

Greetings from e-iceblue!
Have you tried the hotfix?
Your feedback would be greatly appreciated!

Sincerely,
Jane
E-iceblue support team

Wed Apr 10, 2019 3:11 pm

I have the same problem. Could resend the version that you have uploaded before?

Thu Apr 11, 2019 1:44 am

Hello Adch,

Thank you for contacting.
Please download and test our latest Spire.PDF Pack(Hot Fix) Version:5.4.1 which includes all the fixes and new features. If your problem still occurs after trying, please provide your input file, full testing code as well as your output file to help us further look into it. You could send them to us via email (support@e-iceblue.com).

Sincerely,
Lisa
E-iceblue support team

Wed Apr 17, 2019 7:02 am

Hello Adch,

Greetings from E-iceblue.
Did the latest version work for you? Thanks in advance for your feedback and time.

Sincerely,
Lisa
E-icelue support team

Fri Apr 19, 2019 7:17 pm

Hi there, i may have posted into the wrong place

I am using the Spire for a java app so i need the jar file. Could you help with it?

Thanks in advance

Specifying the encoding for ExtractText

Purchase

Partnership

Products

Corporation