Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Thu Nov 09, 2017 3:51 pm

Hi

I am currently evaluating libraries to extract text from PDF files.
With Spire.PDF i have the issue that a file which appears rotated by 90 degrees when opened in a PDF viewer, the text will not appear in the right order. It basically extracts the text as if the file was rotated correctly. Something like this (imagine all letters rotated by 90° to the right)

Code: Select all
P   S   T
a   o   i
g   m   t
e   e   l
        e
1   T
    e
    x
    t


What i'd expect is the following extracted text:
Title
Some Text
Page 1

But what i get is:
Page Some Title
1 Text


Is there a way to solve this with Spire.PDF? Another library i'm evaluating extracts the text from this file just fine (but has serious issues with other files which in turn Spire.PDF handles perfectly).

I'm sorry i can't provide the PDF in question since we received it from a customer of ours and it contains confidential information.
Last edited by Naryoril on Fri Nov 10, 2017 7:58 am, edited 1 time in total.

Naryoril
 
Posts: 6
Joined: Wed Sep 20, 2017 10:18 am

Fri Nov 10, 2017 3:29 am

Hello,

Thanks for your inquiry.
I have tested some cases on my side with the the latest commercial version(Spire.PDF Pack(Hot Fix) Version:3.9.431), but I didn't replicate the issue, please try this version and you may also try the other overloaded method as below, if this can't resolve your issue, I am afraid that you have to provide us a document that can demonstrate the issue for further investigation on our side as every document has different structures.
Code: Select all
   
// Summary:
// Extracts text from the given PDF Page by SimpleTextExtractionStrategy.
// Returns:
//The Extracted Text.
public string ExtractText(SimpleTextExtractionStrategy strategy);


Sincerely,
Gary
E-iceblue support team
User avatar

Gary.zhang
 
Posts: 1380
Joined: Thu Apr 04, 2013 1:30 am

Mon Nov 13, 2017 12:55 pm

I used Spire.PDF to save the last page from one of the documents in question as a separate file (thus the evaluation warning) and then used another tool (Foxit Phantom trial) to delete the sensitive information so i now have a PDF file i can provide you.

I attached the zip that includes the PDF, the results with ExtractText without an argument, the results with ExtractText with a SimpleTextExtractionStrategy from the edited and from the original file and the result i get from another library.
Rotated PDF.zip


The result from the other library is what i would expect to get.
In the edited PDF ExtractText with the SimpleTextExtractionStrategy results in an empty file, apart from the evaluation warning. On the original file it isn't empty, but the result is pretty much unusable. I also added a file where i extracted the text from the original file and deleted everything that isn't in the edited PDF file. All other conversions i tried look the same, no matter whether they are done on the edited or the original file.

I hope this helps you to reproduce the issue.

Naryoril
 
Posts: 6
Joined: Wed Sep 20, 2017 10:18 am

Tue Nov 14, 2017 3:45 am

Hello,

Thanks for the information. I have noticed the issue and posted it to our Dev team, once it is resolved or we have some other update, we will let you know ASAP.

Sincerely,
Gary
E-iceblue support team
User avatar

Gary.zhang
 
Posts: 1380
Joined: Thu Apr 04, 2013 1:30 am

Tue Nov 21, 2017 8:30 am

Is there any new information on this issue? Because i encountered something that seems to be the same problem in another set of documents. Is it likely that you can/will fix it? This issue is the only thing preventing us from buying a Site Enterprise Subscription.

Naryoril
 
Posts: 6
Joined: Wed Sep 20, 2017 10:18 am

Tue Nov 21, 2017 10:02 am

Hello,

The issue has get resolved, but the hotfix is in the testing phase, once it is available, we will let you know ASAP.

Sincerely,
Gary
E-iceblue support team
User avatar

Gary.zhang
 
Posts: 1380
Joined: Thu Apr 04, 2013 1:30 am

Tue Nov 21, 2017 10:25 am

Amazing, thank you very much.

Naryoril
 
Posts: 6
Joined: Wed Sep 20, 2017 10:18 am

Thu Nov 30, 2017 3:10 am

Hello,

Thanks for your waiting. After testing and trying repeatedly, our product uses absolute positioning method and can't detect if the text has been rotated like the pdf document you provided.
As a workaround, it needs to do the page rotation and then extract the text, here is the code for your reference.
Code: Select all
PdfDocument doc = new PdfDocument("rotated PDF.pdf");
StringBuilder sb = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
        {
            page.Rotation = PdfPageRotateAngle.RotateAngle270;
            sb.Append(page.ExtractText());
        }
FileStream fs = new FileStream("1295.txt", FileMode.Create);
StreamWriter sw = new StreamWriter(fs);
sw.Write(sb.ToString());
sw.Flush();
sw.Close();

Sincerely,
Gary
E-iceblue support team
User avatar

Gary.zhang
 
Posts: 1380
Joined: Thu Apr 04, 2013 1:30 am

Thu Nov 30, 2017 1:51 pm

Hi

Thank you very much. I tried this before already but it didn't change anything. With the new hotfix tough it has an effect and solved the issue for me. Although i found a different issue now, but i'll make a separate thread for that.

Naryoril
 
Posts: 6
Joined: Wed Sep 20, 2017 10:18 am

Fri Dec 01, 2017 9:07 am

Hello,

Thanks for your feedback.
Please feel free to contact us if you need any assistance.

Sincerely,
Jane
E-iceblue support team
User avatar

Jane.Bai
 
Posts: 1156
Joined: Tue Nov 29, 2016 1:47 am

Return to Spire.PDF