Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Wed Aug 24, 2022 12:05 pm

PDF conversion to Word results in images rather than text - source is a PDF which was originally images and has been OCR and converted to text and is clearly now text as I can copy the text out of the PDF manually so the PDF is seeing it is text rather than images. But when converting it to Word using Spire.PDF the resulting DOCX contains images rather than text?

Any thoughts please - I am new to Spire.PDF and only purchased Developer OEM Subscription in the last few weeks, it works fine for all other files just not these ones.

I have attached an example PDF and the resulting DOCX file.

Thanks

darren.rose@pcassistonline.co.uk
 
Posts: 7
Joined: Thu Jun 13, 2019 7:53 pm

Thu Aug 25, 2022 5:52 am

Hi,

Thank you for your inquiry.
I noticed the phenomenon you mentioned and logged it into our bug tracking system with ticket SPIREPDF-5448. Our dev team will investigate this issue, once there is any good news, I will inform you. Apologize for the inconvenience.

Sincerely,
Kylie
E-iceblue support team
User avatar

kylie.tian
 
Posts: 412
Joined: Mon Mar 07, 2022 2:30 am

Fri Sep 09, 2022 1:13 pm

Any update on this issue?

darren.rose@pcassistonline.co.uk
 
Posts: 7
Joined: Thu Jun 13, 2019 7:53 pm

Tue Sep 13, 2022 10:46 am

Hi,

Sorry for the late reply due to the weekend.
We further parsed your Pdf document and found that your Pdf is composed of multiple images. The text you see is actually in the form of images, and these images are covered with invisible text, so it made us misunderstanding. Our Spire.PDF do the conversion according to the actual visible content in Pdf. Hence, you got images rather than text. This behavior is caused by your Pdf document itself, hope you can understand.

Sincerely,
Kylie
E-iceblue support team
User avatar

kylie.tian
 
Posts: 412
Joined: Mon Mar 07, 2022 2:30 am

Tue Sep 13, 2022 11:31 am

I am confused, as yes the document originally was images, but it has been OCR'd so the images have been converted to text and are clearly text as i can extract text from the PDF using the Spire Text Extraction methods e.g. PdfTextExtractor.ExtractText, or copy/paste text from the PDF using any PDF viewer.

So if your PdfTextExtractor can see the text then why can't your Word / Excel conversion see the "invisible" text which is clearly visible to everything?

This is very disappointing after spending just over £2000

darren.rose@pcassistonline.co.uk
 
Posts: 7
Joined: Thu Jun 13, 2019 7:53 pm

Wed Sep 14, 2022 10:18 am

Hi,

Thanks for your feedback.
As I explained before, the text you see is actually in the form of images, and these images are covered with invisible text. It's not really plain text in PDF document. Please note that the PdfTextExtractor and conversion (PDF to Word) have different internal principles, we consider the invisible text in the extraction method, but the conversion method does not. If we force to show the invisible text when converting PDF to Word, it may cause messy effect and does not actually conform to the specification. In addition, even if we convert your PDF to Word using Adobe, the result is also pictures, you can verify it. Hope you can understand.

Sincerely,
Kylie
E-iceblue support team
User avatar

kylie.tian
 
Posts: 412
Joined: Mon Mar 07, 2022 2:30 am

Wed Sep 14, 2022 12:00 pm

Okay understood. Is there anyway using Spire.PDF that I can analyze the PDF before attempting conversion to determine it contains just images and no text (other than the invisible text), so I can at least show a warning to my users saying converted document may only contains images as source PDF doesn't contain any text?

darren.rose@pcassistonline.co.uk
 
Posts: 7
Joined: Thu Jun 13, 2019 7:53 pm

Thu Sep 15, 2022 11:04 am

Hi,

Thank you for your reply,
Spire.PDF provides PdfTextExtractOptions object to determine whether to extract invisible text, Please refer to the code below. But I tested with your document and found that setting IsShowHiddenText to false can also extract invisible text. I logged it into our bug tracking system with ticket SPIREPDF-5479. Our dev team will investigate this issue, once there is any good news, I will inform you. Apologize for the inconvenience.
Code: Select all
         
            PdfTextExtractOptions options = new PdfTextExtractOptions();
            // Whether is show hidden text
            options.IsShowHiddenText = false
            //Extract text
            string str=page.ExtractText(options);


Sincerely,
Kylie
E-iceblue support team
User avatar

kylie.tian
 
Posts: 412
Joined: Mon Mar 07, 2022 2:30 am

Thu Sep 15, 2022 11:27 am

I think you have missed my point completely. I will try explaining it again from scratch.

My original issue was the conversion of a PDF (which had been OCR'd) to Word coming out as an image rather than text - despite me being able to copy/paste text from the PDF and PdfTextExtractor being able to extract the text - your explanation of it being "invisible text" and that how it is handled when doing a conversion to Word (or Excel) is different and hence why the converted document is an image rather than text makes sense to me and I accept that - (although it is a shame and I can't understand why you can't at least have an option to convert "invisible text" because if PdfTextExtractor can do it then I can't see why "doc.SaveToFile("ToDocx2.docx", FileFormat.DOCX);" couldn't as well, even if option to do so is false by default - but that is a mute point.)

My further question was this:-

I have users of my product who had tried converting some PDF files (which were scanned documents that had been OCR'd) to Word and they couldn't understand why they had come out as images (you and now I understand why as above, but a general user doesn't, and doesn't always know the difference when just looking at a PDF as to whether it contains just an image (and "invisible text!") or text - e.g. whether when converted it will be a perfectly editable Word document with text or end up as a useless Word document with just an image).

So for the conversion of a PDF to Word (or Excel) is there anyway in my code I can analyze the PDF first to determine if it only contains images and no text (or only 'invisible text') then I can at least display a prompt to my users to say that the PDF they are trying to convert only contains images so the converted Word/Excel file will only contain images - at least like that they will be aware and not just think my product has not worked correctly.

Hope this now makes sense?

darren.rose@pcassistonline.co.uk
 
Posts: 7
Joined: Thu Jun 13, 2019 7:53 pm

Fri Sep 16, 2022 11:17 am

Hi,

Thank you for your detailed explanation.
Sorry our Spire.PDF doesn't provide a method to detect if the document only contains invisible text. But the PdfTextExtractOptions object can extract PDF text (only visible text) by setting IsShowHiddenText to false. If the visible text is empty and the PDF contains images, it means that the visible content in this PDF only contains images. Please refre to the below sample code. But as I said before, I tested your document and found that there is an issue with getting visible text, our dev team is investigating it, I'll let you know as soon as there are any meaningful updates.
Code: Select all
            //Load the PDF document
            PdfDocument doc = new PdfDocument();
            doc.LoadFromFile("test1.pdf");
            List<Image> ListImage = new List<Image>();
            StringBuilder sb = new StringBuilder();
            foreach (PdfPageBase page in doc.Pages)
            {
                // Get all images
                Image[] images = page.ExtractImages();
                if (images != null && images.Length > 0)
                {
                    ListImage.AddRange(images);
                }
                PdfTextExtractOptions options = new PdfTextExtractOptions();
                // Whether is show hidden text
                options.IsShowHiddenText = false;
                //Get visible text
                sb.Append(page.ExtractText(options).Trim());
               
            }
            //Image exists and visible text is empty
            if (ListImage.Count>0 && sb.Length==0)
            {
                //....
            }


Sincerely
Kylie
E-iceblue support team
User avatar

kylie.tian
 
Posts: 412
Joined: Mon Mar 07, 2022 2:30 am

Fri Oct 28, 2022 4:26 pm

Any update on SPIREPDF-5479 - it has been quite a while now.....

darren.rose@pcassistonline.co.uk
 
Posts: 7
Joined: Thu Jun 13, 2019 7:53 pm

Mon Oct 31, 2022 9:32 am

Hi,

Our dev team has adjusted the SPIREPDF-5479 problem and it is going to the test stage now. If the test goes well, we will provide a hotfix asap. Thanks for your understanding.

Sincerely,
Kylie
E-iceblue support team
User avatar

kylie.tian
 
Posts: 412
Joined: Mon Mar 07, 2022 2:30 am

Mon Nov 28, 2022 9:45 pm

Another month has now passed since I last asked for an update......

darren.rose@pcassistonline.co.uk
 
Posts: 7
Joined: Thu Jun 13, 2019 7:53 pm

Tue Nov 29, 2022 1:41 am

Hello,

Thanks for your following-up.
According to the feedback from our development team, as fixing this issue with the number SPIREPDF-5479 has affected other functions, they are fixing these issues. Please share us more time to solve these issue. In addition, I have given this issue the highest priority. Once there are any updates, I’ll inform you in time. Sorry for the inconvenience caused.

Sincerely
Abel
E-iceblue support team
User avatar

Abel.He
 
Posts: 860
Joined: Tue Mar 08, 2022 2:02 am

Thu Dec 15, 2022 8:21 am

Hello,

Thanks for your patience!
Glad to inform you that we just released Spire.PDF 8.12.5 which fixes the issue with SPIREPDF-5479.
Please download the new version from the following links to test.
Website download link: https://www.e-iceblue.com/Download/down ... t-now.html
Nuget download link: https://www.nuget.org/packages/Spire.Pdf/8.12.5

Sincerely
Abel
E-iceblue support team
User avatar

Abel.He
 
Posts: 860
Joined: Tue Mar 08, 2022 2:02 am

Return to Spire.PDF