Spire.Doc is a professional Word .NET library specifically designed for developers to create, read, write, convert and print Word document files. Get free and professional technical support for Spire.Doc for .NET, Java, Android, C++, Python.

Fri Mar 12, 2021 1:29 pm

Hello,
my doc file with cyrillic do not converting to PDF properly, all words in paragraph are joined to one
Source:
Image
Result:
Image
Code example
Code: Select all
import java.io.ByteArrayOutputStream

import com.spire.doc.{ Document, FileFormat }
import com.spire.pdf.{ PdfDocument, PdfPageBase }

import scala.jdk.CollectionConverters._

object SpireDocToPDFExample extends App {

  val pathToDoc = "/home/pavel/template.doc"

  def test(): Unit = {
    val document = new Document(pathToDoc)
    document.getLastSection.addParagraph().appendText("четыре пять шесть")
    println(document.getText)
    val baos = new ByteArrayOutputStream()
    document.saveToStream(baos, FileFormat.PDF)
    val doc      = new PdfDocument(baos.toByteArray)
    val headPage = doc.getPages.iterator().asScala.collect { case p: PdfPageBase => p.extractText() }.toList
    headPage.foreach(println)
  }

  test()
}


Programmatically added paragraph converted correct
Thanks
spire.doc.free 3.9.0
spire.pdf.free 3.9.0

truthSerruf
 
Posts: 13
Joined: Mon Feb 01, 2021 7:08 am

Fri Mar 12, 2021 7:22 pm

I have found workaround - copy text to new document
Code: Select all
import java.io.ByteArrayOutputStream

import com.spire.doc.{ Document, FileFormat }
import com.spire.pdf.{ PdfDocument, PdfPageBase }

import scala.jdk.CollectionConverters._

object SpireDocToPDFExample extends App {

  val pathToDoc = "/home/pavel/template.doc"

  def test(): Unit = {
    val document = new Document(pathToDoc)
    val newDocument = new Document()
    newDocument.addSection().addParagraph().appendText(document.getText)
    val baos = new ByteArrayOutputStream()
    newDocument.saveToStream(baos, FileFormat.PDF)
    val doc      = new PdfDocument(baos.toByteArray)
    val headPage = doc.getPages.iterator().asScala.collect { case p: PdfPageBase => p.extractText() }.toList
    headPage.foreach(println)
  }

  test()
}

Result
Image
But it's not convenient
Any help would be much appreciated.

truthSerruf
 
Posts: 13
Joined: Mon Feb 01, 2021 7:08 am

Sun Mar 14, 2021 6:29 pm

Another observation - spaces randomly deleted in paragraph with justify horizontal alignment
Code: Select all
import java.io.ByteArrayOutputStream

import com.spire.doc.documents.HorizontalAlignment
import com.spire.doc.{Document, FileFormat}
import com.spire.pdf.{PdfDocument, PdfPageBase}

import scala.jdk.CollectionConverters._

object JustifyParagraphToPDF extends App {

  val document = new Document()
  val paragraph = document.addSection().addParagraph()
  paragraph.appendText("Общество с ограниченной ответственностью «Сименс» (ООО «Сименс»), именуемое в дальнейшем «Поставщик»")
  paragraph.getFormat.setHorizontalAlignment(HorizontalAlignment.Justify)

  val baos = new ByteArrayOutputStream()
  document.saveToStream(baos, FileFormat.PDF)
  val doc      = new PdfDocument(baos.toByteArray)
  val headPage = doc.getPages.iterator().asScala.collect { case p: PdfPageBase => p.extractText() }.toList
  headPage.foreach(println)
}

If set HorizontalAlignment to Left - everything fine

truthSerruf
 
Posts: 13
Joined: Mon Feb 01, 2021 7:08 am

Mon Mar 15, 2021 8:28 am

Hello,

Thanks for your inquiry.
After thorough analysis and debugging, I found that the main issue is that the text extracted by the p.extractText() method is incorrect, not that the Doc is not converted to PDF properly.
You can try to directly add a line "document.saveToFile("temp.pdf", FileFormat.PDF)" to save the doc document to a local PDF document, and then use a PDF editor (such as Adobe) to open it and extract its text, you will find that the extracted text is correctly separated by spaces.

Anyway, I have logged the issue in our bug tracking system with the ticket SPIREPDF-4104. If there are any updates, we will inform you. Sorry for the inconvenience caused.

Sincerely,
Elena
E-iceblue support team
User avatar

Elena.Zhang
 
Posts: 279
Joined: Thu Jul 23, 2020 1:18 am

Mon Mar 15, 2021 9:04 am

Hello Elena.
Thanks for reply.
We need convert doc/docx to PDF to render result in broweser with pdf.js lib (test stend https://mozilla.github.io/pdf.js/web/viewer.html), and it's render OK Image,
but if you copy text you will see text without spaces
Image
So, maybee it's convertation issue anyway ?

truthSerruf
 
Posts: 13
Joined: Mon Feb 01, 2021 7:08 am

Mon Mar 15, 2021 9:42 am

And text from Adobe looks wierd if you copy-paste it
Image
Image
If setHorizontalAlignment to Left for example - everything fine

truthSerruf
 
Posts: 13
Joined: Mon Feb 01, 2021 7:08 am

Mon Mar 15, 2021 11:08 am

Hello,

Thanks for your feedback.
For the two documents (template.doc and 2.doc) you provided, I used Spire.Doc to convert them to PDF directly, and then use the web page you provided (https://mozilla.github.io/pdf.js/web/viewer.html) to view them, I did find that there were no spaces after copying the text.
Meanwhile, for the PDF converted from 2.doc, after opening it in Adobe and then copying and pasting the text, I noticed that there were extra spaces between some characters. I have logged these issues in our bug tracking system with the ticket SPIREDOC-5689. If there is any update, we will inform you immediately. Sorry for the inconvenience caused.

Sincerely,
Elena
E-iceblue support team
User avatar

Elena.Zhang
 
Posts: 279
Joined: Thu Jul 23, 2020 1:18 am

Fri Apr 30, 2021 1:59 am

Hello,

Thanks for your patience!

Glad to inform you that we just released Spire.Office for Java Version:4.4.6 which fixes the issue of the extracted text was incorrect. (SPIREPDF-4104).

Please download the fix version from the following links to test.
Website link: https://www.e-iceblue.com/Download/office-for-java.html

Sincerely,
Marcia
E-iceblue support team
User avatar

Marcia.Zhou
 
Posts: 858
Joined: Wed Nov 04, 2020 2:29 am

Return to Spire.Doc