Spire.Doc is a professional Word .NET library specifically designed for developers to create, read, write, convert and print Word document files. Get free and professional technical support for Spire.Doc for .NET, Java, Android, C++, Python.

Tue Aug 30, 2022 9:40 pm

I have a document with many content controls, each content control gets converted to html string. When converting to html I have noticed that DOCTYPE that gets added to top of html is

Code: Select all
<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\"><html xmlns=\"http://www.w3.org/1999/xhtml\">........


Below is the code I am using to convert to HTML
Code: Select all
private static readonly Encoding LocalEncoding = Encoding.UTF8;
private string ConvertDocumentContentToHTML(StructureDocumentTag content)
        {
            try
            {
                DocumentObject docObj = content.Clone();
                Spire.Doc.Document currComponentDoc = new Spire.Doc.Document();
                Section section = currComponentDoc.AddSection();
                section.Body.ChildObjects.Add(docObj);
                MemoryStream memoryStream = new MemoryStream();
                currComponentDoc.SaveToStream(memoryStream, FileFormat.Html);
                string htmlText = LocalEncoding.GetString(memoryStream.ToArray());
                Console.WriteLine(htmlText);
                htmlText = HtmlParser.RemoveDocType(htmlText); //As a temporary workaround I am removing DOCTYPE from the html string please ignore this line
                return htmlText;
            }
            catch (Exception ex)
            {
                Console.WriteLine($"{nameof(ConvertDocumentContentToHTML)} - Failed - Error : {ex.Message}");
                throw ex;
            }
        }


My concern is XHTML although has similarities with XHTML, it really is not HTML their mime type is different and so are parsing modes and many more differences are there between the two. When I save to stream -
Code: Select all
doc.SaveToStream(memoryStream, FileFormat.Html);
- does this convert to XHTML and not HTML?

The second part to this is after converting content control to html when I convert the same html string back to docx using Spire.Doc it adds extra line of text on the word document as follows (also attached screenshot)
Screenshot 2022-08-30 142636.png

Code: Select all
html xmlns="http://www.w3.org/1999/xhtml">

How do I avoid the above without stripping of <!DOCTYPE> from the html string.

Please note I am using Spire.Doc 10.7.0 and using .NET Core 3.1

jalpaashara
 
Posts: 21
Joined: Wed Jun 22, 2022 8:12 pm

Wed Aug 31, 2022 10:02 am

Hi,

Thank you for your inquiry.
Yes, the .html file converted by Spire.Doc is based on the XHTML specification. In addition, I did an initial test on my side, but didn't encounter an extra line of text when converting html to word document, please provide your input document and .html file for further investigation. You can attach your file here or send it to us via email (support@e-iceblue.com).

Sincerely,
Kylie
E-iceblue support team
User avatar

kylie.tian
 
Posts: 412
Joined: Mon Mar 07, 2022 2:30 am

Return to Spire.Doc