Spire.Doc is a professional Word .NET library specifically designed for developers to create, read, write, convert and print Word document files. Get free and professional technical support for Spire.Doc for .NET, Java, Android, C++, Python.

Fri Mar 12, 2021 9:42 am

Hi Team,

We need to read and extract the text with formatting from the table of contents available in DOCX file.

Please advise on the solution.

Sample docx is attached in which we want to extract the "Contents" and the list as well.

pr20080798
 
Posts: 149
Joined: Wed Jan 20, 2021 1:15 pm

Fri Mar 12, 2021 10:16 am

Hello,

Please refer to the following code to read the "Contents" and the list in table of contents available in your DOCX file.
Code: Select all
 Document doc = new Document();
  doc.LoadFromFile(@"C:\Heading_Mismatch.docx");
  Section section = doc.Sections[0];
  for (int i = 0; i < section.Body.ChildObjects.Count; i++)
            {
                if (section.Body.ChildObjects[i].DocumentObjectType == DocumentObjectType.StructureDocumentTag)
                {
                    StructureDocumentTag tag = section.Body.ChildObjects[i] as StructureDocumentTag;
                    DocumentObjectCollection objects = tag.ChildObjects; //Includes "the "Contents" and the list in table of contents "                 
                }
            }


Sincerely,
Amy
E-iceblue support team
User avatar

amy.zhao
 
Posts: 2766
Joined: Wed Jun 27, 2012 8:50 am

Fri Mar 12, 2021 11:20 am

Thank you for your response.

We tried the same file and observed the count of children.

StructureDocumentTag tag = section.Body.ChildObjects[i] as StructureDocumentTag;
DocumentObjectCollection objects = tag.ChildObjects;
// Printed the objects.getCount() gives 10, but table of content has 9 items if we include "Contents" word.

Please let us know how to get the exact "Contents" word from the document and subsequent list items (like Heading-1, Heading-1.1, and so on).

Given code seems incomplete to us to move further.

pr20080798
 
Posts: 149
Joined: Wed Jan 20, 2021 1:15 pm

Mon Mar 15, 2021 6:37 am

Hello,

Thanks for your feedback.
In the table of content of your word file, in fact there are 10 paragraphs, the last paragraph is blank.
001.png

Please refer to the following code to get the exact "Contents" word and the list items in the table of content.
Code: Select all
  Document doc = new Document();
            doc.LoadFromFile(@"C:\Heading_Mismatch.docx");
            Section section = doc.Sections[0];
            StringBuilder stringBuilder = new StringBuilder();
            String text;
            Paragraph paragraph;
            TextRange textRange;
            for (int i = 0; i < section.Body.ChildObjects.Count; i++)
            {
                if (section.Body.ChildObjects[i].DocumentObjectType == DocumentObjectType.StructureDocumentTag)
                {
                    StructureDocumentTag tag = section.Body.ChildObjects[i] as StructureDocumentTag;
                    DocumentObjectCollection objects = tag.ChildObjects; //Includes "the "Contents" and the list in table of contents "
                    for (int j = 0; j < objects.Count; j++)
                    {
                        if (objects[j].DocumentObjectType == DocumentObjectType.Paragraph)
                        {
                            paragraph= objects[j] as Paragraph;
                            //Read the text of paragraph
                            text = paragraph.Text;
                            stringBuilder.AppendLine(text);
                            for(int k=0;k<paragraph.ChildObjects.Count;k++)
                            {
                                if (paragraph.ChildObjects[k].DocumentObjectType == DocumentObjectType.TextRange)
                                {
                                    //Read some formatting of text
                                    textRange = paragraph.ChildObjects[k] as TextRange;
                                    String fontName =  textRange.CharacterFormat.FontName;
                                    float fontSize = textRange.CharacterFormat.FontSize;
                                    Color textColor = textRange.CharacterFormat.TextColor;
                                }
                            }
                        }
                    }
                }
            }
            File.WriteAllText("TOC.txt", stringBuilder.ToString());


Sincerely,
Amy
E-iceblue support team
User avatar

amy.zhao
 
Posts: 2766
Joined: Wed Jun 27, 2012 8:50 am

Thu Mar 18, 2021 5:50 am

Hello,

Does the above code meet your needs?
Looking forward to your feedback.

Sincerely,
Amy
E-iceblue support team
User avatar

amy.zhao
 
Posts: 2766
Joined: Wed Jun 27, 2012 8:50 am

Fri Mar 19, 2021 1:29 pm

Hi Amy,

Thanks for the response.

Suggested solution works, but some TOC cases, we are getting the contents as a FIELD_MARK document object type, so cannot able to extract the exact text since it is no longer a paragraph.

Please do advise on getting the text value if we have childObj.getDocumentObjectType().equals(DocumentObjectType.Field_Mark)

Thanks in advance.

pr20080798
 
Posts: 149
Joined: Wed Jan 20, 2021 1:15 pm

Mon Mar 22, 2021 5:56 am

Hi,

Thanks for your feedback.
You could read text by paragraph.Text property, also enter the TextRange object for the paragraph and read text via textRange.Text property.
You do not need to retrieve the Field Mark object type, please just search textRange object from the collection of child objects of paragraph object.

Code: Select all
if (objects[j].DocumentObjectType == DocumentObjectType.Paragraph)
                        {
                            paragraph= objects[j] as Paragraph;
                            //Read the text of paragraph
                            text = paragraph.Text;
                            for(int k=0;k<paragraph.ChildObjects.Count;k++)
                            {
                                if (paragraph.ChildObjects[k].DocumentObjectType == DocumentObjectType.TextRange)
                                {
                                  String text1 = textRange.Text;
                                }
                            }
                        }


If you still don't know how to read text in some TOC cases you're facing, please provide your sample file. I will do a demo for you.

Sincerely,
Amy
E-iceblue support team
User avatar

amy.zhao
 
Posts: 2766
Joined: Wed Jun 27, 2012 8:50 am

Tue Mar 23, 2021 9:13 am

Thanks for the solution. We are able to read the contents as Paragraph now.

In another case, we get HYPERLINK \l "_Toc<RandomNumberSequence> prepended with the actual word when we extract the text from TOC.

Example:
1. Purpose………………………………………
When we try to extract the word “Purpose” through paragraph, we receive HYPERLINK \l "_Toc139704637" Purpose.

Let us know on what basis we could get the hardcoded HYPERLINK in final extracted word since it doesn’t happen for all the TOCs from other word files, and is there any way to avoid this unwanted string and get only the word ‘Purpose’ alone?

pr20080798
 
Posts: 149
Joined: Wed Jan 20, 2021 1:15 pm

Tue Mar 23, 2021 9:28 am

Hi Pradeep,

In order to help me investigate the case soon, please share a sample file(you could send it via email), I will provide you with sample code.
Thanks in advance!

Sincerely,
Amy
E-iceblue support team
User avatar

amy.zhao
 
Posts: 2766
Joined: Wed Jun 27, 2012 8:50 am

Tue Mar 23, 2021 10:24 am

Hello Pradeep,

Thanks for sharing your sample file via email.
Please refer to the following sample code.
Code: Select all
     Document doc = new Document();
            doc.LoadFromFile(input);
            Body body = doc.Sections[0].Body;
            Paragraph paragraph;
            for (int i = 0; i < body.ChildObjects.Count; i++)
            {
                if (body.ChildObjects[i].DocumentObjectType == DocumentObjectType.Paragraph)
                {
                    paragraph = (Paragraph)body.ChildObjects[i];
                    if (paragraph.StyleName.Contains("TOC"))
                    {
                        for (int j = 0; j < paragraph.ChildObjects.Count; j++)
                        {
                            if (paragraph.ChildObjects[j].DocumentObjectType == DocumentObjectType.Field)
                            {
                               String text = (paragraph.ChildObjects[j] as Field).FieldText;
                            }
                        }
                    }
                }
            }


Sincerely,
Amy
E-iceblue support team
User avatar

amy.zhao
 
Posts: 2766
Joined: Wed Jun 27, 2012 8:50 am

Tue Mar 23, 2021 1:11 pm

Hi Amy,

Thanks for the sample code. It works perfectly.

pr20080798
 
Posts: 149
Joined: Wed Jan 20, 2021 1:15 pm

Wed Mar 24, 2021 2:01 am

Hello,

Glad to hear that!
If you encounter any issues related to our products in the future, just feel free to contact us.
Have a nice day!

Sincerely,
Rachel
E-iceblue support team
User avatar

rachel.lei
 
Posts: 1571
Joined: Tue Jul 09, 2019 2:22 am

Return to Spire.Doc

cron