Read Table of Content from Word file

Fri Mar 12, 2021 9:42 am

Hi Team,

We need to read and extract the text with formatting from the table of contents available in DOCX file.

Please advise on the solution.

Sample docx is attached in which we want to extract the "Contents" and the list as well.

Fri Mar 12, 2021 10:16 am

Hello,

Please refer to the following code to read the "Contents" and the list in table of contents available in your DOCX file.

Code: Select all: Document doc = new Document(); doc.LoadFromFile(@"C:\Heading_Mismatch.docx"); Section section = doc.Sections[0]; for (int i = 0; i < section.Body.ChildObjects.Count; i++) { if (section.Body.ChildObjects[i].DocumentObjectType == DocumentObjectType.StructureDocumentTag) { StructureDocumentTag tag = section.Body.ChildObjects[i] as StructureDocumentTag; DocumentObjectCollection objects = tag.ChildObjects; //Includes "the "Contents" and the list in table of contents " } }

Sincerely,
Amy
E-iceblue support team

Fri Mar 12, 2021 11:20 am

Thank you for your response.

We tried the same file and observed the count of children.

StructureDocumentTag tag = section.Body.ChildObjects[i] as StructureDocumentTag;
DocumentObjectCollection objects = tag.ChildObjects;
// Printed the objects.getCount() gives 10, but table of content has 9 items if we include "Contents" word.

Please let us know how to get the exact "Contents" word from the document and subsequent list items (like Heading-1, Heading-1.1, and so on).

Given code seems incomplete to us to move further.

Mon Mar 15, 2021 6:37 am

Hello,

Thanks for your feedback.
In the table of content of your word file, in fact there are 10 paragraphs, the last paragraph is blank.

001.png

Please refer to the following code to get the exact "Contents" word and the list items in the table of content.

Code: Select all: Document doc = new Document(); doc.LoadFromFile(@"C:\Heading_Mismatch.docx"); Section section = doc.Sections[0]; StringBuilder stringBuilder = new StringBuilder(); String text; Paragraph paragraph; TextRange textRange; for (int i = 0; i < section.Body.ChildObjects.Count; i++) { if (section.Body.ChildObjects[i].DocumentObjectType == DocumentObjectType.StructureDocumentTag) { StructureDocumentTag tag = section.Body.ChildObjects[i] as StructureDocumentTag; DocumentObjectCollection objects = tag.ChildObjects; //Includes "the "Contents" and the list in table of contents " for (int j = 0; j < objects.Count; j++) { if (objects[j].DocumentObjectType == DocumentObjectType.Paragraph) { paragraph= objects[j] as Paragraph; //Read the text of paragraph text = paragraph.Text; stringBuilder.AppendLine(text); for(int k=0;k<paragraph.ChildObjects.Count;k++) { if (paragraph.ChildObjects[k].DocumentObjectType == DocumentObjectType.TextRange) { //Read some formatting of text textRange = paragraph.ChildObjects[k] as TextRange; String fontName = textRange.CharacterFormat.FontName; float fontSize = textRange.CharacterFormat.FontSize; Color textColor = textRange.CharacterFormat.TextColor; } } } } } } File.WriteAllText("TOC.txt", stringBuilder.ToString());

Sincerely,
Amy
E-iceblue support team

Thu Mar 18, 2021 5:50 am

Hello,

Does the above code meet your needs?
Looking forward to your feedback.

Sincerely,
Amy
E-iceblue support team

Fri Mar 19, 2021 1:29 pm

Hi Amy,

Thanks for the response.

Suggested solution works, but some TOC cases, we are getting the contents as a FIELD_MARK document object type, so cannot able to extract the exact text since it is no longer a paragraph.

Please do advise on getting the text value if we have childObj.getDocumentObjectType().equals(DocumentObjectType.Field_Mark)

Thanks in advance.

Mon Mar 22, 2021 5:56 am

Hi,

Thanks for your feedback.
You could read text by paragraph.Text property, also enter the TextRange object for the paragraph and read text via textRange.Text property.
You do not need to retrieve the Field Mark object type, please just search textRange object from the collection of child objects of paragraph object.

Code: Select all: if (objects[j].DocumentObjectType == DocumentObjectType.Paragraph) { paragraph= objects[j] as Paragraph; //Read the text of paragraph text = paragraph.Text; for(int k=0;k<paragraph.ChildObjects.Count;k++) { if (paragraph.ChildObjects[k].DocumentObjectType == DocumentObjectType.TextRange) { String text1 = textRange.Text; } } }

If you still don't know how to read text in some TOC cases you're facing, please provide your sample file. I will do a demo for you.

Sincerely,
Amy
E-iceblue support team

Tue Mar 23, 2021 9:13 am

Thanks for the solution. We are able to read the contents as Paragraph now.

In another case, we get HYPERLINK \l "_Toc<RandomNumberSequence> prepended with the actual word when we extract the text from TOC.

Example:
1. Purpose………………………………………
When we try to extract the word “Purpose” through paragraph, we receive HYPERLINK \l "_Toc139704637" Purpose.

Let us know on what basis we could get the hardcoded HYPERLINK in final extracted word since it doesn’t happen for all the TOCs from other word files, and is there any way to avoid this unwanted string and get only the word ‘Purpose’ alone?

Tue Mar 23, 2021 9:28 am

Hi Pradeep,

In order to help me investigate the case soon, please share a sample file(you could send it via email), I will provide you with sample code.
Thanks in advance!

Sincerely,
Amy
E-iceblue support team

Tue Mar 23, 2021 10:24 am

Hello Pradeep,

Thanks for sharing your sample file via email.
Please refer to the following sample code.

Code: Select all: Document doc = new Document(); doc.LoadFromFile(input); Body body = doc.Sections[0].Body; Paragraph paragraph; for (int i = 0; i < body.ChildObjects.Count; i++) { if (body.ChildObjects[i].DocumentObjectType == DocumentObjectType.Paragraph) { paragraph = (Paragraph)body.ChildObjects[i]; if (paragraph.StyleName.Contains("TOC")) { for (int j = 0; j < paragraph.ChildObjects.Count; j++) { if (paragraph.ChildObjects[j].DocumentObjectType == DocumentObjectType.Field) { String text = (paragraph.ChildObjects[j] as Field).FieldText; } } } } }

Sincerely,
Amy
E-iceblue support team

Tue Mar 23, 2021 1:11 pm

Hi Amy,

Thanks for the sample code. It works perfectly.

Wed Mar 24, 2021 2:01 am

Hello,

Glad to hear that!
If you encounter any issues related to our products in the future, just feel free to contact us.
Have a nice day!

Sincerely,
Rachel
E-iceblue support team

Read Table of Content from Word file

Purchase

Partnership

Products

Corporation