Spire.Doc is a professional Word .NET library specifically designed for developers to create, read, write, convert and print Word document files. Get free and professional technical support for Spire.Doc for .NET, Java, Android, C++, Python.

Tue Sep 14, 2021 7:43 am

Hi Team,

We need to read and extract the embedded attachments (.docx, .pdf etc.) from the MS word file. Please let us know if it is possible through Spire product.

Thanks.

pr20080798
 
Posts: 146
Joined: Wed Jan 20, 2021 1:15 pm

Tue Sep 14, 2021 10:01 am

Hello,

Thank you for your inquiry.
Please refer to the following code to extract the attachments embedded in the Word file. If this code does not meet your needs, please provide your sample Word file for our reference. You can attach it here or send it to us via email (support@e-iceblue.com). Thanks in advance.
Code: Select all
Document doc = new Document(); 
//Load file from disk         
doc.LoadFromFile("filePath");
//Traverse through all sections of the word document
foreach (Section sec in doc.Sections)
{
    //Traverse through all Child Objects in the body of each section
    foreach (DocumentObject obj in sec.Body.ChildObjects)
    {
        //Find the paragraph
        if (obj is Paragraph)
        {
            Paragraph par = obj as Paragraph;
            foreach (DocumentObject o in par.ChildObjects)
            {
                //Check whether the object is OLE
                if (o.DocumentObjectType == DocumentObjectType.OleObject)
                {
                    DocOleObject Ole = o as DocOleObject;
                    string s = Ole.ObjectType;
                    //If s == "AcroExch.Document.DC", means it's a PDF document
                    if (s == "AcroExch.Document.DC")
                    {
                        File.WriteAllBytes("Result.pdf", Ole.NativeData);
                    }
                    //If s == "Excel.Sheet.12", means it's an Excel workbook
                    else if (s == "Excel.Sheet.12")
                    {
                        File.WriteAllBytes("Result.xlsx", Ole.NativeData);
                    }
                    //If s == "PowerPoint.Show.12", means it's a PowerPoint File
                    else if (s == "PowerPoint.Show.12")
                    {
                        File.WriteAllBytes("PPTResult.pptx", Ole.NativeData);
                    }
                    //If s == "Word.Document.12", means it's a Word document
                    else if (s == "Word.Document.12")
                    {
                        File.WriteAllBytes("WordResult.docx", Ole.NativeData);
                    }
                }
            }
        }
    }
}

Sincerely,
Annika
E-iceblue support team
User avatar

Annika.Zhou
 
Posts: 1643
Joined: Wed Apr 07, 2021 2:50 am

Thu Sep 16, 2021 5:54 am

Hi Team,

Please let me know if there is any possibility to extract the file name of the embedded attachments while extracting the attachments from the MS word file.

Thanks in advance

pr20080798
 
Posts: 146
Joined: Wed Jan 20, 2021 1:15 pm

Thu Sep 16, 2021 8:29 am

Hi Team,

File.WriteAllBytes("Result.pdf", Ole.NativeData);

This code is not working .I have modified this code into Files.write(Paths.get(fileName),data);.It is working fine for both pdf and docx files but not for xlsx file.I am not able to see
the contents inside excel file.

Thanks in advance

pr20080798
 
Posts: 146
Joined: Wed Jan 20, 2021 1:15 pm

Thu Sep 16, 2021 11:36 am

Hi,

Thanks for your feedback.
1) It is impossible to obtain the file name of the embedded attachments. For example, your Word document has an Excel named Test.xlsx embedded, after parsing the data of Word, you will find the "Test.xlsx" name is not stored in parsed XML. Hence, there is no way to extract the name by our product.
2) I didn't find that the extracted Excel was incorrect. Please provide your sample Word file for further investigation. You could upload it here or send to us via email (support@e-iceblue.com), thanks in advance.

Sincerely,
Annika
E-iceblue support team
User avatar

Annika.Zhou
 
Posts: 1643
Joined: Wed Apr 07, 2021 2:50 am

Thu Sep 16, 2021 1:25 pm

Hi Team

Thanks for your response

pr20080798
 
Posts: 146
Joined: Wed Jan 20, 2021 1:15 pm

Fri Sep 17, 2021 12:36 pm

Hello,

Thank you for your sharing.
I tested your Word file and found that the extracted excel file actually is correct. After parsing the data of your Word file, I found the excel file embedded has the "Windows Hide" feature enabled by deafult when opening in MS Excel. That's the cause that leads to your misunderatanding, if you deselect this feature as the attached screenshot, you will see the correct content of generated excel. If there is any question, please feel free to write back.

Sincerely,
Annika
E-iceblue support team
User avatar

Annika.Zhou
 
Posts: 1643
Joined: Wed Apr 07, 2021 2:50 am

Wed Sep 22, 2021 11:51 am

Hi Team

I have requirement that I need to filter some of file extensions ( exe file ) while extracting embedded documents from MS word file. Please let me know if there is a possibility to ignore attached file based on the file extensions.

Thanks in advance

pr20080798
 
Posts: 146
Joined: Wed Jan 20, 2021 1:15 pm

Thu Sep 23, 2021 4:34 am

Hi Team,

Please let me know if there is a chance to extract embedded attachments from MS word file with the same name. I should not extract embedded file with different name.

Thanks in advance

pr20080798
 
Posts: 146
Joined: Wed Jan 20, 2021 1:15 pm

Thu Sep 23, 2021 10:04 am

Hi,

Thanks for your inquiry.
1. You could use the following code snippet to ignore exe file according to extension. If there is any question, please feel free to write back.
Code: Select all
 ...
//Check whether the object type is "Package"
else if ("Package".equals(type)){
     //Get file name
     String fileName = ole.getPackageFileName();
     String extension = fileName.substring(fileName.lastIndexOf("."));
     if (extension.equals(".exe")){
     }
     else if (extension.equals(".dll")){
         byte[] bytes = ole.getNativeData();
         Files.write(Paths.get("extract.dll"),bytes);
     }
 }
...

2. Do you want to extract the file name of the attachment from Word file ? Sorry there is no way available for you. As I mentioned in previous post, actually, the file name of embedded attachment will not be stored in Word document. For example, your Word document has an Excel named Test.xlsx embedded, after parsing the data of Word, you will find the "Test.xlsx" name is replaced as "Microsoft_Excel____1" in embeddings, the original name "Test" is not stored in parsed XML. Hope you can understand.

Sincerely,
Annika
E-iceblue support team
User avatar

Annika.Zhou
 
Posts: 1643
Joined: Wed Apr 07, 2021 2:50 am

Thu Sep 23, 2021 10:45 am

Thank you for your response

I have written below code to extract images from the MS word file.but it treats other file types(pdf,xlsx ,docx) as image(picure type).I have to extract only images.

//Load document
Document document = new Document(@"sample.docx");
int index = 0;
//Choose or create a folder by user
FolderBrowserDialog dialog = new FolderBrowserDialog();
dialog.Description = "Choose a folder";
string foldPath = "";
if (dialog.ShowDialog() == DialogResult.OK)
{
foldPath = dialog.SelectedPath + "\\";
}
//Get Each Section of Document
foreach (Section section in document.Sections)
{
//Get Each Paragraph of Section
foreach (Paragraph paragraph in section.Paragraphs)
{
//Get Each Document Object of Paragraph Items
foreach (DocumentObject docObject in paragraph.ChildObjects)
{
//If Type of Document Object is Picture, Extract.
if (docObject.DocumentObjectType == DocumentObjectType.Picture)
{
DocPicture picture = docObject as DocPicture;
picture.Image.Save(foldPath + string.Format("image_{0}.png", index), System.Drawing.Imaging.ImageFormat.Png);
index++;
}
}
}
}


Thanks,

pr20080798
 
Posts: 146
Joined: Wed Jan 20, 2021 1:15 pm

Fri Sep 24, 2021 7:27 am

Hello,

Please refer to the following code.
Code: Select all
            //Load document
            Document document = new Document(@"sample.docx");
            int index = 0;

            List<DocPicture> olePictures = new List<DocPicture>();
            foreach (Section section in document.Sections)
            {
                foreach (Paragraph paragraph in section.Paragraphs)
                {
                    foreach (DocumentObject docObject in paragraph.ChildObjects)
                    {
                        if (docObject.DocumentObjectType == DocumentObjectType.OleObject)
                        {
                            DocOleObject Ole = docObject as DocOleObject;
                            olePictures.Add(Ole.OlePicture);
                        }
                    }
                }
            }

            //Get Each Section of Document
            foreach (Section section in document.Sections)
            {
                //Get Each Paragraph of Section
                foreach (Paragraph paragraph in section.Paragraphs)
                {
                    //Get Each Document Object of Paragraph Items
                    foreach (DocumentObject docObject in paragraph.ChildObjects)
                    {
                        if (docObject.DocumentObjectType == DocumentObjectType.Picture)
                        {
                            DocPicture picture = docObject as DocPicture;
                            foreach(DocPicture olePicture in olePictures)
                            {
                                if (picture == olePicture)
                                {
                                    goto loop;
                                }
                            }
                            picture.Image.Save( string.Format("image_{0}.png", index), System.Drawing.Imaging.ImageFormat.Png);
                            index++;
                        }
                        loop:;
                    }
                }
            }

If this cannot meet your needs, please provide your input file for further investigation. You can upload it here or send it to us (support@e-iceblue.com) via email. Thanks in advance.

Sincerely,
Brian
E-iceblue support team
User avatar

Brian.Li
 
Posts: 1271
Joined: Mon Oct 19, 2020 3:04 am

Thu Sep 30, 2021 10:06 am

Hi Brian.Li

Thank you for your reply.

Above code is not working for me.Please refer attached sample word file.

I have to extract all attached files from word document.

if(extracted file type is oleType) {

//code for ole type
}
else if(extracted file is picture) {

// code for picture type
}

pr20080798
 
Posts: 146
Joined: Wed Jan 20, 2021 1:15 pm

Thu Sep 30, 2021 10:09 am

Hello,

Thanks for your feedback.
I am sorry that I did not find any file here you mentioned, please confirm it again.

Sincerely,
Brian
E-iceblue support team
User avatar

Brian.Li
 
Posts: 1271
Joined: Mon Oct 19, 2020 3:04 am

Thu Sep 30, 2021 11:02 am

Hi
Last edited by pr20080798 on Thu Sep 30, 2021 11:31 am, edited 2 times in total.

pr20080798
 
Posts: 146
Joined: Wed Jan 20, 2021 1:15 pm

Return to Spire.Doc