C++: Extract Text and Images from Word Documents

Extracting text and images is a common requirement while working with Word documents. This will help you to save useful content out of the original document to re-use in a new document or for other purposes. In this article, you will learn how to extract text or images from a Word document using Spire.Doc for C++.

Install Spire.Doc for C++

There are two ways to integrate Spire.Doc for C++ into your application. One way is to install it through NuGet, and the other way is to download the package from our website and copy the libraries into your program. Installation via NuGet is simpler and more recommended. You can find more details by visiting the following link.

Integrate Spire.Doc for C++ in a C++ Application

Extract Text from a Word Document in C++

To extract the text content from an existing Word document, Spire.Doc for C++ provides the Document->GetText() method. The following are steps to extract text and save in a TXT file.

  • Create a Document instance.
  • Load a sample Word document using Document->LoadFromFile() method.
  • Get text from the document using Document->GetText() method.
  • Create a new txt file and write the extracted text to the file.
  • C++
#include "Spire.Doc.o.h"

using namespace Spire::Doc;

int main() {
	//Specify input file path and name
	std::wstring data_path = L"Data\\";
	std::wstring inputFile = data_path + L"input.docx";

	//Specify output file path and name
	std::wstring outputPath = L"Output\\";
	std::wstring outputFile = outputPath + L"GetText.txt";

	//Create a Document instance
	Document* document = new Document();

	//Load a sample Word document from disk
	document->LoadFromFile(inputFile.c_str());

	//Get text from the document
	std::wstring text = document->GetText();

	//Create a new TXT File to save the extracted text
	std::wofstream write(outputFile);
	write << text;
	write.close();
	document->Close();
	delete document;
}

C++: Extract Text and Images from Word Documents

Extract Images from a Word Document in C++

For a Word document with a lot of images, manually saving these images one by one is quite time-consuming. Below are steps to extract all images at once using Spire.Doc for C++.

  • Load a sample Word document using Document->LoadFromFile() method.
  • Append the document to the end of the deque, and then create a vector of images list.
  • Traverse through all child objects of the document.
  • Determine whether the object type is picture. If yes, get each image using DocPicture->GetImage() method and add it to the list.
  • Save the extracted images out of the document in a specified output file path.
  • C++
#include "Spire.Doc.o.h"
#include <deque>

using namespace Spire::Doc;

int main() {
	//Specify input file path and name
	std::wstring data_path = L"Data\\";
	std::wstring inputFile = data_path + L"input.docx";

	//Specify output file path and name
	std::wstring outputPath = L"Output\\";
	std::wstring outputFile = outputPath + L"ExtractImage/";

	//Load a sample Word document
	Document* document = new Document();
	document->LoadFromFile(inputFile.c_str());

	//Append the document to the end of the deque
	std::deque<ICompositeObject*> nodes;
	nodes.push_back(document);

	//Create a vector of images list
	std::vector<Image*> images;

	//Traverse through all child objects of the document
	while (nodes.size() > 0)
	{
		ICompositeObject* node = nodes.front();
		nodes.pop_front();
		for (int i = 0; i < node->GetChildObjects()->GetCount(); i++)
		{
			IDocumentObject* child = node->GetChildObjects()->GetItem(i);

			//Get each image and add it to the list
			if (child->GetDocumentObjectType() == DocumentObjectType::Picture)
			{
				DocPicture* picture = dynamic_cast<DocPicture*>(child);
				images.push_back(picture->GetImage());
			}
			else if (dynamic_cast<ICompositeObject*>(child) != nullptr)
			{
				nodes.push_back(dynamic_cast<ICompositeObject*>(child));
			}

		}
	}
	//Save the images out of the document
	for (int i = 0; i < images.size(); i++)
	{
		std::wstring fileName = L"Image-" + std::to_wstring(i) + L".png";
		images[i]->Save((outputFile + fileName).c_str(), ImageFormat::GetPng());
	}
	document->Close();
	delete document;
}

C++: Extract Text and Images from Word Documents

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.