Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Thu Jun 29, 2023 4:11 pm

Hello,

I am trying to extract text from a PDF file with multiple pages, and have them stored in individual .txt files for each page. My current technique is to split the PDF into multiple PDFs first, then loop through each PDF file, converting it to a .txt. However, when running the code I am getting a NullReferenceException error. The relevant code is pasted below...

Code: Select all
wstring inputFile = finalFileName;
wstring pattern = L"SplitDocument-{0}.pdf";
   
PdfDocument* doc = new PdfDocument();

doc->LoadFromFile(finalFileName);

doc->Split(pattern.c_str());

doc->Close();
delete doc;
   
for (int a = 0; a < numOfSheets; a++)
{
   PdfDocument* newDoc = new PdfDocument();

   string individualFile = "SplitDocument-" + to_string(a) + ".pdf";

   wstring temp2 = wstring(individualFile.begin(), individualFile.end());
   LPCWSTR finalIndivFile = temp2.c_str();
      
   newDoc->LoadFromFile(finalIndivFile);

   wstring buffer = L"";

   buffer = newDoc->GetPages()->GetItem(0)->ExtractText();

   string textName = "ExtractText-" + to_string(a) + ".pdf";
   wstring temp3 = wstring(textName.begin(), textName.end());
   LPCWSTR finalTextName = temp3.c_str();

   wofstream write(finalTextName);
   auto LocUtf8 = locale(locale(""), new std::codecvt_utf8<wchar_t>);
   write.imbue(LocUtf8);
   write << buffer;
   write.close();
   newDoc->Close();
   delete newDoc;
}


I more or less just combined the code that was given on this website for these tasks and applied it as necessary. When used individually everything works fine (splitting PDF works solo, extracting text works solo), but something about extracting text after it has been split is causing the error. Unfortunately I am unable to post the PDFs, but the original file has the correct encoding and when I open the ones created by "doc->Split(pattern.c_str())" Adobe says it is unable to display some characters which is what leads me to believe it's something about the combination of the two tasks that is causing the issue here.

Perhaps it's just something simple that I'm not seeing or don't understand, but I've been stuck on this for a few days now and would really appreciate any help!

Note: "finalFileName" is just the LPCWSTR version of a string I am reading in from the console

Thanks,
Tristan

tristan.c
 
Posts: 2
Joined: Thu Jun 29, 2023 3:50 pm

Fri Jun 30, 2023 6:56 am

Hello,

Thanks for your inquiry.
I tested your sample code and did reproduce the issue you mentioned. After investigation, we found that the issue stems from the generated split documents. I have logged the issue into our bug tracking system with the ticket number SPIREPDF-6110. Our development team will investigate and fix it. Once it is resolved, I will inform you in time. Sorry for the inconvenience caused.
In addition, to store in individual .txt files for each page, you also can use the following sample code to directly extract the text and save to .txt file. I have verified this way is no problem.
Code: Select all
wstring input_path = L"F:\\";
   wstring inputFile = input_path + L"Sample.pdf";
   wstring output_path = L"F:\\output\\";

   intrusive_ptr<PdfDocument> doc = new PdfDocument();
   doc->LoadFromFile(inputFile.c_str());

   int pageCount = doc->GetPages()->GetCount();

   for (int i = 0; i < pageCount; i++)
   {
      intrusive_ptr<PdfPageBase> page = doc->GetPages()->GetItem(i);
      wstring outputFile = output_path + L"Page_" + to_wstring(i) + L".txt";

      // Extract text from page and write to file
      wstring text = page->ExtractText(true);

      wofstream os(outputFile, ios::trunc);
      os << text;
      os.close();
   }

   doc->Close();


Sincerely
Wenly
E-iceblue support team
User avatar

Wenly.Zhang
 
Posts: 149
Joined: Tue May 16, 2023 2:19 am

Thu Jul 06, 2023 8:24 pm

This is exactly what I needed, thank you so much!

Thanks,
Tristan

tristan.c
 
Posts: 2
Joined: Thu Jun 29, 2023 3:50 pm

Fri Jul 07, 2023 1:34 am

Hello,

You're welcome.
Glad to hear that your issue has been resolved. If you encounter other issues related to our products in the future, please feel free to contact us.
Have a nice day!

Sincerely,
Wenly
E-iceblue support team
User avatar

Wenly.Zhang
 
Posts: 149
Joined: Tue May 16, 2023 2:19 am

Fri Oct 24, 2025 4:18 am

Hello,

We are pleased to announce that our latest Spire.PDF.Cpp Version: 11.10.0 has fixed issue SPIREPDF-6110. Welcome to test. We are looking forward to your testing feedback.
Website:
https://www.e-iceblue.com/Download/pdf-for-cpp.html
Nuget:
https://www.nuget.org/packages/Spire.PDF.Cpp/11.10.0
https://www.nuget.org/packages/Spire.PDF.Cpp.Linux/11.10.0
Sincerely,
Talia
E-iceblue support team
User avatar

talia.liu
 
Posts: 331
Joined: Mon Apr 14, 2025 3:33 am

Return to Spire.PDF

cron