I am trying to extract text from a PDF file with multiple pages, and have them stored in individual .txt files for each page. My current technique is to split the PDF into multiple PDFs first, then loop through each PDF file, converting it to a .txt. However, when running the code I am getting a NullReferenceException error. The relevant code is pasted below...
- Code: Select all
wstring inputFile = finalFileName;
wstring pattern = L"SplitDocument-{0}.pdf";
PdfDocument* doc = new PdfDocument();
doc->LoadFromFile(finalFileName);
doc->Split(pattern.c_str());
doc->Close();
delete doc;
for (int a = 0; a < numOfSheets; a++)
{
PdfDocument* newDoc = new PdfDocument();
string individualFile = "SplitDocument-" + to_string(a) + ".pdf";
wstring temp2 = wstring(individualFile.begin(), individualFile.end());
LPCWSTR finalIndivFile = temp2.c_str();
newDoc->LoadFromFile(finalIndivFile);
wstring buffer = L"";
buffer = newDoc->GetPages()->GetItem(0)->ExtractText();
string textName = "ExtractText-" + to_string(a) + ".pdf";
wstring temp3 = wstring(textName.begin(), textName.end());
LPCWSTR finalTextName = temp3.c_str();
wofstream write(finalTextName);
auto LocUtf8 = locale(locale(""), new std::codecvt_utf8<wchar_t>);
write.imbue(LocUtf8);
write << buffer;
write.close();
newDoc->Close();
delete newDoc;
}
I more or less just combined the code that was given on this website for these tasks and applied it as necessary. When used individually everything works fine (splitting PDF works solo, extracting text works solo), but something about extracting text after it has been split is causing the error. Unfortunately I am unable to post the PDFs, but the original file has the correct encoding and when I open the ones created by "doc->Split(pattern.c_str())" Adobe says it is unable to display some characters which is what leads me to believe it's something about the combination of the two tasks that is causing the issue here.
Perhaps it's just something simple that I'm not seeing or don't understand, but I've been stuck on this for a few days now and would really appreciate any help!
Note: "finalFileName" is just the LPCWSTR version of a string I am reading in from the console
Thanks,
Tristan