Spire.PDF is a professional PDF library applied to creating, writing, editing, handling and reading PDF files without any external dependencies. Get free and professional technical support for Spire.PDF for .NET, Java, Android, C++, Python.

Wed Nov 17, 2021 11:22 am

Hi,

I'm working on a project where I have to extract specific text from a pdf so that I can send these info into an excel file.
I tried at first to convert my pdf into a .txt file thinking a .txt file format would be easier to convert into json.
But the result is not at all what I need (dictionary-style Json format) but instead a kind of giant messy string .
The pdf sample looks like this:

Analysis
Some text

Reference Date 11/17/2021
Reference Price USD 745
Client id 4572845

I'd like to have something like this at the end:

{Analysis:Some text, Reference Date:11/17/2021, Reference Price:USD 745, Client id:4572845 }

Currently the results give all the info mixed up between each others.

Here is my code:

First, I created a "Global" class where I will create the method "Extract_Row_Info_TS that will basically load the first page of the document (called a TS or Termsheet) and extract the text from the PDF and store it into a txt file called "result.txt":

Code: Select all
class Global
{
   public static void Extract_RowInfo_TS(string doc_Type, string docPath, int? nbrPage = null)
   {
      switch (doc_Type)
      {
         case "Pdf":
            Spire.Pdf.PdfDocument doc = new Spire.Pdf.PdfDocument();
            doc.LoadFromFile(docPath);
            StringBuilder buffer = new StringBuilder();
            
            //Extract text from the first page only
            Spire.Pdf.PdfPageBase pagefirst = doc.Pages[0];
            buffer.Append(pagefirst.ExtractText());
               
            doc.Close();
            
            //save text
            String fileName = @"my_disk:\my_path\result.txt";
            File.WriteAllText(fileName, buffer.ToString());
            
            //Load File
            System.Diagnostics.Process.Start(fileName);
            break;

         case "Excel":
            Spire.Xls.Workbook Wb = new Spire.Xls.Workbook();
            break;
         case "Word":
            Spire.Doc.Document doc_word = new Spire.Doc.Document();
            break;
      }
   }
}


Come back to my main page, I call the above method "Extract_RowInfo_TS" from above Global class and when it created "result.txt" from the pdf infos, I'll try to convert this "result.txt" into a json format:

Code: Select all

public partial class Form1 : Form
{
   public Form1()
   {
      InitializeComponent();
   }

   private void btn_Extract_PDF_Click(object sender, EventArgs e)
   {
      Global.Extract_RowInfo_TS("Pdf", @"my_disk:\my_path\my_doc.pdf");
      Convert_To_Json_Format(@"my_disk:\my_path\result.txt");
   }

   private void Convert_To_Json_Format(string baseTextFile)
   {
      
      string streamText = new StreamReader(baseTextFile).ReadToEnd();
      
      // Serialize Json Data.
      string serializeData = Serialize_into_Json(streamText);
      string newFile = @"my_disk:\my_path\NEW_text_file_2.txt";
      File.WriteAllText(newFile, serializeData);
      System.Diagnostics.Process.Start(newFile);

   }
   private static string Serialize_into_Json(string json)
   {
      string jsonData = JsonConvert.SerializeObject(json);
      return jsonData;
   }
}



I'm stuck here trying to create a proper json format file that I can use for sending into my Excel file. Any help would be much appreciated !

Thanks in advance for helping and have a great day.

Pgiugla06
 
Posts: 4
Joined: Wed Nov 10, 2021 10:46 am

Thu Nov 18, 2021 7:21 am

Hello,

Thanks for your inquiry.
Sorry that our product does not support directly converting PDF content to json, but our Spire.PDF supports converting PDF to Excel file. I suggest that you use the following code to directly convert your PDF to Excel file.
Code: Select all
PdfDocument doc = new PdfDocument();
doc.LoadFromFile("input.pdf");
doc.SaveToFile("output.xlsx", FileFormat.XLSX);

In addition, since we don't know much about json, the conversion from string to json data needs to be implemented by yourself. If you have other questions about our products, please feel free to contact us.

Sincerely,
Annika
E-iceblue support team
User avatar

Annika.Zhou
 
Posts: 1651
Joined: Wed Apr 07, 2021 2:50 am

Thu Nov 18, 2021 1:01 pm

Hi Annika,

Thanks for answering, I actually do not have the fileformat ".XLSX". I'm using the Spire.pdf.Fileformat enum with Free v4.3.1 Spire.Office package. I wanted to test the free version of office to help me to decide if I want to push forward to obtain the paid version.

Best regards,

Pgiugla06
 
Posts: 4
Joined: Wed Nov 10, 2021 10:46 am

Fri Nov 19, 2021 2:00 am

Hello,

Thank you for your feedback.
Yes, Spire.PDF in the free version of Spire.Office does not contain "FileFormat.XLSX". Since we only maintain the free version irregularly. I suggest that you use our commercial version (the latest version is Spire. Office Platinum(Hotfix) Version:6.10.3), which conatins more fixes and new features than the free one. We are willing to provide you with a one-month temporary license to evaluate without any watermarks and restrictions, please apply for a temporary license from this link: https://www.e-iceblue.com/TemLicense.html. If there is any question, please feel free to write back.

Sincerely,
Annika
E-iceblue support team
User avatar

Annika.Zhou
 
Posts: 1651
Joined: Wed Apr 07, 2021 2:50 am

Return to Spire.PDF

cron