This section aims at introducing a solution to read PDF via a .net PDF component in C#, VB.NET. In this solution, you can realize your PDF read task in only three steps. Besides, you can save the extracted image to be many commonly used formats such as jpg, jpeg, png, bmp, tiff, gif etc. Some special texts such as texts which are written from right to left for example Herbrew, Arabic etc all can be read from PDF.
Spire.PDF for .NET, a professional .NET PDF component for reading, editing and manipulating PDF file, enables you to read your PDF file in a fast way. First, you can implement PdfDocument.LoadFromFile(string filename) method to load your PDF file from system.Then, please call the methods ExtractText and ExtractImages to extract PDF txt and images. System.IO.File.WriteAllText(string path, string contents) and Image.Save(string filename, ImageFormat format) , you can save the extracted text and images respectively.Please download Spire.PDF for .NET and view below picture:
Detail Code:
using Spire.Pdf; namespace extract_pdf { class Program { static void Main(string[] args) { //Create a pdf document. PdfDocument doc = new PdfDocument(); doc.LoadFromFile(@"..\Sample_image.pdf"); StringBuilder buffer = new StringBuilder(); IListimages = new List
(); foreach (PdfPageBase page in doc.Pages) { buffer.Append(page.ExtractText()); foreach (Image image in page.ExtractImages()) { images.Add(image); } } doc.Close(); //save text String fileName = "TextInPdf.txt"; File.WriteAllText(fileName, buffer.ToString()); //save image int index = 0; foreach (Image image in images) { String imageFileName = String.Format("Image-{0}.png", index++); image.Save(imageFileName, ImageFormat.Png); } //Launching the Text file. System.Diagnostics.Process.Start(fileName); } } }
Imports Spire.Pdf Namespace extract_pdf Class Program Private Shared Sub Main(ByVal args() As String) 'Create a pdf document. Dim doc As PdfDocument = New PdfDocument doc.LoadFromFile("..\Sample_image.pdf") Dim buffer As StringBuilder = New StringBuilder Dim images As IList(Of Image) = New List(Of Image) For Each page As PdfPageBase In doc.Pages buffer.Append(page.ExtractText) For Each image As Image In page.ExtractImages images.Add(image) Next Next doc.Close 'save text Dim fileName As String = "TextInPdf.txt" File.WriteAllText(fileName, buffer.ToString) 'save image Dim index As Integer = 0 For Each image As Image In images Dim imageFileName As String = String.Format("Image-{0}.png", index++) image.Save(imageFileName, ImageFormat.Png) Next 'Launching the Text file. System.Diagnostics.Process.Start(fileName) End Sub End Class End Namespace
Spire.PDF is a .NET PDF component, which enables users to perform a wide range of PDF document processing tasks directly, such as generate, read, write and modify PDF document in WPF, .NET and Silverlight.