Read PDF Images and Text in C#, VB.NET

This section aims at introducing a solution to read PDF via a .net PDF component in C#, VB.NET. In this solution, you can realize your PDF read task in only three steps. Besides, you can save the extracted image to be many commonly used formats such as jpg, jpeg, png, bmp, tiff, gif etc. Some special texts such as texts which are written from right to left for example Herbrew, Arabic etc all can be read from PDF.

Spire.PDF for .NET, a professional .NET PDF component for reading, editing and manipulating PDF file, enables you to read your PDF file in a fast way. First, you can implement PdfDocument.LoadFromFile(string filename) method to load your PDF file from system.Then, please call the methods ExtractText and ExtractImages to extract PDF txt and images. System.IO.File.WriteAllText(string path, string contents) and Image.Save(string filename, ImageFormat format) , you can save the extracted text and images respectively.Please download Spire.PDF for .NET and view below picture:

Read PDF Text and Images

Detail Code:

[C#]
using Spire.Pdf;

namespace extract_pdf
{
    class Program
    {
        static void Main(string[] args)
        {
            //Create a pdf document.
            PdfDocument doc = new PdfDocument();
            doc.LoadFromFile(@"..\Sample_image.pdf");

            StringBuilder buffer = new StringBuilder();
            IList images = new List();

            foreach (PdfPageBase page in doc.Pages)
            {
                buffer.Append(page.ExtractText());
                foreach (Image image in page.ExtractImages())
                {
                    images.Add(image);
                }
            }

            doc.Close();
            //save text
            String fileName = "TextInPdf.txt";
            File.WriteAllText(fileName, buffer.ToString());
            //save image
            int index = 0;
            foreach (Image image in images)
            {
                String imageFileName
                    = String.Format("Image-{0}.png", index++);
                image.Save(imageFileName, ImageFormat.Png);
            }
            //Launching the Text file.
            System.Diagnostics.Process.Start(fileName);
        }

    }
}
[VB.NET]
Imports Spire.Pdf

Namespace extract_pdf
    Class Program
        Private Shared Sub Main(ByVal args() As String)
            'Create a pdf document.
            Dim doc As PdfDocument = New PdfDocument
            doc.LoadFromFile("..\Sample_image.pdf")
            Dim buffer As StringBuilder = New StringBuilder
            Dim images As IList(Of Image) = New List(Of Image)
            For Each page As PdfPageBase In doc.Pages
                buffer.Append(page.ExtractText)
                For Each image As Image In page.ExtractImages
                    images.Add(image)
                Next
            Next
            doc.Close
            'save text
            Dim fileName As String = "TextInPdf.txt"
            File.WriteAllText(fileName, buffer.ToString)
            'save image
            Dim index As Integer = 0
            For Each image As Image In images
                Dim imageFileName As String = String.Format("Image-{0}.png", index++)
                image.Save(imageFileName, ImageFormat.Png)
            Next
            'Launching the Text file.
            System.Diagnostics.Process.Start(fileName)
        End Sub
    End Class
End Namespace

Spire.PDF is a .NET PDF component, which enables users to perform a wide range of PDF document processing tasks directly, such as generate, read, write and modify PDF document in WPF, .NET and Silverlight.