Text aus PDF auslesen
Unter C# kann man mit wenigen Code Zeilen einen kompletten Pdf-Reader
erstellen.
Hierzu kann man das Nuget Package iTextSharp einbinden.
kleines Windows Programm mit iTextSharp in C# und WPF
C#, wpf: PDF Textreader
Mit iTextSharp
In diesem Beispiel wurde das
PDF Dokument von der Rechten Seite eingelesen und als Text extrahiert zur C#
WPF Anwendung übergeben
MainWindow.
Xaml Code, MainWindow.xaml
<Window x:Class="PDF_TextReader.MainWindow"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:local="clr-namespace:PDF_TextReader"
mc:Ignorable="d"
Title="MainWindow" Height="700" Width="800">
<Grid >
<Button x:Name="btnStart" Content="Read PDF" Click="btnStart_Click" HorizontalAlignment="Left" Margin="15,9,0,0" VerticalAlignment="Top" Width="86" Height="33"/>
<TextBox x:Name="tbxFilename" Text="C:\_Daten\Desktop\VS_Projects\Office\PDF_TextReader\_Test_PDF\test_pdf_import.pdf"
Width="631" Height="27" Margin="115,12,0,0" TextWrapping="Wrap" VerticalAlignment="Top" HorizontalAlignment="Left" />
<ScrollViewer Height="584" Margin="16,71,25.6,0" VerticalAlignment="Top" >
<TextBlock x:Name="lblPDF_Output" Text=""
TextWrapping="Wrap" HorizontalAlignment="Stretch" VerticalAlignment="Stretch"
/>
</ScrollViewer>
</Grid>
</Window>
|
C# Codebehind Window
Mit PdfReader(Filename) bindet
man den iTextSharp Reader an ein Pdf-Dokument an.
PdfReader pdf_Reader = new PdfReader(sFilename);
|
Mit der C# Code Zeile PdfTextExtractor.GetTextFromPage wird der Text
aus einer Pdf-Seite komplett als String mit Umbruchzeichen ausgelesen.
Platzhalter wie Bilder, Scans
und leer-Tabellen werden ausgelassen.
sText = PdfTextExtractor.GetTextFromPage(pdf_Reader, 1);
|
Code
in C#, .net Framework 4.7
Unter
MainWindow.xaml.cs
using System;
using System.Windows;
//< using >
using iTextSharp.text.pdf; //*iTextSharp
using iTextSharp.text.pdf.parser; //*iTextSharp Text-Reader
//</ using >
namespace PDF_TextReader
{
/// <summary>
/// demo pdf reader
/// </summary>
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
}
private void btnStart_Click(object sender, RoutedEventArgs e)
{
//String sFilename = "C:\\_Daten\\Desktop\\VS_Projects\\Office\\PDF_TextReader\\_Test_PDF\\test_pdf_import_bank.pdf";
String sFilename = tbxFilename.Text;
//--< read File >--
PdfReader pdf_Reader = new PdfReader(sFilename);
String sText = "";
for (int i = 1; i <= pdf_Reader.NumberOfPages; i++)
{
sText = sText + PdfTextExtractor.GetTextFromPage(pdf_Reader, i);
}
//MessageBox.Show(sText);
lblPDF_Output.Text=sText;
}
}
}
|
Nuget Package: iTextSharp
in das wpf Projekt muss man
per Nuget Package das Package iTextSharp einbinden.
iTextSharp ist für den
privaten Gebrauch kostenlos und frei verfügbar, solange man keine Software
erstellt, welche zum öffentlichen Verkauf angeboten wird.
Beschreibung von iTextSharp:
Nuget Package
iText is a PDF library that allows
you to CREATE, ADAPT, INSPECT and MAINTAIN documents in the Portable Document
Format (PDF), allowing you to add PDF functionality to your software projects
with ease. We even have documentation
to help you get coding.
We have two currently supported
versions: iText 5 and iText 7. Both are available under AGPL and Commercial
license.
* iText 5 AGPL
* iText 7 community:
https://www.nuget.org/packages/itext7/
iText 5 is a one solution library
that is complex, but well documented to help you create your solutions.
iText 7 is a complete re-write of
iText 5, allowing you to choose your adventure with add-ons, all based on a
simple, modular code structure that is easy to use and well documented.
Both versions allow you to:
- Generate documents and reports
based on data from an XML file or a database
- Create maps and books, exploiting
numerous interactive features available in PDF
- Add bookmarks, page numbers,
watermarks, and other features to existing PDF documents
- Split or concatenate pages from
existing PDF files
- Fill out interactive forms
- Serve dynamically generated or
manipulated PDF documents to a web browser
iText 7 includes pdfDebug, the
first debugging tool that gives you a clear overview of your content streams
and document structure as well as pdfCalligraph, allowing you to leverage
advanced typography.
iText is available for Java, .NET
in both versions, and Android and GAE for iText 5 only.
iTextSharp is the .NET port of
iText 5.
Several iText engineers are
actively supporting the project on StackOverflow:
http://stackoverflow.com/questions/tagged/itext
|