' lifted from http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging
Dim inputFile As String = "C:\test\multipage.tif"
Dim strRecText As String = ""
Dim Doc1 As MODI.Document
Doc1 = New MODI.Document
Doc1.Create(inputFile)
Doc1.OCR() ' this will ocr all pages of a multi-page tiff file
Doc1.Save() ' this will save the deskewed reoriented images, and the OCR text, back to the inputFile
For imageCounter As Integer = 0 To (Doc1.Images.Count - 1) ' work your way through each page of results
strRecText &= Doc1.Images(imageCounter).Layout.Text ' this puts the ocr results into a string
Next
File.AppendAllText("C:\test\testmodi.txt", strRecText) ' write the OCR file out to disk
Doc1.Close() ' clean up
Doc1 = Nothing
I've used pdftohtml to successfully strip tables out of PDF into CSV. It's based on Xpdf, which is a more general purpose tool, that includes pdftotext. I just wrap it as a Process.Start call from C#.
If you're looking for something a little more DIY, there's the iTextSharp library - a port of Java's iText - and PDFBox (yes, it says Java - but they have a .NET version by way of IKVM.NET). Here's some CodeProject articles on using iTextSharp and PDFBox from C#.
And, if you're really a masochist, you could call into Adobe's PDF IFilter with COM interop. The IFilter specs is pretty simple, but I would guess that the interop overhead would be significant.
Edit: After re-reading the question and subsequent answers, it's become clear that the OP is dealing with images in his PDF. In that case, you'll need to extract the images (the PDF libraries above are able to do that fairly easily) and run it through an OCR engine.
I've used MODI interactively before, with decent results. It's COM, so calling it from C# via interop is also doable and pretty simple:
' lifted from http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging
Dim inputFile As String = "C:\test\multipage.tif"
Dim strRecText As String = ""
Dim Doc1 As MODI.Document
Doc1 = New MODI.Document
Doc1.Create(inputFile)
Doc1.OCR() ' this will ocr all pages of a multi-page tiff file
Doc1.Save() ' this will save the deskewed reoriented images, and the OCR text, back to the inputFile
For imageCounter As Integer = 0 To (Doc1.Images.Count - 1) ' work your way through each page of results
strRecText &= Doc1.Images(imageCounter).Layout.Text ' this puts the ocr results into a string
Next
File.AppendAllText("C:\test\testmodi.txt", strRecText) ' write the OCR file out to disk
Doc1.Close() ' clean up
Doc1 = Nothing
Others like Tesseract, but I have direct experience with it. I've heard both good and bad things about it, so I imagine it greatly depends on your source quality.
您无法从 PDF 中提取扫描文本。 您需要 OCR 软件。 好消息是,您可以尝试一些开源应用程序,并且 OCR 路线很可能比使用 PDF 库提取文本更容易。 查看 Tesseract 和 GOCR。
You can't extract scanned text from a PDF. You need OCR software. The good news is there are a few open source applications you can try and the OCR route will most likely be easier than using a PDF library to extract text. Check out Tesseract and GOCR.
顺便说一下,我刚刚拿到了《Head First C#》(在 Keith 的
建议)来自亚马逊。 这本书很酷! 它确实是为
初学者。 该版本涵盖VS2008和框架3.5。
给你...
/* Marlon Ribunal
* Convert PDF To Text
* *******************/
using System;
using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Drawing.Printing;
using System.IO;
using System.Text;
using System.ComponentModel.Design;
using System.ComponentModel;
using org.pdfbox.pdmodel;
using org.pdfbox.util;
namespace MarlonRibunal.iPdfToText
{
public partial class MainForm : Form
{
public MainForm()
{
InitializeComponent();
}
void Button1Click(object sender, EventArgs e)
{
PDDocument doc = PDDocument.load("C:\\pdftoText\\myPdfTest.pdf");
PDFTextStripper stripper = new PDFTextStripper();
richTextBox1.Text=(stripper.getText(doc));
}
}
}
I have posted about parsing pdf's in one of my blogs. Hit this link:
Well, the following is based on popular examples available on the web.
What this does is "read" the pdf file and output it as a text in the
rich text box control in the form. The PDFBox for .NET library can be
downloaded from sourceforge.
You need to add reference to IKVM.GNU.Classpath & PDFBox-0.7.3. And
also, FontBox-0.1.0-dev.dll and PDFBox-0.7.3.dll need to be added on
the bin folder of your application. For some reason I can't recall
(maybe it's from one of the tutorials), I also added to the bin
IKVM.GNU.Classpath.dll.
On the side note, just got my copy of "Head First C#" (on Keith's
suggestion) from Amazon. The book is cool! It is really written for
beginners. This edition covers VS2008 and the framework 3.5.
Here you go...
/* Marlon Ribunal
* Convert PDF To Text
* *******************/
using System;
using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Drawing.Printing;
using System.IO;
using System.Text;
using System.ComponentModel.Design;
using System.ComponentModel;
using org.pdfbox.pdmodel;
using org.pdfbox.util;
namespace MarlonRibunal.iPdfToText
{
public partial class MainForm : Form
{
public MainForm()
{
InitializeComponent();
}
void Button1Click(object sender, EventArgs e)
{
PDDocument doc = PDDocument.load("C:\\pdftoText\\myPdfTest.pdf");
PDFTextStripper stripper = new PDFTextStripper();
richTextBox1.Text=(stripper.getText(doc));
}
}
}
如果PDF是打印文本的扫描件,自己做起来会比较困难(涉及图像处理、字符识别等)。 PDF 通常会将扫描文档在内部存储为 JPEG。 您最好使用第三方工具(OCR 工具)来执行此操作。
If the PDF is a scans of printed text, it will be hard (involves image processing, character recognizing etc.) to do it yourself. PDF will generally store the scanned documents as JPEGs internally. You are better of using a third party tool (OCR tool) that does this.
If I get it right, sheebz is asking how to extract PDF fields and load the data into a database. Have you looked at iTextSharp? - http://sourceforge.net/projects/itextsharp/
发布评论
评论(10)
我已经使用 pdftohtml 成功地将表格从 PDF 中剥离为 CSV。 它基于 Xpdf,这是一个更通用的工具,包括 pdftotext。 我只是将其包装为来自 C# 的 Process.Start 调用。
如果您正在寻找更 DIY 的东西,可以使用 iTextSharp 库 - Java iText - 和 PDFBox(是的,它说的是 Java - 但他们有一个 .NET 版本,通过 IKVM.NET )。 以下是一些有关使用 iTextSharp 和 PDFBox 来自 C#。
而且,如果您确实是一个受虐狂,您可以调用 Adobe 的 PDF IFilter 与 COM 互操作。 IFilter 规范 非常简单,但我猜互操作开销将是意义重大的。
编辑:重新阅读问题和后续答案后,很明显OP正在处理他的PDF中的图像。 在这种情况下,您需要提取图像(上面的 PDF 库可以相当轻松地做到这一点)并通过 OCR 引擎运行它。
我之前以交互方式使用过MODI,效果不错。 它是 COM,因此通过互操作从 C# 调用它也是可行并且非常简单:
其他类似 Tesseract,但我有直接的经验。 我听说过关于它的好消息和坏消息,所以我想这很大程度上取决于您的源质量。
I've used pdftohtml to successfully strip tables out of PDF into CSV. It's based on Xpdf, which is a more general purpose tool, that includes pdftotext. I just wrap it as a Process.Start call from C#.
If you're looking for something a little more DIY, there's the iTextSharp library - a port of Java's iText - and PDFBox (yes, it says Java - but they have a .NET version by way of IKVM.NET). Here's some CodeProject articles on using iTextSharp and PDFBox from C#.
And, if you're really a masochist, you could call into Adobe's PDF IFilter with COM interop. The IFilter specs is pretty simple, but I would guess that the interop overhead would be significant.
Edit: After re-reading the question and subsequent answers, it's become clear that the OP is dealing with images in his PDF. In that case, you'll need to extract the images (the PDF libraries above are able to do that fairly easily) and run it through an OCR engine.
I've used MODI interactively before, with decent results. It's COM, so calling it from C# via interop is also doable and pretty simple:
Others like Tesseract, but I have direct experience with it. I've heard both good and bad things about it, so I imagine it greatly depends on your source quality.
您无法从 PDF 中提取扫描文本。 您需要 OCR 软件。 好消息是,您可以尝试一些开源应用程序,并且 OCR 路线很可能比使用 PDF 库提取文本更容易。 查看 Tesseract 和 GOCR。
You can't extract scanned text from a PDF. You need OCR software. The good news is there are a few open source applications you can try and the OCR route will most likely be easier than using a PDF library to extract text. Check out Tesseract and GOCR.
我在我的一篇博客中发布了有关解析 pdf 的文章。 点击此链接:
http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library- pdfbox-another-sample-for-grade-1-pupils.aspx
编辑:链接不再有效。 下面引用自http://web.archive.org/web/20130507084207/http://devpinoy.org /blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx
I have posted about parsing pdf's in one of my blogs. Hit this link:
http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx
Edit: Link no long works. Below quoted from http://web.archive.org/web/20130507084207/http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx
在我曾经工作过的一家公司,我们使用 ActivePDF 工具包并取得了一些成功:
http ://www.activepdf.com/products/serverproducts/toolkit/index.cfm
我认为你至少需要标准版或专业版,但他们有试用版,这样你就可以看看它是否能做到你想要的想要它。
At a company I used to work for, we used ActivePDF toolkit with some success:
http://www.activepdf.com/products/serverproducts/toolkit/index.cfm
I think you'd need at least the Standard or Pro version but they have trials so you can see if it'll do what you want it to.
谷歌的快速搜索显示了这个有希望的结果。
http://www.pdftron.com/net/index.html
A quick google search shows this promising result.
http://www.pdftron.com/net/index.html
如果PDF是打印文本的扫描件,自己做起来会比较困难(涉及图像处理、字符识别等)。 PDF 通常会将扫描文档在内部存储为 JPEG。 您最好使用第三方工具(OCR 工具)来执行此操作。
If the PDF is a scans of printed text, it will be hard (involves image processing, character recognizing etc.) to do it yourself. PDF will generally store the scanned documents as JPEGs internally. You are better of using a third party tool (OCR tool) that does this.
您可以使用 Perl 的 PDF 等模块来提取文本。 并使用另一个工具将相关信息导入数据库。
我确信.NET有PDF组件,但我没有尝试过,所以我不知道什么是好的。
You can use a module like perl's PDF to extract the text. And use another tool to import the pertinent info into the database.
I am sure there are PDF components for .NET, but I have not tried any, so I don't know what is good.
我最近发现了用于 Python 的 ReportLab。
I've recently found ReportLab for Python.
如果我猜对了,sheebz 会询问如何提取 PDF 字段并将数据加载到数据库中。 你看过 iTextSharp 吗? - http://sourceforge.net/projects/itextsharp/
If I get it right, sheebz is asking how to extract PDF fields and load the data into a database. Have you looked at iTextSharp? - http://sourceforge.net/projects/itextsharp/
根据 Mark Brackett 的回答,我创建了一个 Nuget 包 来包装 pdftotext。
它是开源,针对 .net 标准 1.6 和 .net框架4.5。
用法:
Based on Mark Brackett's answer, I created a Nuget package to wrap pdftotext.
It's open source, targeting .net standard 1.6 and .net framework 4.5.
Usage: