以编程方式识别 PDF 文件中的扫描文本

发布于 2024-07-06 07:53:12 字数 1560 浏览 10 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

衣神在巴黎 2024-07-13 07:53:12

我已经使用 pdftohtml 成功地将表格从 PDF 中剥离为 CSV。 它基于 Xpdf,这是一个更通用的工具,包括 pdftotext。 我只是将其包装为来自 C# 的 Process.Start 调用。

如果您正在寻找更 DIY 的东西,可以使用 iTextSharp 库 - Java iText - 和 PDFBox(是的,它说的是 Java - 但他们有一个 .NET 版本,通过 IKVM.NET )。 以下是一些有关使用 iTextSharpPDFBox 来自 C#。

而且,如果您确实是一个受虐狂,您可以调用 Adob​​e 的 PDF IFilter 与 COM 互操作。 IFilter 规范 非常简单,但我猜互操作开销将是意义重大的。

编辑:重新阅读问题和后续答案后,很明显OP正在处理他的PDF中的图像。 在这种情况下,您需要提取图像(上面的 PDF 库可以相当轻松地做到这一点)并通过 OCR 引擎运行它。

我之前以交互方式使用过MODI,效果不错。 它是 COM,因此通过互操作从 C# 调用它也是可行并且非常简单

' lifted from http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging
Dim inputFile As String = "C:\test\multipage.tif"
Dim strRecText As String = ""
Dim Doc1 As MODI.Document

Doc1 = New MODI.Document
Doc1.Create(inputFile)
Doc1.OCR()  ' this will ocr all pages of a multi-page tiff file
Doc1.Save() ' this will save the deskewed reoriented images, and the OCR text, back to the inputFile

For imageCounter As Integer = 0 To (Doc1.Images.Count - 1) ' work your way through each page of results
   strRecText &= Doc1.Images(imageCounter).Layout.Text    ' this puts the ocr results into a string
Next

File.AppendAllText("C:\test\testmodi.txt", strRecText)     ' write the OCR file out to disk

Doc1.Close() ' clean up
Doc1 = Nothing

其他类似 Tesseract,但我有直接的经验。 我听说过关于它的好消息和坏消息,所以我想这很大程度上取决于您的源质量。

I've used pdftohtml to successfully strip tables out of PDF into CSV. It's based on Xpdf, which is a more general purpose tool, that includes pdftotext. I just wrap it as a Process.Start call from C#.

If you're looking for something a little more DIY, there's the iTextSharp library - a port of Java's iText - and PDFBox (yes, it says Java - but they have a .NET version by way of IKVM.NET). Here's some CodeProject articles on using iTextSharp and PDFBox from C#.

And, if you're really a masochist, you could call into Adobe's PDF IFilter with COM interop. The IFilter specs is pretty simple, but I would guess that the interop overhead would be significant.

Edit: After re-reading the question and subsequent answers, it's become clear that the OP is dealing with images in his PDF. In that case, you'll need to extract the images (the PDF libraries above are able to do that fairly easily) and run it through an OCR engine.

I've used MODI interactively before, with decent results. It's COM, so calling it from C# via interop is also doable and pretty simple:

' lifted from http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging
Dim inputFile As String = "C:\test\multipage.tif"
Dim strRecText As String = ""
Dim Doc1 As MODI.Document

Doc1 = New MODI.Document
Doc1.Create(inputFile)
Doc1.OCR()  ' this will ocr all pages of a multi-page tiff file
Doc1.Save() ' this will save the deskewed reoriented images, and the OCR text, back to the inputFile

For imageCounter As Integer = 0 To (Doc1.Images.Count - 1) ' work your way through each page of results
   strRecText &= Doc1.Images(imageCounter).Layout.Text    ' this puts the ocr results into a string
Next

File.AppendAllText("C:\test\testmodi.txt", strRecText)     ' write the OCR file out to disk

Doc1.Close() ' clean up
Doc1 = Nothing

Others like Tesseract, but I have direct experience with it. I've heard both good and bad things about it, so I imagine it greatly depends on your source quality.

养猫人 2024-07-13 07:53:12

您无法从 PDF 中提取扫描文本。 您需要 OCR 软件。 好消息是,您可以尝试一些开源应用程序,并且 OCR 路线很可能比使用 PDF 库提取文本更容易。 查看 Tesseract 和 GOCR。

You can't extract scanned text from a PDF. You need OCR software. The good news is there are a few open source applications you can try and the OCR route will most likely be easier than using a PDF library to extract text. Check out Tesseract and GOCR.

反目相谮 2024-07-13 07:53:12

我在我的一篇博客中发布了有关解析 pdf 的文章。 点击此链接:

http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library- pdfbox-another-sample-for-grade-1-pupils.aspx

编辑:链接不再有效。 下面引用自http://web.archive.org/web/20130507084207/http://devpinoy.org /blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

好吧,以下内容基于网络上的流行示例。
它的作用是“读取”pdf 文件并将其作为文本输出
表单中的富文本框控件。 PDFBox for .NET 库可以是
从sourceforge下载。

您需要添加对 IKVM.GNU.Classpath & 的引用 PDFBox-0.7.3。 和
另外,需要添加 FontBox-0.1.0-dev.dll 和 PDFBox-0.7.3.dll
应用程序的 bin 文件夹。 由于某种原因我不记得了
(也许是来自其中一个教程),我也添加到了垃圾箱
IKVM.GNU.Classpath.dll。

顺便说一下,我刚刚拿到了《Head First C#》(在 Keith 的
建议)来自亚马逊。 这本书很酷! 它确实是为
初学者。 该版本涵盖VS2008和框架3.5。

给你...

/* Marlon Ribunal
 * Convert PDF To Text
 * *******************/

using System;
using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Drawing.Printing;
using System.IO;
using System.Text;
using System.ComponentModel.Design;
using System.ComponentModel;
using org.pdfbox.pdmodel;
using org.pdfbox.util;

namespace MarlonRibunal.iPdfToText
{
    public partial class MainForm : Form
    {
        public MainForm()
        {
            InitializeComponent(); 
        }

        void Button1Click(object sender, EventArgs e)    
        {    
            PDDocument doc = PDDocument.load("C:\\pdftoText\\myPdfTest.pdf");
            PDFTextStripper stripper = new PDFTextStripper();
            richTextBox1.Text=(stripper.getText(doc));
        }

     }
}

I have posted about parsing pdf's in one of my blogs. Hit this link:

http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

Edit: Link no long works. Below quoted from http://web.archive.org/web/20130507084207/http://devpinoy.org/blogs/marl/archive/2008/03/04/pdf-to-text-using-open-source-library-pdfbox-another-sample-for-grade-1-pupils.aspx

Well, the following is based on popular examples available on the web.
What this does is "read" the pdf file and output it as a text in the
rich text box control in the form. The PDFBox for .NET library can be
downloaded from sourceforge.

You need to add reference to IKVM.GNU.Classpath & PDFBox-0.7.3. And
also, FontBox-0.1.0-dev.dll and PDFBox-0.7.3.dll need to be added on
the bin folder of your application. For some reason I can't recall
(maybe it's from one of the tutorials), I also added to the bin
IKVM.GNU.Classpath.dll.

On the side note, just got my copy of "Head First C#" (on Keith's
suggestion) from Amazon. The book is cool! It is really written for
beginners. This edition covers VS2008 and the framework 3.5.

Here you go...

/* Marlon Ribunal
 * Convert PDF To Text
 * *******************/

using System;
using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Drawing.Printing;
using System.IO;
using System.Text;
using System.ComponentModel.Design;
using System.ComponentModel;
using org.pdfbox.pdmodel;
using org.pdfbox.util;

namespace MarlonRibunal.iPdfToText
{
    public partial class MainForm : Form
    {
        public MainForm()
        {
            InitializeComponent(); 
        }

        void Button1Click(object sender, EventArgs e)    
        {    
            PDDocument doc = PDDocument.load("C:\\pdftoText\\myPdfTest.pdf");
            PDFTextStripper stripper = new PDFTextStripper();
            richTextBox1.Text=(stripper.getText(doc));
        }

     }
}
属性 2024-07-13 07:53:12

在我曾经工作过的一家公司,我们使用 ActivePDF 工具包并取得了一些成功:

http ://www.activepdf.com/products/serverproducts/toolkit/index.cfm

我认为你至少需要标准版或专业版,但他们有试用版,这样你就可以看看它是否能做到你想要的想要它。

At a company I used to work for, we used ActivePDF toolkit with some success:

http://www.activepdf.com/products/serverproducts/toolkit/index.cfm

I think you'd need at least the Standard or Pro version but they have trials so you can see if it'll do what you want it to.

听风吹 2024-07-13 07:53:12

谷歌的快速搜索显示了这个有希望的结果。
http://www.pdftron.com/net/index.html

A quick google search shows this promising result.
http://www.pdftron.com/net/index.html

狼亦尘 2024-07-13 07:53:12

如果PDF是打印文本的扫描件,自己做起来会比较困难(涉及图像处理、字符识别等)。 PDF 通常会将扫描文档在内部存储为 JPEG。 您最好使用第三方工具(OCR 工具)来执行此操作。

If the PDF is a scans of printed text, it will be hard (involves image processing, character recognizing etc.) to do it yourself. PDF will generally store the scanned documents as JPEGs internally. You are better of using a third party tool (OCR tool) that does this.

云雾 2024-07-13 07:53:12

您可以使用 Perl 的 PDF 等模块来提取文本。 并使用另一个工具将相关信息导入数据库。

我确信.NET有PDF组件,但我没有尝试过,所以我不知道什么是好的。

You can use a module like perl's PDF to extract the text. And use another tool to import the pertinent info into the database.

I am sure there are PDF components for .NET, but I have not tried any, so I don't know what is good.

失眠症患者 2024-07-13 07:53:12

我最近发现了用于 Python 的 ReportLab

I've recently found ReportLab for Python.

掀纱窥君容 2024-07-13 07:53:12

如果我猜对了,sheebz 会询问如何提取 PDF 字段并将数据加载到数据库中。 你看过 iTextSharp 吗? - http://sourceforge.net/projects/itextsharp/

If I get it right, sheebz is asking how to extract PDF fields and load the data into a database. Have you looked at iTextSharp? - http://sourceforge.net/projects/itextsharp/

心凉怎暖 2024-07-13 07:53:12

根据 Mark Brackett 的回答,我创建了一个 Nuget 包 来包装 pdftotext

它是开源,针对 .net 标准 1.6.net框架4.5。

用法:

using XpdfNet;

var pdfHelper = new XpdfHelper();

string content = pdfHelper.ToText("./pathToFile.pdf");

Based on Mark Brackett's answer, I created a Nuget package to wrap pdftotext.

It's open source, targeting .net standard 1.6 and .net framework 4.5.

Usage:

using XpdfNet;

var pdfHelper = new XpdfHelper();

string content = pdfHelper.ToText("./pathToFile.pdf");
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文