我希望获取 PDF 并从中提取任何文本。 然后我想使用 ColdFusion 的可用 Verity 搜索来搜索内容。
是否有任何库已经在这方面做得很好? 我将 Java 或 .NET(首选 Java)库包含在范围内,因为它们可以从 CF 调用。
任何见解或经验将不胜感激...谢谢!
编辑:据我所知,当文本嵌入 PDF 中时,索引 PDF 文件就可以使用 CF 了。 我必须处理的 PDF 文件将文本扫描为图像。
I am looking to take a PDF and extract any text from it. I then want to make it available using ColdFusion's available Verity search to search the contents.
Are there any libraries out there that do this quite well already? I am including Java or .NET (Java prefered) libraries in the scope since they can be called from CF.
Any insights or experiences would be greatly appreciated... thanks!
Edit: Indexing PDF files works when the text is embedded in the PDF as far as I know with CF. The PDFs I'm having to deal with have the text scanned as an image.
发布评论
评论(4)
Ray Camden 有一个关于 在 ColdFusion 8 中处理 PDF。
第 7 部分 介绍了如何使用 DDX 从 PDF 中获取文本。
虽然不确定这是否能满足您的 OCR 需求,但可能仍然值得一看。
Ray Camden has an eight-part series on working with PDFs in ColdFusion 8.
Part 7 of the series covers using DDX to get text out of a PDF.
Not sure this will work with your OCR needs though, but may still be worth looking at.
在半相关的说明中,我发现了一篇关于在 Coldfusion 中编码和读取 2D Matrix 条形码的非常简洁的文章。
http://www.stillnetstudios.com/2007/12/15 /2d-barcodes-coldfusion/
这可能会解决我需要提取编码信息的一些问题,但我仍在寻找文本正文。
关于tessnet,也找到了.net版本。 http://www.pixel-technology.com/freeware/tessnet2/ 如果我本机可以输入 PDF 而不是 TIFF..:)
On a semi related note, I found a very neat post about encoding and reading 2D Matrix barcodes in coldfusion.
http://www.stillnetstudios.com/2007/12/15/2d-barcodes-coldfusion/
This might solve some of my issues in needing to extract encoded information, but I am still after the body of the text.
Regarding tessnet, found a .net version too. http://www.pixel-technology.com/freeware/tessnet2/ If I could natively feed in PDF's instead of TIFFs.. :)
如果您有能力运行自己的软件(即专用/VPS),那么您可以使用 进行调查Tesseract OCR 与 cfexecute 将 PDF 转换为文本?
If you have the ability to run your own software (i.e. Dedicated/VPS) then you could investigate using Tesseract OCR with
cfexecute
to convert the PDFs to text?默认情况下,Verity 应该能够索引 PDF 文件:
http://livedocs .adobe.com/coldfusion/6/Developing_ColdFusion_MX_Applications_with_CFML/indexSearch2.htm#1142322
Verity should be able to index PDF files by default:
http://livedocs.adobe.com/coldfusion/6/Developing_ColdFusion_MX_Applications_with_CFML/indexSearch2.htm#1142322