从数百万个简单但不一致的文本文件中提取信息
我们有数百万个简单的 txt 文档,其中包含从 pdf 中提取的各种数据结构,文本是逐行打印的,因此所有格式都会丢失(因为当我们尝试使用工具来维护格式时,它们只是搞砸了)。我们需要从该文本文档中提取字段和值,但这些文件的结构存在一些变化(这里和那里换行,某些工作表上有噪音,因此拼写不正确)。
我想我们会创建某种模板结构,其中包含有关关键字和值的坐标(行、单词/单词数)的信息,并使用此信息来定位和收集关键字值,就像使用各种算法来弥补格式不一致一样。
有没有任何标准方法可以做到这一点,有任何可能有帮助的链接吗?还有其他想法吗?
We have millions of simple txt documents containing various data structures we extracted from pdf, the text is printed line by line so all formatting is lost (because when we tried tools to maintain the format they just messed it up). We need to extract the fields and there values from this text document but there is some variation in structure of these files (new line here and there, noise on some sheets so spellings are incorrect).
I was thinking we would create some sort of template structure with information about the coordinates (line, word/words number) of keywords and values and use this information to locate and collect keyword values like that using various algorithms to make up for inconsistant formatting.
Is there any standard way of doing this, any links that might help? any other ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
可以使用 agrep 等模糊文本匹配工具来纠正或忽略噪音: http://www.tgries.de/ agrep/
然而,额外新线的问题仍然存在。
我建议的一种技术是以编译器类似的方式限制错误传播。例如,您尝试匹配您的模板或模式,但您无法做到这一点。稍后在文本中有一个确定的匹配,但它可能是当前未匹配模式的一部分。
在这种情况下,应该接受确定的匹配,并且应该将不匹配的文本块保留下来以供将来处理。这将使您能够跳过难以解析的错误。
the noise can be corrected or ignored by using fuzzy text matching tools like agrep: http://www.tgries.de/agrep/
However, the problem with extra new-lines will remain.
One technique that i would suggest is to limit the error propagation in a similar way compilers do. For example, you try to match your template or a pattern, and you can't do that. Later on in the text there is a sure match, but it might be a part of the current un-matched pattern.
In this case, the sure match should be accepted and the chunk of text that was un-matched should be left aside for future processing. This will enable you to skip errors that are too hard to parse.
Larry Wall 的 Perl 是您的朋友。这正是它所擅长的问题领域。
Sed 还不错,但对于这种想法来说,Perl 就更胜一筹了。
Larry Wall's Perl is your friend here. This is precisely the sort of problem domain at which it excels.
Sed is OK, but for this sort of think, Perl is the bee's knees.
虽然我支持 Unix 命令行和 Perl 的建议,但可能有帮助的更高级别工具是 谷歌优化。它旨在处理混乱的现实世界数据。
While I second the recommendations for the Unix command-line and for Perl, a higher-level tool that may help is Google Refine. It is meant to handle messy real-world data.
我建议在这里使用图形正则表达式,规则非常弱,最终接受谓词。在这里,您可以在标记级别上编写模糊匹配,然后在行级别上编写模糊匹配等。
I would recoomnd using graph regular expression here with very weak rules and final accpetion predicate. Here you can write fuzzy matching on token level, then on line level etc.
我建议使用 Talend 数据集成工具。它是开源的(即免费!)。它基于 Java 构建,您可以通过修改底层 Java 代码来自定义您的数据集成项目。
我使用它并发现它对于低预算、高度复杂的数据集成项目非常有帮助。这是他们网站的链接;Talend
祝你好运。
I suggest Talend data integration tool. It is open source (i.e. FREE!). It is build on Java and you can customize your data integration project anyway you like by modifying underlying java code.
I used it and found very helpful on low budget highly complex data integration projects. Here's the link to their WEB site;Talend
Good luck.