用于解析草率/古怪/“几乎结构化”的背景阅读数据?

发布于 2024-08-04 03:09:39 字数 435 浏览 5 评论 0原文

我正在维护一个程序,需要解析以“几乎结构化”的文本形式存在的数据。即,生成它的各种程序使用稍微不同的格式,它可能已被打印出来并 OCR 回来(是的,我知道),但有错误等,所以我需要使用启发式方法来猜测它是如何生成的并应用不同的怪癖模式等。这是令人沮丧的,因为如果事情表现良好,我对解析的理论和实践有些熟悉,并且那里有很好的解析框架等,但是数据的不可靠性导致我写了一些非常草率的临时代码。目前还可以,但我担心当我将其扩展以处理更多变化和更复杂的数据时,事情会失控。所以我的问题是:

由于有相当多的现有商业产品可以做相关的事情(网络浏览器中的“怪异模式”,编译器中的错误解释,甚至自然语言处理和数据挖掘等),我确信一些聪明的产品人们已经对此进行了思考,并试图发展一种理论,那么以尽可能有原则的方式解析无原则数据的背景阅读的最佳来源是什么?

我意识到这有点开放式,但我的问题是我认为我需要更多背景知识才能知道要问的正确问题是什么。

I'm maintaining a program that needs to parse out data that is present in an "almost structured" form in text. i.e. various programs that produce it use slightly different formats, it may have been printed out and OCR'd back in (yeah, I know) with errors, etc. so I need to use heuristics that guess how it was produced and apply different quirks modes, etc. It's frustrating, because I'm somewhat familiar with the theory and practice of parsing if things are well behaved, and there are nice parsing frameworks etc. out there, but the unreliability of the data has led me to write some very sloppy ad-hoc code. It's OK at the moment but I'm worried that as I expand it to process more variations and more complex data, things will get out of hand. So my question is:

Since there are a fair number of existing commercial products that do related things ("quirks modes" in web browsers, error interpretation in compilers, even natural language processing and data mining, etc.) I'm sure some smart people have put thought into this, and tried to develop a theory, so what are the best sources for background reading on parsing unprincipled data in as principled a manner as possible?

I realize this is somewhat open-ended, but my problem is that I think I need more background to even know what the right questions to ask are.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

皇甫轩 2024-08-11 03:09:39

考虑到你提出的方案和双手绑在背后与生牛肉味果酱浑身与饥饿的鳄鱼搏斗之间的选择,我会选择……

好吧,更严肃地说,如果你拥有不遵守任何“健全”结构的数据,您必须研究数据并找到其中的怪异频率,并将数据与给定上下文相关联(即它是如何生成的)

打印到 OCR 以获取数据几乎总是会导致心碎。我工作的公司雇佣了一支名副其实的人员队伍,他们手动阅读此类文档,并对已知有问题的 OCR 场景的数据进行手动“编码”(即手动输入),或者我们的客户检测到原始 OCR 失败的文档。

至于利用“解析框架”,这些框架往往期望数据始终遵循您制定的语法规则。您所描述的数据没有这样的保证。如果您走这条路,请为意外的(尽管并不总是显而易见的)失败做好准备。

如果有任何可能的方式来获取原始数据文件,请务必这样做。或者,如果您可以要求提供数据的人员将其数据采用单一定义良好的格式,那就更好了。 (它可能不是“您的”格式,但至少它是您可以转换的常规且可预测的格式)

Given the choice between what you've proposed and fighting a hungry crocodile while covered in raw-beef-flavored marmalade and both hands tied behind my back, I'd choose the ...

Well, OK on a more serious note, if you have data that doesn't abide by the any "sane" structure, you have to study the data and find frequencies of quirks in it and correlate the data for the given context (i.e. how it was generated)

Print to OCR to get the data in is almost always going to lead to heart break. The company I work for employs a veritable army of people who manually read such documents and hand "code" (i.e. enter by hand) the data for known problematic OCR scenarios, or documents our customers detect the original OCR failed on.

As for leveraging "Parsing Frameworks" these tend to expect data that will always follow the grammar rules you've laid out. The data you've described has no such guarantees. If you go that route be prepared for unexpected - though not always obvious - failures.

By all means if there is any way possible to get the original data files, do so. Or if you can demand that those providing the data make their data come in a single well defined format, even better. (It might not be "YOUR" format, but at least it's a regular and predictable format you can convert from)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文