解析半结构化数据 - 我可以使用任何分类器吗?

发布于 2024-09-30 04:02:10 字数 276 浏览 7 评论 0原文

我有一组半规则格式的文档。行通常由换行符分隔,每行的主要组成部分由空格分隔。一些示例是一组家具组装说明、一组目录、一组食谱和一组银行对账单。

问题在于,每个集合中的每个样本都与其对等成员不同,这使得正则表达式解析变得不可行:项目的数量可能位于项目名称之前或之后,相同的项目在样本、说明性文本或注释之间可能有不同的名称。注释可能存在于行之间,等等。

我使用分类器(神经网络、贝叶斯、GA 和 GP)来处理整个文档或数据集,但不是从文档中提取项目并在上下文中对它们进行分类。这可以做到吗?有没有更可行的办法呢?

I've got a set of documents which have a semi-regular format. Rows are typically separated by new line characters, and the main components of each row are separated by spaces. Some examples are a set of furniture assembly instructions, a set of table of contents, a set of recipes and a set of bank statements.

The problem is that each specimen in each set is different from its peer members in ways which make RegEx parsing infeasible: the quantity of an item may come before or after the item name, the same items may have different names between specimens, expository text or notes may exist between rows, etc.

I've used classifiers (Neural Nets, Bayesian, GA and GP) to deal with whole documents or data sets, but not to extract items from documents and classify them within a context. Can this be done? Is there a more feasible approach?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

蘑菇王子 2024-10-07 04:02:10

如果您的数据具有结构,那么您可以使用语法来描述其中的一些结构。 (通常,您使用语法来识别它们可以识别的内容,但通常太多,并使用语法外检查来删除语法无法消除的内容)。

如果您使用可以运行并行潜在解析的语法,则可以在解析变得不可行时消除解析,
您可以直接解析不同的顺序。 (GLR 解析器可以很好地做到这一点)。

想象一下,您有描述数量的数字、描述各种对象的名词以及表示动作的动词。
那么可以接受不同顺序的项目的语法可能是:

 G = SENTENCE '.' ;
 SENTENCE = VERB NOUN NUMBER ; 
 SENTENCE = NOUN VERB NUMBER;
 VERB = 'ORDER' | 'SAW' ;
 NUMBER = '1' | '2' | '10' ;
 NOUN = 'JOE' | 'TABLE' | 'SAW' ;

这个示例非常简单,但它将处理:

 JOE ORDERED 10.
 JOE SAW 1.
 ORDER 2 SAW.

它也将接受:

 SAW SAW 10.

您可以通过添加演员必须是人的外部约束来消除这种情况。

If your data has structure, arguably you can use a grammar to describe some of that structure. (Classically you use grammars to recognize what they can, often too much, and extra-grammatical checks to prune away what the grammars cannot eliminate).

If you use a grammar that can run parallel potential parses, which eliminate parses as they become infeasible,
you can parse different ordering straightforwardly. (A GLR parser can do this nicely).

Imaging you have NUMBERS describing amounts, NOUNS describing various objects, and VERBS for actions.
Then a grammar that can accept varying orders of items might be:

 G = SENTENCE '.' ;
 SENTENCE = VERB NOUN NUMBER ; 
 SENTENCE = NOUN VERB NUMBER;
 VERB = 'ORDER' | 'SAW' ;
 NUMBER = '1' | '2' | '10' ;
 NOUN = 'JOE' | 'TABLE' | 'SAW' ;

This sample is extremely simple, but it will handle:

 JOE ORDERED 10.
 JOE SAW 1.
 ORDER 2 SAW.

It will also accept:

 SAW SAW 10.

You can eliminate this by adding an external constraint that actors must be people.

恋竹姑娘 2024-10-07 04:02:10

有很多方法可以做到这一点。这是一个活跃的研究领域,称为:信息提取。特别是从半结构化源中提取信息。

There are plenty of methods to do that. It is an active research area called: information extraction. In particular information extraction from semi-structured sources.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文