pyparsing歧义

发布于 2024-09-04 14:25:57 字数 685 浏览 4 评论 0原文

我正在尝试使用 PyParser 解析一些文本。问题是我的名称可以包含空格。所以我的输入可能是这样的。首先,名称列表:

Joe
bob
Jimmy X
grjiaer-rreaijgr Y

然后,它们所做的事情:

Joe A
bob B
Jimmy X C

当然,问题是它们所做的事情可能与名称的结尾相同:

Jimmy X X
grjiaer-rreaijgr Y Y

如何为操作线创建解析器?解析 Joe A 的输出应该是 [Joe, A]。解析 Jimmy X C 的输出应为 [Jimmy X, C], of Jimmy X X - [Jimmy X, X]< /代码>。即,[名称,操作] 对。

如果我天真地创建我的名称解析器,意思类似于 OneOrMore(RegEx("\S*")),那么它将匹配给我 [Jimmy XX] 的整行随后出现解析错误,因为没有看到操作(因为它已被名称解析器消耗)。

注意:很抱歉之前的措辞含糊,这使得这看起来像是一个 NLP 问题。

I'm trying to parse some text using PyParser. The problem is that I have names that can contain white spaces. So my input might look like this. First, a list of names:

Joe
bob
Jimmy X
grjiaer-rreaijgr Y

Then, things they do:

Joe A
bob B
Jimmy X C

the problem of course is that a thing they do can be the same as the end of the name:

Jimmy X X
grjiaer-rreaijgr Y Y

How can I create a parser for the action lines? The output of parsing Joe A should be [Joe, A]. The output of parsing Jimmy X C should be [Jimmy X, C], of Jimmy X X - [Jimmy X, X]. That is, [name, action] pairs.

If I create my name parser naively, meaning something like OneOrMore(RegEx("\S*")), then it will match the entire line giving me [Jimmy X X] followed by a parsing error for not seeing an action (since it was already consumed by the name parser).

NOTE: Sorry for the ambiguous phrasing earlier that made this look like an NLP question.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

世界如花海般美丽 2024-09-11 14:25:57

您需要的不仅仅是一个简单的解析器。解析器使用字符串中的符号来定义字符串的哪些部分代表语法的不同元素。这就是为什么 FM 要求提供一些线索来表明您如何知道哪个部分是名称以及哪个部分是句子的其余部分。如果您可以说名称由一个或多个大写单词组成,那么解析器就会知道名称何时停止以及句子的其余部分开始。

但是像“jimmy foo决定”这样的名字呢?解析器如何仅通过查看“decides”中的符号就知道“decides”是否是名称的一部分?即使是一个人读你的“吉米·福决定吃”这句话,也很难确定名字的开始或结束位置,以及这是否是某种拼写错误。

如果您的输入确实如此不可预测,那么您需要使用诸如 NLTK(自然语言工具包)之类的工具。我自己没有使用过它,但它从解析语言中的句子的角度来解决这个问题,而不是尝试解析结构化数据或数学格式。

我不建议使用 pyparsing 进行这种语言解释。

You pretty much need more than a simple parser. Parsers use the symbols in a string to define which pieces of the string represent different elements of a grammar. This is why FM asked for some clue to indicate how you know what part is the name and what part is the rest of the sentence. If you could say that names are made up of one or more capitalized words, then the parser would know when the name stops and the rest of the sentence starts.

But a name like "jimmy foo decides"? How can the parser know just by looking at the symbols in "decides" whether "decides" is or is not part of the name? Even a human reading your "jimmy foo decides decides to eat" sentence would have some trouble determining where the name starts or stops, and whether this was some sort of typo.

If your input is really this unpredictable, then you need to use a tool such as the NLTK (Natural Language Toolkit). I've not used it myself, but it approaches this problem from the standpoint of parsing sentences in a language, as opposed to trying to parse structured data or mathematical formats.

I would not recommend pyparsing for this kind of language interpretation.

仙女 2024-09-11 14:25:57

玩得开心:

from pyparsing import Regex, oneOf

THE_NAMES = \
"""Joe
bob
Jimmy X
grjiaer-rreaijgr Y
"""

THE_THINGS_THEY_DO = \
"""Joe A
bob B
Jimmy X C
Jimmy X X
grjiaer-rreaijgr Y Y
"""

ACTION = Regex('.*')
NAMES = THE_NAMES.splitlines()
print NAMES
GRAMMAR = oneOf(NAMES) + ACTION    
for line in THE_THINGS_THEY_DO.splitlines():
    print GRAMMAR.parseString(line)

Have fun:

from pyparsing import Regex, oneOf

THE_NAMES = \
"""Joe
bob
Jimmy X
grjiaer-rreaijgr Y
"""

THE_THINGS_THEY_DO = \
"""Joe A
bob B
Jimmy X C
Jimmy X X
grjiaer-rreaijgr Y Y
"""

ACTION = Regex('.*')
NAMES = THE_NAMES.splitlines()
print NAMES
GRAMMAR = oneOf(NAMES) + ACTION    
for line in THE_THINGS_THEY_DO.splitlines():
    print GRAMMAR.parseString(line)
屋檐 2024-09-11 14:25:57

看起来你需要 nltk,而不是 pyparsing。看来您需要解决一个易于处理的问题。你怎么知道如何解析“jimmy foo决定决定吃”?您使用什么规则来推断(与大多数人的假设相反)“决定决定”不是拼写错误?

关于“可以包含空格的名称”:首先,我希望您将其标准化为一个空格。其次:这出乎意料吗?第三:名称可以包含撇号和连字符(O'Brien、Montagu-Douglas-Scott),并且可能包含不大写的组件,例如 Georg von und zu Hohenlohe),并且我们不会提及 Unicode。

Looks like you need nltk, not pyparsing. Looks like you need a tractable problem to work on. How do YOU know how to parse 'jimmy foo decides decides to eat'? What rules do YOU use to deduce (contrary to what most people would assume) that "decides decides" is not a typo?

Re "names that can contain whitespaces": Firstly, I'd hope that you'd normalise that into one space. Secondly: this is unexpected?? Thirdly: names can contain apostrophes and hyphens (O'Brien, Montagu-Douglas-Scott) and may have components that aren't capitalised e.g. Georg von und zu Hohenlohe) and we won't mention Unicode.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文