获取完整的unicode句子

发布于 2024-12-08 22:16:06 字数 854 浏览 2 评论 0原文

我正在尝试解析像 Base: Lote Numero 1, Marcelo T de Alvear 500. Demanda: otras palabras. 这样的句子。我想:首先,按句点分割文本,然后使用任何内容冒号之前作为冒号之后句子的标签。 现在我有以下定义:

from pyparsing import *

unicode_printables = u''.join(unichr(c) for c in xrange(65536) 
                                    if not unichr(c).isspace())

def parse_test(text):
    label = Word(alphas)+Suppress(':')
    value = OneOrMore(Word(unicode_printables)|Literal(','))
    group = Group(label.setResultsName('label')+value.setResultsName('value'))
    exp = delimitedList(
        group,
        delim='.'
    )

    return exp.parseString(text)

有点有效,但它会删除 unicode 字符(以及字母数字之外的任何字符),并且我想我希望将 value 作为一个整体句子而不是这个: 'value': [(([u'Lote', u'Numero', u'1', ',', u'Marcelo', u'T', u'de',你“阿尔维尔”, u'500'],{}),1)

有一个简单的方法可以解决这个问题吗?

I'm trying to parse a sentence like Base: Lote Numero 1, Marcelo T de Alvear 500. Demanda: otras palabras. I want to: first, split the text by periods, then, use whatever is before the colon as a label for the sentence after the colon.
Right now I have the following definition:

from pyparsing import *

unicode_printables = u''.join(unichr(c) for c in xrange(65536) 
                                    if not unichr(c).isspace())

def parse_test(text):
    label = Word(alphas)+Suppress(':')
    value = OneOrMore(Word(unicode_printables)|Literal(','))
    group = Group(label.setResultsName('label')+value.setResultsName('value'))
    exp = delimitedList(
        group,
        delim='.'
    )

    return exp.parseString(text)

And kind of works but it drops the unicode caracters (and whatever that is not in alphanums) and I'm thinking that I would like to have the value as a whole sentence and not this: 'value': [(([u'Lote', u'Numero', u'1', ',', u'Marcelo', u'T', u'de', u'Alvear', u'500'], {}), 1).

Is a simple way to tackle this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

只是在用心讲痛 2024-12-15 22:16:06

要直接回答您的问题,请使用 originalTextFor 包装您的值定义,这将为您返回匹配标记所来自的字符串切片,作为单个字符串。您还可以添加一个解析操作,例如:

value.setParseAction(lambda t : ' '.join(t))

但是,当可能没有空格(在单词后面有“,”的情况下)或多个空格时,这会在每个项目之间显式放置一个空格。 originalTextFor 将为您提供准确的输入子字符串。但更简单的是,如果您只是读取“:”之后的所有内容,则可以使用 restOfLine。 (当然,最简单的方法就是使用 split(':'),但我假设您是专门询问如何使用 pyparsing 来做到这一点。)

其他一些注意事项:

  • < code>xxx.setResultsName('yyy') 可以缩短为 xxx('yyy'),从而提高解析器定义的可读性。

  • 您将值定义为 OneOrMore(Word(unicode_printables) | Literal(',')) 有几个问题。一方面,“,”将包含在 unicode_printables 中的字符集中,因此“,”将包含在任何已解析的单词中。解决此问题的最佳方法是使用 WordexcludeChars 参数,以便您的句子单词不包含逗号:OneOrMore(Word(unicode_printables, exceptChars= ',') | ',')。现在您还可以排除其他可能的标点符号,例如“;”、“-”等,只需将它们添加到 exceptChars 字符串中即可。 (我刚刚注意到您使用 '.' 作为 delimitedList 的分隔符 - 要使其工作,您还必须包含 '.' 作为排除字符。)Pyparsing 不像正则表达式在这方面 - 如果下一个字符继续与当前标记匹配,它不会执行任何前瞻来尝试匹配解析器中的下一个标记。这就是为什么你必须自己做一些额外的工作以避免阅读太多。一般来说,像 OneOrMore(Word(unicode_printables)) 这样的开放式内容很可能会耗尽输入字符串的整个其余部分。

To directly answer your question, wrap your value definition with originalTextFor, and this will give you back the string slice that the matching tokens came from, as a single string. You could also add a parse action, like:

value.setParseAction(lambda t : ' '.join(t))

But this would explicitly put a single space between each item, when there might have been no spaces (in the case of a ',' after a word), or more than one space. originalTextFor will give you the exact input substring. But even simpler, if you are just reading everything after the ':', would be to use restOfLine. (Of course, the simplest would be just to use split(':'), but I assume you are specifically asking how to do this with pyparsing.)

A couple of other notes:

  • xxx.setResultsName('yyy') can be shortened to just xxx('yyy'), improving the readability of your parser definition.

  • Your definition of value as OneOrMore(Word(unicode_printables) | Literal(',')) has a couple of problems. For one thing, ',' will be included in the set of characters in unicode_printables, so ',' will be included in with any parsed words. The best way to solve this is to use the excludeChars parameter to Word, so that your sentence words do not include commas: OneOrMore(Word(unicode_printables, excludeChars=',') | ','). Now you can also exclude other possible punctuation, like ';', '-', etc. just be adding them in the excludeChars string. (I just noticed that you are using '.' as a delimiter for a delimitedList - for this to work, you will have to include '.' as an excluded character too.) Pyparsing is not like a regular expression in this regard - it does not do any lookahead to try to match the next token in the parser if the next character continues to match the current token. That is why you have to do some extra work of your own to avoid reading too much. In general, something as open-ended as OneOrMore(Word(unicode_printables)) is very likely to eat up the entire rest of your input string.

爱,才寂寞 2024-12-15 22:16:06

您应该查看 PyICU 它提供了对所提供的丰富 Unicode 文本库的访问由 ICU 提供,包括提供句子的 BreakIterator 类发现者。

You should look into PyICU which provides access to the rich Unicode text library provided by ICU, including the BreakIterator class that provides a sentence finder.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文