获取完整的unicode句子
我正在尝试解析像 Base: Lote Numero 1, Marcelo T de Alvear 500. Demanda: otras palabras.
这样的句子。我想:首先,按句点分割文本,然后使用任何内容冒号之前作为冒号之后句子的标签。 现在我有以下定义:
from pyparsing import *
unicode_printables = u''.join(unichr(c) for c in xrange(65536)
if not unichr(c).isspace())
def parse_test(text):
label = Word(alphas)+Suppress(':')
value = OneOrMore(Word(unicode_printables)|Literal(','))
group = Group(label.setResultsName('label')+value.setResultsName('value'))
exp = delimitedList(
group,
delim='.'
)
return exp.parseString(text)
有点有效,但它会删除 unicode 字符(以及字母数字之外的任何字符),并且我想我希望将 value
作为一个整体句子而不是这个: 'value': [(([u'Lote', u'Numero', u'1', ',', u'Marcelo', u'T', u'de',你“阿尔维尔”, u'500'],{}),1)
。
有一个简单的方法可以解决这个问题吗?
I'm trying to parse a sentence like Base: Lote Numero 1, Marcelo T de Alvear 500. Demanda: otras palabras.
I want to: first, split the text by periods, then, use whatever is before the colon as a label
for the sentence after the colon.
Right now I have the following definition:
from pyparsing import *
unicode_printables = u''.join(unichr(c) for c in xrange(65536)
if not unichr(c).isspace())
def parse_test(text):
label = Word(alphas)+Suppress(':')
value = OneOrMore(Word(unicode_printables)|Literal(','))
group = Group(label.setResultsName('label')+value.setResultsName('value'))
exp = delimitedList(
group,
delim='.'
)
return exp.parseString(text)
And kind of works but it drops the unicode caracters (and whatever that is not in alphanums) and I'm thinking that I would like to have the value
as a whole sentence and not this: 'value': [(([u'Lote', u'Numero', u'1', ',', u'Marcelo', u'T', u'de', u'Alvear', u'500'], {}), 1)
.
Is a simple way to tackle this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
要直接回答您的问题,请使用
originalTextFor
包装您的值定义,这将为您返回匹配标记所来自的字符串切片,作为单个字符串。您还可以添加一个解析操作,例如:但是,当可能没有空格(在单词后面有“,”的情况下)或多个空格时,这会在每个项目之间显式放置一个空格。
originalTextFor
将为您提供准确的输入子字符串。但更简单的是,如果您只是读取“:”之后的所有内容,则可以使用restOfLine
。 (当然,最简单的方法就是使用split(':')
,但我假设您是专门询问如何使用 pyparsing 来做到这一点。)其他一些注意事项:
< code>xxx.setResultsName('yyy') 可以缩短为
xxx('yyy')
,从而提高解析器定义的可读性。您将值定义为
OneOrMore(Word(unicode_printables) | Literal(','))
有几个问题。一方面,“,”将包含在unicode_printables
中的字符集中,因此“,”将包含在任何已解析的单词中。解决此问题的最佳方法是使用Word
的excludeChars
参数,以便您的句子单词不包含逗号:OneOrMore(Word(unicode_printables, exceptChars= ',') | ',')
。现在您还可以排除其他可能的标点符号,例如“;”、“-”等,只需将它们添加到 exceptChars 字符串中即可。 (我刚刚注意到您使用 '.' 作为delimitedList
的分隔符 - 要使其工作,您还必须包含 '.' 作为排除字符。)Pyparsing 不像正则表达式在这方面 - 如果下一个字符继续与当前标记匹配,它不会执行任何前瞻来尝试匹配解析器中的下一个标记。这就是为什么你必须自己做一些额外的工作以避免阅读太多。一般来说,像OneOrMore(Word(unicode_printables))
这样的开放式内容很可能会耗尽输入字符串的整个其余部分。To directly answer your question, wrap your value definition with
originalTextFor
, and this will give you back the string slice that the matching tokens came from, as a single string. You could also add a parse action, like:But this would explicitly put a single space between each item, when there might have been no spaces (in the case of a ',' after a word), or more than one space.
originalTextFor
will give you the exact input substring. But even simpler, if you are just reading everything after the ':', would be to userestOfLine
. (Of course, the simplest would be just to usesplit(':')
, but I assume you are specifically asking how to do this with pyparsing.)A couple of other notes:
xxx.setResultsName('yyy')
can be shortened to justxxx('yyy')
, improving the readability of your parser definition.Your definition of value as
OneOrMore(Word(unicode_printables) | Literal(','))
has a couple of problems. For one thing, ',' will be included in the set of characters inunicode_printables
, so ',' will be included in with any parsed words. The best way to solve this is to use theexcludeChars
parameter toWord
, so that your sentence words do not include commas:OneOrMore(Word(unicode_printables, excludeChars=',') | ',')
. Now you can also exclude other possible punctuation, like ';', '-', etc. just be adding them in the excludeChars string. (I just noticed that you are using '.' as a delimiter for adelimitedList
- for this to work, you will have to include '.' as an excluded character too.) Pyparsing is not like a regular expression in this regard - it does not do any lookahead to try to match the next token in the parser if the next character continues to match the current token. That is why you have to do some extra work of your own to avoid reading too much. In general, something as open-ended asOneOrMore(Word(unicode_printables))
is very likely to eat up the entire rest of your input string.您应该查看 PyICU 它提供了对所提供的丰富 Unicode 文本库的访问由 ICU 提供,包括提供句子的 BreakIterator 类发现者。
You should look into PyICU which provides access to the rich Unicode text library provided by ICU, including the BreakIterator class that provides a sentence finder.