在包含标点符号的情况下在标点符号之后分割字符串
我正在尝试通过正则表达式将一串单词拆分为单词列表。我对正则表达式还是一个初学者。
我正在使用 nltk.regex_tokenize,它产生的结果很接近,但不完全是我想要的。
这就是我到目前为止所拥有的:
>>> import re, codecs, nltk
>>> sentence = "détesté Rochard ! m'étais à... 'C'est hyper-cool.' :) :P"
>>> pattern = r"""(?x)
#words with internal hyphens
| \w+(-\w+)*
#ellipsis
| \.\.\.
#other punctuation tokens
| [][.,;!?"'():-_`]
"""
>>> nltk.regexp_tokenize(sentence.decode("utf8"), pattern)
[u'd\xe9test\xe9', u'Rochard', u'!', u'm', u"'", u'\xe9tais', u'\xe0', u'qu', u"'", u'on', u'...', u"'", u'C', u"'", u'est', u'hyper-cool', u'.', u"'", u':', u')', u':', u'P']
我希望输出如下:
[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0', u"qu'", u'on', u'...', u"'", u"C'", u'est', u'hyper-cool', u'.', u"'", u':)', u':P']
我有一个“表情符号”的解决方法,所以我最关心的是引号。
I'm trying to split a string of words into a list of words via regex. I'm still a bit of a beginner with regular expressions.
I'm using nltk.regex_tokenize, which is yielding results that are close, but not quite what I want.
This is what I have so far:
>>> import re, codecs, nltk
>>> sentence = "détesté Rochard ! m'étais à... 'C'est hyper-cool.' :) :P"
>>> pattern = r"""(?x)
#words with internal hyphens
| \w+(-\w+)*
#ellipsis
| \.\.\.
#other punctuation tokens
| [][.,;!?"'():-_`]
"""
>>> nltk.regexp_tokenize(sentence.decode("utf8"), pattern)
[u'd\xe9test\xe9', u'Rochard', u'!', u'm', u"'", u'\xe9tais', u'\xe0', u'qu', u"'", u'on', u'...', u"'", u'C', u"'", u'est', u'hyper-cool', u'.', u"'", u':', u')', u':', u'P']
I would like to have the output as follows:
[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0', u"qu'", u'on', u'...', u"'", u"C'", u'est', u'hyper-cool', u'.', u"'", u':)', u':P']
I have a workaround for the "emoticons", so what I'm most concerned with are quotes.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
似乎所需的输出与您的输入句子
[u"qu'", u'on']
不一致:我无法弄清楚这两个匹配是从您的句子中确定的u'.'
不是u'hyper-cool'
的一部分(假设您希望标点符号作为单词的一部分。u"'"< /code> 不属于
u"C'"
。(假设你想要标点符号作为单词的一部分。另外,如果你只想正则表达式分割,除了分割行之外,还有什么理由使用 nltk 吗?我有没有
nltk
经验,因此仅建议使用regex
解决方案,看看这是否适合您
如果您需要有关捕获组、非捕获组、字符类的更多信息,统一码匹配我建议你粗略地浏览一下 python 的 re 包。
另外,我不确定您在多行中继续字符串的方式在这种情况下是否合适。如果您需要有关跨行分割字符串(而不是多行字符串)的更多信息,请查看 这个。
It seems that the desired output is not consistent with your input sentence
[u"qu'", u'on']
: I can't figure out from where did these two matches were determined from your sentenceu'.'
was not part ofu'hyper-cool'
(Assuming you want the punctuation as part of the word.u"'"
was not part ofu"C'"
. (Assuming you want the punctuation as part of the word.Also if you just want regex split, is there any reason why you are using nltk apart from splitting the lines? I have no experience with
nltk
so would be proposing just aregex
solution.See if this works for you
If you need more information on Capturing Group, Non-Capturing Group, Character Class, Unicode Match and findall I would suggest you take a cursory glance on the re package of python.
Also I am not sure if the way you are continuing string in multiple lines is appropriate in this scenario. If you need more information on splitting string across lines (not multi-line strings) please have a look into this.