在包含标点符号的情况下在标点符号之后分割字符串

发布于 2024-12-21 16:24:19 字数 921 浏览 3 评论 0原文

我正在尝试通过正则表达式将一串单词拆分为单词列表。我对正则表达式还是一个初学者。

我正在使用 nltk.regex_tokenize，它产生的结果很接近，但不完全是我想要的。

这就是我到目前为止所拥有的：

>>> import re, codecs, nltk
>>> sentence = "détesté Rochard ! m'étais à... 'C'est hyper-cool.' :) :P"    
>>> pattern = r"""(?x)
    #words with internal hyphens
    | \w+(-\w+)*
    #ellipsis
    | \.\.\.
    #other punctuation tokens
    | [][.,;!?"'():-_`]
    """ 
>>> nltk.regexp_tokenize(sentence.decode("utf8"), pattern)
[u'd\xe9test\xe9', u'Rochard', u'!', u'm', u"'", u'\xe9tais', u'\xe0', u'qu', u"'", u'on', u'...', u"'", u'C', u"'", u'est', u'hyper-cool', u'.', u"'", u':', u')', u':', u'P']

我希望输出如下：

[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0', u"qu'", u'on', u'...', u"'", u"C'", u'est', u'hyper-cool', u'.', u"'", u':)', u':P']

我有一个“表情符号”的解决方法，所以我最关心的是引号。

原文

I'm trying to split a string of words into a list of words via regex. I'm still a bit of a beginner with regular expressions.

I'm using nltk.regex_tokenize, which is yielding results that are close, but not quite what I want.

This is what I have so far:

>>> import re, codecs, nltk
>>> sentence = "détesté Rochard ! m'étais à... 'C'est hyper-cool.' :) :P"    
>>> pattern = r"""(?x)
    #words with internal hyphens
    | \w+(-\w+)*
    #ellipsis
    | \.\.\.
    #other punctuation tokens
    | [][.,;!?"'():-_`]
    """ 
>>> nltk.regexp_tokenize(sentence.decode("utf8"), pattern)
[u'd\xe9test\xe9', u'Rochard', u'!', u'm', u"'", u'\xe9tais', u'\xe0', u'qu', u"'", u'on', u'...', u"'", u'C', u"'", u'est', u'hyper-cool', u'.', u"'", u':', u')', u':', u'P']

I would like to have the output as follows:

[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0', u"qu'", u'on', u'...', u"'", u"C'", u'est', u'hyper-cool', u'.', u"'", u':)', u':P']

I have a workaround for the "emoticons", so what I'm most concerned with are quotes.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

秋千易 2024-12-28 16:24:19

似乎所需的输出与您的输入句子

[u"qu'", u'on'] 不一致：我无法弄清楚这两个匹配是从您的句子中确定的
为什么 u'.' 不是 u'hyper-cool' 的一部分（假设您希望标点符号作为单词的一部分。
为什么 u"'"< /code> 不属于u"C'"。（假设你想要标点符号作为单词的一部分。

另外，如果你只想正则表达式分割，除了分割行之外，还有什么理由使用 nltk 吗？我有没有 nltk 经验，因此仅建议使用 regex 解决方案，

>>> sentence
u"d\xe9test\xe9 Rochard ! m'\xe9tais \xe0... 'C'est hyper-cool.' :) :P"
>>> pattern=re.compile(
    u"(" #Capturing Group
    "(?:" #Non Capturing
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    "[\w\-]+"                           #Alphanumeric Unicode Word with hypen
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    ")"
    "|(?:[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]+)" #1- punctuation
     ")",re.UNICODE)
>>> pattern.findall(sentence)
[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0.', u'..', u"'C'", u'est', u'hyper-cool.', u"'", u':)', u':P']

看看这是否适合您

如果您需要有关捕获组、非捕获组、字符类的更多信息，统一码匹配我建议你粗略地浏览一下 python 的 re 包。
另外，我不确定您在多行中继续字符串的方式在这种情况下是否合适。如果您需要有关跨行分割字符串（而不是多行字符串）的更多信息，请查看这个。

It seems that the desired output is not consistent with your input sentence

[u"qu'", u'on'] : I can't figure out from where did these two matches were determined from your sentence
Why u'.' was not part of u'hyper-cool' (Assuming you want the punctuation as part of the word.
Why u"'" was not part of u"C'". (Assuming you want the punctuation as part of the word.

Also if you just want regex split, is there any reason why you are using nltk apart from splitting the lines? I have no experience with nltk so would be proposing just a regex solution.

>>> sentence
u"d\xe9test\xe9 Rochard ! m'\xe9tais \xe0... 'C'est hyper-cool.' :) :P"
>>> pattern=re.compile(
    u"(" #Capturing Group
    "(?:" #Non Capturing
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    "[\w\-]+"                           #Alphanumeric Unicode Word with hypen
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    ")"
    "|(?:[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]+)" #1- punctuation
     ")",re.UNICODE)
>>> pattern.findall(sentence)
[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0.', u'..', u"'C'", u'est', u'hyper-cool.', u"'", u':)', u':P']

See if this works for you

If you need more information on Capturing Group, Non-Capturing Group, Character Class, Unicode Match and findall I would suggest you take a cursory glance on the re package of python.
Also I am not sure if the way you are continuing string in multiple lines is appropriate in this scenario. If you need more information on splitting string across lines (not multi-line strings) please have a look into this.

回复收藏 0 原文

~没有更多了~