Python正则表达式nltk网站提取
您好,我以前从未处理过正则表达式,我正在尝试使用 Python 和 NLTK 预处理一些原始文本。 当我尝试使用以下方法标记文档时:
tokens = nltk.regexp_tokenize(corpus, sentence_re)
sentence_re = r'''(?x) # set flag to allow verbose regexps
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \#?\w+|\@?\w+ # hashtags and @ signs
| \.\.\. # ellipsis
| [][.,;"'?()-_`] # these are separate tokens
| ?:http://|www.)[^"\' ]+ # websites
'''
它无法将整个网站作为一个标记:
print toks[:50]
['on', '#Seamonkey', '(', 'SM', ')', '-', 'I', 'had', 'a', 'short', 'chirp', 'exchange', 'with', '@angie1234p', 'at', 'the', '18thDec', ';', 'btw', 'SM', 'is', 'faster', 'has', 'also', 'an', 'agile', '...', '1', '/', '2', "'", '...', 'user', 'community', '-', 'http', ':', '/', '/', 'bit', '.', 'ly', '/', 'XnF5', '+', 'ICR', 'http', ':', '/', '/']
非常感谢任何帮助。非常感谢!
-弗洛里
Hi I have never had to deal with regex before and I'm trying to preprocess some raw text with Python and NLTK.
when I tried to tokenize the document using :
tokens = nltk.regexp_tokenize(corpus, sentence_re)
sentence_re = r'''(?x) # set flag to allow verbose regexps
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \#?\w+|\@?\w+ # hashtags and @ signs
| \.\.\. # ellipsis
| [][.,;"'?()-_`] # these are separate tokens
| ?:http://|www.)[^"\' ]+ # websites
'''
its not able to take all of the website as one single token:
print toks[:50]
['on', '#Seamonkey', '(', 'SM', ')', '-', 'I', 'had', 'a', 'short', 'chirp', 'exchange', 'with', '@angie1234p', 'at', 'the', '18thDec', ';', 'btw', 'SM', 'is', 'faster', 'has', 'also', 'an', 'agile', '...', '1', '/', '2', "'", '...', 'user', 'community', '-', 'http', ':', '/', '/', 'bit', '.', 'ly', '/', 'XnF5', '+', 'ICR', 'http', ':', '/', '/']
any help is greatly appreicated. Thanks so much!
-Florie
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在此标记生成器中,正则表达式用于指定要从文本中提取的标记的外观。
我有点困惑您使用了上面的许多正则表达式中的哪一个,但是对于非空白标记的非常简单的标记化,您可以使用:
这相当于:
另一种方法可以使用 nltk 函数 sent_tokenize() 和 nltk。 word_tokenize():
不过,如果您的文本包含大量网站网址,这可能不是最佳选择。有关 NLTK 中不同分词器的信息可以在此处< /a>.
如果您只想从语料库中提取 URL,您可以使用如下正则表达式:
希望这会有所帮助。如果这不是您正在寻找的答案,请尝试更准确地解释您想要做什么以及您希望您的令牌是什么样子(例如您想要的输入/输出示例),我们可以提供帮助找到正确的正则表达式。
In this tokenizer RegularExpressions are used to specify how the Tokens you want to extract from the text can look like.
I'm a bit confused which of the many regular expressions above you used, but for a very simple tokenization to non-whitespace tokens you could use:
which is equivalent to:
another approach could be using the nltk functions sent_tokenize() and nltk.word_tokenize():
though if your text contains lots of website-urls this might not be the best choice. information about the different tokenizers in the NLTK can be found here.
if you just want to extract URLs from the corpus you could use a regular expression like this:
Hope this helps. If this was not the answer you were looking for, please try to explain a bit more precisely what you want to do and how exactely you want your tokens look like (e.g. an example input/output you would like to have) and we can help finding the right regular expression.