Python正则表达式nltk网站提取

发布于 2024-12-08 13:41:07 字数 1025 浏览 0 评论 0原文

您好，我以前从未处理过正则表达式，我正在尝试使用 Python 和 NLTK 预处理一些原始文本。当我尝试使用以下方法标记文档时：

tokens = nltk.regexp_tokenize(corpus, sentence_re)
sentence_re = r'''(?x)  # set flag to allow verbose regexps
  ([A-Z])(\.[A-Z])+\.?  # abbreviations, e.g. U.S.A.
| \w+(-\w+)*            # words with optional internal hyphens
| \$?\d+(\.\d+)?%?      # currency and percentages, e.g. $12.40, 82%
| \#?\w+|\@?\w+         # hashtags and @ signs
| \.\.\.                # ellipsis
| [][.,;"'?()-_`]       # these are separate tokens
| ?:http://|www.)[^"\' ]+ # websites
'''

它无法将整个网站作为一个标记：

print toks[:50]
['on', '#Seamonkey', '(', 'SM', ')', '-', 'I', 'had', 'a', 'short', 'chirp',   'exchange', 'with', '@angie1234p', 'at', 'the', '18thDec', ';', 'btw', 'SM', 'is', 'faster', 'has', 'also', 'an', 'agile', '...', '1', '/', '2', "'", '...', 'user', 'community', '-', 'http', ':', '/', '/', 'bit', '.', 'ly', '/', 'XnF5', '+', 'ICR', 'http', ':', '/', '/']

非常感谢任何帮助。非常感谢！

-弗洛里

原文

Hi I have never had to deal with regex before and I'm trying to preprocess some raw text with Python and NLTK.
when I tried to tokenize the document using :

tokens = nltk.regexp_tokenize(corpus, sentence_re)
sentence_re = r'''(?x)  # set flag to allow verbose regexps
  ([A-Z])(\.[A-Z])+\.?  # abbreviations, e.g. U.S.A.
| \w+(-\w+)*            # words with optional internal hyphens
| \$?\d+(\.\d+)?%?      # currency and percentages, e.g. $12.40, 82%
| \#?\w+|\@?\w+         # hashtags and @ signs
| \.\.\.                # ellipsis
| [][.,;"'?()-_`]       # these are separate tokens
| ?:http://|www.)[^"\' ]+ # websites
'''

its not able to take all of the website as one single token:

print toks[:50]
['on', '#Seamonkey', '(', 'SM', ')', '-', 'I', 'had', 'a', 'short', 'chirp',   'exchange', 'with', '@angie1234p', 'at', 'the', '18thDec', ';', 'btw', 'SM', 'is', 'faster', 'has', 'also', 'an', 'agile', '...', '1', '/', '2', "'", '...', 'user', 'community', '-', 'http', ':', '/', '/', 'bit', '.', 'ly', '/', 'XnF5', '+', 'ICR', 'http', ':', '/', '/']

any help is greatly appreicated. Thanks so much!

-Florie

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无法回应 2024-12-15 13:41:07

在此标记生成器中，正则表达式用于指定要从文本中提取的标记的外观。
我有点困惑您使用了上面的许多正则表达式中的哪一个，但是对于非空白标记的非常简单的标记化，您可以使用：

>>> corpus = "this is a sentence. and another sentence. my homepage is http://test.com"
>>> nltk.regexp_tokenize(corpus, r"\S+")
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

这相当于：

>>> corpus.split()
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

另一种方法可以使用 nltk 函数 sent_tokenize() 和 nltk。 word_tokenize()：

>>> sentences = nltk.sent_tokenize(corpus)
>>> sentences
['this is a sentence.', 'and another sentence.', 'my homepage is http://test.com']
>>> for sentence in sentences:
    print nltk.word_tokenize(sentence)
['this', 'is', 'a', 'sentence', '.']
['and', 'another', 'sentence', '.']
['my', 'homepage', 'is', 'http', ':', '//test.com']

不过，如果您的文本包含大量网站网址，这可能不是最佳选择。有关 NLTK 中不同分词器的信息可以在此处< /a>.

如果您只想从语料库中提取 URL，您可以使用如下正则表达式：

nltk.regexp_tokenize(corpus, r'(http://|https://|www.)[^"\' ]+')

希望这会有所帮助。如果这不是您正在寻找的答案，请尝试更准确地解释您想要做什么以及您希望您的令牌是什么样子（例如您想要的输入/输出示例），我们可以提供帮助找到正确的正则表达式。

In this tokenizer RegularExpressions are used to specify how the Tokens you want to extract from the text can look like.
I'm a bit confused which of the many regular expressions above you used, but for a very simple tokenization to non-whitespace tokens you could use:

>>> corpus = "this is a sentence. and another sentence. my homepage is http://test.com"
>>> nltk.regexp_tokenize(corpus, r"\S+")
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

which is equivalent to:

>>> corpus.split()
['this', 'is', 'a', 'sentence.', 'and', 'another', 'sentence.', 'my', 'homepage', 'is', 'http://test.com']

another approach could be using the nltk functions sent_tokenize() and nltk.word_tokenize():

>>> sentences = nltk.sent_tokenize(corpus)
>>> sentences
['this is a sentence.', 'and another sentence.', 'my homepage is http://test.com']
>>> for sentence in sentences:
    print nltk.word_tokenize(sentence)
['this', 'is', 'a', 'sentence', '.']
['and', 'another', 'sentence', '.']
['my', 'homepage', 'is', 'http', ':', '//test.com']

though if your text contains lots of website-urls this might not be the best choice. information about the different tokenizers in the NLTK can be found here.

if you just want to extract URLs from the corpus you could use a regular expression like this:

nltk.regexp_tokenize(corpus, r'(http://|https://|www.)[^"\' ]+')

Hope this helps. If this was not the answer you were looking for, please try to explain a bit more precisely what you want to do and how exactely you want your tokens look like (e.g. an example input/output you would like to have) and we can help finding the right regular expression.

回复收藏 0 原文

~没有更多了~