nltk/python 停用词问题

发布于 2024-10-28 08:27:37 字数 1286 浏览 5 评论 0原文

我有一些处理数据集供以后使用的代码，我用于停止词的代码似乎没问题，但是我认为问题出在我的代码的其余部分，因为它似乎只删除了一些停止词。

import re
import nltk

# Quran subset
filename = 'subsetQuran.txt'

# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)

word_list2 = [w for w in word_list if not w in nltk.corpus.stopwords.words('english')]



# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]') 
for word in word_list2:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1


print '-'*30

print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result
for freq, word in freq_list2:
    print word, freq
f = open("wordfreq.txt", "w")
f.write( str(freq_list3) )
f.close()

输出看起来像这样

[(71, 'allah'), (65, 'ye'), (46, 'day'), (21, 'lord'), (20, 'truth'), (20, 'say'), (20, 'and')

这只是一个小样本，还有其他一些应该被删除。任何帮助表示赞赏。

原文

I have some code that processes a dataset for later use, the code i'm using for the stop words seems to be ok, however I think the problem lies within the rest of my code as it seems to only remove some of the stop words.

import re
import nltk

# Quran subset
filename = 'subsetQuran.txt'

# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)

word_list2 = [w for w in word_list if not w in nltk.corpus.stopwords.words('english')]



# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]') 
for word in word_list2:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1


print '-'*30

print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result
for freq, word in freq_list2:
    print word, freq
f = open("wordfreq.txt", "w")
f.write( str(freq_list3) )
f.close()

The output is looking like this

[(71, 'allah'), (65, 'ye'), (46, 'day'), (21, 'lord'), (20, 'truth'), (20, 'say'), (20, 'and')

This is just a small sample, there are others that should have been removed.
Any help is appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

埖埖迣鎅 2024-11-04 08:27:37

尝试在制作 word_list2 时删除您的单词

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

try stripping your words while making your word_list2

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

回复收藏 0 原文

~没有更多了~

关于作者

最笨的告白

暂无简介

0 文章

0 评论

23 人气

关注发私信

隔纱相望

文章 0 评论 0

关注

昵称有卵用

文章 0 评论 0

关注

梨涡

文章 0 评论 0

关注

蓝咒

文章 0 评论 0

关注

白芷

文章 0 评论 0

关注

樱娆

文章 0 评论 0

友情链接

文江博客

nltk/python 停用词问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

隔纱相望

昵称有卵用

梨涡

蓝咒

白芷

樱娆

友情链接

nltk/python 停用词问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

隔纱相望

昵称有卵用

梨涡

蓝咒

白芷

樱娆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。