计算功能词的频率

发布于 2024-11-03 20:54:33 字数 316 浏览 11 评论 0原文

我想计算Python/NLTK中功能词的频率。我看到有两种方法可以解决这个问题:

  • 使用词性标注器并对构成功能词的词性标签进行总结
  • 创建功能词列表并执行简单的查找

第一种情况的问题是,我的数据很吵,而且我不知道(肯定)哪些词性标签构成功能词。第二种情况的问题是我没有列表,并且由于我的数据很嘈杂,因此查找不会准确。

我更喜欢第一个而不是第二个或任何其他例子,这会给我带来更准确的结果。

I would like to calculate the frequency of function words in Python/NLTK. I see two ways to go about it :

  • Use Part-Of-Speech tagger and sum up on POS tags which constitute to function words
  • Create a list of function words and perform a simple look up

The catch in the first case is that, my data is noisy and I don't know(for sure) which POS tags constitute as function words. The catch in the second case is I don't have a list and since my data is noisy the lookup won't be accurate.

I would prefer the first to the second or any other example which would throw me more accurate results.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

你又不是我 2024-11-10 20:54:33

我刚刚使用了 LIWC English 2007 词典(我支付了相同的费用)并执行了一个简单的查找:现在。任何其他答案都是非常受欢迎的。

我必须说,我对这里几个答案的冲动感到有点惊讶。因为,有人索要代码。这就是我所做的:

''' Returns frequency of function words '''
def get_func_word_freq(words,funct_words):
    fdist = nltk.FreqDist([funct_word for funct_word in funct_words if funct_word in words]) 
    funct_freq = {}    
    for key,value in fdist.iteritems():
        funct_freq[key] = value
    return funct_freq

''' Read LIWC 2007 English dictionary and extract function words '''
def load_liwc_funct():
    funct_words = set()
    data_file = open(liwc_dict_file, 'rb')
    lines = data_file.readlines()
    for line in lines:
        row = line.rstrip().split("\t")
        if '1' in row:
            if row[0][-1:] == '*' :
                funct_words.add(row[0][:-1])
            else :
                funct_words.add(row[0])
    return list(funct_words)

任何用 python 编写过代码的人都会告诉你,执行查找或提取具有特定 POS 标签的单词并不是什么复杂的事情。要补充的是,NLP(自然语言处理)和 NLTK(自然语言工具包)的标签(关于问题)对于精明的人来说应该足够了。

无论如何,我理解&尊重在这里回复的人的情感,因为大部分都是免费的,但我认为我们至少可以做的是对问题发布者表现出一点尊重。正如正确指出的那样,当你帮助别人时,你就会得到帮助,同样,当你尊重别人时,你就会得到尊重。

I just used the LIWC English 2007 dictionary ( I paid for the same) and performed a simple lookup as of now. Any other answers are most welcome.

I must say I am a little surprised by the impulsiveness of a couple of answers here. Since, someone asked for code. Here's what I did :

''' Returns frequency of function words '''
def get_func_word_freq(words,funct_words):
    fdist = nltk.FreqDist([funct_word for funct_word in funct_words if funct_word in words]) 
    funct_freq = {}    
    for key,value in fdist.iteritems():
        funct_freq[key] = value
    return funct_freq

''' Read LIWC 2007 English dictionary and extract function words '''
def load_liwc_funct():
    funct_words = set()
    data_file = open(liwc_dict_file, 'rb')
    lines = data_file.readlines()
    for line in lines:
        row = line.rstrip().split("\t")
        if '1' in row:
            if row[0][-1:] == '*' :
                funct_words.add(row[0][:-1])
            else :
                funct_words.add(row[0])
    return list(funct_words)

Anyone who has done some code in python would tell you that performing a look up or extracting words with specific POS tags isn't rocket science. To add, tags(on the question) of NLP(Natural Language Processing) and NLTK(Natural Language ToolKit) should be enough indication to the astute minded.

Anyways, I understand & respect sentiments of people who reply here since most of it is free but I think the least we can do is show a bit of respect to question posters. As it's rightly pointed out help is received when you help others, similarly respect is received when one respect's others.

野鹿林 2024-11-10 20:54:33

在您尝试之前,您不知道哪种方法有效。不过我推荐第一种方法;我在非常嘈杂的数据上成功地使用了它,其中电子邮件主题标题(短文本,不是正确的句子)的“句子”甚至语言都是未知的(大约 85% 英语;Cavnar & Trenkle 算法很快就崩溃了) )。 成功被定义为搜索引擎检索性能的提高;如果你只是想计算频率,问题可能会更容易。

确保您使用的词性标注器考虑了上下文(大多数都会考虑到)。检查你得到的单词和频率列表,也许会消除一些你不认为功能词的单词,甚至过滤掉太长的单词;这将消除误报。

(免责声明:我使用的是斯坦福 POS 标注器,而不是 NLTK,所以 YMMV。我使用了默认的英语模型之一,我认为是在 Penn Treebank 上训练的。)

You don't know which approach will work until you try. I recommend the first approach though; I've used it with success on very noisy data, where the "sentences" where email subject headers (short texts, not proper sentences) and even the language was unknown (some 85% English; the Cavnar & Trenkle algorithm broke down quickly). Success was defined as increased retrieval performance in a search engine; if you just want to count frequencies, the problem may be easier.

Make sure you use a POS tagger that takes context into account (most do). Inspect the list of words and frequencies you get and maybe eliminate some words that you don't consider function words, or even filter out words that are too long; that will eliminate the false positives.

(Disclaimer: I was using the Stanford POS tagger, not NLTK, so YMMV. I used one of the default models for English, trained, I think, on the Penn Treebank.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文