如何从一系列文本条目中提取常见/重要短语

发布于 2024-08-26 02:08:09 字数 484 浏览 15 评论 0原文

我有一系列文本项 - 来自 MySQL 数据库的原始 HTML。我想找到这些条目中最常见的短语（不是单个最常见的短语，并且理想情况下不强制逐字匹配）。

我的示例是 Yelp.com 上的任何评论，它显示了给定餐厅的数百条评论中的 3 个片段，格式为：

“尝试一下汉堡”（共 44 条评论）

，例如本页的“评论亮点”部分：

< a href="http://www.yelp.com/biz/sushi-gen-los-angeles" rel="noreferrer">http://www.yelp.com/biz/sushi-gen-los-angeles/

我安装了 NLTK 并且已经玩过有点，但老实说我对这些选择不知所措。这似乎是一个相当常见的问题，我无法通过此处搜索找到简单的解决方案。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花开雨落又逢春i 2024-09-02 02:08:09

我怀疑您不仅想要最常见的短语，还想要最有趣的搭配。否则，您最终可能会得到由常见单词组成的短语过多，而有趣且信息丰富的短语较少。

为此，您本质上需要从数据中提取 n 元语法，然后找到具有最高逐点互信息 (PMI)。也就是说，您希望找到同时出现的单词，而不是您期望它们偶然出现的情况。

NLTK 搭配指南涵盖了如何用大约 7 行代码来完成此操作，例如：

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

I suspect you don't just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases.

To do this, you'll essentially want to extract n-grams from your data and then find the ones that have the highest point wise mutual information (PMI). That is, you want to find the words that co-occur together much more than you would expect them to by chance.

The NLTK collocations how-to covers how to do this in a about 7 lines of code, e.g.:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

回复收藏 0 原文

狠疯拽 2024-09-02 02:08:09

我认为您正在寻找的是分块。我建议阅读 NLTK 书的第 7 章或者我自己的文章块提取。这两者都假设您了解词性标记，这在第 5 章。

回复收藏 0 原文

绾颜 2024-09-02 02:08:09

如果你只是想获得大于 3 ngrams 的数据，你可以尝试这个。我假设你已经删除了所有像 HTML 等垃圾。

import nltk
ngramlist=[]
raw=<yourtextfile here>

x=1
ngramlimit=6
tokens=nltk.word_tokenize(raw)

while x <= ngramlimit:
  ngramlist.extend(nltk.ngrams(tokens, x))
  x+=1

可能不是很 Pythonic，因为我自己只做了一个月左右，但可能会有帮助！

if you just want to get to larger than 3 ngrams you can try this. I'm assuming you've stripped out all the junk like HTML etc.

import nltk
ngramlist=[]
raw=<yourtextfile here>

x=1
ngramlimit=6
tokens=nltk.word_tokenize(raw)

while x <= ngramlimit:
  ngramlist.extend(nltk.ngrams(tokens, x))
  x+=1

Probably not very pythonic as I've only been doing this a month or so myself, but might be of help!

回复收藏 0 原文

失与倦＂ 2024-09-02 02:08:09

首先，您可能必须删除所有 HTML 标记（搜索“<[^>]*>”并将其替换为“”）。之后，您可以尝试寻找每两个文本项之间最长的公共子串的天真的方法，但我认为您不会得到很好的结果。
您可以通过首先规范化单词（将它们还原为基本形式、删除所有重音符号、将所有内容设置为小写或大写）然后进行分析来做得更好。同样，根据您想要完成的任务，如果允许一定的词序灵活性，您可能能够更好地对文本项进行聚类，即将文本项视为标准化单词包并测量包内容相似性。

我在此处评论了类似（尽管不相同）的主题。

回复收藏 0 原文

~没有更多了~