如何从一系列文本条目中提取常见/重要短语
我有一系列文本项 - 来自 MySQL 数据库的原始 HTML。我想找到这些条目中最常见的短语(不是单个最常见的短语,并且理想情况下不强制逐字匹配)。
我的示例是 Yelp.com 上的任何评论,它显示了给定餐厅的数百条评论中的 3 个片段,格式为:
“尝试一下汉堡”(共 44 条评论)
,例如本页的“评论亮点”部分:
< a href="http://www.yelp.com/biz/sushi-gen-los-angeles" rel="noreferrer">http://www.yelp.com/biz/sushi-gen-los-angeles/
我安装了 NLTK 并且已经玩过有点,但老实说我对这些选择不知所措。这似乎是一个相当常见的问题,我无法通过此处搜索找到简单的解决方案。
I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).
My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:
"Try the hamburger" (in 44 reviews)
e.g., the "Review Highlights" section of this page:
http://www.yelp.com/biz/sushi-gen-los-angeles/
I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我怀疑您不仅想要最常见的短语,还想要最有趣的搭配。否则,您最终可能会得到由常见单词组成的短语过多,而有趣且信息丰富的短语较少。
为此,您本质上需要从数据中提取 n 元语法,然后找到具有最高 逐点互信息 (PMI)。也就是说,您希望找到同时出现的单词,而不是您期望它们偶然出现的情况。
NLTK 搭配指南 涵盖了如何用大约 7 行代码来完成此操作,例如:
I suspect you don't just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases.
To do this, you'll essentially want to extract n-grams from your data and then find the ones that have the highest point wise mutual information (PMI). That is, you want to find the words that co-occur together much more than you would expect them to by chance.
The NLTK collocations how-to covers how to do this in a about 7 lines of code, e.g.:
我认为您正在寻找的是分块。我建议阅读 NLTK 书的第 7 章 或者我自己的文章 块提取。这两者都假设您了解词性标记,这在第 5 章。
I think what you're looking for is chunking. I recommended reading chapter 7 of the NLTK book or maybe my own article on chunk extraction. Both of these assume knowledge of part-of-speech tagging, which is covered in chapter 5.
如果你只是想获得大于 3 ngrams 的数据,你可以尝试这个。我假设你已经删除了所有像 HTML 等垃圾。
可能不是很 Pythonic,因为我自己只做了一个月左右,但可能会有帮助!
if you just want to get to larger than 3 ngrams you can try this. I'm assuming you've stripped out all the junk like HTML etc.
Probably not very pythonic as I've only been doing this a month or so myself, but might be of help!
首先,您可能必须删除所有 HTML 标记(搜索“<[^>]*>”并将其替换为“”)。之后,您可以尝试寻找每两个文本项之间最长的公共子串的天真的方法,但我认为您不会得到很好的结果。
您可以通过首先规范化单词(将它们还原为基本形式、删除所有重音符号、将所有内容设置为小写或大写)然后进行分析来做得更好。同样,根据您想要完成的任务,如果允许一定的词序灵活性,您可能能够更好地对文本项进行聚类,即将文本项视为标准化单词包并测量包内容相似性。
我在此处评论了类似(尽管不相同)的主题。
Well, for a start you would probably have to remove all HTML tags (search for "<[^>]*>" and replace it with ""). After that, you could try the naive approach of looking for the longest common substrings between every two text items, but I don't think you'd get very good results.
You might do better by normalizing the words (reducing them to their base form, removing all accents, setting everything to lower or upper case) first and then analyse. Again, depending on what you want to accomplish, you might be able to cluster the text items better if you allow for some word order flexibility, i.e. treat the text items as bags of normalized words and measure bag content similarity.
I've commented on a similar (although not identical) topic here.