如何在大量文本中找到常用短语
我目前正在开展一个项目,需要在大量文本中挑选出最常见的短语。例如,假设我们有如下三个句子:
- The狗跳过over the Woman。
- 狗跳上了车。
- 狗跳上了楼梯。
从上面的示例中,我想提取“the狗跳了”,因为它是文本中最常见的短语。起初我想,“哦,让我们使用有向图[带有重复节点]”:
有向图 http://img.skitch .com/20091218-81ii2femnfgfipd9jtdg32m74f.png
编辑:抱歉,我在制作此图“over”、“into”和“up”时犯了一个错误,应该全部链接回“the” 。
我打算维护一个单词在每个节点对象中出现的次数(“the”为 6;“dog”和“jumped”为 3;等等),但尽管有许多其他问题,但主要的问题出现在我们添加了更多示例(请忽略错误的语法:-)):
- 狗跳上跳下。
- 狗跳得像以前从未有过的狗一样。
- 狗高兴地跳了起来。
我们现在遇到了一个问题,因为“dog”将启动一个新的根节点(与“the”处于同一级别),并且我们不会像现在一样识别“dog Jump”是最常见的短语。所以现在我想也许我可以使用无向图来映射所有单词之间的关系,并最终选出常见的短语,但我也不确定这将如何工作,因为你失去了之间的重要顺序关系的话。
那么,有人对如何识别大量文本中的常见短语以及我将使用什么数据结构有任何一般想法吗?
谢谢, 本
I'm working on a project at the moment where I need to pick out the most common phrases in a huge body of text. For example say we have three sentences like the following:
- The dog jumped over the woman.
- The dog jumped into the car.
- The dog jumped up the stairs.
From the above example I would want to extract "the dog jumped" as it is the most common phrase in the text. At first I thought, "oh lets use a directed graph [with repeated nodes]":
directed graph http://img.skitch.com/20091218-81ii2femnfgfipd9jtdg32m74f.png
EDIT: Apologies, I made a mistake while making this diagram "over", "into" and "up" should all link back to "the".
I was going to maintain a count of how many times a word occurred in each node object ("the" would be 6; "dog" and "jumped", 3; etc.) but despite many other problems the main one came up when we add a few more examples like (please ignore the bad grammar :-)):
- Dog jumped up and down.
- Dog jumped like no dog had ever jumped before.
- Dog jumped happily.
We now have a problem since "dog" would start a new root node (at the same level as "the") and we would not identify "dog jumped" as now being the most common phrase. So now I am thinking maybe I could use an undirected graph to map the relationships between all the words and eventually pick out the common phrases but I'm not sure how this is going to work either, as you lose the important relationship of order between the words.
So does anyone have any general ideas on how to identify common phrases in a large body of text and what data structure I would use.
Thanks,
Ben
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
查看此相关问题:有哪些技术/工具可以发现文本块中的常见短语?也与最长常见短语相关子串问题。
我之前已经发布过此内容,但我使用 R 来完成我的所有数据挖掘任务,它非常适合对于这样的分析。特别是查看
tm
包。以下是一些相关链接:邮件列表 (https://stat.ethz.ch/pipermail/r-devel/) 2006 年以来的新闻组帖子。
更普遍的是,有大量的文本挖掘包关于 CRAN 的自然语言处理视图。
Check out this related question: What techniques/tools are there for discovering common phrases in chunks of text? Also related to the longest common substring problem.
I've posted this before, but I use R for all of my data-mining tasks and it's well suited to this kind of analysis. In particular, look at the
tm
package. Here are some relevant links:mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.