对大量字符串进行文本挖掘
我有字符串列表。 (相当大的 id 和字符串列表,分散在 4-5 个大文件中。每个文件大约 1 GB)。这些字符串的格式如下:
1,Hi
2,Hi How ru?
2、怎么样?
3、汝在哪里?
3,这是什么意思
3,这是什么意思
现在我想对这些字符串进行文本挖掘,并想准备一个树状图,我想按以下方式显示字符串
1-Hi
2-Hi 你怎么样?
----How r u?
3-这是什么意思?
----what it means?
3-你在哪里?
此输出基于特定人员的 id(假设使用这些字符串的人员的 ID)后逗号后面的字符串的相似性。如果其他人使用相同的单词,则应根据他使用的字符串进行分组。
现在看来,这似乎是一个简单的任务。但我希望在 hadoop/Mahout 上完成这样的事情,或者可以在集群 Linux 机器上支持大量数据的事情。 以及我应该如何解决这个问题。我已经在 Mahout 中尝试了不同的方法,其中我尝试创建序列文件和 seq2sparse 向量,然后尝试进行聚类。但这对我不起作用。任何有关方向的帮助或指示都会有很大帮助。
谢谢&问候, 阿图尔
I have list of strings. (pretty big list of ids and strings scattered in 4-5 big files. around a GB each). These strings are formatted like this:
1,Hi
2,Hi How r u?
2,How r u?
3,where r u?
3,what does this mean
3,what it means
Now I want to do text mining on these strings and want to prepare a dendrogram which I want to display the strings in the following way
1-Hi
2-Hi How r u?
----How r u?
3-What does this mean?
----what it means?
3-Where are you?
This output is based on the similarities of strings following the comma after an id(suppose ID of a person who used those strings) for a particular person. If some other person used same words, then it should be grouped according to strings he used.
Now, it seems to be a simple task. But I want something to be done like this on hadoop/Mahout or something which can support huge set of data on clustered linux machines.
and also how should I approach this problem for the solution. I have tried different approaches in Mahout already, wherein i tried to create sequence file and seq2sparse vectores and then trying to do clustering. but it didn't work for me. Any help or pointers in the direction would be a great help.
Thanks & Regards,
Atul
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为你真正需要的是层次聚类。为 Mahout 提出了一种实现,其中一种也在 Shogun Toolbox(也为大规模计算而设计)。但很难保证它会起作用,因为输入似乎很难。
I think that what you really need is hierarchical clustering. There was one implementation proposed for Mahout, one is also implemented in Shogun Toolbox (also designed for large-scale computation). But it's hard to guarantee that it will work, because the input seems to be hard.