对大量字符串进行文本挖掘

发布于 2024-12-03 02:06:10 字数 667 浏览 2 评论 0原文

我有字符串列表。 (相当大的 id 和字符串列表,分散在 4-5 个大文件中。每个文件大约 1 GB)。这些字符串的格式如下:

1,Hi

2,Hi How ru?

2、怎么样?

3、汝在哪里?

3,这是什么意思

3,这是什么意思

现在我想对这些字符串进行文本挖掘,并想准备一个树状图,我想按以下方式显示字符串

1-Hi

2-Hi 你怎么样?

 ----How r u?

3-这是什么意思?

 ----what it means?

3-你在哪里?

此输出基于特定人员的 id(假设使用这些字符串的人员的 ID)后逗号后面的字符串的相似性。如果其他人使用相同的单词,则应根据他使用的字符串进行分组。

现在看来,这似乎是一个简单的任务。但我希望在 hadoop/Mahout 上完成这样的事情,或者可以在集群 Linux 机器上支持大量数据的事情。 以及我应该如何解决这个问题。我已经在 Mahout 中尝试了不同的方法,其中我尝试创建序列文件和 seq2sparse 向量,然后尝试进行聚类。但这对我不起作用。任何有关方向的帮助或指示都会有很大帮助。

谢谢&问候, 阿图尔

I have list of strings. (pretty big list of ids and strings scattered in 4-5 big files. around a GB each). These strings are formatted like this:

1,Hi

2,Hi How r u?

2,How r u?

3,where r u?

3,what does this mean

3,what it means

Now I want to do text mining on these strings and want to prepare a dendrogram which I want to display the strings in the following way

1-Hi

2-Hi How r u?

 ----How r u?

3-What does this mean?

 ----what it means?

3-Where are you?

This output is based on the similarities of strings following the comma after an id(suppose ID of a person who used those strings) for a particular person. If some other person used same words, then it should be grouped according to strings he used.

Now, it seems to be a simple task. But I want something to be done like this on hadoop/Mahout or something which can support huge set of data on clustered linux machines.
and also how should I approach this problem for the solution. I have tried different approaches in Mahout already, wherein i tried to create sequence file and seq2sparse vectores and then trying to do clustering. but it didn't work for me. Any help or pointers in the direction would be a great help.

Thanks & Regards,
Atul

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

陌伤浅笑 2024-12-10 02:06:10

我认为你真正需要的是层次聚类。为 Mahout 提出了一种实现,其中一种也在 Shogun Toolbox(也为大规模计算而设计)。但很难保证它会起作用,因为输入似乎很难。

I think that what you really need is hierarchical clustering. There was one implementation proposed for Mahout, one is also implemented in Shogun Toolbox (also designed for large-scale computation). But it's hard to guarantee that it will work, because the input seems to be hard.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文