当前位置：文江博客话题详情

对大量字符串进行文本挖掘

发布于 2024-12-03 02:06:10 字数 667 浏览 4 评论 0原文

我有字符串列表。（相当大的 id 和字符串列表，分散在 4-5 个大文件中。每个文件大约 1 GB）。这些字符串的格式如下：

1,Hi

2,Hi How ru？

2、怎么样？

3、汝在哪里？

3，这是什么意思

现在我想对这些字符串进行文本挖掘，并想准备一个树状图，我想按以下方式显示字符串

1-Hi

2-Hi 你怎么样？

 ----How r u?

3-这是什么意思？

 ----what it means?

3-你在哪里？

此输出基于特定人员的 id（假设使用这些字符串的人员的 ID）后逗号后面的字符串的相似性。如果其他人使用相同的单词，则应根据他使用的字符串进行分组。

现在看来，这似乎是一个简单的任务。但我希望在 hadoop/Mahout 上完成这样的事情，或者可以在集群 Linux 机器上支持大量数据的事情。以及我应该如何解决这个问题。我已经在 Mahout 中尝试了不同的方法，其中我尝试创建序列文件和 seq2sparse 向量，然后尝试进行聚类。但这对我不起作用。任何有关方向的帮助或指示都会有很大帮助。

谢谢&问候，阿图尔

原文

I have list of strings. (pretty big list of ids and strings scattered in 4-5 big files. around a GB each). These strings are formatted like this:

1,Hi

2,Hi How r u?

2,How r u?

3,where r u?

3,what does this mean

3,what it means

Now I want to do text mining on these strings and want to prepare a dendrogram which I want to display the strings in the following way

1-Hi

2-Hi How r u?

 ----How r u?

3-What does this mean?

 ----what it means?

3-Where are you?

This output is based on the similarities of strings following the comma after an id(suppose ID of a person who used those strings) for a particular person. If some other person used same words, then it should be grouped according to strings he used.

Now, it seems to be a simple task. But I want something to be done like this on hadoop/Mahout or something which can support huge set of data on clustered linux machines.
and also how should I approach this problem for the solution. I have tried different approaches in Mahout already, wherein i tried to create sequence file and seq2sparse vectores and then trying to do clustering. but it didn't work for me. Any help or pointers in the direction would be a great help.

Thanks & Regards,
Atul

分享到QQ

分享到微博