Mahout - 朴素贝叶斯

发布于 2024-12-15 02:39:54 字数 245 浏览 0 评论 0原文

我尝试使用 mahout 部署 20 个新闻组示例,看起来工作正常。出于好奇,我想深入研究模型统计数据,

例如: bayes-model 目录包含以下子目录,

trainer-tfIdf trainer-thetaNormalizer trainer-weights

包含部分 0000 文件。我想阅读文件的内容以更好地理解,cat 命令似乎不起作用,它打印一些垃圾。

任何帮助表示赞赏。

谢谢

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,

for example: bayes-model directory contains the following sub directories,

trainer-tfIdf trainer-thetaNormalizer trainer-weights

which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.

Any help is appreciated.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

拧巴小姐 2024-12-22 02:39:55

您可以使用hadoop的文件系统-text选项读取part-0000文件。只要进入hadoop目录并输入以下

`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`

part-m-00000就会打印到STDOUT。

如果出现错误,您可能需要将 HADOOP_CLASSPATH 变量添加到您的路径中。例如,如果运行后它给你

文本:java.io.IOException:WritableName无法加载类:org.apache.mahout.math.VectorWritable

然后将相应的类添加到HADOOP_CLASSPATH变量中

export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar

这对我有用;)

You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following

`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`

part-m-00000 will be printed to STDOUT.

If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you

text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable

then add the corresponding class to the HADOOP_CLASSPATH variable

export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar

That worked for me ;)

灵芸 2024-12-22 02:39:55

为了读取part-00000(序列文件),您需要使用“seqdumper”实用程序。这是我用于实验的示例:

MAHOUT_HOME$:bin/mahout seqdumper -s
〜/聚类/实验-v1/t14/tfidf-向量/part-r-00000
-o ~/vectors-v2-1010

-s 是要转换为纯文本的序列文件

-o 是输出文件

In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:

MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010

-s is the sequence file you want to convert to plain text

-o is the output file

深海里的那抹蓝 2024-12-22 02:39:54

“part-00000”文件由 Hadoop 创建,采用 Hadoop 的 SequenceFile 格式,包含特定于 Mahout 的值。您不能将它们作为文本文件打开,不。您可以在 Mahout 中找到实用程序类 SequenceFileDumper ,它将尝试将内容作为文本输出到 stdout。

至于这些值的起始值,它们是 Mahout 执行的基于 Hadoop 的多阶段计算的中间结果。您可以阅读代码以更好地了解它们是什么。例如,“tfidf”目录包含与术语频率相关的中间计算。

The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.

As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文