Mahout - 朴素贝叶斯
我尝试使用 mahout 部署 20 个新闻组示例,看起来工作正常。出于好奇,我想深入研究模型统计数据,
例如: bayes-model 目录包含以下子目录,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
包含部分 0000 文件。我想阅读文件的内容以更好地理解,cat 命令似乎不起作用,它打印一些垃圾。
任何帮助表示赞赏。
谢谢
I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用hadoop的文件系统-text选项读取part-0000文件。只要进入hadoop目录并输入以下
part-m-00000就会打印到STDOUT。
如果出现错误,您可能需要将 HADOOP_CLASSPATH 变量添加到您的路径中。例如,如果运行后它给你
然后将相应的类添加到HADOOP_CLASSPATH变量中
这对我有用;)
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
then add the corresponding class to the HADOOP_CLASSPATH variable
That worked for me ;)
为了读取part-00000(序列文件),您需要使用“seqdumper”实用程序。这是我用于实验的示例:
-s 是要转换为纯文本的序列文件
-o 是输出文件
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
-s is the sequence file you want to convert to plain text
-o is the output file
“part-00000”文件由 Hadoop 创建,采用 Hadoop 的
SequenceFile
格式,包含特定于 Mahout 的值。您不能将它们作为文本文件打开,不。您可以在 Mahout 中找到实用程序类SequenceFileDumper
,它将尝试将内容作为文本输出到 stdout。至于这些值的起始值,它们是 Mahout 执行的基于 Hadoop 的多阶段计算的中间结果。您可以阅读代码以更好地了解它们是什么。例如,“tfidf”目录包含与术语频率相关的中间计算。
The 'part-00000' files are created by Hadoop, and are in Hadoop's
SequenceFile
format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility classSequenceFileDumper
in Mahout that will try to output the content as text to stdout.As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.