生成序列文件

发布于 2024-11-29 11:10:55 字数 1557 浏览 4 评论 0原文

给定以下格式的数据（tag_uri image_uri image_uri image_uri ...），我需要将它们转换为 Hadoop SequenceFile 格式，以便 Mahout 进一步处理（例如聚类）

http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318@N06/4019040356 http://flickr.com/photos/46857830@N03/5651576112
http://flickr.com/photos/tags/100faves http://flickr.com/photos/21207178@N07/5441742937
...

在此之前，我将输入转换为 csv （或 arff），如下所示

http://flickr.com/photos/tags/100commentgroup,http://flickr.com/photos/tags/100faves,...
0,1,...
1,1,...
...

：每一行描述一个标签。然后将arff文件转换为mahout使用的矢量文件进行进一步处理。我试图跳过 arff 生成部分，并生成一个序列文件。如果我没有记错的话，为了将我的数据表示为序列文件，我需要使用 $tag_uri 作为键存储数据的每一行，然后使用 $image_vector 作为值。执行此操作的正确方法是什么（如果可能，我可以将每行的 tag_url 包含在序列文件中的某处）吗？

我找到的一些参考资料，但不确定它们是否相关：

写作一个序列文件
格式化输入矩阵以进行 svd 矩阵分解（我可以以这种形式存储我的矩阵吗？）
RandomAccessSparseVector （考虑到我只列出分配有给定标签的图像而不是所有一行中的图像，是否可以使用此向量表示它？）
SequenceFile 写入
Sequence文件说明

原文

Given data in the following format (tag_uri image_uri image_uri image_uri ...), I need to turn them into Hadoop SequenceFile format for further processing by Mahout (e.g. clustering)

http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318@N06/4019040356 http://flickr.com/photos/46857830@N03/5651576112
http://flickr.com/photos/tags/100faves http://flickr.com/photos/21207178@N07/5441742937
...

Before this I would turn the input into csv (or arff) as follows

http://flickr.com/photos/tags/100commentgroup,http://flickr.com/photos/tags/100faves,...
0,1,...
1,1,...
...

with each row describes one tag. Then the arff file is converted into a vector file used by mahout for further processing. I am trying to skip the arff generation part, and generate a sequenceFile instead. If I am not mistaken, to represent my data as a sequenceFile, I would need to store each row of the data with $tag_uri as key, then $image_vector as value. What is the proper way of doing this (if possible, can I have the tag_url for each row to be included in the sequencefile somewhere)?

Some references that I found, but not sure if they are relevant: