生成序列文件
给定以下格式的数据(tag_uri image_uri image_uri image_uri ...),我需要将它们转换为 Hadoop SequenceFile 格式,以便 Mahout 进一步处理(例如聚类)
http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318@N06/4019040356 http://flickr.com/photos/46857830@N03/5651576112
http://flickr.com/photos/tags/100faves http://flickr.com/photos/21207178@N07/5441742937
...
在此之前,我将输入转换为 csv (或 arff),如下所示
http://flickr.com/photos/tags/100commentgroup,http://flickr.com/photos/tags/100faves,...
0,1,...
1,1,...
...
:每一行描述一个标签。然后将arff文件转换为mahout使用的矢量文件进行进一步处理。我试图跳过 arff 生成部分,并生成一个序列文件。如果我没有记错的话,为了将我的数据表示为序列文件,我需要使用 $tag_uri 作为键存储数据的每一行,然后使用 $image_vector 作为值。执行此操作的正确方法是什么(如果可能,我可以将每行的 tag_url 包含在序列文件中的某处)吗?
我找到的一些参考资料,但不确定它们是否相关:
- 写作一个序列文件
- 格式化输入矩阵以进行 svd 矩阵分解(我可以以这种形式存储我的矩阵吗?)
- RandomAccessSparseVector (考虑到我只列出分配有给定标签的图像而不是所有一行中的图像,是否可以使用此向量表示它?)
- SequenceFile 写入
- Sequence文件说明
Given data in the following format (tag_uri image_uri image_uri image_uri ...), I need to turn them into Hadoop SequenceFile format for further processing by Mahout (e.g. clustering)
http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318@N06/4019040356 http://flickr.com/photos/46857830@N03/5651576112
http://flickr.com/photos/tags/100faves http://flickr.com/photos/21207178@N07/5441742937
...
Before this I would turn the input into csv (or arff) as follows
http://flickr.com/photos/tags/100commentgroup,http://flickr.com/photos/tags/100faves,...
0,1,...
1,1,...
...
with each row describes one tag. Then the arff file is converted into a vector file used by mahout for further processing. I am trying to skip the arff generation part, and generate a sequenceFile instead. If I am not mistaken, to represent my data as a sequenceFile, I would need to store each row of the data with $tag_uri as key, then $image_vector as value. What is the proper way of doing this (if possible, can I have the tag_url for each row to be included in the sequencefile somewhere)?
Some references that I found, but not sure if they are relevant:
- Writing a SequenceFile
- Formatting input matrix for svd matrix factorization (can I store my matrix in this form?)
- RandomAccessSparseVector (considering I only list images that are assigned with a given tag instead of all the images in a line, is it possible to represent it using this vector?)
- SequenceFile write
- SequenceFile explanation
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您只需要一个
SequenceFile.Writer
,这在您的链接 #4 中进行了解释。这使您可以将键值对写入文件。当然,键和值是什么取决于您的用例。聚类与矩阵分解与协同过滤完全不同。不存在一种SequenceFile
格式。键或值很可能是 Mahout
Vector
。知道如何编写Vector
的是VectorWritable
。您将使用该类来包装Vector
并使用SequenceFile.Writer
写入它。您需要查看将消耗它的工作,以确保您达到了它的预期。例如,对于聚类,我认为键被忽略,值是一个
Vector
。You just need a
SequenceFile.Writer
, which is explained in your link #4. This lets you write key-value pairs to the file. What the key and value are depends on your use case, of course. It's not at all the same for clustering versus matrix decomposition versus collaborative filtering. There's not oneSequenceFile
format.Chances are that the key or value will be a Mahout
Vector
. The thing that knows how to write aVector
isVectorWritable
. This is the class you would use to wrap aVector
and write it withSequenceFile.Writer
.You would need to look at the job that will consume it to make sure you're passing what it expects. For clustering, for example, I think the key is ignored and the value is a
Vector
.