hadoop 自定义Writables
我有更多关于 CustomWritable 对于我的用例的必要性的设计问题:
所以我有一个文档对,我将通过管道进行处理并将中间和最终数据写入 HDFS。我的密钥类似于 ObjectId - DocId - Pair - Lang。我不明白为什么/是否需要一个 CustomWritable 来用于此用例。我想如果我没有钥匙,我需要一个 CustomWritable 吗?另外,当我将数据写入Reducer 中的HDFS 时,我使用自定义分区器。那么,这会消除我对自定义可写的需求吗?
我不确定我是否了解需要自定义可写权限的概念。有人能指出我正确的方向吗?
I have more of a design question regarding the necessity of a CustomWritable for my use case:
So I have a document pair that I will process through a pipeline and write out intermediate and final data to HDFS. My key will be something like ObjectId - DocId - Pair - Lang. I do not see why/if I will need a CustomWritable for this use case. I guess if I did not have a key, I would need a CustomWritable? Also, when I write data out to HDFS in the Reducer, I use a Custom Partitioner. So, that would kind of eliminate my need for a Custom Writable?
I am not sure if I got the concept of the need for a Custom Writable right. Can someone point me in the right direction?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
可写对象可用于反/序列化对象。例如,日志条目可以包含时间戳、用户 IP 和浏览器代理。因此,您应该为标识此条目的键实现自己的 WritableComparable,并且应该实现一个值类,该值类实现 Writable,以读取和写入日志条目中的属性。
这些序列化只是将数据从二进制格式获取到对象的便捷方法。一些框架(例如 HBase)仍然需要字节数组来保存数据。因此,您自己传输此数据会产生大量开销,并且会弄乱您的代码。
Writables can be used for de/serializing objects. For example a log entry can contain a timestamp, an user IP and the browser agent. So you should implement your own WritableComparable for a key that identifies this entry and you should implement a value class that implements Writable that reads and writes the attributes in your log entry.
These serializations are just a handy way to get the data from a binary format to an object. Some Frameworks like HBase still require byte arrays to persist the data. So you'll have a lot of overhead in transfering this by yourself and messes up your code.
托马斯的回答解释了一些。虽然已经太晚了,但我想为潜在的读者添加以下内容:
分区器仅在映射和化简阶段之间发挥作用,并且在从化简器到输出文件的写入过程中不起任何作用。
我不认为在大多数情况下需要将中间数据写入 hdfs,尽管可以应用一些技巧来执行相同的操作。
当您从reducer写入hdfs时,键将自动排序,并且每个reducer将写入一个单独的文件。根据其
compareTo
方法,对键进行排序。因此,如果您想基于多个变量进行排序,请选择扩展WritableComparable
的自定义键类,并实现write
、readFields
和 <代码>compareTo 方法。您现在可以根据compareTo
实现来控制键的排序方式Thomas' answer explains a bit. Its way too late but I'd like to add the following for prospective readers:
Partitioner only comes into play between the map and reduce phase and has no role to play in writing from reducer to output files.
I don't believe writing INTERMEDIATE data to hdfs is a requirement in most cases, although there are some hacks that can be applied to do the same.
When you write from a reducer to hdfs, the keys will automatically be sorted and each reducer will write to ONE SEPARATE file. Based on their
compareTo
method, keys are sorted. So if you want to sort based on multiple variables, go for a Custom key class that extendsWritableComparable
, and implement thewrite
,readFields
andcompareTo
methods. You can now control the way the keys are sorted, based on thecompareTo
implementation