hive 是否为每条记录实例化一个新的 UDF 对象?
假设我正在构建一个名为 StaticLookupUDF 的 UDF 类,该类必须在构建期间从本地文件加载一些静态数据。
在这种情况下,我想确保我复制的工作不会超过我需要的,因为我不想在每次调用评估()方法时重新加载静态数据。
显然,每个映射器都使用自己的 UDF 实例化,但是是否会为处理的每个记录生成一个新实例?
例如,映射器将处理 3 行。它是创建单个 StaticLookupUDF 并调用valuate() 3 次,还是为每个记录创建一个新的StaticLookupUDF,并为每个实例仅调用evaluate 一次?
如果第二个例子是正确的,我应该以什么替代方式来构建它?
在文档中的任何地方都找不到这个,我将查看代码,但我想我会同时询问这里的聪明人。
Say I'm building a UDF class called StaticLookupUDF that has to load some static data from a local file during construction.
In this case I want to ensure that I'm not replicating work more than I need to be, in that I don't want to re-load the static data on every call to the evaluate() method.
Clearly each mapper uses it's own instantiation of the UDF, but does a new instance get generated for each record processed?
For example, a mapper is going to process 3 rows. Does it create a single StaticLookupUDF and call evaluate() 3 times, or does it create a new StaticLookupUDF for each record, and call evaluate only once per instance?
If the second example is true, in what alternate way should I structure this?
Couldn't find this anywhere in the docs, I'm going to look through the code, but figured I'd ask the smart people here at the same time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
仍然不完全确定这一点,但我通过使用静态惰性值来根据需要加载数据来解决这个问题。
这样,每个映射器就拥有一个静态值的实例。因此,如果您正在读取数据集并且有 6 个地图任务,您将读取数据 6 次。并不理想,但比每条记录一次要好。
Still not totally sure about this, but I got around it by having a static lazy value that loaded data as needed.
This way you have one-instance of the static value per mapper. So if you're reading in a dataset and you have 6 map tasks you'll read in the data 6 times. Not ideal, but better than once per record.