HBase Mapreduce 在多个扫描对象上
我只是想针对我们正在做的一些数据分析工作来评估 HBase。
HBase 将包含我们的事件数据。键为 eventId + 时间。我们想要对日期范围内的几种事件类型 (4-5) 进行分析。事件类型总数约为 1000 个。
在 hbase 表上运行 mapreduce 作业的问题是 initTableMapperJob (见下文)仅需要 1 个扫描对象。出于性能原因,我们只想扫描给定日期范围内 4-5 个事件类型的数据,而不是 1000 个事件类型。如果我们使用下面的方法,那么我想我们没有这个选择,因为它只需要 1 个扫描对象。
公共静态无效 initTableMapperJob(字符串表, 扫描扫描, 类映射器, 类输出KeyClass, 类输出值类, org.apache.hadoop.mapreduce.Job 作业) 抛出 IOException
是否可以在扫描对象列表上运行 mapreduce?有什么解决方法吗?
谢谢
I am just trying to evaluate HBase for some of data analysis stuff we are doing.
HBase would contain our event data. Key would be eventId + time. We want to run analysis on few events types (4-5) between a date range. Total number of event type is around 1000.
The problem with running mapreduce job on the hbase table is that initTableMapperJob (see below) takes only 1 scan object. For performance reason we want to scan the data for only 4-5 event types in a give date range and not the 1000 event types. If we use the method below then I guess we don't have that choice because it takes only 1 scan object.
public static void initTableMapperJob(String table,
Scan scan,
Class mapper,
Class outputKeyClass,
Class outputValueClass,
org.apache.hadoop.mapreduce.Job job)
throws IOException
Is it possible to run mapreduce on a list of scan objects? any workaround?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
TableMapReduceUtil.initTableMapperJob
将您的作业配置为使用TableInputFormat
,正如您所注意到的,它需要一次Scan
。听起来您想扫描表的多个段。为此,您必须创建自己的
InputFormat
,例如MultiSegmentTableInputFormat
。扩展TableInputFormatBase
并重写getSplits
方法,以便它为表的每个开始/停止行段调用super.getSplits
一次。 (最简单的方法是每次TableInputFormatBase.scan.setStartRow()
)。将返回的InputSplit
实例聚合到一个列表中。然后自行配置作业以使用自定义
MultiSegmentTableInputFormat
。TableMapReduceUtil.initTableMapperJob
configures your job to useTableInputFormat
which, as you note, takes a singleScan
.It sounds like you want to scan multiple segments of a table. To do so, you'll have to create your own
InputFormat
, something likeMultiSegmentTableInputFormat
. ExtendTableInputFormatBase
and override thegetSplits
method so that it callssuper.getSplits
once for each start/stop row segment of the table. (Easiest way would be toTableInputFormatBase.scan.setStartRow()
each time). Aggregate theInputSplit
instances returned to a single list.Then configure the job yourself to use your custom
MultiSegmentTableInputFormat
.您正在寻找的班级:
每次扫描都可以获取一个过滤器。过滤器可能非常复杂。 FilterList 允许您指定多个单个过滤器,然后在所有组件过滤器之间执行 AND 或 OR 操作。您可以使用它来构建对行的任意布尔查询。
You are looking for the class:
Each scan can take a filter. A filter can be quite complex. The FilterList allows you to specify multiple single filters and then do an AND or an OR between all of the component filters. You can use this to build up an arbitrary boolean query over the rows.
我尝试过 Dave L 的方法,效果非常好。
要配置地图作业,您可以使用该函数
,其中 inputFormatClass 引用 Dave L 评论中提到的 MultiSegmentTableInputFormat。
I've tried Dave L's approach and it works beautifully.
To configure the map job, you can use the function
where inputFormatClass refers to the MultiSegmentTableInputFormat mentioned in Dave L's comments.