将多个连续 HBase 查询的结果传递给 Mapreduce 作业

发布于 2024-12-18 17:06:52 字数 776 浏览 3 评论 0原文

我有一个 HBase 数据库,它存储有向图的邻接列表,每个方向的边存储在一对列族中,其中每行表示一个顶点。我正在编写一个mapreduce作业,它将所有节点作为其输入,这些节点也具有指向相同顶点的边,并且具有指向其他顶点的边(指定为查询的主题)。这有点难以解释,但在下图中,当查询顶点“A”时,作为输入的节点集将是 {A, B, C},因为它们都具有来自顶点的边'1':

Example graph

为了在 HBase 中执行此查询,我首先在撤销边列族产生 {1},并且对于该集合中的每个元素,在前向边列族中查找具有来自该集合的该元素的边的顶点。

这应该产生一组键值对:{1: {A,B,C}}。

现在,我想获取这组查询的输出并将其传递给 hadoop mapreduce 作业,但是,我找不到将 hbase 查询“链接”在一起以向 Hbase mapreduce 中的 TableMapper 提供输入的方法API。到目前为止,我唯一的想法是提供另一个初始映射器,它获取第一个查询的结果(在反向边缘表上),对于每个结果,在前向边缘表上执行查询,并生成要传递给的结果第二个地图工作。然而,从映射作业中执行 IO 让我感到不安,因为它似乎与 MapReduce 范例相反(如果多个映射器都尝试同时访问 HBase,则可能会导致瓶颈)。因此,任何人都可以建议执行此类查询的替代策略,或者提供有关以这种方式使用 hbase 和 mapreduce 的最佳实践的任何建议吗?我也有兴趣知道我的数据库架构是否有任何改进可以缓解这个问题。

谢谢,

蒂姆

I have an HBase database that stores adjacency lists for a directed graph, with the edges in each direction stored in a pair of column families, where each row denotes a vertex. I am writing a mapreduce job, which takes as its input all nodes which also have an edge pointing from the same vertices as have an edge pointed at some other vertex (nominated as the subject of the query). This is a little difficult to explain, but in the following diagram, the set of nodes taken as the input, when querying on vertex 'A', would be {A, B, C}, by virtue of their all having edges from vertex '1':

Example graph

To perform this query in HBase, I first lookup the vertices with edges to 'A' in the reverse edges column family yielding {1}, and the, for every element in that set, lookup the vertices with edges from that element of the set, in the forward edges column family.

This should yield a set of key-value pairs: {1: {A,B,C}}.

Now, I would like to take the output of this set of queries and pass it to a hadoop mapreduce job, however, I can't find a way of 'chaining' hbase queries together to provide the input to a TableMapper in the Hbase mapreduce API. So far, my only idea has been to provide another initial mapper which takes the results of the first query (on the reverse edges table), for each result, performs the query on the forward edges table, and yields the results to be passed to a second map job. However, performing IO from within a map job makes me uneasy, as it seems rather counter to the mapreduce paradigm (and could lead to a bottleneck if several mappers are all trying to access HBase at once). Therefore, can anyone suggest an alternative strategy for performing this sort of query, or offer any advice about best practices for working with hbase and mapreduce in such a way? I'd also be interested to know if there's any improvements to my database schema that could mitigate this problem.

Thanks,

Tim

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

我做我的改变 2024-12-25 17:06:52

您的问题与 Map/Reduce 范例的关系不太好。我见过许多 M/R 链接在一起解决了最短路径问题。这不是那么有效,但需要在减速器级别获得全局视图。

在您的情况下,您似乎可以通过跟踪边缘并保留所见节点的列表来执行映射器中的所有请求。

但是,在地图作业中执行 IO 让我感到不安

您不必担心这一点。您的数据模型绝对是随机的,尝试执行数据局部性将非常困难,因此您没有太多选择,只能通过网络查询所有这些数据。 HBase 旨在处理大型并行查询。对不相交的数据进行多个映射器查询将产生良好的请求分布和高吞吐量。

确保在 HBase 表中保持较小的块大小以优化读取,并为您的区域提供尽可能少的 HFile。我假设您的数据在这里非常静态,因此进行主要压缩会将 HFile 合并在一起并减少要读取的文件数量。

Your problem is not flowing so well with the Map/Reduce paradigm. I've seen the shortest path problem solved by many M/R chained together. This is not so efficient but needed to get the global view at the reducer level.

In your case, it seems that you could perform all the requests within your mapper by following the edges and keeping a list of seen nodes.

However, performing IO from within a map job makes me uneasy

You should not worry about that. Your data model is absolutely random and trying to perform data locality will be extremely hard therefore you don't have much choice but to query all this data across the network. HBase is designed to handle large parallel queries. Having multiple mapper query on disjoint data will yield into a well distribution of request and a high throughput.

Make sure to keep small block size in HBase tables to optimize your reads and have as little as possible HFile for your regions. I'm assuming your data is quite static here so doing a major compaction will merge the HFile together and reduce the number of files to read.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文