解决“小数据”问题分布式计算集群有问题吗?
我正在学习 Hadoop + MapReduce 和大数据,根据我的理解,Hadoop 生态系统主要是为了分析分布在许多服务器上的大量数据而设计的。我的问题有点不同。
我有相对少量的数据(由 1-1000 万行数字组成的文件),需要以数百万种不同的方式进行分析。例如,考虑以下数据集:
[1, 6, 7, 8, 10, 17, 19, 23, 27, 28, 28, 29, 29, 29, 29, 30, 32, 35, 36, 38]
[1, 3, 3, 4, 4, 5, 5, 10, 11, 12, 14, 16, 17, 18, 18, 20, 27, 28, 39, 40]
[2, 3, 7, 8, 10, 10, 12, 13, 14, 15, 15, 16, 17, 19, 27, 30, 32, 33, 34, 40]
[1, 9, 11, 13, 14, 15, 17, 17, 18, 18, 18, 19, 19, 23, 25, 26, 27, 31, 37, 39]
[5, 8, 8, 10, 14, 16, 16, 17, 20, 21, 22, 22, 23, 28, 29, 30, 32, 32, 33, 38]
[1, 1, 3, 3, 13, 17, 21, 24, 24, 25, 26, 26, 30, 31, 32, 35, 38, 39, 39, 39]
[1, 2, 4, 4, 5, 5, 10, 13, 14, 14, 14, 14, 15, 17, 28, 29, 29, 35, 37, 40]
[1, 2, 6, 8, 12, 13, 14, 15, 15, 15, 16, 22, 23, 24, 26, 30, 31, 36, 36, 40]
[3, 6, 7, 8, 8, 10, 10, 12, 13, 17, 17, 20, 21, 22, 33, 35, 35, 36, 39, 40]
[1, 3, 8, 8, 11, 11, 13, 18, 19, 19, 19, 23, 24, 25, 27, 33, 35, 37, 38, 40]
我需要分析每列(N 列
)中的某个数字在一定数量的行之后(L 行之后
)重复的频率。例如,如果我们使用 1L
(1-Row-Later)分析 Column A
,结果将如下:
Note: The position does not need to match - so number can appear anywhere in the next row
Column: A N-Later: 1 Result: YES, NO, NO, NO, NO, YES, YES, NO, YES -> 4/9.
我们将分别对每一列重复上述分析,在上面的日期集中,最多包含 N 次。 10 行意味着最多 9 N 行,但在 100 万行的数据集中,分析(每列)将重复 999,999 次,
我查看了 MapReduce 框架,但事实并非如此。似乎没有解决这个问题;它似乎不是解决这个问题的有效方法,并且需要大量的工作才能将核心代码转换为 MapReduce 友好的结构,
正如您在 中所看到的 。上面的例子中,每个分析彼此独立,例如,可以将Column A
与Column B
分开分析。还可以与 2L
分开执行 1L
分析,依此类推。然而,与 Hadoop 中的数据驻留在不同的机器上不同,在我们的场景中,每台服务器都需要访问所有数据才能执行其“共享”分析。
我研究了这个问题的可能解决方案,似乎有很少的选择: Ray 或构建使用 Apache Twill 构建于 YARN
之上的自定义应用程序。 Apache Twill 在 2020 年移至 Attic,这意味着 Ray 是唯一可用的选项。
Ray 是解决这个问题的最佳方法还是还有其他更好的选择?理想情况下,该解决方案应自动处理故障转移并智能地分配处理负载。例如,在上面的示例中,如果我们想将负载分配给 20 台机器,一种方法是将 999,999 N Later 除以 20,让机器 A 分析 1L-49999L,机器 B 分析 50000L - 100000L,依此类推在。然而,当您考虑一下时,您会发现负载的分配并不均匀 - 因为分析 1L 与 500000L 需要花费更多更长的时间,因为后者仅包含大约一半的行数(对于 500000L,我们分析的第一行是第 500001 行,因此我们基本上在分析中忽略了前 500K 行)。
它也应该不需要需要对核心代码进行大量修改(就像MapReduce
那样)。
我正在使用 Java
。
谢谢
I'm learning about Hadoop + MapReduce and Big Data
and from my understanding it seems that the Hadoop
ecosystem was mainly designed to analyze large amounts of data that's distributed on many servers. My problem is a bit different.
I have a relatively small amount of data (a file consisting of 1-10 million lines of numbers) which needs to be analyzed in millions of different ways. For example, consider the following dataset:
[1, 6, 7, 8, 10, 17, 19, 23, 27, 28, 28, 29, 29, 29, 29, 30, 32, 35, 36, 38]
[1, 3, 3, 4, 4, 5, 5, 10, 11, 12, 14, 16, 17, 18, 18, 20, 27, 28, 39, 40]
[2, 3, 7, 8, 10, 10, 12, 13, 14, 15, 15, 16, 17, 19, 27, 30, 32, 33, 34, 40]
[1, 9, 11, 13, 14, 15, 17, 17, 18, 18, 18, 19, 19, 23, 25, 26, 27, 31, 37, 39]
[5, 8, 8, 10, 14, 16, 16, 17, 20, 21, 22, 22, 23, 28, 29, 30, 32, 32, 33, 38]
[1, 1, 3, 3, 13, 17, 21, 24, 24, 25, 26, 26, 30, 31, 32, 35, 38, 39, 39, 39]
[1, 2, 4, 4, 5, 5, 10, 13, 14, 14, 14, 14, 15, 17, 28, 29, 29, 35, 37, 40]
[1, 2, 6, 8, 12, 13, 14, 15, 15, 15, 16, 22, 23, 24, 26, 30, 31, 36, 36, 40]
[3, 6, 7, 8, 8, 10, 10, 12, 13, 17, 17, 20, 21, 22, 33, 35, 35, 36, 39, 40]
[1, 3, 8, 8, 11, 11, 13, 18, 19, 19, 19, 23, 24, 25, 27, 33, 35, 37, 38, 40]
I need to analyze how frequently a number of each column (Column N
) repeats itself a certain number of rows later (L rows later
. For example, if we were analyzing Column A
with 1L
(1-Row-Later) the result would be as follows:
Note: The position does not need to match - so number can appear anywhere in the next row
Column: A N-Later: 1 Result: YES, NO, NO, NO, NO, YES, YES, NO, YES -> 4/9.
We would repeat the above analysis for each column separately and for maximum N later times. In the above dateset which only consists of 10 lines it means a maximum of 9 N later. But in a dateset of 1 million lines, the analyses (for each column) would be repeated 999,999 times.
I looked into the MapReduce
framework but it doesn't seem to cut it; it doesn't seem like an efficient solution for this problem and it requires a great deal of work to convert the core code into a MapReduce
friendly structure.
As you can see in the above example, each analyses is independent of each other. For example, it is possible to analyze Column A
separately from Column B
. It is also possible to perform 1L
analyses separately from 2L
and so on. However, unlike Hadoop
where the data lives on separate machines, in our scenario, each server needs access to all of the data to perform it's "share" of analysis.
I looked into possible solutions for this problem and it seems there are very few options: Ray or building a custom application on top of YARN
using Apache Twill. Apache Twill was moved to the Attic in 2020 which means that Ray
is the only available option.
Is Ray
the best way to tackle this problem or are there other, better options? Ideally, the solution should automatically handle fail over and distribute the processing load intelligently. For example, in the above example, if we wanted to distribute the load to 20 machines, one way of doing so would be to divide 999,999 N Later by 20 and let Machine A analyze 1L-49999L, Machine B from 50000L - 100000L and so on. However, when you think about it, the load isn't being distributed equally - as it takes much longer to analyze 1L vs. 500000L as the latter contains only about half the number of rows (for 500000L the first row we are analyzing is row 500001 so we are essentially omitting the first 500K rows from analysis).
It should also not require a great deal of modification to the core code (like MapReduce
does).
I'm working with Java
.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
嗯,你是对的 - 你的场景和你的技术堆栈不太合适。 这提出了问题 - 为什么不(添加)一些与您当前的技术堆栈更相关的东西?例如 - Redis 数据库。
似乎您的常见操作主要是计数值,并且您希望防止过度计算并使其性能更高(例如 - 正确索引您的数据)。鉴于这是 Redis 的主要功能之一 - 将其用作处理器听起来合乎逻辑
我的建议:
创建一个使用数值作为键及其值的
hashmap
算作值。这样,您将能够对这些指标进行不同的计算,并且始终迭代您的数据集一次。之后 - 您只需按照不同的标准(每个计算或指标)从 Redis 中提取数据。从现在起,您可以轻松地将计算数据保存回数据库并为直接查询做好准备。整个过程可能与此类似:
hashmap
)请按照文档进行 填充和检索数据
Well you are right - your scenario and your technological stack are not that suitable. Which raise the question - why not (add) something more relevant to your current technological stack? For instance - Redis DB.
Seems that your common action is mainly count values and you want to prevent over-calculation and make it more performant (e.g. - properly index your data). Given that this is one of the main features of Redis - it sounds logical to use it as a processor
My suggestion:
Create a
hashmap
that uses the numeric value as key and its count as value. This way - you will be able to pull different calculations over those metrics and always iterate your data-set once. Afterwards - you just need to pull the data from Redis by different criteria (per calculation or metric).From this point, it's easy to save your calculated data back to your database and make it ready for direct querying. The overall process may be similar to this:
hashmap
)Follow the docs for both populating and retrieving data