解决“小数据”问题分布式计算集群有问题吗？

发布于 2025-01-12 21:11:18 字数 2515 浏览 3 评论 0原文

我正在学习 Hadoop + MapReduce 和大数据，根据我的理解，Hadoop 生态系统主要是为了分析分布在许多服务器上的大量数据而设计的。我的问题有点不同。

我有相对少量的数据（由 1-1000 万行数字组成的文件），需要以数百万种不同的方式进行分析。例如，考虑以下数据集：

[1, 6, 7, 8, 10, 17, 19, 23, 27, 28, 28, 29, 29, 29, 29, 30, 32, 35, 36, 38]
[1, 3, 3, 4, 4, 5, 5, 10, 11, 12, 14, 16, 17, 18, 18, 20, 27, 28, 39, 40]
[2, 3, 7, 8, 10, 10, 12, 13, 14, 15, 15, 16, 17, 19, 27, 30, 32, 33, 34, 40]
[1, 9, 11, 13, 14, 15, 17, 17, 18, 18, 18, 19, 19, 23, 25, 26, 27, 31, 37, 39]
[5, 8, 8, 10, 14, 16, 16, 17, 20, 21, 22, 22, 23, 28, 29, 30, 32, 32, 33, 38]
[1, 1, 3, 3, 13, 17, 21, 24, 24, 25, 26, 26, 30, 31, 32, 35, 38, 39, 39, 39]
[1, 2, 4, 4, 5, 5, 10, 13, 14, 14, 14, 14, 15, 17, 28, 29, 29, 35, 37, 40]
[1, 2, 6, 8, 12, 13, 14, 15, 15, 15, 16, 22, 23, 24, 26, 30, 31, 36, 36, 40]
[3, 6, 7, 8, 8, 10, 10, 12, 13, 17, 17, 20, 21, 22, 33, 35, 35, 36, 39, 40]
[1, 3, 8, 8, 11, 11, 13, 18, 19, 19, 19, 23, 24, 25, 27, 33, 35, 37, 38, 40]

我需要分析每列（N 列）中的某个数字在一定数量的行之后（L 行之后）重复的频率。例如，如果我们使用 1L（1-Row-Later）分析 Column A，结果将如下：

Note: The position does not need to match - so number can appear anywhere in the next row
Column: A N-Later: 1 Result: YES, NO, NO, NO, NO, YES, YES, NO, YES  -> 4/9.

我们将分别对每一列重复上述分析，在上面的日期集中，最多包含 N 次。 10 行意味着最多 9 N 行，但在 100 万行的数据集中，分析（每列）将重复 999,999 次，

我查看了 MapReduce 框架，但事实并非如此。似乎没有解决这个问题；它似乎不是解决这个问题的有效方法，并且需要大量的工作才能将核心代码转换为 MapReduce 友好的结构，

正如您在中所看到的。上面的例子中，每个分析彼此独立，例如，可以将Column A 与Column B 分开分析。还可以与 2L 分开执行 1L 分析，依此类推。然而，与 Hadoop 中的数据驻留在不同的机器上不同，在我们的场景中，每台服务器都需要访问所有数据才能执行其“共享”分析。

我研究了这个问题的可能解决方案，似乎有很少的选择： Ray 或构建使用 Apache Twill 构建于 YARN 之上的自定义应用程序。 Apache Twill 在 2020 年移至 Attic，这意味着 Ray 是唯一可用的选项。

Ray 是解决这个问题的最佳方法还是还有其他更好的选择？理想情况下，该解决方案应自动处理故障转移并智能地分配处理负载。例如，在上面的示例中，如果我们想将负载分配给 20 台机器，一种方法是将 999,999 N Later 除以 20，让机器 A 分析 1L-49999L，机器 B 分析 50000L - 100000L，依此类推在。然而，当您考虑一下时，您会发现负载的分配并不均匀 - 因为分析 1L 与 500000L 需要花费更多更长的时间，因为后者仅包含大约一半的行数（对于 500000L，我们分析的第一行是第 500001 行，因此我们基本上在分析中忽略了前 500K 行）。

它也应该不需要需要对核心代码进行大量修改（就像MapReduce那样）。

我正在使用 Java。

谢谢

原文

I'm learning about Hadoop + MapReduce and Big Data and from my understanding it seems that the Hadoop ecosystem was mainly designed to analyze large amounts of data that's distributed on many servers. My problem is a bit different.

I have a relatively small amount of data (a file consisting of 1-10 million lines of numbers) which needs to be analyzed in millions of different ways. For example, consider the following dataset:

[1, 6, 7, 8, 10, 17, 19, 23, 27, 28, 28, 29, 29, 29, 29, 30, 32, 35, 36, 38]
[1, 3, 3, 4, 4, 5, 5, 10, 11, 12, 14, 16, 17, 18, 18, 20, 27, 28, 39, 40]
[2, 3, 7, 8, 10, 10, 12, 13, 14, 15, 15, 16, 17, 19, 27, 30, 32, 33, 34, 40]
[1, 9, 11, 13, 14, 15, 17, 17, 18, 18, 18, 19, 19, 23, 25, 26, 27, 31, 37, 39]
[5, 8, 8, 10, 14, 16, 16, 17, 20, 21, 22, 22, 23, 28, 29, 30, 32, 32, 33, 38]
[1, 1, 3, 3, 13, 17, 21, 24, 24, 25, 26, 26, 30, 31, 32, 35, 38, 39, 39, 39]
[1, 2, 4, 4, 5, 5, 10, 13, 14, 14, 14, 14, 15, 17, 28, 29, 29, 35, 37, 40]
[1, 2, 6, 8, 12, 13, 14, 15, 15, 15, 16, 22, 23, 24, 26, 30, 31, 36, 36, 40]
[3, 6, 7, 8, 8, 10, 10, 12, 13, 17, 17, 20, 21, 22, 33, 35, 35, 36, 39, 40]
[1, 3, 8, 8, 11, 11, 13, 18, 19, 19, 19, 23, 24, 25, 27, 33, 35, 37, 38, 40]

I need to analyze how frequently a number of each column (Column N) repeats itself a certain number of rows later (L rows later. For example, if we were analyzing Column A with 1L (1-Row-Later) the result would be as follows:

Note: The position does not need to match - so number can appear anywhere in the next row
Column: A N-Later: 1 Result: YES, NO, NO, NO, NO, YES, YES, NO, YES  -> 4/9.

We would repeat the above analysis for each column separately and for maximum N later times. In the above dateset which only consists of 10 lines it means a maximum of 9 N later. But in a dateset of 1 million lines, the analyses (for each column) would be repeated 999,999 times.

I looked into the MapReduce framework but it doesn't seem to cut it; it doesn't seem like an efficient solution for this problem and it requires a great deal of work to convert the core code into a MapReduce friendly structure.

As you can see in the above example, each analyses is independent of each other. For example, it is possible to analyze Column A separately from Column B. It is also possible to perform 1L analyses separately from 2L and so on. However, unlike Hadoop where the data lives on separate machines, in our scenario, each server needs access to all of the data to perform it's "share" of analysis.

I looked into possible solutions for this problem and it seems there are very few options: Ray or building a custom application on top of YARN using Apache Twill. Apache Twill was moved to the Attic in 2020 which means that Ray is the only available option.

Is Ray the best way to tackle this problem or are there other, better options? Ideally, the solution should automatically handle fail over and distribute the processing load intelligently. For example, in the above example, if we wanted to distribute the load to 20 machines, one way of doing so would be to divide 999,999 N Later by 20 and let Machine A analyze 1L-49999L, Machine B from 50000L - 100000L and so on. However, when you think about it, the load isn't being distributed equally - as it takes much longer to analyze 1L vs. 500000L as the latter contains only about half the number of rows (for 500000L the first row we are analyzing is row 500001 so we are essentially omitting the first 500K rows from analysis).

It should also not require a great deal of modification to the core code (like MapReduce does).

I'm working with Java.

Thanks

分享到QQ

分享到微博