使用 SQL Server 作为数据源的 MapReduce

发布于 2024-12-11 21:50:40 字数 808 浏览 0 评论 0原文

我目前正在研究使用 MapReduce 在 SQL Server 中维护增量视图构建的可能性。

基本上,使用 MapReduce 创建物化视图。

我有点卡atm了。考虑如何对我的 map 输出进行分区。现在,我并没有真正遇到大数据情况,最大大约为 50 GB,但我有很多复杂性和隐含的性能问题。我想看看我的这种 MapReduce/NoSQL 方法是否会成功。

我目前遇到的关于 MapReduce 的问题是分区。由于我使用 SQL Server 作为数据源,数据局部性并不是我真正的问题,因此我不需要将数据发送到各处,相反,每个工作人员应该能够检索数据的一个分区基于map定义。

我打算通过使用LINQ或者实体框架之类的东西来完全映射数据,只是为了提供一个熟悉的界面,这有点离题,但这是我当前正在探索的路线。

现在,我如何分割我的数据?我有一个主键,我有表达式树(AST,如果您不熟悉 LINQ)方面的 mapreduce 定义。

  • 首先,我如何设计一种方法来分割整个输入并对初始问题进行分区(我想我应该能够利用 SQL Server 中的窗口聚合,例如 ROW_NUMBERTILE)。

  • 其次,更重要的是,我如何确保逐步执行此操作?也就是说,如果我添加或更改原始问题,我如何有效地确保最大限度地减少需要进行的重新计算量?

我一直在寻找 CouchDB 的灵感,他们似乎有办法做到这一点,但我如何使用 SQL Server 来利用其中的一些优点呢?

I'm currently investigating the possibility of using MapReduce to maintain incremental view builds in SQL Server.

Basically, use MapReduce to create materialized views.

I'm a bit stuck atm. thinking about how to partition my map outputs. Now, I don't really have a BigData situation, with roughly 50 GB being the max but I have a lot complexity and sort-of implied performance problems. I want to see if this MapReduce/NoSQL approach of mine might pan out.

The thing about MapReduce I'm currently having my issues with is the partitioning. Since I'm using SQL Server as data source, data locality isn't really a problem of mine and thus I don't need to send data all over the place, rather, each worker should be able to retrieve a partition of the data base on the map definition.

I intend to fully map the data through the use of LINQ and maybe something like Entity Framework, just to provide a familiar interface, this is somewhat besides the point, but it's the current route I'm exploring.

Now, how do I split my data? I have a primary key, I have map and reduce definitions in terms of expression trees (ASTs, if you unfamiliar with LINQ).

  • Firstly, how do I devise a way for me to split the entire input and partitioning the initial problem (I'm thinking I should be able to leverage window aggregates in SQL Server such as ROW_NUMBER and TILE).

  • Secondly, and more importantly, how do I make sure that I do this incrementally? That is, if I add, or make a change to the original problem, how do I effectively ensure that I minimize the amount of re-computations that need to take place?

I've been looking at CouchDB for inspiration and they seem to have a way to do this, but how do I leverage some of that goodness using SQL Server?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

左岸枫 2024-12-18 21:50:40

我面临着类似的事情。我认为你应该忘记窗口函数,因为它使你的进程序列化。换句话说,所有工作人员都将等待查询。

我们已经测试过并且它的“工作”是将数据分区到更多表中(每个月都有 x 个表)并在这些分区上运行单独的分析线程。在Reduce之后标记已处理/未处理/可能坏的/等数据。

使用一个分区表进行测试会带来锁定升级问题。

您肯定会为当前的解决方案增加一点复杂性。

I am facing something similar. I think you should forget windowing functions as it makes your process serialized. In other words all workers will be waiting for the query.

What we have tested and it's 'working' is to partition data into more tables (every month has its x tables) and run separate analytical threads on such partitions. Marking processed/unprocessed/possibly bad/etc data after Reduce.

Tests with one partitioned table brought as a locking escalation issues..

You'll definitely add a little bit more complexity to your current solution.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文