SQL Azure 是否适合即席密集型 SQL 处理？

发布于 2024-11-25 15:26:27 字数 857 浏览 8 评论 0原文

我正在寻找一些指导来了解 SQL Azure 是否是一个适合一次性、短期密集处理一批 SQL 数据的平台。（即“处理”数据）

我的场景如下：

我有一个包含单个数据表的 32Gb 数据库。该表包含使用几何数据类型定义的空间数据以及相关属性的各个列。我需要对此数据执行一些一次性处理，这涉及执行一系列计算成本高昂的查询（就像大多数空间查询一样！）

当我在自己的服务器上的数据子集上测试这些查询时，它们花了几个小时才完成。我预计，如果我尝试在本地对整个数据集执行它们，它会锁定我的 SQL Server 数天（或者可能会死掉），这是我试图避免的情况。

因此，我正在寻找一种短期替代方案，可以将这些查询设置为在其他地方执行，并在完成时检索已处理的表。

据我所知，SQL Azure 平台旨在提供灵活的容量（在存储方面），并且还可以进行扩展以适应例如增加的事务数量。引用的典型示例应用程序似乎是为需求快速增长或波动的 Web 应用程序/商店提供数据库后端。然而，我还没有找到很多细节，即 SQL Azure 是否适合容纳串行执行的单个长时间运行的查询。

需要明确的是 -

我希望这将是一次性操作。或者也许每年进行一次。
处理完成后，我无意继续在“云中”托管数据 - 我想检索处理后的数据集并再次在现场托管它。
轻松地从平台获取数据/从平台获取数据显然很重要，因为我不希望永久“迁移”任何东西。如果我理解正确的话，你无法将数据库备份/恢复到 Azure，并且编写数据脚本将非常痛苦。
我对 Management Studio 很满意，任何允许我使用它作为界面来运行查询并对结果进行抽查的平台都将是一个好处。

如果有人有使用 SQL Azure 进行此类活动的经验，或者可以提出替代方案，我将非常感激！

原文

I'm looking for some guidance on whether SQL Azure is a suitable platform for one-off, short-term intensive processing of a batch of SQL data. (i.e. "crunching" of data)

My scenario is as follows:

I have a 32Gb database containing a single table of data. The table contains spatial data defined using the geometry datatype, together with various columns of associated attributes. I need to perform some one-off processing of this data, which involves executing a series of computationally-expensive queries (as most spatial queries seem to be!)

When I tested these queries on a subset of the data on my own server, they took several hours to complete. I expect that, if I tried to execute them against the whole dataset locally, it would lock up my SQL Server for days (or perhaps it would die trying), which is a situation I'm trying to avoid.

So I'm looking for a short-term alternative where I can set those queries to be executed somewhere else, and just retrieve the processed table when they are complete.

I understand that the SQL Azure platform is designed to offer flexible capacity (in terms of storage), and also scales to accommodate e.g. an increased number of transactions. The typical example application cited seems to be to provide a DB backend for a web application/store that experiences rapidly increasing or fluctuating demand.
However, what I haven't been able to find many details of is whether SQL Azure is suitable to accommodate individual long-running queries, executed serially.

Just to be clear -

I expect this to be a one-off operation. Or perhaps, to be performed maybe once annually.
Once the processing is complete, I have no intentions to continue to host the data "in the cloud" - I'd like to retrieve the processed dataset and host it on-site again.
The ease of getting data to/back from the platform is obviously important, since I'm not looking to permenantly "migrate" anything. If I understand correctly, you can't backup/restore databases to Azure, and scripting the data is going to be very painful.
I am comfortable with Management Studio, and any platform that allowed me to use this as an interface to run the queries and perform spot checks of the results would be a benefit.

If anybody has any experiences with using SQL Azure for this kind of activity, or could perhaps suggest an alternative, I'd be very grateful!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

本宫微胖 2024-12-02 15:26:27

我真的不确定 SQL Azure 是否适合这个任务 - 存储方面没有问题，但我不知道它的架构对于长时间运行的任务有多好。特别是，请参阅：

SQL Azure 数据库提供大规模多租户数据库
共享资源上的服务。为了给大家提供良好的体验
所有 SQL Azure 数据库客户，您与该服务的连接可能会
因下列情况而被关闭：
资源使用过多
长时间运行的查询
BEGIN TRAN 和 END TRAN 之间长时间运行的单个事务
声明
空闲连接
这与 SQL Server 本地实例的工作方式不同。

来自： http://msdn.microsoft.com/en-us/library/ee730903 .aspx

所以我担心 SQL Azure 可能不适用于您的长查询 - 除非您可以将它们分解为许多短查询。

如果 SQL Azure 无法为您工作，那么您最好在某处部署一个单独的 SQL 实例（也许部署到 AWS 实例？）来执行这些一次性计算。

回复收藏 0 原文

驱逐舰岛风号 2024-12-02 15:26:27

这取决于工作负载的性质。您提到“执行一系列计算成本高昂的查询”；然而，我不清楚您是否有很多小但重复的命令，或者需要在整个批处理期间工作的一项大作业。前者可能会在 SQL Azure 中以某种形式的连接重试逻辑工作，而后者可能不会。无论哪种情况，您都可以考虑在.NET 中重构处理逻辑。

事实上，由于 SQL Azure 限制机制，大多数批处理活动在云中被重新设计为工作进程；基本上，.NET 代码将在 Windows Azure 中运行，从 SQL Azure 读取所需的数据，在内存中执行所需的计算并将结果保存回 SQL Azure。根据工作负载的类型，这可能是最好的方法，因为您可以以良好扩展的方式设计它；因此可能会显着减少总执行时间（假设您可以将数据处理逻辑分解为更小的部分并在 .NET 而不是 SQL Azure 中执行）。

关于将数据备份/恢复到本地服务器，您有一些不涉及数据脚本的选项。如果您决定尝试在 .NET 中进行重构，我们可以进一步讨论这些选项。

回复收藏 0 原文

无人问我粥可暖 2024-12-02 15:26:27

几点/问题：

您正在执行的代码是用 T-SQL 还是其他编程语言编写的？
处理可以并行执行，还是必须顺序执行？
目前的瓶颈在哪里？是计算还是数据检索/存储？

鉴于您到目前为止所说的以及我过去在大型数据库中遇到的问题，我会质疑 SQL Server 是否是一种合适的存储技术。确实，它适用于基于事务的查询，但您只有一个数据库表。这意味着整个“关系数据库”方面有点被排除在外，除非它是自引用的（这会产生一系列其他问题，所以我现在忽略它并假设情况并非如此）。当然有一些方法可以确保在使用 NoSQL 存储处理数据时不会遇到竞争条件，而且我无法想象事务是绝对必要的。进行计算时，如果结果存储失败，则重试。最坏的情况是，你重新计算。

对于 SQL Server 来说，单个表中的 32 GB 数据是大量数据，我猜测其中可能存在某种索引。如果您没有正确配置 SQL Server（使用大量物理轴并在它们之间拼接数据），则很容易因磁盘 I/O 而遇到 SQL 中的主要性能问题。

Microsoft 很有可能比普通 SQL 开发人员更好地扩展 SQL Azure，因为他们知道应该如何完成。但是，这并不意味着吞吐量或查询/添加数据的速度没有限制，因为确实存在限制。

我的建议是考虑使用 Azure 表（基本上是一个 NoSQL 表），因为它允许您跨多个节点对数据进行分区。这种分区允许您将它们保存的数据量扩展到 100TB，同时不会影响查询速度。

此外，32GB SQL Azure 数据库每月花费 400 美元，而具有 500 万个存储事务的 40GB Azure 表存储每月仅需花费 11 美元。您必须添加工作节点的“成本”，但理论上它们应该是相等的。因此，表选项每月更便宜，但如果是支持该项目的企业，那么成本可能远远低于其开发时间。

您需要考虑将 32GB 数据传输到云中的时间。加载 SQL 数据库可能需要相当长的时间，并且您需要以某种方式获取其中的数据。取决于您将数据传输到云端的速度，以及您是否可以在数据全部到达之前开始处理。

我认为您会遇到的问题是，为了使用 Azure Tables 而不是 SQL Azure，您需要做出一些权衡。您可能需要将数据转换为 Azure 表，然后编写处理代码等。最终，这可能不值得。

然而，我认为这里还没有足够的信息来做出这样的决定。真正的大问题是是否有机会并行处理以及您估计在一台机器上处理需要多长时间。接下来要回答的问题是构建需要多长时间以及需要多少时间来完成。

从您关于锁定数据库数天的评论来看，我认为假设您现在可能遇到数据库问题并不奇怪。根据您未来期望的额外处理，您可能别无选择，只能评估 NoSQL 选项。

我不想在这里给出“视情况而定”的答案，但如果您提供一些其他详细信息，我很乐意更新此内容，让您更好地了解该去哪里以及该做什么。

A few points/questions:

Is the code you're executing written in T-SQL or in some other programming language?
Can the processing be executed in parallel, or must it be sequential?
Where are the current bottlenecks? Is it in computation or data retrieval/storage?

Given what you've said so far and problems I've seen in the past with large databases, I would question whether SQL Server is an appropriate storage technology at all. True, it's intended for transactional based queries, but you only have a single database table. That means the entire "relational database" aspect sort of goes out the window unless it's self referencing (which would create a world of other problems, so I'll ignore that for now and assume that's not the case). There are certainly ways to make sure that you don't run into race conditions when processing the data using NoSQL storage and I can't imagine that transactions would be absolutely necessary. When doing computations, if storage of the result fails, you retry. Worst case scenario, you redo the computation.

32 GB of data for a SQL Server in a single table is a LOT of data and I'm guessing that there are probably indices of some kind in there. If you don't have your SQL Server configured properly (using a lot of physical spindles and splicing the data across them), you can easily run into major performance issues in SQL due to disk I/O.

Chances are really good that Microsoft will be able to scale SQL Azure quite a bit better than the average SQL developer because they know how it's supposed to be done. However, that doesn't mean that there aren't limitations to throughput or how quickly you can query/add data because there are.

My advice would be to look into using Azure Tables (which is basically a NoSQL table) because it would allow you to partition the data across multiple nodes. This partitioning allows you to scale the amount of data they hold to 100TB while not compromising the speed of the queries.

In addition, a 32GB SQL Azure database will cost you $400/month while 40GB of Azure Table storage with 5 million storage transactions would only cost you $11/month. You'll have to add in the "cost" of the worker nodes, but they should theoretically be equivalent. So the Tables option is cheaper per month, but if it's a business backing the project, then the cost is probably far less than the development time going into it.

You'll need to account for the time to pipe 32GB of data into the cloud. Loading a SQL database can take quite a while, and you're somehow going to need to get the data there. Depends on how quickly you can pipe the data into the cloud, and whether you can start processing before it's all there.

The problem I think you'll run into is that in order to use Azure Tables instead of SQL Azure, you're going to need to make some tradeoffs. You'll likely need to convert the data into Azure tables, then write the processing code, etc. At the end of the day, it may not be worth it.

However, I don't think there's enough information here to make that call just yet. The really big question is whether or not there's an opportunity to parallelize the processing and how long you estimate processing would take on a single machine. The next questions to answer are how long it will take to build vs how much time you have to do it.

Judging from your comments about locking the database for days, I don't think it's out of line to assume that you're probably experiencing database issues right now. Depending on the additional processing you expect in the future, you might not have a choice but to evaluate a NoSQL option.

I hate to give an "it depends" answer here, but if you give some additional details, I'd be happy to update this and give you a better sense of where to go and what to do.

回复收藏 0 原文

~没有更多了~