现代云数据仓库中的存储和计算是否将其解耦?

发布于 2025-02-03 09:34:35 字数 240 浏览 3 评论 0原文

  • 在红移,雪花和Azure SQL DW中,我们是否有存储和计算脱钩?

    • 如果它们被解耦,是否有“外部表”的用途还是消失了?
  • 当计算和存储紧密耦合时,当我们想缩放时,我们缩放了计算和存储。但是在引擎盖下,它是虚拟机,我们缩放了计算机和VMS磁盘吗?你们可能对此有一些阅读吗?

非常感谢,我现在很困惑,如果有人可以跳入解释,那将是一种祝福!

  • In Redshift, Snowflake, and Azure SQL DW, do we have storage and compute decoupled?

    • If they are decoupled, is there any use of "External Tables" still or they are gone?
  • When Compute and Storage were tightly coupled, and when we wanted to scale, we scaled both compute and storage. But under the hoods, was it a virtual machine and we scaled the compute and the VMs disks? Do you guys have maybe some readings on this?

Massive thanks, I am confused now and it would be a blessing if someone could jump in to explain!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

各空 2025-02-10 09:34:35

您有理由感到困惑,因为许多地方都有大量的营销。让我们从一些事实开始:

所有数据库都需要本地磁盘才能操作。该磁盘可以存储桌子的永久版本(经典本地存储的表,需要存储本地工作数据集以使数据库运行。即使没有将桌子永久存储在本地磁盘上的情况下,本地磁盘的大小也是如此 可以从远程存储中获取日期并缓存。

至关重要的是, 它们都将表的永久版本存储在远离数据库计算系统的磁盘上。

每个不同的数据库优化它们, 。即使使用远程永久存储的本地磁盘,如果数据工作集太大。远程永久存储的缺点是数据是遥远的。在网络上使用某些灵活的存储解决方案意味着获取数据需要更多的时间(所有数据库系统都有自己的方法在尽可能多的情况下隐藏它)。这也意味着数据的相干控制也跨越了整个网络(在某些方面),并且还带有影响。

外部表和透明的远程表都是永久存储的,但存在差异。外部表不在完全拥有的表(无论是本地还是远程)的相同的一致性结构下。透明的遥控器只是意味着数据库正在使用远程表“好像”它是本地拥有的。

VM不会改变当地磁盘状况。盒子中的每个VM分配了一定量的磁盘,并将一定数量的本地磁盘分配给每个VM。这些磁盘仍然是本地的,只是任何一个VM都可以解决物理磁盘的一部分。

因此,留下事实并发表意见。虽然营销会告诉您为什么在所有情况下,一种类型的数据库存储都比另一种数据库存储更好。每个都有优势和缺点,最适合您的是您的需求。仅提供一个数据组织的数据库提供商会告诉您这是最好的选择,这是对某些人的最佳选择。

对于那些访问数据速度至关重要并且缓存不起作用的应用程序,本地表存储将始终更快。但是,这意味着DBA需要进行工作以维护盘数数据已优化,并且适合可用的本地存储(对于所需的计算大小)。这是真正的工作,需要时间。您在移动遥控器中获得的收益是减少这项工作,但这是以数据库成本,硬件成本和/或性能的某种组合为代价的。有时值得折衷,有时不是。

You have reason to be confused as there is a heavy layer of marketing being applied in a lot of places. Let's start with some facts:

All databases need local disk to operate. This disk can store permanent versions of the tables (classic locally stored tables and is needed to store the local working set of data for the database to operate. Even in cases where no tables are permanently stored on local disk the size of the local disks is critical as this allows for date fetched from remote storage to be worked upon and cached.

Remote storage of permanent tables comes in 2 "flavors" - defined external tables and transparent remote tables. While there are lots of differences in how these flavors work and how each different database optimizes them they all store the permanent version of the table on disks that are remote from the database compute system(s).

Remote permanent storage comes with pros and cons. "Decoupling" is the most often cited advantage for remote permanent storage. This just means that you cannot fill up the local disks with the storage of "cold" data as only "in use" data is stored on the local disks in this case. To be clear you can fill up (or brown out) the local disks even with remote permanent storage if the working set of data is too large. The downside of remote permanent storage is that the data is remote. Being across a network to some flexible storage solution means that getting to the data takes more time (with all the database systems having their own methods to hide this in as many cases as possible). This also means that the coherency control for the data is also across the network (in some aspect) and also comes with impacts.

External tables and transparent remote tables are both permanently stored remotely but there are differences. An external table isn't under the same coherency structure that a fully-owned table is under (whether local or remote). Transparent remote just implies that the database is working with the remote table "as if" it is locally owned.

VMs don't change the local disk situation. An amount of disk is apportioned to each VM in the box and an amount of local disk is allocated to each VM. The disks are still local, it's just that only a portion of the physical disks are addressable by any one VM.

So leaving fact and moving to opinion. While marketing will tell you why one type of database storage is better than the other in all cases this just isn't true. Each has advantages and disadvantages and which is best for you will depend on what your needs are. The database providers that offer only one data organization will tell you that this is the best option, and it is for some.

Local table storage will always be faster for those applications where speed of access to data is critical and caching doesn't work. However, this means that DBAs will need to do the work to maintain the on-disk data is optimized and fits is the available local storage (for the compute size needed). This is real work and takes time an energy. What you gain in moving remote is the reduction of this work but it comes at the cost of some combination of database cost, hardware cost, and/or performance. Sometimes worth the tradeoff, sometimes not.

萌辣 2025-02-10 09:34:35

当涉及到分开(或脱耦合)云计算与云存储的概念时,这些概念可能会有些混乱。简而言之,真实的脱钩通常需要对象级存储与更快的传统块存储(传统上是本地存储,也称为本地存储)。这样做的主要原因是对象存储是平坦的,没有层次结构,因此与您添加的数据量线性缩放。因此,它的结局也更便宜,因为它非常分布,多余且易于重新分布和重复。

这一切都很重要,因为要使存储在云中的计算或任何大型分布式计算范式中,您需要将数据(拆分)数据(存储)在计算节点之间...因此,随着存储的线性增长,对象存储,对象存储,是平坦的 - 允许在性能上没有任何惩罚的情况下发生这种情况 - 虽然您可以(实际上)立即“重新制作”计算节点,以便它们在您向上或向下缩放计算或承受网络/承受网络时可以均匀地分配工作负载节点失败。

When it comes to the concept of separating (or de-coupling) Cloud Compute vs. Cloud Storage, the concepts can become a little confusing. In short, true decoupling generally requires object level storage vs. faster traditional block storage (traditionally on-premises and also called local storage). The main reason for this is that object storage is flat, without a hierarchy and therefore scales linearly with the amount of data you add. It therefore winds up also being cheaper as it is extremely distributed, redundant, and easily re-distributed and duplicated.

This is all important because in order to decouple storage from compute in the cloud or any large distributed computing paradigm you need to shard (split) your data (storage) amongst your compute nodes... so as your storage grows linearly, object storage which is flat -- allows that to happen without any penalty in performance -- while you can (practically) instantly "remaster" your compute nodes so that they can evenly distribute the workload again as you scale your compute up or down or to withstand network/node failures.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文