什么是分片以及为什么它很重要?

发布于 2024-07-24 01:21:05 字数 264 浏览 2 评论 0原文

我认为我理解的分片是将切分的数据(分片)放回易于处理的聚合中,这在上下文中是有意义的。 它是否正确?

更新:我想我在这里很挣扎。 在我看来,应用程序层不应该决定数据应该存储在哪里。 充其量它应该是某种分片客户端。 这两个答案都回答了“是什么”这一重要方面,但没有回答“为什么”这一重要方面。 除了明显的性能提升之外,它还有什么影响? 这些收益是否足以抵消 MVC 违规? 分片在超大规模应用程序中最重要还是适用于较小规模的应用程序?

I think I understand sharding to be putting back your sliced up data (the shards) into an easy to deal with aggregate that makes sense in the context. Is this correct?

Update: I guess I am struggling here. In my opinion the application tier should have no business determining where data should be stored. At best it should be shard client of some sort. Both responses answered the what but not the why is it important aspect. What implications does it have outside of the obvious performance gains? Are these gains sufficient to offset the MVC violation? Is sharding mostly important in very large scale applications or does it apply to smaller scale ones?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

伏妖词 2024-07-31 01:21:05

分片只是数据库“水平分区”的另一个名称。 您可能需要搜索该术语以使其更清楚。

来自维基百科

水平分区是一种设计原则,数据库表的行被单独保存,而不是按列分割(如规范化)。 每个分区形成分片的一部分,而分片又可能位于单独的数据库服务器或物理位置上。 优点是每个表中的行数减少了(这减少了索引大小,从而提高了搜索性能)。 如果分片基于数据的某些现实方面(例如欧洲客户与美国客户),则可以轻松自动地推断适当的分片成员资格,并仅查询相关分片。

有关分片的更多信息:

首先,每个数据库服务器都是相同的,具有相同的表结构。 其次,数据记录在逻辑上分割在分片数据库中。 与分区数据库不同,每条完整的数据记录仅存在于一个分片中(除非有用于备份/冗余的镜像),并且所有 CRUD 操作仅在该数据库中执行。 您可能不喜欢所使用的术语,但这确实代表了将逻辑数据库组织成较小部分的不同方式。

更新:您不会破坏 MVC。 确定存储数据的正确分片的工作将由数据访问层透明地完成。 在那里,您必须根据用于对数据库进行分片的标准来确定正确的分片。 (因为您必须根据应用程序的某些具体方面手动将数据库分片为一些不同的分片。)然后,在从数据库加载数据和将数据存储到数据库中时必须小心,以使用正确的分片。

也许这个例子与Java代码使它有点更清楚(这是关于 Hibernate Shards 项目的),这在现实场景中是如何工作的。

解决“为什么要分片”的问题:它主要适用于具有大量数据的超大规模应用程序。 首先,它有助于最大限度地缩短数据库查询的响应时间。 其次,您可以使用更便宜的“低端”机器来托管数据,而不是一台大型服务器,后者可能已经不够了。

Sharding is just another name for "horizontal partitioning" of a database. You might want to search for that term to get it clearer.

From Wikipedia:

Horizontal partitioning is a design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. The advantage is the number of rows in each table is reduced (this reduces index size, thus improves search performance). If the sharding is based on some real-world aspect of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.

Some more information about sharding:

Firstly, each database server is identical, having the same table structure. Secondly, the data records are logically split up in a sharded database. Unlike the partitioned database, each complete data record exists in only one shard (unless there's mirroring for backup/redundancy) with all CRUD operations performed just in that database. You may not like the terminology used, but this does represent a different way of organizing a logical database into smaller parts.

Update: You wont break MVC. The work of determining the correct shard where to store the data would be transparently done by your data access layer. There you would have to determine the correct shard based on the criteria which you used to shard your database. (As you have to manually shard the database into some different shards based on some concrete aspects of your application.) Then you have to take care when loading and storing the data from/into the database to use the correct shard.

Maybe this example with Java code makes it somewhat clearer (it's about the Hibernate Shards project), how this would work in a real world scenario.

To address the "why sharding": It's mainly only for very large scale applications, with lots of data. First, it helps minimizing response times for database queries. Second, you can use more cheaper, "lower-end" machines to host your data on, instead of one big server, which might not suffice anymore.

做个ˇ局外人 2024-07-31 01:21:05

如果您对位置受到严格限制的 DBMS 进行查询(例如,用户仅使用“where username = $my_username”触发选择),则将所有以 AM 开头的用户名放在一台服务器上并且全部来自 NZ 是有意义的在另一。 通过这种方式,您可以对某些查询进行接近线性缩放。

长话短说:分片基本上是将表分配到不同服务器上的过程,以便平等地平衡两台服务器上的负载。

当然,现实情况要复杂得多。 :)

If you have queries to a DBMS for which the locality is quite restricted (say, a user only fires selects with a 'where username = $my_username') it makes sense to put all the usernames starting with A-M on one server and all from N-Z on the other. By this you get near linear scaling for some queries.

Long story short: Sharding is basically the process of distributing tables onto different servers in order to balance the load onto both equally.

Of course, it's so much more complicated in reality. :)

世界如花海般美丽 2024-07-31 01:21:05

分片是水平(行式)数据库分区,而不是垂直(列式)分区,后者是标准化。 它将非常大的数据库分成更小、更快、更容易管理的部分,称为数据分片。 它是一种实现分布式系统的机制。

为什么我们需要分布式系统?

  • 提高可用性。
  • 更容易扩展。
  • 经济性:利用单个大型计算机的能力创建小型计算机网络的成本更低。

您可以在这里阅读更多内容: 分布式数据库的优点

分片如何帮助实现分布式系统?

您可以将搜索索引分为 N 个分区,并将每个索引加载到单独的服务器上。 如果查询一台服务器,您将得到 1/N 的结果。 因此,为了获得完整的结果集,典型的分布式搜索系统使用聚合器来累积来自每个服务器的结果并将它们组合起来。 聚合器还将查询分发到每个服务器上。 这个聚合器程序在大数据术语中称为 MapReduce。 换句话说,分布式系统=Sharding+MapReduce(尽管还有其他东西)。

下面是视觉表示。 分布式系统

Sharding is horizontal(row wise) database partitioning as opposed to vertical(column wise) partitioning which is Normalization. It separates very large databases into smaller, faster and more easily managed parts called data shards. It is a mechanism to achieve distributed systems.

Why do we need distributed systems?

  • Increased availablity.
  • Easier expansion.
  • Economics: It costs less to create a network of smaller computers with the power of single large computer.

You can read more here: Advantages of Distributed database

How sharding help achieve distributed system?

You can partition a search index into N partitions and load each index on a separate server. If you query one server, you will get 1/Nth of the results. So to get complete result set, a typical distributed search system use an aggregator that will accumulate results from each server and combine them. An aggregator also distribute query onto each server. This aggregator program is called MapReduce in big data terminology. In other words, Distributed Systems = Sharding + MapReduce (Although there are other things too).

A visual representation below. Distributed System

奈何桥上唱咆哮 2024-07-31 01:21:05

分片在非常重要的领域中是否最重要?
大规模应用还是
适用于较小规模的吗?

当且仅当您的需求超出单个数据库服务器所能满足的范围时,分片才是一个问题。 如果您有可分片数据并且具有极高的可扩展性和性能要求,那么它是一个强大的工具。 我猜想,在我成为软件专业人士的 12 年里,我遇到过一种可以从分片中受益的情况。 这是一项先进技术,但适用性非常有限。

此外,未来可能会变得有趣和令人兴奋,就像一个巨大的对象“云”,消除了所有潜在的性能限制,对吧? :)

Is sharding mostly important in very
large scale applications or does it
apply to smaller scale ones?

Sharding is a concern if and only if your needs scale past what can be served by a single database server. It's a swell tool if you have shardable data and you have incredibly high scalability and performance requirements. I would guess that in my entire 12 years I've been a software professional, I've encountered one situation that could have benefited from sharding. It's an advanced technique with very limited applicability.

Besides, the future is probably going to be something fun and exciting like a massive object "cloud" that erases all potential performance limitations, right? :)

海之角 2024-07-31 01:21:05

分片最初是由 Google 工程师创造的,您可以看到在 Google App Engine 上编写应用程序时它被大量使用。 由于查询可以使用的资源量存在严格限制,而且查询本身也有严格的限制,因此架构不仅鼓励分片,而且几乎强制执行分片。

分片的另一个用途是减少数据实体的争用。 在构建可扩展系统时,留意那些经常写入的数据尤为重要,因为它们始终是瓶颈。 一个好的解决方案是将特定实体分片并写入多个副本,然后读取总数。 此“GAE 分片计数器”示例:http://code.google.com/appengine /articles/sharding_counters.html

Sharding was originally coined by google engineers and you can see it used pretty heavily when writing applications on Google App Engine. Since there are hard limitations on the amount of resource your queries can use and because queries themselves have strict limitations, sharding is not only encouraged but almost enforced by the architecture.

Another place sharding can be used is to reduce contention on data entities. It is especially important when building scalable systems to watch out for those piece of data that are written often because they are always the bottleneck. A good solution is to shard off that specific entity and write to multile copies, then read the total. An example of this "sharded counter wrt GAE: http://code.google.com/appengine/articles/sharding_counters.html

半寸时光 2024-07-31 01:21:05

分片不仅仅是水平分区。
根据 维基百科文章

水平分区通常在模式和数据库服务器的单个实例中按行拆分一个或多个表。 它可以通过减少索引大小(从而减少搜索工作量)来提供优势,前提是有一些明显的、稳健的、隐式的方法来识别将在哪个分区中找到特定行,而无需首先搜索索引,例如经典的“CustomersEast”和“CustomersWest”表的示例,其中的邮政编码已指示可以在哪里找到它们。

分片不仅仅如此:它将有问题的表分区为
以同样的方式,但它可能跨多个实例执行此操作
模式的。 明显的优势是搜索负载
大型分区表现在可以跨多个服务器拆分
(逻辑或物理),而不仅仅是同一逻辑上的多个索引
服务器。

还,

跨多个独立实例分割分片需要超过
简单的水平分区。 预期的效率提升
如果查询数据库需要两个实例,则会丢失
查询,只是为了检索一个简单的维度表。 超过
分区,分片因此将大型可分区表分割
服务器,而较小的表则作为完整的单元进行复制

Sharding does more than just horizontal partitioning.
According to the wikipedia article,

Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which partition a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.

Sharding goes beyond this: it partitions the problematic table(s) in
the same way, but it does this across potentially multiple instances
of the schema. The obvious advantage would be that search load for the
large partitioned table can now be split across multiple servers
(logical or physical), not just multiple indexes on the same logical
server.

Also,

Splitting shards across multiple isolated instances requires more than
simple horizontal partitioning. The hoped-for gains in efficiency
would be lost, if querying the database required both instances to be
queried, just to retrieve a simple dimension table. Beyond
partitioning, sharding thus splits large partitionable tables across
the servers, while smaller tables are replicated as complete units

风追烟花雨 2024-07-31 01:21:05

我认为应用层
不应该有任何业务决定
数据应该存储在哪里

这是一个很好的规则,但像大多数事情一样并不总是正确的。

当您进行架构设计时,您会从责任和协作开始。 一旦确定了功能架构,您就必须平衡非功能力量。

如果这些非功能性力量之一是大规模可扩展性,那么您必须调整架构来满足这种力量,即使这意味着您的数据存储抽象现在泄漏到您的应用程序层。

In my opinion the application tier
should have no business determining
where data should be stored

This is a good rule but like most things not always correct.

When you do your architecture you start with responsibilities and collaborations. Once you determine your functional architecture, you have to balance the non-functional forces.

If one of these non-functional forces is massive scalability, you have to adapt your architecture to cater for this force even if it means that your data storage abstraction now leaks into your application tier.

冬天旳寂寞 2024-07-31 01:21:05

抱歉,这里没有详细介绍,但是这两篇文章是我在分片以及如何实现此模式的不同策略方面找到的最好的文章。

https://learn.microsoft.com/en-us/azure/架构/模式/分片

将数据存储划分为水平分区或分片。 每个碎片
具有相同的模式,但拥有自己独特的数据子集。 A
分片本身就是一个数据存储(它可以包含以下数据)
许多不同类型的实体),在充当服务器的服务器上运行
存储节点。

https://www.mongodb.com/features/database-sharding-explained

分片是一种扩展形式,称为水平扩展或
横向扩展,因为引入额外的节点来分担负载。
水平扩展允许近乎无限的可扩展性来处理大数据
数据和繁重的工作负载。 相反,垂直缩放是指
通过以下方式增加单台机器或单台服务器的能力
更强大的 CPU、增加的 RAM 或增加的存储容量。

Sorry for not going into detail here, but these two articles are the best I've found on sharding and the different strategies as to how this pattern can be implemented.

https://learn.microsoft.com/en-us/azure/architecture/patterns/sharding

Divide the data store into horizontal partitions or shards. Each shard
has the same schema, but holds its own distinct subset of the data. A
shard is a data store in its own right (it can contain the data for
many entities of different types), running on a server acting as a
storage node.

https://www.mongodb.com/features/database-sharding-explained

Sharding is a form of scaling known as horizontal scaling or
scale-out, as additional nodes are brought on to share the load.
Horizontal scaling allows for near-limitless scalability to handle big
data and intense workloads. In contrast, vertical scaling refers to
increasing the power of a single machine or single server through a
more powerful CPU, increased RAM, or increased storage capacity.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文