什么是分片以及为什么它很重要?
我认为我理解的分片是将切分的数据(分片)放回易于处理的聚合中,这在上下文中是有意义的。 它是否正确?
更新:我想我在这里很挣扎。 在我看来,应用程序层不应该决定数据应该存储在哪里。 充其量它应该是某种分片客户端。 这两个答案都回答了“是什么”这一重要方面,但没有回答“为什么”这一重要方面。 除了明显的性能提升之外,它还有什么影响? 这些收益是否足以抵消 MVC 违规? 分片在超大规模应用程序中最重要还是适用于较小规模的应用程序?
I think I understand sharding to be putting back your sliced up data (the shards) into an easy to deal with aggregate that makes sense in the context. Is this correct?
Update: I guess I am struggling here. In my opinion the application tier should have no business determining where data should be stored. At best it should be shard client of some sort. Both responses answered the what but not the why is it important aspect. What implications does it have outside of the obvious performance gains? Are these gains sufficient to offset the MVC violation? Is sharding mostly important in very large scale applications or does it apply to smaller scale ones?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
分片只是数据库“水平分区”的另一个名称。 您可能需要搜索该术语以使其更清楚。
来自维基百科:
有关分片的更多信息:
更新:您不会破坏 MVC。 确定存储数据的正确分片的工作将由数据访问层透明地完成。 在那里,您必须根据用于对数据库进行分片的标准来确定正确的分片。 (因为您必须根据应用程序的某些具体方面手动将数据库分片为一些不同的分片。)然后,在从数据库加载数据和将数据存储到数据库中时必须小心,以使用正确的分片。
也许这个例子与Java代码使它有点更清楚(这是关于 Hibernate Shards 项目的),这在现实场景中是如何工作的。
解决“
为什么要分片
”的问题:它主要适用于具有大量数据的超大规模应用程序。 首先,它有助于最大限度地缩短数据库查询的响应时间。 其次,您可以使用更便宜的“低端”机器来托管数据,而不是一台大型服务器,后者可能已经不够了。Sharding is just another name for "horizontal partitioning" of a database. You might want to search for that term to get it clearer.
From Wikipedia:
Some more information about sharding:
Update: You wont break MVC. The work of determining the correct shard where to store the data would be transparently done by your data access layer. There you would have to determine the correct shard based on the criteria which you used to shard your database. (As you have to manually shard the database into some different shards based on some concrete aspects of your application.) Then you have to take care when loading and storing the data from/into the database to use the correct shard.
Maybe this example with Java code makes it somewhat clearer (it's about the Hibernate Shards project), how this would work in a real world scenario.
To address the "
why sharding
": It's mainly only for very large scale applications, with lots of data. First, it helps minimizing response times for database queries. Second, you can use more cheaper, "lower-end" machines to host your data on, instead of one big server, which might not suffice anymore.如果您对位置受到严格限制的 DBMS 进行查询(例如,用户仅使用“where username = $my_username”触发选择),则将所有以 AM 开头的用户名放在一台服务器上并且全部来自 NZ 是有意义的在另一。 通过这种方式,您可以对某些查询进行接近线性缩放。
长话短说:分片基本上是将表分配到不同服务器上的过程,以便平等地平衡两台服务器上的负载。
当然,现实情况要复杂得多。 :)
If you have queries to a DBMS for which the locality is quite restricted (say, a user only fires selects with a 'where username = $my_username') it makes sense to put all the usernames starting with A-M on one server and all from N-Z on the other. By this you get near linear scaling for some queries.
Long story short: Sharding is basically the process of distributing tables onto different servers in order to balance the load onto both equally.
Of course, it's so much more complicated in reality. :)
分片是水平(行式)数据库分区,而不是垂直(列式)分区,后者是标准化。 它将非常大的数据库分成更小、更快、更容易管理的部分,称为数据分片。 它是一种实现分布式系统的机制。
为什么我们需要分布式系统?
您可以在这里阅读更多内容: 分布式数据库的优点
分片如何帮助实现分布式系统?
您可以将搜索索引分为 N 个分区,并将每个索引加载到单独的服务器上。 如果查询一台服务器,您将得到 1/N 的结果。 因此,为了获得完整的结果集,典型的分布式搜索系统使用聚合器来累积来自每个服务器的结果并将它们组合起来。 聚合器还将查询分发到每个服务器上。 这个聚合器程序在大数据术语中称为 MapReduce。 换句话说,分布式系统=Sharding+MapReduce(尽管还有其他东西)。
下面是视觉表示。
Sharding is horizontal(row wise) database partitioning as opposed to vertical(column wise) partitioning which is Normalization. It separates very large databases into smaller, faster and more easily managed parts called data shards. It is a mechanism to achieve distributed systems.
Why do we need distributed systems?
You can read more here: Advantages of Distributed database
How sharding help achieve distributed system?
You can partition a search index into N partitions and load each index on a separate server. If you query one server, you will get 1/Nth of the results. So to get complete result set, a typical distributed search system use an aggregator that will accumulate results from each server and combine them. An aggregator also distribute query onto each server. This aggregator program is called MapReduce in big data terminology. In other words, Distributed Systems = Sharding + MapReduce (Although there are other things too).
A visual representation below.
当且仅当您的需求超出单个数据库服务器所能满足的范围时,分片才是一个问题。 如果您有可分片数据并且具有极高的可扩展性和性能要求,那么它是一个强大的工具。 我猜想,在我成为软件专业人士的 12 年里,我遇到过一种可以从分片中受益的情况。 这是一项先进技术,但适用性非常有限。
此外,未来可能会变得有趣和令人兴奋,就像一个巨大的对象“云”,消除了所有潜在的性能限制,对吧? :)
Sharding is a concern if and only if your needs scale past what can be served by a single database server. It's a swell tool if you have shardable data and you have incredibly high scalability and performance requirements. I would guess that in my entire 12 years I've been a software professional, I've encountered one situation that could have benefited from sharding. It's an advanced technique with very limited applicability.
Besides, the future is probably going to be something fun and exciting like a massive object "cloud" that erases all potential performance limitations, right? :)
分片最初是由 Google 工程师创造的,您可以看到在 Google App Engine 上编写应用程序时它被大量使用。 由于查询可以使用的资源量存在严格限制,而且查询本身也有严格的限制,因此架构不仅鼓励分片,而且几乎强制执行分片。
分片的另一个用途是减少数据实体的争用。 在构建可扩展系统时,留意那些经常写入的数据尤为重要,因为它们始终是瓶颈。 一个好的解决方案是将特定实体分片并写入多个副本,然后读取总数。 此“GAE 分片计数器”示例:http://code.google.com/appengine /articles/sharding_counters.html
Sharding was originally coined by google engineers and you can see it used pretty heavily when writing applications on Google App Engine. Since there are hard limitations on the amount of resource your queries can use and because queries themselves have strict limitations, sharding is not only encouraged but almost enforced by the architecture.
Another place sharding can be used is to reduce contention on data entities. It is especially important when building scalable systems to watch out for those piece of data that are written often because they are always the bottleneck. A good solution is to shard off that specific entity and write to multile copies, then read the total. An example of this "sharded counter wrt GAE: http://code.google.com/appengine/articles/sharding_counters.html
分片不仅仅是水平分区。
根据 维基百科文章,
还,
Sharding does more than just horizontal partitioning.
According to the wikipedia article,
Also,
这是一个很好的规则,但像大多数事情一样并不总是正确的。
当您进行架构设计时,您会从责任和协作开始。 一旦确定了功能架构,您就必须平衡非功能力量。
如果这些非功能性力量之一是大规模可扩展性,那么您必须调整架构来满足这种力量,即使这意味着您的数据存储抽象现在泄漏到您的应用程序层。
This is a good rule but like most things not always correct.
When you do your architecture you start with responsibilities and collaborations. Once you determine your functional architecture, you have to balance the non-functional forces.
If one of these non-functional forces is massive scalability, you have to adapt your architecture to cater for this force even if it means that your data storage abstraction now leaks into your application tier.
抱歉,这里没有详细介绍,但是这两篇文章是我在分片以及如何实现此模式的不同策略方面找到的最好的文章。
https://learn.microsoft.com/en-us/azure/架构/模式/分片
https://www.mongodb.com/features/database-sharding-explained
Sorry for not going into detail here, but these two articles are the best I've found on sharding and the different strategies as to how this pattern can be implemented.
https://learn.microsoft.com/en-us/azure/architecture/patterns/sharding
https://www.mongodb.com/features/database-sharding-explained