社交应用程序的数据库设计和优化注意事项
通常的情况。我有一个简单的应用程序,允许人们上传照片并关注其他人。因此,每个用户都会有类似“墙”或“活动源”的东西,他或她可以在其中看到他/她的朋友(他或她关注的人)上传的最新照片。
大多数功能都很容易实现。然而,当涉及到这个历史活动源时,由于纯粹的性能原因,事情很容易变得一团糟。
我在这里遇到了以下困境: 我可以轻松地将活动源设计为数据库的规范化部分,这将节省我的编写周期,但会极大地增加为每个用户选择这些结果时的复杂性(对于在特定时间段内上传的每张照片,选择特定数量,我关注的上传者/对于我关注的每个人,选择他的照片)
优化选项可以是引入一系列阈值约束,例如,允许我根据日期对我关注的人进行排序他们最后上传的照片,甚至排除一些,以节省周期,并且对于每个用户,仅选择 5 张(例如)最后上传的照片。
第二种方法是为活动提要引入完全非规范化的模式,其中每一行代表我的一位关注者的通知。这意味着我每次上传照片时,DB都会在这个“drop Bucket”中放入n行,n表示我关注的人数,即大量的写入周期。不过,如果我有这样一个表,我可以轻松地应用一些优化技术,例如巧妙的索引,以及修剪早于特定时间段(队列)的条目。
然而,我想到的第三种方法是一种不太非规范化的模式,其中服务器端应用程序将消除数据库的部分复杂性。我发现一些社交应用程序(例如friendfeed)严重依赖于数据库中序列化对象(例如JSON对象)的存储。
我肯定仍然掌握可扩展数据库设计的技能,所以我确信我错过了很多东西,或者仍然需要学习。如果有人至少能为我指明正确的方向,我将不胜感激。
The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).
Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.
I have come to the following dilemma here:
i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )
An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.
The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).
Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.
I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您的申请成功,那么很有可能您的读取次数将多于写入次数 - 我只上传一次照片(写入),但我的每个朋友在刷新提要时都会读取它。因此,您应该优化快速读取,而不是快速写入,这指向非规范化模式的方向。
这里的问题是,如果您拥有大量用户,您创建的数据量可能很快就会失控。非常大的表很难在数据库上查询,因此再次存在潜在的性能问题。 (还有一个问题是是否有足够的存储空间,但这更容易解决)。
如果按照您的建议,您可以在一段时间后删除行,那么这可能是一个很好的解决方案。随着您的成长和遇到性能问题,您可以减少该时间(最多一定程度)。
关于存储序列化对象,如果这些对象是不可变的(写入后不会更改它们)并且不需要对它们进行索引或查询,那么这是一个不错的选择。请注意,如果您对数据进行非规范化,则可能意味着您有一个用于活动源的表。在这种情况下,我认为存储 blob 几乎没有什么好处。
如果您要采用序列化对象的方式,请考虑使用一些 NoSQL 解决方案,例如 CouchDB - 它们针对处理此类数据进行了更好的优化,因此原则上您应该在相同的硬件设置下获得更好的性能。
请注意,我并不是建议您将所有数据移至 NoSQL - 仅针对它是更好的解决方案的部分。
最后,从经验来看,需要注意的是:构建可扩展的应用程序非常困难,并且需要将时间花在其他地方。在担心如何为数百万用户提供服务之前,您应该花时间考虑如何让数百万用户使用您的应用程序 - 第一个是更困难的问题。当您取得巨大成功时,您可以重新架构并重建您的应用程序。
If your application is successful, then it's a good bet that you'll have more reads than writes - I only upload a photo once (write), but each of my friends reads it whenever they refresh their feed. Therefore you should optimize for fast reads, not fast writes, which points in the direction of a denormalized schema.
The problem here is that the amount of data you create could quickly get out of hand if you have a large number of users. Very large tables are hard on the db to query, so again there's a potential performance issue. (There's also the question of having enough storage, but that's much more easily solved).
If, as you suggest, you can delete rows after a certain amount of time, then this could be a good solution. You can reduce that amount of time (up to a point) as you grow and run into performance issues.
Regarding storing serialized objects, it's a good option if these objects are immutable (you won't change them after writing) and you don't need to index them or query on them. Note that if you denormalize your data, it probably means that you have a single table for the activity feed. In that case I see little gain in storing blobs.
If you're going the serialized objects way, consider using some NoSQL solution, such as CouchDB - they're better optimized for handling that kind of data, so in principle you should get better performance for the same hardware setup.
Note that I'm not suggesting that you move all your data to NoSQL - only for that part where it's a better solution.
Finally, a word of caution, spoken from experience: building an application that can scale is hard and takes time better spent elsewhere. You should spend your times worrying about how to get millions of users to your app before you worry about how you're going to serve those millions - the first is the more difficult problem. When you get to the point that you're hugely successful, you can re-architect and rebuild your application.
您可以采取多种选择
id, user_id, event_name, date, event_parameters'
- 例如:1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, < ;图片的id>
和最重要的是,这个表将在内存中。不再需要担心写入性能。记录过去(即 3 天)后,如果用户选择返回那么远的时间,则可以将它们清除到另一个表(在非内存中)并包含在查询结果中。通过将所有这些都放在一个表中,您无需执行多个查询和 SELECT 来构建此信息。值得阅读的好资源
There are many options you can take
id, user_id, event_name, date, event_parameters'
-- an example would be:1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, <id of picture>
and most important of all, this table will be in memory. No longer need to worry about write performance. After the records go past i.e. 3 days they can be purged into another table (in non-memory) and included into the query results, if the user chooses to go back that far. By having all this in one table you remove having to do multiple queries and SELECTs to build up this information.Good Resources to read
我可能会从使用规范化模式开始,以便您可以快速而紧凑地编写。然后使用非事务性(无锁定)读取将信息拉回,确保使用游标,以便您可以在结果返回时对其进行处理,而不是等待整个结果集。由于听起来这些信息没有任何特定的关键含义,因此您实际上不需要担心通常会阻止您进行事务性读取的问题锁定。
I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.
这些问题就是目前使用 NOSql 解决方案的原因。我在之前的项目中所做的事情非常简单。我不会在内存存储中保留仅包含 feed'id 的 user->wall user->history(我最喜欢的是 redis)。因此,在每次插入中,我都会对数据库执行 1 次插入操作,并在内存存储中执行(n*读取优化)插入操作。我设计内存存储来优化我的读取。如果我想过滤视频的用户历史记录(或墙),我会将推送 feedid 放入诸如 user::{userid}::wall::videos 之类的列表中。
当然,您也可以纯粹在内存存储中构建系统,但是有两个系统做它们最擅长的事情是很好的。
编辑 :
查看这些应用程序以获得一个想法:
http://retwis.antirez.com/
http://twissandra.com/
These kind of problems are why currently NOSql solutions used these days. What I did in my previos projecs is really simple. I don't keep user->wall user->history which contains purely feed'ids in memory stores(my favorite is redis). so in every insert I do 1 insert operation on database and (n*read optimization) insert operation in memory store. I design memory store to optimize my reads. if I want to filter user history (or wall) for videos I put a push feedid to a list like user::{userid}::wall::videos.
Well ofcourse you can purely build the system in memstores aswell but its nice to have 2 systems doing what they are doing the best.
edit :
checkout these applications to get an idea:
http://retwis.antirez.com/
http://twissandra.com/
我阅读了越来越多的有关 NoSQL 解决方案的文章以及人们提出的建议,但是没有人提到这种选择的缺点。
对我来说最明显的是缺乏交易 - 想象一下,如果您时不时地丢失一些记录(有案例报告这种情况经常发生)。
但是,令我惊讶的是,没有人提到 MySQL 被用作 NoSQL - 这是一些阅读的链接。
最后,无论您选择哪种解决方案(关系数据库或 NoSQL 存储),它们都以类似的方式进行扩展 - 通过跨网络分片数据(当然,还有更多选择,但这是最明显的一种)。由于 NoSQL 的工作量较少(没有 SQL 层,因此 CPU 周期不会浪费在解释 SQL 上),因此速度更快,但也可能达到顶峰。
正如 Elad 已经指出的那样 - 构建一个从一开始就可扩展的应用程序是一个痛苦的过程。最好花时间专注于让它流行起来,然后再扩大规模。
I'm reading more and more about NoSQL solutions and people suggesting them, however no one ever mentions drawbacks of such choice.
Most obvious for me is lack of transactions - imagine if you lost a few records every now and then (there are cases reporting this happens often).
But, what I'm surprised with is that no one mentions MySQL being used as NoSQL - here's a link for some reading.
In the end, no matter what solution you choose (relational database or NoSQL storage), they scale in similar manner - by sharding data across network (naturally, there are more choices but this is the most obvious one). Since NoSQL does less work (no SQL layer so CPU cycles aren't wasted on interpreting SQL), it's faster, but it can hit the roof too.
As Elad already pointed out - building an app that's scalable from the get go is a painful process. It's better that you spend time focusing on making it popular and then scale it out.