事件源实施——它会扩展吗?
情况:
我目前正在为一个社交网站设计一个提要系统,每个用户都可以通过该系统获得其朋友活动的提要。我有两种可能的方法来生成提要,我想问哪种方法在扩展能力方面最好。
来自所有用户的事件都收集在一个中央数据库表 event_log
中。用户在friends
表中配对为好友。我们使用的 RDBMS 是 MySQL。
标准方法: 当用户请求他们的 feed 页面时,系统通过将 event_log
与 friends
内部连接来生成 feed。然后结果被缓存并设置为 5 分钟后超时。通过改变此超时来实现缩放。
假设方法: 任务在后台运行,对于 event_log
中的每个新的未处理项目,它会在数据库表 user_feed
中创建条目,将该事件与所有好友用户配对发起该事件的用户。一个表行将一个事件与一个用户配对。
标准方法的问题是众所周知的——如果很多人的缓存同时过期怎么办?该解决方案也不能很好地扩展——简述是让源更新尽可能接近实时。
在我看来,假设的解决方案似乎要好得多;所有处理都是离线完成的,因此没有用户等待页面生成,并且没有连接,因此数据库表可以跨物理机器分片。但是,如果用户有 100,000 个好友,并在一次会话中创建 20 个事件,则将导致向数据库中插入 2,000,000 行。
问题:
问题归结为两点:
- 上面提到的最坏情况是否有问题,即表大小是否对 MySQL 性能有影响,以及为每个表大量插入数据是否存在问题事件?
- 还有什么我错过的吗?
Situation:
I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.
Events from all users are collected in one central database table, event_log
. Users are paired as friends in the table friends
. The RDBMS we are using is MySQL.
Standard method:
When a user requests their feed page, the system generates the feed by inner joining event_log
with friends
. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.
Hypothesised method:
A task runs in the background and for each new, unprocessed item in event_log
, it creates entries in the database table user_feed
pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.
The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible
The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.
Question:
The question boils down to two points:
- Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
- Is there anything else I have missed?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为你假设的系统生成了太多数据;首先,在全球范围内,随着用户群变得越来越大、相互联系越来越紧密,对 user_feed 的存储和索引需求似乎呈指数级增长(这对于社交网络来说可能是理想的);其次,考虑一下如果在一分钟内 1000 个用户每个人都输入了一条新消息,并且每个用户都有 100 个朋友 - 那么您的后台线程有 100 000 个插入要做,并且可能很快就会落后。
我想知道您提出的两个解决方案之间是否可以做出妥协,其中后台线程更新表last_user_feed_update,该表包含每个用户的单行以及上次更改用户提要的时间戳。
然后,虽然需要完全连接和查询来刷新 feed,但对 last_user_feed 表的快速查询将告诉您是否需要刷新。这似乎减轻了标准方法的最大问题,并避免了存储大小的困难,但后台线程仍然有很多工作要做。
I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.
I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.
Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.
当你限制最大好友数量时,假设方法效果更好。很多网站都设置了安全上限,包括 Facebook iirc。当您的 10 万朋友用户产生活动时,它会限制“小问题”。
假设模型的另一个问题是,您实际上为其预先生成缓存的一些朋友可能会注册但几乎从不登录。对于免费网站来说,这是一种非常常见的情况,您可能希望限制这些不活动的网站的负担用户会让你付出代价。
我已经多次思考过这个问题——这不是 MySQL 擅长解决的问题。我已经想到了使用 memcached 的方法,每个用户将他们最新的几个状态项推送到“他们的密钥”(并且在提要阅读活动中,您获取并聚合所有朋友的密钥)......但我还没有对此进行了测试。我还不确定所有的优点/缺点。
The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.
Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.
I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.