慢更新与慢选择
这是一个关于权衡的问题。
想象一个社交网络。每个用户都有一个状态消息,他可以随时更改。每当他改变它时,他的所有朋友都会通过墙得到通知(就像在 Facebook 中一样)。
为了使这项工作成功。我们有 3 个表:Users(id, name)、FriendLists(userId、friendUserId)、Notifications(?)。
现在我们假设每个用户的朋友列表中有大约 50 个朋友。我面临着困境 - 如何实现通知表。
第一个选项
CREATE TABLE Notifications
(
toUserId bigint NOT NULL,
[identity] bigint IDENTITY(1,1) NOT NULL,
fromUserId bigint NOT NULL,
data varchar(256) NOT NULL,
CONSTRAINT [PK_Notifications] PRIMARY KEY CLUSTERED (toUserId, [identity])
)
发送通知:
-- Get all friends of @fromUserId.
WITH Friends AS
(SELECT FriendLists.friendUserId
FROM FriendLists
WHERE userId = @fromUserId)
-- Send updates to all friends.
SELECT
friendUserId as toUserId,
@fromUserId as fromUserId,
@data as data
INTO Notifications
FROM Friends
在这种情况下,对于每个状态更改,我们都会创建 50 条记录(假设有 50 个朋友)。这很糟糕。然而,好处是检索特定用户的通知非常快,因为我们在 toUserId 上有聚集索引。
第二个选项
CREATE TABLE Notifications
(
toUserId bigint NOT NULL,
[identity] bigint IDENTITY(1,1) NOT NULL,
fromUserId bigint NOT NULL,
data varchar(256) NOT NULL,
CONSTRAINT [PK_Notifications] PRIMARY KEY CLUSTERED ([identity])
)
CREATE NONCLUSTERED INDEX [IX_toUserId] ON Notifications (toUserId ASC)
发送通知:
-- Get all friends of @fromUserId.
WITH Friends AS
(SELECT FriendLists.friendUserId
FROM FriendLists
WHERE userId = @fromUserId)
-- Send updates to all friends.
INSERT INTO Notifications(toUserId, fromUserId, data)
VALUES(friendUserId, @fromUserId, @data)
这里我们只为每个状态更新插入一条记录。这很好。缺点是通知的检索会变慢,因为记录不是按 toUserId 聚集的。
两种方法的获取通知是相同的:
SELECT TOP(50) fromUserId, [identity], data
FROM Notifications
WHERE toUserId = @toUserId
那么您对此有何看法?
This is a question about tradeoffs.
Imagine a social network. Each user has a status message, that he can change anytime. Whenever he does change it, all his friends are notified through a wall (like in Facebook).
To make this work. We have 3 tables Users(id, name), FriendLists(userId, friendUserId), Notifications(?).
Now let's assume each user has approximately 50 friends in his friend list. I am faced with the dilemma - how to implement Notifications table.
1st option
CREATE TABLE Notifications
(
toUserId bigint NOT NULL,
[identity] bigint IDENTITY(1,1) NOT NULL,
fromUserId bigint NOT NULL,
data varchar(256) NOT NULL,
CONSTRAINT [PK_Notifications] PRIMARY KEY CLUSTERED (toUserId, [identity])
)
Send notifications:
-- Get all friends of @fromUserId.
WITH Friends AS
(SELECT FriendLists.friendUserId
FROM FriendLists
WHERE userId = @fromUserId)
-- Send updates to all friends.
SELECT
friendUserId as toUserId,
@fromUserId as fromUserId,
@data as data
INTO Notifications
FROM Friends
In this case, for each status change we create 50 records (assuming 50 friends). This is bad. However the good point is that to retrieve notifications for a specific user it is really fast, since we have a clustered index on the toUserId.
2nd option
CREATE TABLE Notifications
(
toUserId bigint NOT NULL,
[identity] bigint IDENTITY(1,1) NOT NULL,
fromUserId bigint NOT NULL,
data varchar(256) NOT NULL,
CONSTRAINT [PK_Notifications] PRIMARY KEY CLUSTERED ([identity])
)
CREATE NONCLUSTERED INDEX [IX_toUserId] ON Notifications (toUserId ASC)
Send notifications:
-- Get all friends of @fromUserId.
WITH Friends AS
(SELECT FriendLists.friendUserId
FROM FriendLists
WHERE userId = @fromUserId)
-- Send updates to all friends.
INSERT INTO Notifications(toUserId, fromUserId, data)
VALUES(friendUserId, @fromUserId, @data)
Here we only insert a single record per status update. This is good. The bad point is that the retrieval of the notifications is going to be slower, since records are not clustered by toUserId.
Getting notifications is same for both methods:
SELECT TOP(50) fromUserId, [identity], data
FROM Notifications
WHERE toUserId = @toUserId
So what is your take on this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,与写入相比,读取总是压倒性的,因为每堵“墙”被看到的次数比它被更新的次数要多得多。所以你最好加快阅读速度。
其次,这类大型社交网站固有的问题之一是数据的分布(分片、分区,没有一个数据库能够存储所有帐户、所有朋友、所有通知),这意味着当有新通知时如果被挂在墙上,则必须在其他服务器上通知好友。这意味着更新是异步的并且是基于消息传递的。
所以我肯定会选择一种针对阅读而优化的结构。
我建议您仔细阅读参与 Facebook 和 MySpace 等网站架构的各种人员所做的公开演示,例如 这个是 Christa Stelzmuller 的。他们解释了设计中的许多思考和推理。
First, reads are always going to be overwhelming in comparison with writes, because each 'wall' will be seen many more times than it will be updated. So you better make reads darn fast.
Second, one of the problem inherent in these kind of big social networking sites is the distribution of data (sharding, partitioning, no single database will ever be capable of storing all accounts, all friends, all notifications) which means that when a new notification is put on a wall, the friends have to be notified on other servers. This implies updates are asynchronous and messaging based anyway.
So I would definitely go with a structure optimized for reading.
I'd recommend you go over the public presentations done by various people involved in the architecture of sites like Facebook and MySpace, like this Christa Stelzmuller's one. They explain a lot of the thinking and reasoning that goes into their design.
与 SELECT 相比,更新非常慢……慢了几个数量级。另外,随着站点的扩展,您将在内存中缓存所有提取,因此选择的速度将微不足道。
Updates are very slow compared to SELECTs ... a few orders of magnitude. Plus, as your site scales you'll be caching all your fetches in memory, so the speed of selects will be trivial.
在这种情况下,在 (toUser,identity) 上创建聚集索引似乎是一个坏主意,因为聚集索引确实应该按升序插入。当然,SQL 会负责保持表排序,但这会带来很高的性能成本(这就是你问题的重点)。但一般来说,不建议提前知道不按特定顺序进行插入聚集索引。这是一个非常好的三个 部分 有关聚集索引建议的文章。
话虽如此,我还是坚持使用标识列作为聚集索引,并在 toUserId 上创建一个非聚集索引,也许还有一个日期时间列。通过包含日期时间列,您可以更有效地查询最近的数据。
关于更新缓慢,社交网站上的状态更新对于消息队列来说是一个完美的情况。这样,您就可以根据需要调整数据库以加快读取速度,并且如果它对写入性能产生影响,用户也不必遭受损失。从他们的角度来看,更新是即时的,尽管可能需要一些时间才能“坚持”。
对于非常大的数据库,我会听从 SQL 专家的意见,他们可以讨论分区策略(针对较新数据的更小、更易于管理的表,针对旧数据的更大/重度索引的表)和复制解决方案。
In this situation, it seems like a bad idea to create a clustered index on (toUser,identity) because a clustered index really ought to be inserted in ascending order. Of course SQL will take care of keeping the table sorted but this comes at a high performance cost (which is the point of your question.) But in general, inserts that are known ahead of time to be in no particular order are not recommended for clustered indexes. Here is a very good three part article about clustered index recommendations.
Having said that, I'd stick with the identity column as your clustered index and create a nonclustered index on toUserId and maybe a datetime column. By including a datetime column, you can more efficiently query for recent data.
Regarding slow updates, status updates on social networking sites are a perfect situation for message queues. That way you can tune the database as needed to make reads fast and if it has an impact on write performance, the user won't have to suffer. From their perspective the update was instantaneous even though it might take a few moments to "stick".
For very large databases I'll defer to the SQL gurus who can talk about partitioning strategies (smaller more manageable tables for newer data, larger/heavily indexed tables for older data) and replication solutions.