Facebook数据库设计?
我一直想知道Facebook是如何设计好友<->的? 用户关系。
我认为用户表是这样的:
user_email PK
user_id PK
password
我计算出包含用户数据的表(我认为通过用户电子邮件连接的性别、年龄等)。
它如何将所有朋友连接到该用户?
像这样的东西吗?
user_id
friend_id_1
friend_id_2
friend_id_3
friend_id_N
可能不会。 因为用户数量未知并且会扩大。
I have always wondered how Facebook designed the friend <-> user relation.
I figure the user table is something like this:
user_email PK
user_id PK
password
I figure the table with user's data (sex, age etc connected via user email I would assume).
How does it connect all the friends to this user?
Something like this?
user_id
friend_id_1
friend_id_2
friend_id_3
friend_id_N
Probably not. Because the number of users is unknown and will expand.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
保留一个好友表,其中保存了 UserID,然后保存了好友的 UserID(我们将其称为 FriendID)。 这两列都将是返回 Users 表的外键。
有点有用的示例:
示例用法:
这将表明 Bob 是 Jon 和 Joe 的朋友,并且 Jon 也是 Joe 的朋友。 在此示例中,我们假设友谊始终是两种方式,因此您不需要表中的行(例如 (2,1) 或 (3,2)),因为它们已经在另一个方向上表示。 例如,友谊或其他关系不是明确的双向关系,您还需要使用这些行来指示双向关系。
Keep a friend table that holds the UserID and then the UserID of the friend (we will call it FriendID). Both columns would be foreign keys back to the Users table.
Somewhat useful example:
Example Usage:
This will show that Bob is friends with both Jon and Joe and that Jon is also friends with Joe. In this example we will assume that friendship is always two ways, so you would not need a row in the table such as (2,1) or (3,2) because they are already represented in the other direction. For examples where friendship or other relations aren't explicitly two way, you would need to also have those rows to indicate the two-way relationship.
TL;DR:
他们使用堆栈架构,其中缓存图表用于其堆栈 MySQL 底部之上的所有内容。
长答案:
我自己对此做了一些研究,因为我很好奇他们如何处理大量数据并快速搜索它。 我见过有人抱怨定制的社交网络脚本随着用户群的增长而变得缓慢。 在我用仅 10k 用户和250 万朋友连接对自己进行了一些基准测试之后 - 甚至没有尝试去关心群组权限、点赞和墙贴 - 很快就发现,这方法是有缺陷的。 因此,我花了一些时间在网上搜索如何做得更好,并发现了这篇 Facebook 官方文章:
我真的建议您在继续阅读之前观看上面第一个链接的演示。 这可能是您能找到的关于 FB 幕后工作原理的最佳解释。
该视频和文章告诉您一些事情:
让我们看一下,朋友连接位于左上角:
嗯,这是一张图表。 :) 它没有告诉您如何在 SQL 中构建它,有多种方法可以做到这一点,但是 这个网站有很多不同的方法。 注意:考虑一下关系数据库的本质:它被认为存储规范化数据,而不是图形结构。 因此它的性能不如专门的图形数据库。
还要考虑到您必须执行比朋友的朋友更复杂的查询,例如,当您想要过滤您和您朋友的朋友喜欢的给定坐标周围的所有位置时。 图表是这里的完美解决方案。
我无法告诉您如何构建它以使其性能良好,但它显然需要一些尝试、错误和基准测试。
这是我令人失望的测试,只是发现朋友的朋友:
数据库架构:
朋友的朋友查询:
我真的建议您创建一些包含至少 10k 用户记录的示例数据并且每个人都至少有 250 个好友连接,然后运行此查询。 在我的机器(i7 4770k、SSD、16GB RAM)上,该查询的结果是~0.18 秒。 也许它可以优化,我不是数据库天才(欢迎提出建议)。 然而,如果是线性扩展,那么对于 10 万用户来说,时间已经是 1.8 秒,对于 100 万用户来说,时间已经是 18 秒。
对于大约 10 万用户来说,这听起来可能还不错,但考虑到您只是获取了朋友的朋友,并且没有执行任何更复杂的查询,例如“仅向我显示来自朋友的朋友的帖子 + 如果允许,请执行权限检查”或者不允许看到其中一些+执行子查询来检查我是否喜欢其中任何一个”。 您想让数据库检查您是否已经喜欢某个帖子,否则您必须在代码中进行检查。 还要考虑到这不是您运行的唯一查询,并且您在或多或少受欢迎的网站上同时拥有多个活跃用户。
我认为我的回答很好地回答了 Facebook 如何很好地设计好友关系的问题,但很抱歉我无法告诉您如何以快速运行的方式实现它。 实施社交网络很容易,但确保其良好运行显然并不容易 - 恕我直言。
我已经开始尝试使用 OrientDB 进行图形查询并将边缘映射到底层 SQL DB。 如果我完成了它,我会写一篇关于它的文章。
如何创建一个性能良好的社交网站?
更新 2021 年 4 月 10 日:我可能永远不会写这篇文章 ;) 但这里有一些要点,您可以尝试扩展它:
如果您以智能方式组合上面列表中的要点,您可以构建一个非常性能良好的系统。 该列表不是“待办事项”列表,您仍然需要理解、思考和熟练它! https://microservices.io/ 是一个不错的网站,涵盖了我之前提到的一些主题。
我所做的是存储聚合生成的事件,并使用项目和处理程序写入不同的数据库,如上所述。 最酷的一点是,我可以随时根据需要重新构建数据。
TL;DR:
They use a stack architecture with cached graphs for everything above the MySQL bottom of their stack.
Long Answer:
I did some research on this myself because I was curious how they handle their huge amount of data and search it in a quick way. I've seen people complaining about custom made social network scripts becoming slow when the user base grows. After I did some benchmarking myself with just 10k users and 2.5 million friend connections - not even trying to bother about group permissions and likes and wall posts - it quickly turned out that this approach is flawed. So I've spent some time searching the web on how to do it better and came across this official Facebook article:
I really recommend you to watch the presentation of the first link above before continue reading. It's probably the best explanation of how FB works behind the scenes you can find.
The video and article tells you a few things:
Let's take a look at this, friend connections are top left:
Well, this is a graph. :) It doesn't tell you how to build it in SQL, there are several ways to do it but this site has a good amount of different approaches. Attention: Consider that a relational DB is what it is: It's thought to store normalised data, not a graph structure. So it won't perform as good as a specialised graph database.
Also consider that you have to do more complex queries than just friends of friends, for example when you want to filter all locations around a given coordinate that you and your friends of friends like. A graph is the perfect solution here.
I can't tell you how to build it so that it will perform well but it clearly requires some trial and error and benchmarking.
Here is my disappointing test for just findings friends of friends:
DB Schema:
Friends of Friends Query:
I really recommend you to create you some sample data with at least 10k user records and each of them having at least 250 friend connections and then run this query. On my machine (i7 4770k, SSD, 16gb RAM) the result was ~0.18 seconds for that query. Maybe it can be optimized, I'm not a DB genius (suggestions are welcome). However, if this scales linear you're already at 1.8 seconds for just 100k users, 18 seconds for 1 million users.
This might still sound OKish for ~100k users but consider that you just fetched friends of friends and didn't do any more complex query like "display me only posts from friends of friends + do the permission check if I'm allowed or NOT allowed to see some of them + do a sub query to check if I liked any of them". You want to let the DB do the check on if you liked a post already or not or you'll have to do in code. Also consider that this is not the only query you run and that your have more than active user at the same time on a more or less popular site.
I think my answer answers the question how Facebook designed their friends relationship very well but I'm sorry that I can't tell you how to implement it in a way it will work fast. Implementing a social network is easy but making sure it performs well is clearly not - IMHO.
I've started experimenting with OrientDB to do the graph-queries and mapping my edges to the underlying SQL DB. If I ever get it done I'll write an article about it.
How can I create a well performing social network site?
Update 2021-04-10: I'll probably never ever write the article ;) but here are a few bullet points how you could try to scale it:
If you combine the points from the above list in a smart way you can build a very well performing system. The list is not a "todo" list, you'll still have to understand, think and adept it! https://microservices.io/ is a nice site that covers a few of the topics I mentioned before.
What I do is to store events that are generated by aggregates and use projects and handlers to write to different DBs as mentioned above. The cool thing about this is, I can re-build my data as needed at any time.
查看以下数据库架构,Anatoly Lubarsky 进行逆向工程:
Have a look at the following database schema, reverse engineered by Anatoly Lubarsky:
我最好的选择是他们创建了一个图形结构。 节点是用户,“友谊”是边。
保留一张用户表,保留另一张边缘表。 然后,您可以保留有关边缘的数据,例如“他们成为朋友的日期”和“批准的状态”等。
My best bet is that they created a graph structure. The nodes are users and "friendships" are edges.
Keep one table of users, keep another table of edges. Then you can keep data about the edges, like "day they became friends" and "approved status," etc.
它很可能是多对多关系:
FriendList(表)
编辑
用户表可能没有 user_email 作为 PK,可能 作为唯一键。
用户(表)
It's most likely a many to many relationship:
FriendList (table)
EDIT
The user table probably doesn't have user_email as a PK, possibly as a unique key though.
users (table)
看一下这些描述 LinkedIn 和 Digg 如何构建的文章:
还有“大数据:Facebook 数据团队的观点”可能会有所帮助:
http://developer.yahoo.net/blogs/theater/archives/2008/01/nextyahoonet_big_data_viewpoints_from_the_fac.html
另外,有一篇文章讨论非关系数据库以及一些公司如何使用它们:
http:// /www.readwriteweb.com/archives/is_the_relational_database_doomed.php
您会发现这些公司正在处理数据仓库、分区数据库、数据缓存和其他我们大多数人日常从未处理过的更高级别的概念。 或者至少,也许我们不知道我们这样做。
前两篇文章中有很多链接,应该可以让您有更多的了解。
更新 10/20/2014
Murat Demirbas 在 TAO 上撰写了摘要
http://muratbuffalo.blogspot.com/2014/10/facebooks-software-architecture.html
HTH
Take a look at these articles describing how LinkedIn and Digg are built:
There's also "Big Data: Viewpoints from the Facebook Data Team" that might be helpful:
http://developer.yahoo.net/blogs/theater/archives/2008/01/nextyahoonet_big_data_viewpoints_from_the_fac.html
Also, there's this article that talks about non-relational databases and how they're used by some companies:
http://www.readwriteweb.com/archives/is_the_relational_database_doomed.php
You'll see that these companies are dealing with data warehouses, partitioned databases, data caching and other higher level concepts than most of us never deal with on a daily basis. Or at least, maybe we don't know that we do.
There are a lot of links on the first two articles that should give you some more insight.
UPDATE 10/20/2014
Murat Demirbas wrote a summary on
http://muratbuffalo.blogspot.com/2014/10/facebooks-software-architecture.html
HTH
对于恒定时间超过 5 亿的数据,无法从 RDBMS 检索用户好友数据的数据
因此 Facebook 使用哈希数据库(无 SQL)实现了这一点,并且他们开源了名为 Cassandra 的数据库。
因此每个用户都有自己的密钥和队列中的好友详细信息; 要了解 cassandra 的工作原理,请查看:
http://prasath.posterous.com/cassandra-55
It's not possible to retrieve data from RDBMS for user friends data for data which cross more than half a billion at a constant time
so Facebook implemented this using a hash database (no SQL) and they opensourced the database called Cassandra.
So every user has its own key and the friends details in a queue; to know how cassandra works look at this:
http://prasath.posterous.com/cassandra-55
它是一种图形数据库:
http://components.neo4j.org/neo4j-examples/1.2 -SNAPSHOT/social-network.html
它与关系数据库无关。
谷歌图数据库。
Its a type of graph database:
http://components.neo4j.org/neo4j-examples/1.2-SNAPSHOT/social-network.html
Its not related to Relational databases.
Google for graph databases.
您正在寻找外键。 基本上,数据库中不能有数组,除非它有自己的表。
示例架构:
You're looking for foreign keys. Basically you can't have an array in a database unless it has it's own table.
Example schema:
大概有一个表,里面存储了朋友<-> 用户关系,例如“frnd_list”,具有字段“user_id”、“frnd_id”。
每当用户将另一个用户添加为好友时,就会创建两个新行。
例如,假设我的 id 是“deep9c”,并且我添加了一个 id 为“akash3b”的用户作为我的朋友,则在表“frnd_list”中创建了两个新行,其值为 ('deep9c','akash3b') 和 ('akash3b') ','deep9c')。
现在,当向特定用户显示好友列表时,一个简单的 sql 就可以做到这一点:“select frnd_id from frnd_list where user_id=”
其中是登录用户的 ID(存储为会话属性)。
Probably there is a table, which stores the friend <-> user relation, say "frnd_list", having fields 'user_id','frnd_id'.
Whenever a user adds another user as a friend, two new rows are created.
For instance, suppose my id is 'deep9c' and I add a user having id 'akash3b' as my friend, then two new rows are created in table "frnd_list" with values ('deep9c','akash3b') and ('akash3b','deep9c').
Now when showing the friends-list to a particular user, a simple sql would do that: "select frnd_id from frnd_list where user_id="
where is the id of the logged-in user (stored as a session-attribute).
关于多对多表的性能,如果您有 2 个 32 位整数链接用户 ID,那么您的 200,000,000 个用户(平均每个用户有 200 个朋友)的基本数据存储空间略低于 300GB。
显然,您需要一些分区和索引,并且您不会将其保留在所有用户的内存中。
Regarding the performance of a many-to-many table, if you have 2 32-bit ints linking user IDs, your basic data storage for 200,000,000 users averaging 200 friends apiece is just under 300GB.
Obviously, you would need some partitioning and indexing and you're not going to keep that in memory for all users.