具有数百万个节点的图数据结构(社交网络)
在使用图
数据结构设计社交网络的背景下,您可以执行 BFS 来查找一个人与另一个人的联系,我有一些关于它。
如果有数百万用户,拓扑确实会比我们通常设计的图表更加复杂和相互关联,我正在尝试理解如何解决这些问题。
在现实世界中,服务器会发生故障。这对您有何影响?
您如何利用缓存的优势?
您是否搜索到图表末尾(无限)?您如何决定何时放弃?
- 在现实生活中,有些人比其他人拥有更多朋友的朋友,因此更有可能 在你和其他人之间开辟一条道路。 您如何使用这些数据来选择您的位置 开始遍历?
In the context of design of a social network using Graphs
data structure, where you can perform a BFS to find a connection from one person to another, I have some questions pertaining to it.
If there are million users, the topology would indeed be much more complicated and interconnected than the graphs we normally design and I am trying to comprehend how you could solve these problems.
In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? How do you decide when to give up?
- In real life, some people have more friends of friends than others, and are therefore more likely
to make a path between you and someone else. How could you use this data to pick where you
start traverse?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你的问题看起来很有趣也很好奇:)
1)嗯......当然,数据存储在磁盘中,而不是内存中。
磁盘具有避免故障的系统,特别是 RAID-5。
冗余是关键:如果一个系统发生故障,另一个系统就会准备接替他的位置。
还有冗余和工作负载共享……有两台计算机并行工作并共享其工作,但如果一台计算机停止工作,则只有一台计算机工作并承担全部工作负载。
在像 google 或 facebook 这样的地方,冗余不是 2,而是 1200000000 :)
还要考虑到数据不在单个服务器场中,在谷歌中,有多个数据中心连接在一起,因此如果一栋建筑爆炸,另一栋建筑就会取代他的位置。
2)这根本不是一个简单的问题,但通常这些系统也有用于磁盘阵列的大缓存,因此在磁盘上读取和写入数据比在我们的笔记本电脑上更快:)
数据可以由多个并发系统并行处理,这是 Facebook 等服务速度的关键。
3)图的末端不是无限的。
所以用实际技术确实是可能的。
探索图上所有连接和所有节点的计算复杂度为 O(n + m),其中 n 是顶点数,m 是边数。
这意味着,它与注册用户数量以及用户之间的连接数量呈线性关系。如今 RAM 非常便宜。
线性增长很容易在需要时添加资源。
添加越多电脑,你就越富有:)
还要考虑一下,没有人会对每个节点执行真正的搜索,facebook 中的所有内容都是相当“本地”的,你可以查看一个人的直接朋友,而不是一个人的朋友的朋友朋友....这没有用。
如果数据结构做得好,获取直接连接到顶点的顶点数是非常容易和快速的。在 SQL 中,这将是一个简单的选择,如果表索引良好,它将非常快,而且不太依赖于用户总数(请参阅哈希表的概念)。
Your question seems interesting and curious :)
1) Well... of course, data is stored in disks, not in ram.
Disks have systems that avoid failure, in particular, RAID-5 for example.
Redundancy is the key: if one system fail there is another system ready to take his place.
There is also redundancy and workload sharing together... there are two computers that work in parallel and share their jobs but if one stops only one works and take the full workload.
In places like google or facebook redundancy is not 2, is 1200000000 :)
And consider also that data is not in a single server farm, in google there are several datacenters connected together, so if one building explodes, another one will take his place for example.
2) Not an easy question at all, but usually these systems have big cache for disk arrays too, so reading and writing data on disk is faster than on our laptops :)
Data can be processed in parallel by several concurrent systems and this is the key of the speed of services like facebook.
3) The end of the graph is not infinite.
So it is possible with actual technology indeed.
The computational complexity of exploring all connections and all nodes on a graph is O(n + m) where n is the number of vertices and m the number of edges.
This means, it is linear to the number of registered user and to the number of connection between users. And RAM these days is very cheap.
Being a linear growth is easy to add resources when needed.
Add more computers the more you get rich :)
Consider also that no-one will perform a real search for every node, everything in facebook is quite "local", you can view the direct friend of one person, not the friend of friend of friend .... it would be not useful.
Getting the number of vertices directly connected to a vertex, if the data structure is well done, is very easy and fast. In SQL it would be a simple select and if tables are well indexed it will be very fast and also not very dependant on the total number of users (see the concept of hash tables).