像 LinkedIn 这样的网站如何在每个人的名字旁边有效地显示第一/第二/第三级关系?
最近,我在一次工作面试中因为没有很好地回答一个简单的问题而搞砸了:像LinkedIn这样的网站如何有效地显示你与页面上显示的每个人的关系距离(第一/第二/第三)(例如在人员搜索结果、工作人员列表中)在公司等)?
<编辑>我得到了解决方案的基本“技巧”:查找“距我的距离”是一个常见的操作(例如,单个页面上 20x+,每个登录会话 100 次),所以你可以这样做“我到 X 的距离”的一部分,缓存它,然后多次重复使用缓存的部分结果,以使其他操作更便宜。我还猜测部分结果可能是我的第二级连接,因为“缓存所有第三级连接”在 RAM 和 CPU 上的成本太高。
但是当尝试时为了将这种见解转化为解决方案,我想出了一个笨拙的答案,涉及为站点上每个人的二级连接创建持久缓存(这在性能方面会非常昂贵并且维护起来很复杂),并且我走了一条莫名其妙的弯路以一种几乎没有技术意义的方式使用布隆过滤器。有了这样的回答,我就不会雇用自己了!
后来,当我在没有面试压力的情况下思考这个问题时,我得出了一个更合理的答案。
构建一种非常快速的方法来获取每批用户 ID 的一级连接(批量大小高达 ~1000?)。这可能意味着一个由大量 RAM 服务器组成的专用集群,可以将整个网络的第一级连接缓存在内存中。幸运的是,5000 万会员 x 平均人数。每个成员 100 个连接 x 每个成员 ID 4 字节 = 在 RAM 中缓存 <25GB,这对于价格合理的硬件来说是可行的。而且每天的更改数量将低于 1%,因此保持缓存最新并不难。 (请注意,关系数据库可能不是实现此缓存的糟糕选择,因为“大量随机 I/O”访问模式会降低关系数据库性能。)
当用户登录时,缓存他或她的第二级通过获取每个第一级连接的第一级连接来建立连接,并将其放入哈希表中(键=第二级ID,值=连接您的第一级连接的数组)。还可以缓存您的第一级连接,以便您可以通过对远程缓存服务器的单个回调来拉回第一级和第二级连接。用户 ID 很容易分区,因此像 memcached 这样的分布式缓存可能很适合这种情况。
对于任何用户 ID,要查找它是否在您的“网络”中以及它与您的关系(第一、第二、第三),请执行以下操作:
- 如果该 ID 在您的第一级连接中,请停止。
- 尝试在缓存的二级连接哈希表中查找 ID。如果找到,则返回连接您的连接数组。
- 获取 ID 的第一级连接,并对每个连接重复步骤 #2。将所有结果聚合到一个数组中并返回它们。
- <编辑>重构为批量实现(“查找我到 N 个不同用户的距离”),这样您就可以获得步骤 #3 中的所有远程结果,而无需弥补N 个远程调用。
但我确信对此有更好的答案。你的是什么?如果您想要额外的挑战,请尝试模拟面试情况(无法在网络上查找解决方案)。
请注意,问题是关于最佳解决方案,无论 LinkedIn 今天实际上是如何做到的,我在上面写下自己的答案后查了一下。
I recently botched a job interview by poorly answering a straightforward question: how do sites like LinkedIn efficiently show the relationship distance (1st/2nd/3rd) from you to every person displayed on a page (e.g. in people search results, list of people working in a company, etc.)?
<EDIT> I got the essential "trick" of the solution: finding "distance from me" is a common operation (e.g. 20x+ on a single page, 100's per login session), so you can do part of the "distance of me to X", cache it, and then re-use that cached partial result many times in order to make other operations much cheaper. I also guessed that the partial result was likely to be my second-level connections, because "cache all 3rd-level connections" would be too costly in RAM and CPU.</EDIT>
But when trying to convert this insight into a solution, I came up with a bumbling answer involving creating persistent caches of 2nd-level connections of everyone on the site (which would have been hugely epensive in perf and complex to maintain), and I took an inexplicable detour into using Bloom Filters in an way that made little technical sense. I wouldn't have hired myself after an answer like that!
Later, as I thought about the problem without the pressure of an interview hanging over my head, I came up a more reasonable answer.
Build a very fast way to get the first-level connections for each of batch of user IDs (batch size up to ~1000?). This probably means a dedicated cluster of lots-of-RAM servers which can cache the entire network's 1st-level connections in memory. Luckily, 50M members x avg. 100 connections per member x 4 bytes per member ID = <25GB to cache in RAM, which is doable with reasonably-priced hardware. And the number of changes per day is going to be under 1%, so keeping the cache up-to-date is not too hard. (Note that a relational database would probably be a bad choice to implement this cache because the "lots of random I/O" access pattern kills relational DB performance.)
when a user logs in, cache his or her 2nd-level connections by fetching 1st-level connections of every 1st-level connections, and stick in a hashtable (key = 2nd-level ID, value = array of 1st-level connections which connect you). Also cache your first-level connections too so you can pull back both 1st- and 2nd-level via a single call back to your remote cache server. User IDs are easily partitionable, so a distributed cache like memcached may work well for this.
for any user ID, to find whether it's in your "network" and what relationship it is to you (1st, 2nd, 3rd), do the following:
- if the ID is in your first-level connections, stop.
- try looking up the ID in your cached 2nd-level connections hashtable. If found, return the array of connections which connect you.
- fetch the ID's first level connections, and repeat step #2 for each of them. Aggregate all results into a single array and return them.
- <EDIT> refactor into a batch implementation ("look up distance from me to N different users") so you can get all the remote results from step #3 without having to make up to N remote calls.</EDIT>
But I'm sure there are better answers to this. What's yours? If you want extra challenge, try simulating an inteview situation (can't look up solutions on the Web).
Note that the question was about an optimal solution, regardless of how LinkedIn actually does it today, which I looked up after I wrote my own answer above.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以利用有关小世界网络的公理来优化这种类型的遍历。
小世界网络的特点是“集线器”,代表其他节点的非常密集的互连。网络中的大多数节点通常会在几跳内连接到拓扑上附近的节点(相距 1-4 跳),或者通过一个或多个此类集线器进行路由。这是小世界网络如此行事的主要原因之一。
You may be able to leverage axioms about small world networks to optimize this type of traversal.
Small world networks are characterized by "hubs" the represent very dense interconnections of other nodes. Most nodes in the network will generally either connect within a few hops to a topologically nearby node (1-4 hops away) or will route through one or more such hubs. This is one of the main reasons that small world networks behave the way they do.
有趣的是,1970 年代的技术可以很好地对此进行建模。 网络数据库模型有效地管理这种类型的关系。
它在即席查询或数据模型维护方面效率不高,因此随着关系数据模型的兴起而失宠。
Interestingly, 1970's technology would do a fair job of modeling this. The Network Database Model efficiently manages this type of relationship.
It's not efficient in terms of ad hoc queries or data model maintenance, so fell out of favor with the rise of relational data models.
如果您考虑一下,在 SQL 中执行此操作可能会占用大量处理器资源。
考虑到这一点以及它最终将在所有地方使用的事实,并且空间相对便宜......我建议根据您的语言偏好使用 Lucene(或 Lucene.NET)创建索引。你可以用这种方式做几件事。
您可以创建一个树型数据结构,并根据您当时的需要递归地抓取索引,查找所有父节点或子节点及其父节点或子节点。
或者你可以写出所有创建的关系(空间是廉价的概念)。这将是一次写入过程(您不会以任何方式经常更新)。创建或撤销关系时,您将对索引的更新进行排队(排队是因为您不想为单个请求打开写入...批量索引更新)。然后您可以读取这个非常扁平的结构来获取有问题的 ID。
有了手中的 ID(从您执行的任何搜索类型),您就可以转到数据库来获取周围所需的信息。然后缓存您的输出,以进一步最小化非常快速的搜索、数据库查询、数据构建……但如果它仅来自缓存,速度仍然会更快。
使用 Velocity、MemCached 或 MemCached Win32 等工具在网络场中进行集中缓存。
If you think about it, doing this in SQL could be very processor intensive.
Given that and the fact that it will ultimately be used all over the place, and that space is relatively cheap...I would suggest creating an index using Lucene (or Lucene.NET) depending on your language preference. You could do a couple things this way.
You can either create a tree type data structure and recursively crawl your index looking for all the parent nodes or child nodes and their parent or child nodes depending on your needs at the time.
Or you could write out all the relationships as they are created (the space is cheap concept). This would be a write once process (which you wouldn't be updating all that often any ways). When a relationship is created or revoked you would queue an update to your index (queue because you wouldn't want to open for write for single requests...batch the index updates). Then you could read this really flat structure to get the IDs in question.
With the IDs in hand (from which ever search type you perform) you can then go to the DB to get the surrounding required information. Then cache your output to further minimize what would be a very fast search, db query, data building...but faster still if it just comes from cache.
Use something like Velocity, MemCached, or MemCached Win32 for your centralized caching across a web farm.
我不确定表结构或系统的复杂性,但这里是一个使用递归 CTE 的简单 SQL Server 示例:
输出:
I'm not sure of the table structure, or complexity of the system, but here is a simple SQL Server example using a recursive CTE:
OUTPUT:
linkedin 数据不是表示为一个巨大的图吗?当一个人登录时,系统会处理其节点,然后通过广度优先遍历3个级别,系统会将这些节点作为一个集合(以及哪个级别信息),当一个人出现在网页上时,系统对此节点集进行查找并给出关系距离。
这是我的猜测。请随意指出,是什么让它不切实际。
Isn't linkedin data represented as a big giant graph? and when a person logins, the system would have handle to its node, and then by doing breadth first traversal for 3 levels, the system would keep these nodes as a set(along with which level info), and when a person appears on webpage, the system does a lookup on this node set and gives out the relationship distance..
This is my guess. Please feel free to point out, what makes it impractical.
要实现,
请使用连接是双向的这一事实。
将第一级连接存储为某些 KV sore 中的排序列表:
伪代码:
复杂度:O(C1+C2)。 C1,C2 - 两个用户的连接数。
To implement
Use fact that connections are bidirectional.
Store 1st level connections as sorted list in some KV sore:
Pseudocode:
Complexity: O(C1+C2). C1,C2 - number of connection of both users.