图数据库统计直接关系
我正在尝试绘制网站的链接结构图,以便可以对给定域上的页面如何相互链接进行建模。请注意,我不会绘制不在根域上的站点的链接。
显然,该图的大小可能相当大。我想要执行的主要查询之一是计算有多少页面直接链接到给定的 URL。我想对整个图表运行这个(不寒而栗),这样我最终会得到一个 url 列表和该 url 的传入链接计数。
我知道一种流行的方法是通过某种地图缩减 - 我可能仍然会这样做 - 但是我需要能够(接近)实时查看此报告,这通常不是地图减少友好。
我快速浏览了 neo4j 和 OrientDb。虽然这两个都可以模拟我想要的关系,但不清楚我是否可以查询它们以生成我想要的报告。目前我还没有致力于任何特定的技术。
任何帮助将不胜感激。 谢谢, 保罗
I'm trying to graph the linking structure of a web site so I can model how pages on a given domain link to each other. Note I'm not graphing links to sites not on the root domain.
Obviously this graph could be considerable in size. One of the main queries I want to perform is to count how many pages directly link into a given url. I want to run this against the whole graph (shudder) such that I end up with a list of urls and the count of incoming links to that url.
I know one popular way of doing this would be via some kind of map reduce - and I may still end up going that way - however I have a requirement to be able to view this report in (near) realtime which isn't generally map reduce friendly.
I've had a quick look at neo4j and OrientDb. While both of these could model the relationship I want it's not clear if I could query them to generate the report I want. At this point I'm not committed to any particularly technology.
Any help would be greatly appreciated.
Thanks,
Paul
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
OrientDB 和 Neo4J 都支持蓝图作为通用 API 来进行图形操作,如遍历、计数等。
如果我理解的话好吧,你的用例你的图表看起来非常简单:你有一个“URL”顶点,它与一种类型的边缘“链接”相互链接。
要对图表执行操作,请查看 Gremlin。
both OrientDB and Neo4J supports Blueprints as common API to make graph operations like traversal, counting, etc.
If I've understood well your use case your graph seems pretty simple: you have a "URL" Vertex that links each other with one type of Edge "Links".
To execute operation against graphs take a look at Gremlin.
您可以查看structr。它是一个运行在 Neo4j 之上的开源 CMS,并且完全具有这些类型的页面间链接。
要获取指向页面的链接数量,您只需迭代当前页面节点的传入 LINKS_TO 链接即可。
您的查询的用例是什么?热门页面列表?那么它只包含前 n 个页面?然后,您可能会尝试从图表的随机位置开始并行遍历与当前节点的传入 LINKS_TO 关系,并将它们放入排序结构中,因此您始终从前 20 个左右的顶部页面节点开始/继续已经拥有最多数量的传入链接(直到完成)。
Marko Rodriguez 在 Gremlin 文档中提供了一些类似的“页面排名”示例。他还有几篇博客文章,其中他谈到了这。
You might have a look at structr. It is a open source CMS running on top of Neo4j and exactly has those types of inter-page links.
For getting the number of links pointing to the page you just have to iterate the incoming LINKS_TO links for the current page-node.
What is the use-case for your query ? A popular page list? So it would just contain the top-n pages? You might then try to just start at random places of the graph traverse incoming LINKS_TO relationships to your current node(s) in parallel and put them into a sorting structure, so you always start/continue with the first 20 or so top page-nodes that already have the highest number of incoming links (until they're finished).
Marko Rodriguez has some similar "page-rank" examples in the Gremlin documentation. He's also got several blog posts where he talks about this.
使用 Neo4J,您将无法跨服务器拆分图表来分配负载。您可以复制数据库来分配计算,但更新会很慢(因为您必须复制更新)。当新关系被添加为节点的属性时,我将通过更新每个节点的入站链接计数来解决该问题。 Neo4J具有出色的写入性能。当然,您不需要保留此信息,因为检索直接关系的成本很低(您不会得到所有相关节点的集合,而只是一个迭代器)。
Well with Neo4J you won't be able to split the graph across servers to distribute the load. you could replicate the database to distribute the computation, but then updating will be slow (as you have to replicate the updates). I would attack the problem by updating a count of inbound links to each node as new relationships are added as a property of the node. Neo4J has excellent write performance. Of course you don't need to persist this information because direct relationships are cheap to retrieve (you don't get a collection of all related nodes just an iterator).
您还应该看看高度可扩展的图形数据库产品,例如 InfiniteGraph。如果您向他们的技术支持发送电子邮件,我认为他们将能够向您指出一些示例代码,这些代码可以完成您在此处描述的大部分内容。
You should also take a look at a highly scalable graph database product, such as InfiniteGraph. If you email their technical support I think they will be able to point you at some sample code that does a large part of what you've described here.