查找两个网页之间的最短路径

发布于 2024-08-14 11:01:43 字数 212 浏览 3 评论 0原文

我需要找到两个维基百科页面之间的最短距离(以“跃点”为单位)

我有一种方法可以提取页面上的所有内部维基链接

我知道起始目的地和结束目的地,但我对如何做一无所知从数据中提取跃点

到目前为止,我一直在使用链接提取方法来填充字典,其中键是页面上的链接,值是从中删除的页面。

如果有人有任何想法,一个好的数据结构将是保存信息,然后如何查看它,我将非常感激

I need to find the shortest distance between two Wikipedia pages (in "hops")

I have a method to extract all the internal wiki links on a page

I know the start destination and the end destination but I'm coming up blank on how to extract the hops from the data

so far ive been using the link extraction method to populate a dictionary with the key being the link on the page and the value being the page it was taken off of.

If anyone has any ideas one what a good data structure would be to hold the info and then how to look through it I would appreciate it very much

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

ペ泪落弦音 2024-08-21 11:01:43

您了解图论吗?您拥有构建图表所需的数据,但您需要使用 Dijkstra 算法遍历它以找到两点之间的最短路径。

Do you know anything about graph theory? You have the necessary data to build a graph, but you'll need to use Dijkstra's algorithm to traverse it to find the shortest path between your two points.

め七分饶幸 2024-08-21 11:01:43

也许这有点愚蠢,因为我并不是真正的 C# 程序员,但包含内部所有链接的多维数组会根据维度的深度让您知道哪种方式包含更少的环。

这只是一个想法,虽然这在理论上肯定是可行的,因为数组可以拥有的维数没有语言限制,但我很确定它会非常占用内存!

像这样的东西:

[source] -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> [target]
         -> [source link] -> ['source link' link] -> etc

Maybe it's a bit stupid since I'm not really a C# programmer but a multidimensional array that contains all the links inside would, depending on the depth of dimensions let you know which way contains less hoops.

It's just a thought while this is certainly doable in theory since there is no language limit to the number of dimensions an array can have, I'm pretty sure it'll be really memory hungry!

Something like this:

[source] -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> etc
         -> [source link] -> ['source link' link] -> [target]
         -> [source link] -> ['source link' link] -> etc
故人的歌 2024-08-21 11:01:43

我认为在这种情况下图表是稀疏的。因此,对每个维基百科页面使用 HashSet 之类的东西可能是个好主意,其中的页面链接到集合内部。

在这种情况下,您实际上不需要实现 Dijikstra 的最短路径算法。因为这等于每条边的权重等于 1 的最短路径问题。您可以执行 广度优先搜索并获取找到目标页面的深度。

I think the graph is sparse in this case. So it might be a good idea to use something like HashSet for each Wikipedia page, with pages where it links to inside the set.

You don't really need to implement Dijikstra's shortest path algorithm in this case. Since this is equal to shortest path problem where the weight of each edge is equal to 1. You could just do a Breadth-first search and get get the depth where the destination page is found.

梅窗月明清似水 2024-08-21 11:01:43

下面是 Dijkstra 算法在 python 中的实现: http://code.activestate.com/recipes/119466/

Here's a implementation of Dijkstra's algorithm in python: http://code.activestate.com/recipes/119466/

夜空下最亮的亮点 2024-08-21 11:01:43

假设您有一个 IEnumerable; PageLinks(Link link)

跳数将通过以下方式解决:

Link curentPage = "somepage";
Link destinationPage = "otherpage";
if (currentPage == destinationPage) return 0;
int hops = 1;
IEnumerable<Link> currentLinks = PageLinks(currentPage);
IEnumerable<Link> visited = new [] {currentPage};
while(!currentLinks.Contains(destinationPage)) 
{
    currentLinks = currentLinks
        .SelectMany(l => PageLinks(l).Where(f => !visited.Contains(f)));
    visited = visited.Union(currentLinks);
    hops++;
}
return hops;

编辑以加快循环速度,尽管算法在没有它的情况下也可以工作。如果页面未链接,它可能会运行到 StackOverflow 左右。

Assuming you have a IEnumerable<Link> PageLinks(Link link)

The number of hops would be solved by the following:

Link curentPage = "somepage";
Link destinationPage = "otherpage";
if (currentPage == destinationPage) return 0;
int hops = 1;
IEnumerable<Link> currentLinks = PageLinks(currentPage);
IEnumerable<Link> visited = new [] {currentPage};
while(!currentLinks.Contains(destinationPage)) 
{
    currentLinks = currentLinks
        .SelectMany(l => PageLinks(l).Where(f => !visited.Contains(f)));
    visited = visited.Union(currentLinks);
    hops++;
}
return hops;

Edited to make faster for cycling, although the algorithm would have worked without it. It may run until StackOverflow or so if the pages aren't linked.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文