如何生成大型网站的图形站点地图

发布于 2024-08-09 20:25:12 字数 1539 浏览 8 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

寄意 2024-08-16 20:25:12

创建站点地图的唯一自动方法是了解站点的结构并编写一个基于该知识的程序。仅抓取链接通常是行不通的,因为链接可以位于任何页面之间,因此您可以获得图表(即节点之间的连接)。一般情况下,无法将图转换为树。

因此,您必须自己确定树的结构,然后爬行相关页面以获取页面的标题。

至于“但它只适用于3级”:3级就足够了。如果您尝试创建更多级别,您的站点地图将变得无法使用(太大、太宽)。没有人愿意下载 1MB 的站点地图,然后滚动浏览 100'000 页的链接。如果您的网站变得那么大,那么您必须实施某种搜索。

The only automatic way to create a sitemap is to know the structure of your site and write a program which builds on that knowledge. Just crawling the links won't usually work because links can be between any pages so you get a graph (i.e. connections between nodes). There is no way to convert a graph into a tree in the general case.

So you must identify the structure of your tree yourself and then crawl the relevant pages to get the titles of the pages.

As for "but it only works for 3 levels": Three levels is more than enough. If you try to create more levels, your sitemap will become unusable (too big, too wide). No one will want to download a 1MB sitemap and then scroll through 100'000 pages of links. If your site grows that big, then you must implement some kind of search.

何其悲哀 2024-08-16 20:25:12

这是一个 python 网络爬虫,它应该是一个很好的起点。您的总体策略是这样的:

  • 您需要注意出站链接永远不会被跟踪,包括同一域上但高于您的起点的链接。
  • 当您抓取时,该网站会收集映射到每个页面中包含的所有内部 url 列表的页面 url 的哈希值。
  • 遍历此列表,为每个唯一的 url 分配一个令牌。
  • 使用您的 {token =>; 的哈希值[tokens]} 生成一个 graphviz 文件,该文件将布置一个图表,以便您
  • 将 graphviz 输出转换为图像映射,其中每个节点链接到其相应的网页

正如 leonm 所指出的,您需要执行所有这些操作的原因是,网站是图形,而不是树,并且布置图形比用简单的 javascript 和CSS。 Graphviz 擅长它的工作。

Here is a python web crawler, which should make a good starting point. Your general strategy is this:

  • you need to take care that outbound links are never followed, including links on the same domain but higher up than your starting point.
  • as you spider, the site collect a hash of page urls mapped to a list of all the internal urls included in each page.
  • take a pass over this list, assigning a token to each unique url.
  • use your hash of {token => [tokens]} to generate a graphviz file that will lay out a graph for you
  • convert the graphviz output into an imagemap where each node links to its corresponding webpage

The reason you need to do all this is, as leonm noted, that websites are graphs, not trees, and laying out graphs is a harder problem than you can do in a simple piece of javascript and css. Graphviz is good at what it does.

莫言歌 2024-08-16 20:25:12

请参阅http://aaron.oirt.rutgers.edu/myapp/docs/ W1100_2200.TreeView
关于如何格式化树视图。您还可以修改示例应用程序
http://aaron.oirt.rutgers.edu/myapp/DirectoryTree/index 刮你的
页面(如果它们被组织为 HTML 文件的目录)。

Please see http://aaron.oirt.rutgers.edu/myapp/docs/W1100_2200.TreeView
on how to format tree views. You can also probably modify the example application
http://aaron.oirt.rutgers.edu/myapp/DirectoryTree/index to scrape your
pages if they are organized as directories of HTML files.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文