We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(3)
创建站点地图的唯一自动方法是了解站点的结构并编写一个基于该知识的程序。仅抓取链接通常是行不通的,因为链接可以位于任何页面之间,因此您可以获得图表(即节点之间的连接)。一般情况下,无法将图转换为树。
因此,您必须自己确定树的结构,然后爬行相关页面以获取页面的标题。
至于“但它只适用于3级”:3级就足够了。如果您尝试创建更多级别,您的站点地图将变得无法使用(太大、太宽)。没有人愿意下载 1MB 的站点地图,然后滚动浏览 100'000 页的链接。如果您的网站变得那么大,那么您必须实施某种搜索。
The only automatic way to create a sitemap is to know the structure of your site and write a program which builds on that knowledge. Just crawling the links won't usually work because links can be between any pages so you get a graph (i.e. connections between nodes). There is no way to convert a graph into a tree in the general case.
So you must identify the structure of your tree yourself and then crawl the relevant pages to get the titles of the pages.
As for "but it only works for 3 levels": Three levels is more than enough. If you try to create more levels, your sitemap will become unusable (too big, too wide). No one will want to download a 1MB sitemap and then scroll through 100'000 pages of links. If your site grows that big, then you must implement some kind of search.
这是一个 python 网络爬虫,它应该是一个很好的起点。您的总体策略是这样的:
正如 leonm 所指出的,您需要执行所有这些操作的原因是,网站是图形,而不是树,并且布置图形比用简单的 javascript 和CSS。 Graphviz 擅长它的工作。
Here is a python web crawler, which should make a good starting point. Your general strategy is this:
The reason you need to do all this is, as leonm noted, that websites are graphs, not trees, and laying out graphs is a harder problem than you can do in a simple piece of javascript and css. Graphviz is good at what it does.
请参阅http://aaron.oirt.rutgers.edu/myapp/docs/ W1100_2200.TreeView
关于如何格式化树视图。您还可以修改示例应用程序
http://aaron.oirt.rutgers.edu/myapp/DirectoryTree/index 刮你的
页面(如果它们被组织为 HTML 文件的目录)。
Please see http://aaron.oirt.rutgers.edu/myapp/docs/W1100_2200.TreeView
on how to format tree views. You can also probably modify the example application
http://aaron.oirt.rutgers.edu/myapp/DirectoryTree/index to scrape your
pages if they are organized as directories of HTML files.