网站的层次结构
我不确定这个问题是否会有一个答案,或者是否有一个简洁的答案,但我想我还是会问。这个问题也不是特定于语言的,但可能有某种伪算法作为答案。
基本上,我试图了解蜘蛛是如何工作的,据我所知,我发现没有蜘蛛管理层次结构。他们只是列出内容或链接,但没有排序。
我的问题是这样的:我们查看一个网站,可以轻松地直观地确定哪些链接是导航链接、内容相关链接或网站外部链接。 我们如何才能实现自动化?我们如何以编程方式帮助蜘蛛确定父页面和子页面。
当然,第一个答案是使用 URL 的目录结构。 例如www.stackoverflow.com/questions/spiders 蜘蛛是问题的子级,问题是基本站点的子级,依此类推。 但如今,层次结构通常是扁平的,ID 在 URL 中引用。
到目前为止,我对这个问题有两个答案,并且希望得到一些反馈。
1:发生。
所有页面中出现次数最多的链接将被称为导航链接。这似乎是最有前途的设计,但我可以看到动态链接和其他问题突然出现,但它们似乎微不足道。
2:深度。
例如,我需要点击某个网站多少次才能到达某个页面。这看起来是可行的,但是如果在实际上位于底层的主页上发布一些信息,则它将被确定为顶层页面或节点。
那么,有没有人对如何让蜘蛛判断链接中的层次结构有任何想法或建设性的批评。
(如果有人真的很好奇,蜘蛛的后端部分很可能是 Ruby on Rails)
I'm not sure if this question will have a single answer or even a concise one for all answer but I thought I would ask non the less. The problem isn't language specific either but may have some sort of pseudo algorithm as an answer.
Basically I'm trying to learn about how spiders work and from what I can tell no spider I've found manages hierarchy. They just list the content or the links but no ordering.
My question is this: we look at a site and can easily determine visually what links are navigational, content related or external to a site.
How could we automate this? How could we pro grammatically help a spider detemine parent and child pages.
Of course the first answer would be to use the URL's directory structure.
E.g www.stackoverflow.com/questions/spiders
spiders is child of questions, questions is child of base site and so on.
But nowadays hierarchy is usually flat with ids being referenced in URL.
So far I have 2 answers to this question and would love some feedback.
1: Occurrence.
The links that occur the most in all pages would be dubbed as navigational. This seems like the most promising design but I can see issues popping up with dynamic links and others but they seem minuscule.
2: Depth.
Example is how many times do I need to click on a site to get to a certain page. This seems doable but if some information is advertised on the home page that is actually on the bottom level, it would be determined as a top level page or node.
So has anyone got any thoughts or constructive criticism on how to make a spider judge hierarchy in links.
(If anyone is really curious, the back end part of the spider will most likely be Ruby on rails)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你的目标是什么?如果你想爬行较少数量的网站并为某种聚合器提取有用的数据,最好构建专注的爬虫(为每个网站编写爬虫)。
如果你想抓取数百万个页面......那么你必须非常熟悉人工智能的一些高级概念。
您可以从这篇文章开始 http://www-ai.ijs .si/SasoDzeroski/ECEMEAML04/presentations/076-Znidarsic.pdf
What is your goal? If you want to crawl smaller number of websites and extract useful data for some kind of aggregator, its best to build focused crawler(Write crawler for every site).
If you want to crawl milion of pages ... Well than you must be very familiar with some advanced concepts from AI.
You can start from this article http://www-ai.ijs.si/SasoDzeroski/ECEMEAML04/presentations/076-Znidarsic.pdf