解释 URL 中的相对路径
我正在用 python 编写一个“网络爬虫”,它接受一个 URL,并根据链接进行深度优先搜索到某个有限的深度。我遇到的问题是解释 URL 中的相对路径。
在页面 http://learnyouahaskell.com/introduction/ 上查看“Starting Out”链接;它看起来像Starting Out
。如何确定此链接是指“http://learnyouahaskell.com/introduction/starting-out”还是“http://learnyouahaskell.com/starting-out”?根据我的浏览器,第二个是正确的。
然而在页面上http://math.colgate.edu/~mionescu/math399s11/有一个链接这里
解析为“http://math.colgate.edu/~mionescu/math399s11/Finalprojects.pdf ”。
有人可以向我解释这种不一致吗?我如何确定应如何在我的爬网程序中解析这些路径?
I'm writing a 'webcrawler' in python that takes a URL and does a depth-first search following links down to some limited depth. The problem I'm having is interpreting relative paths in URLS.
On the page http://learnyouahaskell.com/introduction/ have a look at the "Starting Out" link; it looks like <a href="starting-out" class="nxtlink">Starting Out</a>
. How can I determine whether this link refers to "http://learnyouahaskell.com/introduction/starting-out" or "http://learnyouahaskell.com/starting-out"? The second one is correct according to my browser.
Yet on the page http://math.colgate.edu/~mionescu/math399s11/ there is a link <a href="Finalprojects.pdf">here</a>
which resolves to "http://math.colgate.edu/~mionescu/math399s11/Finalprojects.pdf".
Can someone explain this inconsistency to me? How can I determine how these paths should be resolved in my crawler?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这种“明显”不一致的原因是 learnyouahaskell 网站在其源代码中使用
标记。这指示所有无域 href 使用基础作为起点。如果没有基本标签,它将按预期显示(您发布的第一个链接),并且其行为就像
math.colgate.edu
链接一样。The reason for this 'apparent' inconsistency is that the learnyouahaskell site is using the
<base href="">
tag in their source. This directs all domainless hrefs to use the base as their starting point.Without the base tag it would have appeared as expected (the first link you post) and acted just like the
math.colgate.edu
link.