爬取页面时,如何从获取完整URL或<帧源>属性

我实际上使用的是 PHP,但是任何编程语言都可以完成这种爬行。要满足很多情况会有点困难。请帮我解决这个问题,并请给我一些建议,看看我的方向是否正确。

我所知道的是当前的 url 地址,从中我可以获取 的链接列表。

我正在做的是:从当前的url地址,我可以首先获取根url,例如,从 http://www .abc.com/def,我可以先获取http://www.abc.com。这是为了迎合的情况,所以我必须首先知道根url。

其次,我需要从当前的 url 获取 url 目录,这有点困难,我仍然不知道如何完美地完成它。例如,来自 http://www.abc.com/def/xyz.htm ,它的url目录是http://www.abc.com/def。这是为了迎合这种情况

我面临的问题是,如何获取当前的url目录?例如,如果当前网址是 http://www.abc.com/def,我该如何真的知道 def 是目录还是文件吗?如果 def 是一个文件,则 url 目录将为 http://www.abc.com。但如果 def 是一个目录,那么 url 目录就是 http://www.abc.com/def

可以说如果最后有“/”,那么就是目录。但从我的角度来看,当我抓取网页时,我无法真正确保网页构建器会在目录 url 末尾添加“/”。目录 url 完全有效,例如,如果 def 是目录,则 http://www.abc.com/ def 可能代表 http://www.abc.com/def/index。 html

因为很难知道 http://www.abc.com/def 是目录还是脚本文件,那么很难从相对 href 创建完整的 url,例如

我是否把问题过于复杂化了?有什么办法解决这个问题吗?

还有其他情况,例如 href="# 表示锚点,然后我将其附加到当前 url 的末尾。对于任何当前 url 情况来说,这是否正确且有效?这意味着,对于当前 url 的情况是否有效http://www.abc.com/def (def是一个目录),将http://www.abc.com/def#xyz 转换为 http://www.abc.com/def/index.html#xyz

对于 href="javascript: 或 href="vbscript: 等,我会忽略它

。对于 href="xyz.???",并且如果??? 是图像文件、exe 文件或任何无效的 html 文件,我会忽略它们吗?

谢谢。

这个问题可能有点混乱,我希望我能解释清楚。

I am actually using PHP but such crawling can be done by any programming languages. It will be a bit difficult to cater a lot of situations. Please help me look through the problem, and please give me some suggestion on whether I am going to the right direction.

What I know is the current url address from which I can get a list of links from <a href=" or from <frame src=".

What I am doing is: from current url address, I can firstly get root url, for example, from http://www.abc.com/def, I can get http://www.abc.com first. This is to cater the situation <a href="/fff.html", so I have to know the root url first.

Secondly, I need to get url directory from current url, this is a little difficult and I still have no idea how to get it done perfectly. For example, from http://www.abc.com/def/xyz.htm, it's url directory is http://www.abc.com/def. This is to cater the situation <a href="../../xyz.html">.

The problem I am facing is, how to get the current url directory? For example, if the current url is http://www.abc.com/def, how can I actually know that def is a directory or a file? If def is a file, then the url directory would be http://www.abc.com. But if def is a directory, then the url directory would just be http://www.abc.com/def.

You can say that if there is "/" at last, then it would be directory. But from my point of view, when I am crawling a webpage, I can't really ensure that the webpage builder will add "/" at the end of a directory url. A directory url is perfect valid, for example, if def is a directory, then http://www.abc.com/def would probably stands for http://www.abc.com/def/index.html.

Since it's hard to know whether http://www.abc.com/def is a directory or a script file, then it is hard to make full url from relative href such as <a href="xyz.html">.

Am I over complicating the problem? Is there any solution to this?

There are other situations for example href="# means anchor then I'll just append it to the end of current url. Is that correct and valid for any current url situation? Meaning that, is that valid for the situation where current url is http://www.abc.com/def (def is a directory), will http://www.abc.com/def#xyz be converted to http://www.abc.com/def/index.html#xyz ?

And for href="javascript: or href="vbscript: etc, I'll just ignore it.

And for href="xyz.???", and if ??? is an image file, exe file, or anything that is not valid html, I'll just ignore them?

Thanks.

The question might be a little messy, I hope I explained it clearly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

过去的过去 2024-12-12 12:44:13

域名后面的任何内容都可以映射到配置域的人员想要的任何内容。

无法保证以 .html 结尾的 URL 引用文件系统上某处的真实文件,或者它将返回有效的 HTML 或其他任何内容。

您可以任意决定将 def/ 作为目录或文件名的一部分,无论您喜欢什么,因为任何选择都同样正确。

Anything after the domain name can map to whatever the person configuring the domain wants.

There is no guarantee that a URL ending in .html refers to a real file on a filesystem somewhere, or that it will return valid HTML, or anything else.

You can arbitrarily decide to count def/ as a directory or part of a filename, whatever floats your boat, as any choice is equally correct.

你没皮卡萌 2024-12-12 12:44:13

如果http://www.abc.com/def是一个目录,那么Web服务器通常会重定向到http://www.abc.com/def/以免让客户感到困惑。您只需要注意重定向并使用 urlparse.urljoin() 中的适当函数在任何一种情况下都可以像浏览器一样将两个组件融合在一起。

If http://www.abc.com/def is a directory then the web server will usually redirect to http://www.abc.com/def/ in order to avoid confusing the client. You simply need to notice the redirect and use urlparse.urljoin() or the appropriate function in <language-of-choice> to fuse the two components together in either case as a browser would.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文