快速准确抓取网页标题

发布于 2024-10-18 14:13:32 字数 304 浏览 0 评论 0原文

我正在寻找网页的标题,这是许多 IRC 机器人的共同功能,我想将其合并到我为了好玩而编写的 IRC 客户端中。

我目前使用的方法基本上连接并发送整个网页的 GET 请求,然后查找标签并读取它们之间的内容。对于较大的网页,这可能比我想要的要慢。我注意到的另一个问题是具有动态标题的网页(例如一些 phpbb 论坛)不会返回在浏览器中显示的准确标题,因为我不执行任何 javascript 等操作。

这似乎是一种方法获得准确的标题的方法是将html转储到浏览器控件(例如IE COM控件)中并拉取标题,但这只会使其更加耗时。

有我不知道的简单方法吗?

I'm looking to get the title of a webpage, a common feature of many IRC bots that I'm wanting to incorporate into a IRC client I'm writing for fun.

The method that I currently have working basically connects and sends a GET request for the entire webpage then seeks out the tags and reads inbetween them. For larger webpages this can be slower than I'd like. An additional problem I've noticed is webpages with dynamic titles (such as some phpbb forums) will not return the accurate title as it would show in a browser because I don't do any execution of javascript ect..

It seems one way to get an accurate title is to dump the html into a browser control (such as the IE COM control) and pull the title, but this is just going to make it even more time consuming.

Is there a simple method I am un aware of?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不一样的天空 2024-10-25 14:13:32

总之,不,不是。

我想您可以将 HTTP 文件流式传输到您的应用程序中,然后在到达 时停止下载,而不是下载整个文档 - 这将节省您等待整个 HTML 文档下载的时间。

但是,如果您需要在某些客户端 JavaScript 更改标题后阅读标题,那么这并没有什么帮助。正如你所说,我能想到的唯一方法是使用浏览器控件。

In a word, no, not really.

I guess rather than downloading the whole document you could stream the HTTP file into your application and just stop downloading when you reach </title> - that would save you waiting for the whole HTML document to download.

However that doesn't help the situation if you need to read the title after it's been changed by some client-side javascript. As you say, the only way I can think of doing that is by using a browser control.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文