如何从c#获取网站标题

发布于 2024-07-09 17:59:34 字数 1165 浏览 10 评论 0原文

我正在重新审视我的一些旧代码,并偶然发现了一种根据网址获取网站标题的方法。 这实际上并不是所谓的稳定方法,因为它经常无法产生结果,有时甚至产生不正确的结果。 此外,有时它无法显示标题中的某些字符,因为它们是替代编码。

有人对这个旧版本有改进建议吗?

public static string SuggestTitle(string url, int timeout)
{
    WebResponse response = null;
    string line = string.Empty;

    try
    {
        WebRequest request = WebRequest.Create(url);
        request.Timeout = timeout;

        response = request.GetResponse();
        Stream streamReceive = response.GetResponseStream();
        Encoding encoding = System.Text.Encoding.GetEncoding("utf-8");
        StreamReader streamRead = new System.IO.StreamReader(streamReceive, encoding);

        while(streamRead.EndOfStream != true)
        {
            line = streamRead.ReadLine();
            if (line.Contains("<title>"))
            {
                line = line.Split(new char[] { '<', '>' })[2];
                break;
            }
        }
    }
    catch (Exception) { }
    finally
    {
        if (response != null)
        {
            response.Close();
        }
    }

    return line;
}

最后一点 - 我也希望代码运行得更快,因为它会阻塞直到获取页面,所以如果我只能获取网站标题而不是整个页面,那就太好了。

I'm revisiting som old code of mine and have stumbled upon a method for getting the title of a website based on its url. It's not really what you would call a stable method as it often fails to produce a result and sometimes even produces incorrect results. Also, sometimes it fails to show some of the characters from the title as they are of an alternative encoding.

Does anyone have suggestions for improvements over this old version?

public static string SuggestTitle(string url, int timeout)
{
    WebResponse response = null;
    string line = string.Empty;

    try
    {
        WebRequest request = WebRequest.Create(url);
        request.Timeout = timeout;

        response = request.GetResponse();
        Stream streamReceive = response.GetResponseStream();
        Encoding encoding = System.Text.Encoding.GetEncoding("utf-8");
        StreamReader streamRead = new System.IO.StreamReader(streamReceive, encoding);

        while(streamRead.EndOfStream != true)
        {
            line = streamRead.ReadLine();
            if (line.Contains("<title>"))
            {
                line = line.Split(new char[] { '<', '>' })[2];
                break;
            }
        }
    }
    catch (Exception) { }
    finally
    {
        if (response != null)
        {
            response.Close();
        }
    }

    return line;
}

One final note - I would like the code to run faster as well, as it is blocking until the page as been fetched, so if I can get only the site header and not the entire page, it would be great.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

╭⌒浅淡时光〆 2024-07-16 17:59:34

获取内容的更简单方法:

WebClient x = new WebClient();
string source = x.DownloadString("http://www.singingeels.com/");

获取标题的更简单、更可靠的方法:

string title = Regex.Match(source, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>",
    RegexOptions.IgnoreCase).Groups["Title"].Value;

A simpler way to get the content:

WebClient x = new WebClient();
string source = x.DownloadString("http://www.singingeels.com/");

A simpler, more reliable way to get the title:

string title = Regex.Match(source, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>",
    RegexOptions.IgnoreCase).Groups["Title"].Value;
爱要勇敢去追 2024-07-16 17:59:34

也许有了这个建议,一个新世界就为你打开了
我也有这个问题,并来到这个

Download "Html Agility Pack" from http://html- agility-pack.net/?z=codeplex

或转到 nuget:https://www .nu​​get.org/packages/HtmlAgilityPack/
并添加此参考。

在代码文件中添加以下内容:

using HtmlAgilityPack;

在您的方法中编写以下代码:

var webGet = new HtmlWeb();
var document = webGet.Load(url);    
var title = document.DocumentNode.SelectSingleNode("html/head/title").InnerText;

来源:

https://codeshare.co.uk/blog/how-to-scrape-meta-data-from-a-url-using-htmlagilitypack-in-c /
HtmlAgilityPack 获取标题和元

Perhaps with this suggestion a new world opens up for you
I also had this question and came to this

Download "Html Agility Pack" from http://html-agility-pack.net/?z=codeplex

Or go to nuget: https://www.nuget.org/packages/HtmlAgilityPack/
And add in this reference.

Add folow using in the code file:

using HtmlAgilityPack;

Write folowing code in your methode:

var webGet = new HtmlWeb();
var document = webGet.Load(url);    
var title = document.DocumentNode.SelectSingleNode("html/head/title").InnerText;

Sources:

https://codeshare.co.uk/blog/how-to-scrape-meta-data-from-a-url-using-htmlagilitypack-in-c/
HtmlAgilityPack obtain Title and meta

不知所踪 2024-07-16 17:59:34

为了实现这一目标,您需要做几件事。

  • 让您的应用程序线程化,以便您可以同时处理多个请求并最大限度地增加发出的 HTTP 请求的数量。
  • 在异步请求期间,仅下载您想要拉回的数据量,您可能可以在数据返回时对其进行解析,寻找
  • 可能想要使用正则表达式来拉出标题名称

我之前已经使用 SEO 机器人完成了此操作我已经能够一次性处理近 10,000 个请求。 您只需要确保每个 Web 请求都可以独立包含在线程中。

Inorder to accomplish this you are going to need to do a couple of things.

  • Make your app threaded, so that you can process multiple requests at the time and maximize the number of HTTP requests that are being made.
  • Durring the async request, download only the amount of data you want to pull back, you could probably do parsing on the data as it comes back looking for
  • Probably want to use regex to pull out the title name

I have done this before with SEO bots and I have been able to handle almost 10,000 requests at a single time. You just need to make sure that each web request can be self contained in a thread.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文