正则表达式从网页中提取 Favicon url

发布于 2024-11-18 04:35:39 字数 1434 浏览 4 评论 0原文

请帮助我使用正则表达式从下面的示例 html 中找到 Favicon url。它还应该检查文件扩展名“.ico”。我正在开发一个个人书签网站,我想保存我添加书签的链接的图标。我已经编写了 C# 代码来将图标转换为 gif 并保存,但我对正则表达式的了解非常有限,因此我无法选择此标签,因为不同站点的结束标签不同。结束标签“/>”的示例“/链接>”

我的编程语言是 C#

<meta name="description" content="Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service." />
<meta name="robots" content="index, follow" />
<meta name="verify-v1" content="x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=" />
<link rel="shortcut icon" href="http://3dbin.com/favicon.ico" type="image/x-icon" />
<link rel="stylesheet" type="text/css" href="http://3dbin.com/css/1261391049/style.min.css" />
<!--[if lt IE 8]>
    <script src="http://3dbin.com/js/1261039165/IE8.js" type="text/javascript"></script>
<![endif]-->

解决方案:实现此目的的另一种方法 下载并添加对 htmlagilitypack dll 的引用。谢谢你帮助我。我真的很喜欢这个网站:)

 HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(readcontent);

    if (doc.DocumentNode != null)
    {
        foreach (HtmlNode link in doc.DocumentNode.SelectNodes(@"//link[@href]"))
        {

            HtmlAttribute att = link.Attributes["href"];
            if (att.Value.EndsWith(".ico"))
            {
                faviconurl = att.Value;
            }
        }
    }

Please help me to find the Favicon url from the sample html below using Regular expression. It should also check for file extension ".ico". I am developing a personal bookmarking site and i want to save the favicons of links which i bookmark. I have already written the c# code to convert icon to gif and save but i have very limited knowledge about regex so i am unable to select this tag because ending tags are different in different sites . Example of ending tags "/>" "/link>"

My programming language is C#

<meta name="description" content="Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service." />
<meta name="robots" content="index, follow" />
<meta name="verify-v1" content="x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=" />
<link rel="shortcut icon" href="http://3dbin.com/favicon.ico" type="image/x-icon" />
<link rel="stylesheet" type="text/css" href="http://3dbin.com/css/1261391049/style.min.css" />
<!--[if lt IE 8]>
    <script src="http://3dbin.com/js/1261039165/IE8.js" type="text/javascript"></script>
<![endif]-->

solution: one more way to do this
Download and add reference to htmlagilitypack dll. Thanks for helping me. I really love this site :)

 HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(readcontent);

    if (doc.DocumentNode != null)
    {
        foreach (HtmlNode link in doc.DocumentNode.SelectNodes(@"//link[@href]"))
        {

            HtmlAttribute att = link.Attributes["href"];
            if (att.Value.EndsWith(".ico"))
            {
                faviconurl = att.Value;
            }
        }
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

身边 2024-11-25 04:35:39

这应该与包含 href=http://3dbin.com/favicon.ico 的整个链接标记匹配

 <link .*? href="http://3dbin\.com/favicon\.ico" [^>]* />

根据您的评论进行更正:

我看到您有一个 C# 解决方案 非常好!但以防万一您仍然想知道是否可以使用正则表达式来完成,以下表达式可以满足您的要求。比赛的第 1 组将只有 url。

 <link .*? href="(.*?.ico)"

利用它的简单 C# 片段:

// this is the snipet from your example with an extra link item in the form <link ... href="...ico" > ... </link> 
//just to make sure it would pick it up properly.
String htmlText = String htnlText = "<meta name=\"description\" content=\"Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service.\" /><meta name=\"robots\" content=\"index, follow\" /><meta name=\"verify-v1\" content=\"x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=\" /><link rel=\"shortcut icon\" href=\"http://3dbin.com/favicon.ico\" type=\"image/x-icon\" /><link rel=\"shortcut icon\" href=\"http://anotherURL/someicofile.ico\" type=\"image/x-icon\">just to make sure it works with different link ending</link><link rel=\"stylesheet\" type=\"text/css\" href=\"http://3dbin.com/css/1261391049/style.min.css\" /><!--[if lt IE 8]>    <script src=\"http://3dbin.com/js/1261039165/IE8.js\" type=\"text/javascript\"></script><![endif]-->";

foreach (Match match in Regex.Matches(htmlText, "<link .*? href=\"(.*?.ico)\""))
{
    String url = match.Groups[1].Value;

    Console.WriteLine(url);
}

将以下内容打印到控制台:

http://3dbin.com/favicon.ico
http://anotherURL/someicofile.ico

This should match the whole link tag that contain href=http://3dbin.com/favicon.ico

 <link .*? href="http://3dbin\.com/favicon\.ico" [^>]* />

Correction based on your comment:

I see you have a C# solutions Excellent! But just in case you were still wondering if it could be done with regular expressions the following expression would do what you want. The group 1 of the match will have only the url.

 <link .*? href="(.*?.ico)"

Simple C# snipet that makes use of it:

// this is the snipet from your example with an extra link item in the form <link ... href="...ico" > ... </link> 
//just to make sure it would pick it up properly.
String htmlText = String htnlText = "<meta name=\"description\" content=\"Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service.\" /><meta name=\"robots\" content=\"index, follow\" /><meta name=\"verify-v1\" content=\"x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=\" /><link rel=\"shortcut icon\" href=\"http://3dbin.com/favicon.ico\" type=\"image/x-icon\" /><link rel=\"shortcut icon\" href=\"http://anotherURL/someicofile.ico\" type=\"image/x-icon\">just to make sure it works with different link ending</link><link rel=\"stylesheet\" type=\"text/css\" href=\"http://3dbin.com/css/1261391049/style.min.css\" /><!--[if lt IE 8]>    <script src=\"http://3dbin.com/js/1261039165/IE8.js\" type=\"text/javascript\"></script><![endif]-->";

foreach (Match match in Regex.Matches(htmlText, "<link .*? href=\"(.*?.ico)\""))
{
    String url = match.Groups[1].Value;

    Console.WriteLine(url);
}

which prints the following to the console:

http://3dbin.com/favicon.ico
http://anotherURL/someicofile.ico
晨光如昨 2024-11-25 04:35:39
<link\s+[^>]*(?:href\s*=\s*"([^"]+)"\s+)?rel\s*=\s*"shortcut icon"(?:\s+href\s*=\s*"([^"]+)")?

也许......它并不强大,但可以工作。 (我使用perl正则表达式)

<link\s+[^>]*(?:href\s*=\s*"([^"]+)"\s+)?rel\s*=\s*"shortcut icon"(?:\s+href\s*=\s*"([^"]+)")?

maybe... it is not robust, but could work. (I used perl regex)

庆幸我还是我 2024-11-25 04:35:39

这不是正则表达式的工作,如果您花 2 分钟在 StackOverflow 上寻找如何解析 HTML,您就会发现这一点。

改用 HTML 解析器!< /strong>

这是 Python 中的一个简单示例(我确信这在 C# 中同样可行):

% python
Python 2.7.1 (r271:86832, May 16 2011, 19:49:41) 
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen('https://stackoverflow.com/')
>>> soup = BeautifulSoup(page)
>>> link = soup.html.head.find(lambda x: x.name == 'link' and x['rel'] == 'shortcut icon')
>>> link['href']
u'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'
>>> link['href'].endswith('.ico')
True

This is not a job for a regular expression, as you'll see if you spend 2 minutes on StackOverflow looking for how to parse HTML.

Use an HTML parser instead!

Here's a trivial example in Python (I'm sure this is equally do-able in C#):

% python
Python 2.7.1 (r271:86832, May 16 2011, 19:49:41) 
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen('https://stackoverflow.com/')
>>> soup = BeautifulSoup(page)
>>> link = soup.html.head.find(lambda x: x.name == 'link' and x['rel'] == 'shortcut icon')
>>> link['href']
u'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'
>>> link['href'].endswith('.ico')
True
走走停停 2024-11-25 04:35:39

我不久前尝试过这个,所以这里的事情非常简单。首先,它尝试查找 /favicon.ico 文件。如果失败,我使用 Html Agility pack 加载页面,然后使用 xpath 查找任何标签。我循环遍历链接标签以查看它们是否具有 rel='icon' 属性。如果他们这样做,我会获取 href 属性并将其扩展为该站点的绝对 URL(如果它存在)。

请随意尝试并提供任何改进。

private static Uri GetFaviconUrl(string siteUrl)
{
    // try looking for a /favicon.ico first
    var url = new Uri(siteUrl);
    var faviconUrl = new Uri(string.Format("{0}://{1}/favicon.ico", url.Scheme, url.Host));
    try
    {
        using (var httpWebResponse = WebRequest.Create(faviconUrl).GetResponse() as HttpWebResponse)
        {
            if (httpWebResponse != null && httpWebResponse.StatusCode == HttpStatusCode.OK)
            {
                // Log("Found a /favicon.ico file for {0}", url);
                return faviconUrl;
            }
        }
    }
    catch (WebException)
    {
    }

    // otherwise parse the html and look for <link rel='icon' href='' /> using html agility pack
    var htmlDocument = new HtmlWeb().Load(url.ToString());
    var links = htmlDocument.DocumentNode.SelectNodes("//link");
    if (links != null)
    {
        foreach (var linkTag in links)
        {
            var rel = GetAttr(linkTag, "rel");
            if (rel == null)
                continue;

            if (rel.Value.IndexOf("icon", StringComparison.InvariantCultureIgnoreCase) > 0)
            {
                var href = GetAttr(linkTag, "href");
                if (href == null)
                    continue;

                Uri absoluteUrl;
                if (Uri.TryCreate(href.Value, UriKind.Absolute, out absoluteUrl))
                {
                    // Log("Found an absolute favicon url {0}", absoluteUrl);
                    return absoluteUrl;
                }

                var expandedUrl = new Uri(string.Format("{0}://{1}{2}", url.Scheme, url.Host, href.Value));
                //Log("Found a relative favicon url for {0} and expanded it to {1}", url, expandedUrl);
                return expandedUrl;
            }
        }
    }

    // Log("Could not find a favicon for {0}", url);
    return null;
}

public static HtmlAttribute GetAttr(HtmlNode linkTag, string attr)
{
    return linkTag.Attributes.FirstOrDefault(x => x.Name.Equals(attr, StringComparison.InvariantCultureIgnoreCase));
}

I had a go at this a wee while back so here is something that is pretty simple. First it attempts to find the /favicon.ico file. If that fails I load up the page using Html Agility pack and then use xpath to find any tags. I loop through the link tags to see if they have a rel='icon' attribute. If they do I grab the href attribute and expand that if it exists into an absolute url for that site.

Please feel free to play around with this and offer any improvements.

private static Uri GetFaviconUrl(string siteUrl)
{
    // try looking for a /favicon.ico first
    var url = new Uri(siteUrl);
    var faviconUrl = new Uri(string.Format("{0}://{1}/favicon.ico", url.Scheme, url.Host));
    try
    {
        using (var httpWebResponse = WebRequest.Create(faviconUrl).GetResponse() as HttpWebResponse)
        {
            if (httpWebResponse != null && httpWebResponse.StatusCode == HttpStatusCode.OK)
            {
                // Log("Found a /favicon.ico file for {0}", url);
                return faviconUrl;
            }
        }
    }
    catch (WebException)
    {
    }

    // otherwise parse the html and look for <link rel='icon' href='' /> using html agility pack
    var htmlDocument = new HtmlWeb().Load(url.ToString());
    var links = htmlDocument.DocumentNode.SelectNodes("//link");
    if (links != null)
    {
        foreach (var linkTag in links)
        {
            var rel = GetAttr(linkTag, "rel");
            if (rel == null)
                continue;

            if (rel.Value.IndexOf("icon", StringComparison.InvariantCultureIgnoreCase) > 0)
            {
                var href = GetAttr(linkTag, "href");
                if (href == null)
                    continue;

                Uri absoluteUrl;
                if (Uri.TryCreate(href.Value, UriKind.Absolute, out absoluteUrl))
                {
                    // Log("Found an absolute favicon url {0}", absoluteUrl);
                    return absoluteUrl;
                }

                var expandedUrl = new Uri(string.Format("{0}://{1}{2}", url.Scheme, url.Host, href.Value));
                //Log("Found a relative favicon url for {0} and expanded it to {1}", url, expandedUrl);
                return expandedUrl;
            }
        }
    }

    // Log("Could not find a favicon for {0}", url);
    return null;
}

public static HtmlAttribute GetAttr(HtmlNode linkTag, string attr)
{
    return linkTag.Attributes.FirstOrDefault(x => x.Name.Equals(attr, StringComparison.InvariantCultureIgnoreCase));
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文