正则表达式检索锚点

发布于 2024-09-11 22:30:40 字数 605 浏览 3 评论 0 原文

我有一堆 html,我需要使用正则表达式获取所有锚点和锚点值。

这是我需要处理的示例 html:

<P align=center><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10px"><SPAN style="COLOR: #666666">View the </SPAN><A href="http://www.google.com"><SPAN style="COLOR: #666666">online version</SPAN></A><SPAN style="COLOR: #666666"> if you are having trouble <A name=hi>displaying </A>this <a name="msg">message</A></SPAN></SPAN></P>

因此,我需要能够处理所有

非常感谢任何帮助。

I have bunch of html and I need to get all the anchors and the anchor value using Regular Expression.

This is sample html that I need to process:

<P align=center><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10px"><SPAN style="COLOR: #666666">View the </SPAN><A href="http://www.google.com"><SPAN style="COLOR: #666666">online version</SPAN></A><SPAN style="COLOR: #666666"> if you are having trouble <A name=hi>displaying </A>this <a name="msg">message</A></SPAN></SPAN></P>

So, I need to be able to all <A name="blah">.

Any help is greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

戏蝶舞 2024-09-18 22:30:40

正如 stackoverflow 上数百个其他答案所表明的那样 - 使用正则表达式来处理 html 是一个坏主意。使用一些 html 解析器。

但例如,如果您仍然需要正则表达式来查找 href url,则可以使用以下正则表达式来匹配 href 并提取其值:

\b(?<=(href="))[^"]*?(?=")

如果您想获取 中的内容并且,那么使用正则表达式确实是一个糟糕的方法,因为正则表达式中的前向/后向不支持正则表达式生成可变长度匹配。

As hundreds of other answers on stackoverflow suggest - its a bad idea to use regex for processing html. use some html parser.

But for example, if still you need a regex to find the href urls, below is an regex you can use to match hrefs and extract its value:

\b(?<=(href="))[^"]*?(?=")

If you want to get contents inside <A> and </A>, then using regex is really a bad approach as lookahead/behind in the regex do not support regex producing variable length matches.

猫九 2024-09-18 22:30:40

模式是 href|name)="(?.*?)".*?>

所以你的 c# 代码将是

Regex expression = new Regex("<a.*?(?<attribute>href|name)=\"(?<value>.*?)\".*?>", RegexOptions.IgnoreCase);

the pattern is <a.*?(?<attribute>href|name)="(?<value>.*?)".*?>

so your c# code will be

Regex expression = new Regex("<a.*?(?<attribute>href|name)=\"(?<value>.*?)\".*?>", RegexOptions.IgnoreCase);
神仙妹妹 2024-09-18 22:30:40

不要忘记添加对 Microsoft.mshtml.dll 的引用

using System;
using System.IO;
using System.Linq;
using System.Windows.Forms;

namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();

            string html = "<P align=center><SPAN style=\"FONT-FAMILY: Arial; FONT-SIZE: 10px\"><SPAN style=\"COLOR: #666666\">View the </SPAN><A href=\"http://www.google.com\"><SPAN style=\"COLOR: #666666\">online version</SPAN></A><SPAN style=\"COLOR: #666666\"> if you are having trouble <A name=hi>displaying </A>this <a name=\"msg\">message</A></SPAN></SPAN></P>";
            string fileName = Path.Combine(Path.GetTempPath(), Path.GetTempFileName());
            System.IO.File.WriteAllText(fileName, html);

            var browser = new WebBrowser();
            browser.Navigated += (sender, e) => browser_Navigated(sender, e);
            browser.Navigate(new Uri(fileName));
        }

        private void browser_Navigated(object sender, WebBrowserNavigatedEventArgs e)
        {
            var browser = (WebBrowser)sender;
            var links = browser
                        .Document
                        .Links
                        .OfType<HtmlElement>()
                        .Select(l => ((mshtml.HTMLAnchorElement)l.DomElement).href); 
                        //result: { "http://www.google.com", .. }
        }
    }
}

Don't forget to add a reference to Microsoft.mshtml.dll

using System;
using System.IO;
using System.Linq;
using System.Windows.Forms;

namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();

            string html = "<P align=center><SPAN style=\"FONT-FAMILY: Arial; FONT-SIZE: 10px\"><SPAN style=\"COLOR: #666666\">View the </SPAN><A href=\"http://www.google.com\"><SPAN style=\"COLOR: #666666\">online version</SPAN></A><SPAN style=\"COLOR: #666666\"> if you are having trouble <A name=hi>displaying </A>this <a name=\"msg\">message</A></SPAN></SPAN></P>";
            string fileName = Path.Combine(Path.GetTempPath(), Path.GetTempFileName());
            System.IO.File.WriteAllText(fileName, html);

            var browser = new WebBrowser();
            browser.Navigated += (sender, e) => browser_Navigated(sender, e);
            browser.Navigate(new Uri(fileName));
        }

        private void browser_Navigated(object sender, WebBrowserNavigatedEventArgs e)
        {
            var browser = (WebBrowser)sender;
            var links = browser
                        .Document
                        .Links
                        .OfType<HtmlElement>()
                        .Select(l => ((mshtml.HTMLAnchorElement)l.DomElement).href); 
                        //result: { "http://www.google.com", .. }
        }
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文