我需要正则表达式方面的帮助,以便从 C# 中的字符串中提取链接

发布于 2024-10-21 09:23:15 字数 336 浏览 3 评论 0原文

我需要使用 C# 中的正则表达式从字符串中提取链接。我无法使用子字符串方法,因为字符串和链接中的字母可能会有所不同。 这是带有周围字母的链接:

-sv"><a href="http://sv.wikipedia.org/wiki/%C3%84pple" title="

-sv">

@"]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?[^>]*?>";

现在我在 stackoverflow 上向你们求助。 提前致谢! 最大限度

I need to extract a link from a string using regular expression in C#. I cannot use a substring method since both the letters in the string and the link may vary.
This is the link with surrounding letters:

-sv"><a href="http://sv.wikipedia.org/wiki/%C3%84pple" title="

The -sv"><a href=" part must be included in the regex or it won't be specific enough.
The end of the regex may be at the quotation markat the end of the link or whichever is the easiest way.
I've had another suggestion aswell, however, this does not include the sv-part in the beginning and the submitter couldnt make it compile:

@"]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?[^>]*?>";

Now I'm turning to you guys on stackoverflow.
Thanks in advance!
Max

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

就是爱搞怪 2024-10-28 09:23:15

使用正则表达式从 html 中解析内容充满了危险。请参阅这个经典答案有力而幽默地解释了这一点。

你的问题的问题在于我们不知道背景。

  • 您确定相同的子字符串不会出现两次吗?
  • 你确定不会有多余的空格吗?
  • 您确定 html 有效吗? (即,他们可能忘记使用“”,或者使用“”)
  • 您确定他们不会将标题放在href之前吗?

有很多方法会出错...


但是,为了回答您的问题,此正则表达式模式将适用于您粘贴的确切字符串

 -sv"><a href="([^"]+)"

但是,您将无法执行直接用那个替换。请注意 (),这是一个正则表达式捕获。我建议你自己查一下,这样你就不会永远是新手:)

Parsing stuff out of html with regex is fraught with danger. Please see this classic answer which explains this with force and humour.

The problem with your question is that we don't know the context.

  • Are your sure the same substring won't appear twice?
  • Are you sure there won't be extra whitespace?
  • Are you sure the html will be valid? (i.e., they could forget to use "", or use '' instead)
  • Are you sure they won't put the title before the href?

There are lots of ways to get it wrong...


However, to answer your question, this regex pattern will work for the exact string you have pasted:

 -sv"><a href="([^"]+)"

However, you won't be able to do a replace directly with that. Note the (), this is a regex capture. I'd recommend looking that up yourself, that way you won't be a newbie forever :)

帥小哥 2024-10-28 09:23:15

尝试使用 HTML 解析器。源代码对于学习也非常直观。

下载库,添加对 HtmlAgilityPack.dll 的引用。获取所有链接:

    List<string> listOfUrls = new List<string>();
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(@"c:\ht.html");
        HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//li[@class='interwiki-sv']");
        foreach (HtmlNode li in coll)
        {
            if (li.ChildNodes.Count < 1) continue;
            HtmlNode node = li.ChildNodes.First();
            if (null == node) continue;
            HtmlAttribute att = node.Attributes["href"];
            if (null == att) continue;
            listOfUrls.Add(att.Value);
        }
    //Now, You got your listOfUrls to process.

Try using HTML parser. Source code is very intuitive for learning as well.

Download library, add reference to HtmlAgilityPack.dll. Get all your links with:

    List<string> listOfUrls = new List<string>();
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(@"c:\ht.html");
        HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//li[@class='interwiki-sv']");
        foreach (HtmlNode li in coll)
        {
            if (li.ChildNodes.Count < 1) continue;
            HtmlNode node = li.ChildNodes.First();
            if (null == node) continue;
            HtmlAttribute att = node.Attributes["href"];
            if (null == att) continue;
            listOfUrls.Add(att.Value);
        }
    //Now, You got your listOfUrls to process.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文