如何在没有锚点的页面上为 url 编写正确的正则表达式？

发布于 2024-07-19 10:53:27 字数 1663 浏览 5 评论 0原文

我想剪切所有网址，例如 (http://....) 并将它们替换为锚点 < ;a> 但我的要求：不要触摸锚点和页面定义（文档类型），例如：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

所以我需要找到带有 url 的纯文本...

我试图覆盖我的页面内渲染并制作了 BrowserAdapter：

<browser refID="default">
    <controlAdapters>
        <adapter controlType="System.Web.Mvc.ViewPage"
                 adapterType="Facad.Adapters.AnchorAdapter" />
    </controlAdapters>
</browser>

它看起来像这样：

public class AnchorAdapter : PageAdapter
{
    protected override void Render(HtmlTextWriter writer)
    {
        /* Get page output into string */
        var sb = new StringBuilder();
        TextWriter tw = new StringWriter(sb);
        var htw = new HtmlTextWriter(tw);

        // Render into my writer
        base.Render(htw);

        string page = sb.ToString();
        //regular expression 
        Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase); 

        //get the first match 
        Match match = regx.Match(page); 

        //loop through matches 
        while (match.Success)
        {

            //output the match info 
            System.Web.HttpContext.Current.Response.Write("<p>url match: " + match.Groups[0].Value+"</p>");

            //get next match 
            match = match.NextMatch();
        }

        writer.Write(page);
    }
}

原文

I want to cut all url's like (http://....) and replace them on anchors <a></a> but my requirement:
Do not touch anchors and page definition(Doc type) like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

So I need to find just plain text with url's...

I'm trying to override my render inside page and I made BrowserAdapter:

<browser refID="default">
    <controlAdapters>
        <adapter controlType="System.Web.Mvc.ViewPage"
                 adapterType="Facad.Adapters.AnchorAdapter" />
    </controlAdapters>
</browser>

it looks like this:

public class AnchorAdapter : PageAdapter
{
    protected override void Render(HtmlTextWriter writer)
    {
        /* Get page output into string */
        var sb = new StringBuilder();
        TextWriter tw = new StringWriter(sb);
        var htw = new HtmlTextWriter(tw);

        // Render into my writer
        base.Render(htw);

        string page = sb.ToString();
        //regular expression 
        Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase); 

        //get the first match 
        Match match = regx.Match(page); 

        //loop through matches 
        while (match.Success)
        {

            //output the match info 
            System.Web.HttpContext.Current.Response.Write("<p>url match: " + match.Groups[0].Value+"</p>");

            //get next match 
            match = match.NextMatch();
        }

        writer.Write(page);
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凑诗 2024-07-26 10:53:27

您只需要在 url 的前后搜索一下，看看它是否在引号中，不太可能有人将带引号的 url 粘贴为纯文本，但 url 总是在标签和文档类型中引用。所以你的正则表达式变成：

(^|[^'"])(http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([^'"]+|$)

(^|[^'"]+) 表示字符串的开头或不是引号的字符
([^'"]|$) 表示字符串结尾或不是引号

旧正则表达式周围的额外括号确保它是一个捕获组，因此您可以使用 \2 （组 2）检索实际 URL，而不是得到额外的废话可能在 url 的边缘匹配

，顺便说一句，你的 URL 正则表达式看起来很糟糕，有更紧凑和准确的形式，你真的不需要转义所有内容。

You just need to search a bit ahead and behind the url to see if it's in quotes, it's unlikely someone would paste a quoted url as plaintext but urls are always quoted in tags and doctypes. So your regex becomes:

(^|[^'"])(http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?)([^'"]+|$)

(^|[^'"]+) means start of string or a character that is NOT a quote
([^'"]|$) means end of string or not a quote

The extra brackets around the old regex ensure it's a capture group so you can retrieve the actual URL with \2 (group 2) instead of getting the extra crap it might have matched on the edges of the url

BTW, your URL regex looks pretty bad, there are more compact and accurate forms. You really don't need to escape EVERYTHING.

回复收藏 0 原文

~没有更多了~