如何在没有锚点的页面上为 url 编写正确的正则表达式?
我想剪切所有网址,例如 (http://....) 并将它们替换为锚点 < ;a>
但我的要求: 不要触摸锚点和页面定义(文档类型),例如:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
所以我需要找到带有 url 的纯文本...
我试图覆盖我的页面内渲染并制作了 BrowserAdapter:
<browser refID="default">
<controlAdapters>
<adapter controlType="System.Web.Mvc.ViewPage"
adapterType="Facad.Adapters.AnchorAdapter" />
</controlAdapters>
</browser>
它看起来像这样:
public class AnchorAdapter : PageAdapter
{
protected override void Render(HtmlTextWriter writer)
{
/* Get page output into string */
var sb = new StringBuilder();
TextWriter tw = new StringWriter(sb);
var htw = new HtmlTextWriter(tw);
// Render into my writer
base.Render(htw);
string page = sb.ToString();
//regular expression
Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
//get the first match
Match match = regx.Match(page);
//loop through matches
while (match.Success)
{
//output the match info
System.Web.HttpContext.Current.Response.Write("<p>url match: " + match.Groups[0].Value+"</p>");
//get next match
match = match.NextMatch();
}
writer.Write(page);
}
}
I want to cut all url's like (http://....) and replace them on anchors <a></a>
but my requirement:
Do not touch anchors and page definition(Doc type) like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
So I need to find just plain text with url's...
I'm trying to override my render inside page and I made BrowserAdapter:
<browser refID="default">
<controlAdapters>
<adapter controlType="System.Web.Mvc.ViewPage"
adapterType="Facad.Adapters.AnchorAdapter" />
</controlAdapters>
</browser>
it looks like this:
public class AnchorAdapter : PageAdapter
{
protected override void Render(HtmlTextWriter writer)
{
/* Get page output into string */
var sb = new StringBuilder();
TextWriter tw = new StringWriter(sb);
var htw = new HtmlTextWriter(tw);
// Render into my writer
base.Render(htw);
string page = sb.ToString();
//regular expression
Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);
//get the first match
Match match = regx.Match(page);
//loop through matches
while (match.Success)
{
//output the match info
System.Web.HttpContext.Current.Response.Write("<p>url match: " + match.Groups[0].Value+"</p>");
//get next match
match = match.NextMatch();
}
writer.Write(page);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您只需要在 url 的前后搜索一下,看看它是否在引号中,不太可能有人将带引号的 url 粘贴为纯文本,但 url 总是在标签和文档类型中引用。 所以你的正则表达式变成:
(^|[^'"]+) 表示字符串的开头或不是引号的字符
([^'"]|$) 表示字符串结尾或不是引号
旧正则表达式周围的额外括号确保它是一个捕获组,因此您可以使用 \2 (组 2)检索实际 URL,而不是得到额外的废话可能在 url 的边缘匹配
,顺便说一句,你的 URL 正则表达式看起来很糟糕,有更紧凑和准确的形式,你真的不需要转义所有内容。
You just need to search a bit ahead and behind the url to see if it's in quotes, it's unlikely someone would paste a quoted url as plaintext but urls are always quoted in tags and doctypes. So your regex becomes:
(^|[^'"]+) means start of string or a character that is NOT a quote
([^'"]|$) means end of string or not a quote
The extra brackets around the old regex ensure it's a capture group so you can retrieve the actual URL with \2 (group 2) instead of getting the extra crap it might have matched on the edges of the url
BTW, your URL regex looks pretty bad, there are more compact and accurate forms. You really don't need to escape EVERYTHING.