如何解决字符串替换失败的问题

发布于 2024-11-19 23:43:53 字数 2075 浏览 2 评论 0原文

注意:我的问题不是我的链接没有被替换。但是,它是嵌套的。 注释

some string with www.google.com/blah/blah also something else www.google.com

例如,这是第二个字符串替换完成时的

,第一个字符串的一部分也有效(www.google.com/blah/blah),因此它会替换该链接两次。我有一个允许用户发表评论的网络应用程序。 我正在处理输入字符串,并将所有链接转换为 html 链接格式当我在页面上显示时。原始用户输入字符串保留在数据库中并且什么也没有发生,因此它不会在处理过程中被损坏。当我在页面上显示它时,我就在上面执行了我的功能。

现在,这是我用来将所有链接替换为其 html 格式的逻辑

  1. Regex 所有链接
  2. 对于每个匹配项,将链接替换为其原始字符串中的 html 格式版本。
  3. 最后显示字符串。

例如:www.google.com 变为 www.google.com 就在它显示在页面上之前。

这一直很有效,直到最近,我的一位客户发布了包含来自同一域的两个链接的内容。

链接是

  1. www.google.com/images/blahblah
  2. www.google.com

我的问题是,当第二次时,字符串替换完成(我正在使用 StringBuilder.Replace)第一个链接也被替换!

所以,首先,

www.google.com/images/blahblah

变成

<a href="http://www.google.com/images/blahblah" target="_blank">www.google.com/image/blahblah</a>

哪样都好。但是第二个字符串替换出现了问题,因为替换是全局的,它对已经处理的链接进行替换,因此原始(上面)链接变得扭曲,因为它也在那里看到 www.google.com

这太混乱了,我实际上得到了一根被肢解的、令人憎恶的绳子。

我该如何避免这种情况?

Regex.Matches 是否提供匹配元素的索引供我使用?我到处都找不到它。

最好的处理方法是什么?有什么建议吗?

抱歉问了这么长的问题。

我可以通过手动遍历字符串来做到这一点,但它又长又痛苦,必须有一个好方法来做到这一点...

编辑按照有人的要求添加额外信息:

我的正则表达式:

    string rPattern = @"(((http|ftp|https):\/\/)|www\.)[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#!]*[\w\-\@?^=%&amp;/~\+#])?";

     Regex rLinks = new Regex(rPattern, RegexOptions.IgnoreCase);
     MatchCollection matches = rLinks.Matches(inputString);

然后我正在使用

foreach(Match match in matches)
{
    if(match.value.StartsWith("www.youtube.com/watch"))
    {
         //logic to embed youtube video - this works fine.
    } 
}

//Here I replace all hyperlinks to their <a href> parts

NOTE : My problem is NOT that my links are not being replaced. But, it's being NESTED.
eg, this is the comment

some string with www.google.com/blah/blah also something else www.google.com

by the time second string replace is done, part of first one is also valid (www.google.com/blah/blah) so it's replacing that link twice.

I have a web app which lets users comment.
I am processing the input string and converting all links to html link format when I display it on the page. Original user input string stays in DB and nothing ever happens so it's not corrupted over processing. Just when I show that on page, I do my function on it.

Now, this is the logic I am using to replace all links with their html formats

  1. Regex all links
  2. For each match, replace link with it's html format version in the original string.
  3. Finally display string.

ex: www.google.com becomes <a href="http://www.google.com" target="_blank">www.google.com</a> just before it's displayed on page.

This was working great until recently, one of my customer posted a content with two links from same domain.

the links were, say,

  1. www.google.com/images/blahblah
  2. www.google.com

My problem is, when the second time around, a string replace is done (I am using StringBuilder.Replace) the first link gets replaced as well!

so, firstly,

www.google.com/images/blahblah

becomes

<a href="http://www.google.com/images/blahblah" target="_blank">www.google.com/image/blahblah</a>

which is well. But the problem arises for second string replace, since replace is global, it does a replace on already processed link so the original (above) link becomes twisted as it sees www.google.com in there as well.

This is messing up so much that I actually get a mutilated abomination of a string.

How do I avoid this?

Does the Regex.Matches provide an index of matched element for me to work with? I couldn't find it anywhere.

What's the best way to deal with? any suggestions?

sorry for lengthy question.

I can prolly do this by manually traversing string but it's long and painful there's got to be a good way to do it...

edit adding extra info as someone asked:

My regex:

    string rPattern = @"(((http|ftp|https):\/\/)|www\.)[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#!]*[\w\-\@?^=%&/~\+#])?";

     Regex rLinks = new Regex(rPattern, RegexOptions.IgnoreCase);
     MatchCollection matches = rLinks.Matches(inputString);

then I am using

foreach(Match match in matches)
{
    if(match.value.StartsWith("www.youtube.com/watch"))
    {
         //logic to embed youtube video - this works fine.
    } 
}

//Here I replace all hyperlinks to their <a href> parts

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

月下凄凉 2024-11-26 23:43:53

Regex.Matches 返回一个 MatchCollectionMatch.Index 就是您要查找的内容。

string pattern = @"(https?://)?(?:www(?:\.\w+)+|(?:\w+\.)+(?:com|org|us|net|...))(/\w*)*"; // your pattern here.
foreach (Match match in Regex.Matches (input, pattern))
{
   // Use match.Index and match.Length;
}

但实际上,您可能正在寻找更像这样的东西:

string originalPost = 
   @"Ooh shiney: www.google.com/images/blahblah
   Look here: www.google.com";

string html = Regex.Replace (
   originalPost, patternString, 
   "<a href='http://$1' target='_blank'>$1</a>");

或者,您可以使用 matchEvaluator 来做更高级的工作(例如确保我们不会添加双 http://.

string html = Regex.Replace (
   originalPost, patternString, 
   m => 
      string.Format (
         "<a href='{0}{1}' target='_blank'>{1}</a>",
          m.Value.StartsWith ("http", StringComparison.IgnoreCase) ? "" : "http://",
          m.Value));

Regex.Matches returns a MatchCollection. Match.Index Is what you're looking for.

string pattern = @"(https?://)?(?:www(?:\.\w+)+|(?:\w+\.)+(?:com|org|us|net|...))(/\w*)*"; // your pattern here.
foreach (Match match in Regex.Matches (input, pattern))
{
   // Use match.Index and match.Length;
}

But really, you're probably looking for something more like this:

string originalPost = 
   @"Ooh shiney: www.google.com/images/blahblah
   Look here: www.google.com";

string html = Regex.Replace (
   originalPost, patternString, 
   "<a href='http://$1' target='_blank'>$1</a>");

Or, you can use a matchEvaluator to do more advanced work (like ensure we don't add a double http://.

string html = Regex.Replace (
   originalPost, patternString, 
   m => 
      string.Format (
         "<a href='{0}{1}' target='_blank'>{1}</a>",
          m.Value.StartsWith ("http", StringComparison.IgnoreCase) ? "" : "http://",
          m.Value));
神妖 2024-11-26 23:43:53

我有同样的需求,这就是我过去几年一直在使用的:

public static string MakeCommentSafe(string strComment)
{
    // Replace carriage return / line feeds with line feeds.  Then HtmlEncode.  Then replace multiple consecutive line feeds with single line feeds.
    strComment = Regex.Replace(System.Web.HttpContext.Current.Server.HtmlEncode(Regex.Replace(strComment, "\r\n", "\n").Replace((char)13, (char)10)), "\n(\n)+", "$1\n");

    // Find all links and make them active
    return Regex.Replace(Regex.Replace(strComment, @"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", "<a href=\"$1\" target=\"_blank\" rel=\"nofollow\">$1</a>"), "\n", "<br />");
}

这是一个提示。如果您确实希望它在页面上有大量评论时表现良好,请在发布评论时将不安全和安全版本都存储在数据库中。这样,在页面上显示每条评论时,您就不必重复调用此函数。

I had the same need and this is what I've been using for the past couple years now:

public static string MakeCommentSafe(string strComment)
{
    // Replace carriage return / line feeds with line feeds.  Then HtmlEncode.  Then replace multiple consecutive line feeds with single line feeds.
    strComment = Regex.Replace(System.Web.HttpContext.Current.Server.HtmlEncode(Regex.Replace(strComment, "\r\n", "\n").Replace((char)13, (char)10)), "\n(\n)+", "$1\n");

    // Find all links and make them active
    return Regex.Replace(Regex.Replace(strComment, @"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", "<a href=\"$1\" target=\"_blank\" rel=\"nofollow\">$1</a>"), "\n", "<br />");
}

And here's a tip. If you really want this to perform well with lots of comments on the page, then store both the unsafe and safe versions in the database when the comment is posted. That way you don't have to call this function repeatedly when displaying every comment on a page.

晒暮凉 2024-11-26 23:43:53

使用 Regex.Replace 方法,例如:

var result = Regex.Replace(input, pattern, "<a href=\"$0\" target=\"_blank\">$0</a>");

Use Regex.Replace method, e.g.:

var result = Regex.Replace(input, pattern, "<a href=\"$0\" target=\"_blank\">$0</a>");
纵性 2024-11-26 23:43:53

扮演魔鬼拥护者的角色:

因此,您想要更正看起来像这样的字符串:

www.example.com
www.example.com/foo/bar
www.example.co.tw/baz.moo?foo=1

但是,而不是像这样的字符串:

www.example.com
www.example.com/foo/bar
www.example.co.tw/baz.moo?foo=1

我猜我是对的。简单的解决方案,扩展您的正则表达式以查看看起来像 URL 的内容的任一侧,并在以下情况下忽略它:

  1. 位于 href="" target="_blank"> 之间;
  2. 位于 " target="_blank"> 之间

To play devils advocate:

So, you want to correct strings that look like:

www.example.com
www.example.com/foo/bar
www.example.co.tw/baz.moo?foo=1

but, not strings that look like:

www.example.com
www.example.com/foo/bar
www.example.co.tw/baz.moo?foo=1

I would guess that I am correct. Simple solution, expand your regex to look either side of the thing that looks like a URL and to ignore it if it:

  1. Is between a href=" and a " target="_blank">
  2. Is between a " target="_blank"> and a </a>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文