编码 HTML 的正则表达式

发布于 2024-07-27 05:16:37 字数 784 浏览 11 评论 0原文

我想创建一个正则表达式，它将匹配仅包含 href 属性的开始标记：

<a href="doesntmatter.com">

它应该与上面的内容匹配，但在添加其他属性时不匹配：

<a href="doesntmatter.com" onmouseover="alert('Do something evil with Javascript')">

通常情况下是这样很简单，但是 HTML 是经过编码的。因此，对上述两者进行编码，我需要正则表达式来匹配此：

&#60;a href&#61;&#34;doesntmatter.com&#34; &#62;

但不匹配此：

&#60;a href&#61;&#34;doesntmatter.com&#34; onmouseover&#61;&#34;alert&#40;&#39;do something evil with javascript.&#39;&#41;&#34; &#62;

假设所有编码的 HTML 都是“有效”（没有奇怪的格式错误的 XSS 欺骗）并假设我们不需要遵循任何 HTML 清理最佳实践。我只需要最简单的正则表达式来匹配上面的 A) 但不匹配 B)。

谢谢！

原文

I'd like to create a regex that will match an opening <a> tag containing an href attribute only:

<a href="doesntmatter.com">

It should match the above, but not match when other attributes are added:

<a href="doesntmatter.com" onmouseover="alert('Do something evil with Javascript')">

Normally that would be pretty easy, but the HTML is encoded. So encoding both of the above, I need the regex to match this:

<a href="doesntmatter.com" >

But not match this:

<a href="doesntmatter.com" onmouseover="alert('do something evil with javascript.')" >

Assume all encoded HTML is "valid" (no weird malformed XSS trickery) and assume that we don't need to follow any HTML sanitization best practices. I just need the simplest regex that will match A) above but not B).

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

信仰 2024-08-03 05:16:37

我首先想到的正则表达式是 //; 惰性表达式 (.*?) 可用于匹配引号之间的字符串。然而，正如评论中所指出的，因为正则表达式是由 > 锚定的，所以它也会匹配无效标签，因为仍然会匹配。

为了解决这个问题，您可以使用原子分组。原子分组告诉正则表达式引擎，“一旦找到该组的匹配项，就接受它”——这将解决正则表达式在找不到 > 后返回并匹配第二个字符串的问题。 a href 的结尾。具有原子组的正则表达式如下所示：

/<a (?>href=".*?")>/

当用 HTML 实体替换字符时，它将如下所示：

/<a (?>href=".*?")>/

The initial regular expression that comes to mind is /<a href=".*?">/; a lazy expression (.*?) can be used to match the string between the quotes. However, as pointed out in the comments, because the regular expression is anchored by a >, it'll match the invalid tag as well, because a match is still made.

In order to get around this problem, you can use atomic grouping. Atomic grouping tells the regular expression engine, "once you have found a match for this group, accept it" -- this will solve the problem of the regex going back and matching the second string after not finding a > a the end of the href. The regular expression with an atomic group would look like:

/<a (?>href=".*?")>/

Which would look like the following when replacing the characters with their HTML entities:

/<a (?>href=".*?")>/

回复收藏 0 原文

转角预定愛 2024-08-03 05:16:37

嘿！我最近不得不做类似的事情。我建议先解码 html，然后尝试获取您想要的信息。这是我的 C# 解决方案：

private string getAnchor(string data)
    {
        MatchCollection matches;
        string pattern = @"<a.*?href=[""'](?<href>.*?)[""'].*?>(?<text>.*?)</a>";
        Regex myRegex = new Regex(pattern, RegexOptions.Multiline);
        string anchor = "";

        matches = myRegex.Matches(data);

        foreach (Match match in matches)
        {
            anchor += match.Groups["href"].Value.Trim() + "," + match.Groups["text"].Value.Trim();
        }

        return anchor;
    }

希望有帮助！

Hey! I had to do a similar thing recently. I recommend decoding the html first then attempt to grab the info you want. Here's my solution in C#:

private string getAnchor(string data)
    {
        MatchCollection matches;
        string pattern = @"<a.*?href=[""'](?<href>.*?)[""'].*?>(?<text>.*?)</a>";
        Regex myRegex = new Regex(pattern, RegexOptions.Multiline);
        string anchor = "";

        matches = myRegex.Matches(data);

        foreach (Match match in matches)
        {
            anchor += match.Groups["href"].Value.Trim() + "," + match.Groups["text"].Value.Trim();
        }

        return anchor;
    }

I hope that helps!

回复收藏 0 原文

夏九 2024-08-03 05:16:37

我不明白匹配的一个与另一个有什么不同？您只需精确查找刚刚编写的内容，将 doesntmatter.com 部分作为您捕获的部分。我猜匹配 " （不是 "？）之前的任何内容都会出现问题，但您可以在正则表达式中这样做：

(?:(?!").)*

本质上意味着：

匹配以下组 0 次或多次
- 如果以下字符串为 """，则匹配失败
- 匹配任何字符（换行符除外，除非指定了 DOTALL）

完整的正则表达式将是：

/<a href="(?>(?:[^&]+|(?!").)*)">/s

这比使用更高效非贪婪的表达式。

感谢 Daniel Vandersluis 提醒我原子组！为了优化起见，它非常适合这里（如果必须回溯，则该模式永远无法匹配。）

我还添加了一个额外的 [^&]+ 组以避免重复负向前瞻很多次。

或者，可以使用所有格量词，它本质上做同样的事情（你的正则表达式引擎可能不支持它）：

/<a href="(?:[^&]+|(?!").)*+">/s

如您所见，它稍短。

I don't see how matching one is different from the other? You're just looking for exactly what you just wrote, making the portion that is doesntmatter.com the part you capture. I guess matching for anything until " (not "?) can present a problem, but you do it like this in regex:

(?:(?!").)*

It essentially means:

Match the following group 0 or more times
- Fail match if the following string is """
- Match any character (except new line unless DOTALL is specified)

The complete regular expression would be:

/<a href="(?>(?:[^&]+|(?!").)*)">/s

This is more efficient than using a non-greedy expression.

Credit to Daniel Vandersluis for reminding me of the atomic group! It fits nicely here for the sake of optimization (this pattern can never match if it has to backtrack.)

I also threw in an additional [^&]+ group to avoid repeating the negative look-ahead so many times.

Alternatively, one could use a possessive quantifier, which essentially does the same thing (your regex engine might not support it):