使用 Regex 或 XmlParser 替换标记中未包含的文本

发布于 2024-10-05 07:56:28 字数 821 浏览 2 评论 0原文

我知道使用正则表达式来解析或操作 HTML/XML 是一个坏主意,我通常不会这样做。但考虑它是因为缺乏替代方案。

我需要使用 C# 替换尚未属于标记(最好是具有特定 id 的 span 标记)一部分的字符串内的文本。

例如,假设我想将以下文本中不在跨度内的 ABC 的所有实例替换为替代文本(在我的情况下是另一个跨度)

行首的 ABC 或此处的 ABC 必须替换,但是, < ;span id="__publishingReusableFragment" >span 内的 ABC 不得替换为任何内容。这里还有一个 ABC 这个 ABC 也必须被替换

我尝试使用正则表达式,同时使用前瞻和后瞻断言。各种组合,

string regexPattern = "(?<!id=\"__publishingReusableFragment\").*?" + stringToMatch + ".*?(?!span)";

但放弃了。

我尝试将其加载到 XElement 中,并尝试从那里创建一个编写器并获取不在节点内部的文本。但也无法弄清楚。

XElement xel = XElement.Parse("<payload>" + inputString + @"</payload>");
XmlWriter requiredWriter = xel.CreateWriter();

我希望以某种方式使用编写器来获取不属于节点的字符串并替换它们。

基本上我愿意接受任何解决这个问题的建议/解决方案。

预先感谢您的帮助。

I know that using Regular expressions to parse or manipulate HTML/XML is a bad idea and I usually would never do it. But considering it because of lack of alternatives.

I need to replace text inside a string that is not already part of a tag (ideally a span tag with specific id) using C#.

For example, Lets say I want to replace all instaces of ABC in the following text that are not inside a span with Alternate text (another span in my case)

ABC at start of line or ABC here must be replaced but, <span id="__publishingReusableFragment" >ABC inside span must not be replaced with anything. Another ABC here </span> this ABC must also be replaced

I tried using regex with both look ahead and look behind assertion. Various combinations along the lines of

string regexPattern = "(?<!id=\"__publishingReusableFragment\").*?" + stringToMatch + ".*?(?!span)";

but gave up on that.

I tried loading it into an XElement and trying to create a writer from there and getting text not inside of a node. But couldn't figure that out either.

XElement xel = XElement.Parse("<payload>" + inputString + @"</payload>");
XmlWriter requiredWriter = xel.CreateWriter();

I am hoping somehow to use the writer to get the strings that are not part of a node and replacing them.

Basically I am open to any suggestions/solutions to solve this problem.

Thanks in advance for the help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

隔岸观火 2024-10-12 07:56:28
resultString = Regex.Replace(subjectString, 
    @"(?<!              # assert that we can't match the following 
                        # before the current position: 
                        # An opening span tag with specified id
     <\s*span\s*id=""__publishingReusableFragment""\s*>
     (?:                # if it is not followed by...
      (?!<\s*/\s*span)  # a closing span tag
      .                 # at any position between the opening tag
     )*                 # and our text
    )                   # End of lookbehind assertion
    ABC                 # Match ABC", 
    "XYZ", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);

将会适用于所有关于 HTML 解析的警告(您似乎知道,所以我不会在这里重复它们)仍然有效。

如果正则表达式前面没有开始 标记并且没有结束 标记,则该正则表达式与 ABC 匹配。两者之间如果可以嵌套 标签,显然会失败。

resultString = Regex.Replace(subjectString, 
    @"(?<!              # assert that we can't match the following 
                        # before the current position: 
                        # An opening span tag with specified id
     <\s*span\s*id=""__publishingReusableFragment""\s*>
     (?:                # if it is not followed by...
      (?!<\s*/\s*span)  # a closing span tag
      .                 # at any position between the opening tag
     )*                 # and our text
    )                   # End of lookbehind assertion
    ABC                 # Match ABC", 
    "XYZ", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);

will work with all the caveats about HTML parsing (that you seem to know, so I won't repeat them here) still valid.

The regex matches ABC if it's not preceded by an opening <span id=__publishingReusableFragment"> tag and if there is no closing <span> tag between the two. It will obviously fail if there can be nested <span> tags.

同尘 2024-10-12 07:56:28

我知道它有点难看,但这会起作用

var s =
    @"ABC at start of line or ABC here must be replaced but, <span id=""__publishingReusableFragment"" >ABC inside span must not be replaced with anything. Another ABC here </span> this ABC must also be replaced";
var newS = string.Join("</span>",s.Split(new[] {"</span>"}, StringSplitOptions.None)
    .Select(t =>
        {
            var bits = t.Split(new[] {"<span"}, StringSplitOptions.None);
            bits[0] = bits[0].Replace("ABC","DEF");
            return string.Join("<span", bits);
        }));

I know its slightly ugly, but this will work

var s =
    @"ABC at start of line or ABC here must be replaced but, <span id=""__publishingReusableFragment"" >ABC inside span must not be replaced with anything. Another ABC here </span> this ABC must also be replaced";
var newS = string.Join("</span>",s.Split(new[] {"</span>"}, StringSplitOptions.None)
    .Select(t =>
        {
            var bits = t.Split(new[] {"<span"}, StringSplitOptions.None);
            bits[0] = bits[0].Replace("ABC","DEF");
            return string.Join("<span", bits);
        }));
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文