替换
、
内的标签标签？

发布于 2024-07-27 15:45:03 字数 1502 浏览 9 评论 0原文

我正在开发一个专门的 HTML 剥离器。当前的剥离器取代了带有制表符的标签，然后是

和

带有双回车符的标签。然而，当像这样剥离代码时：

<td>First Text</td><td style="background:#330000"><p style="color:#660000;text-align:center">Some Text</p></td>

它（显然）会产生

First Text

Some Text

We’d like the

在这种情况下什么都不替换，因此它会产生：

First Text (tab) Some Text

然而，我们希望保留双回车替换其他代码，其中

标签没有被包围标签。

基本上，我们正在尝试替换始终带有 \t 和

的标签和

仅当标签没有被包围时才带有 \r\r 标签。

当前代码：（C#）

  // insert tabs in places of <TD> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\t",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);  

  // insert line paragraphs (double line breaks) in place
  // of <P>, <DIV> and <TR> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\r\r",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);

（剥离器还有更多代码；这是相关部分）

关于如何在不完全重写整个剥离器的情况下执行此操作的任何想法？

编辑：我宁愿不使用库，因为将其签署并包含在项目中（它本身就是一个要包含在另一个项目中的库）很麻烦，更不用说法律问题了。不过，如果没有其他解决方案，我可能会使用 HTML Agility Pack。

大多数情况下，剥离器只是删除它发现的任何看起来像标签的东西（使用基于正则表达式食谱中的正则表达式的大型正则表达式完成。这，用 /r 替换换行符标签，并处理多个选项卡是最重要的自定义剥离代码。

原文

I'm working on a specialized HTML stripper. The current stripper replaces <td> tags with tabs then <p> and <div> tags with double carriage-returns. However, when stripping code like this:

<td>First Text</td><td style="background:#330000"><p style="color:#660000;text-align:center">Some Text</p></td>

It (obviously) produces

First Text

Some Text

We'd like to have the <p> replaced with nothing in this case, so it produces:

First Text (tab) Some Text

However, we'd like to keep the double carriage-return replacement for other code where the <p> tag is not surrounded by <td> tags.

Basically, we're trying to replace <td> tags with \t always and <p> and <div> tags with \r\r ONLY when they're not surrounded by <td> tags.

Current code: (C#)

  // insert tabs in places of <TD> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\t",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);  

  // insert line paragraphs (double line breaks) in place
  // of <P>, <DIV> and <TR> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\r\r",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);

(there's more code to the stripper; this is the relevant part)

Any ideas on how to do this without completely rewriting the entire stripper?

EDIT:
I'd prefer to not use a library due to the headaches of getting it signed off on and included with the project (which itself is a library to be included in another project), not to mention the legal issues. If there is no other solution, though, I'll probably use the HTML Agility Pack.

Mostly, the stripper just strips out anything it finds that looks like a tag (done with a large regex based on a regex in Regular Expressions Cookbook. This, replacing line break tags with /r, and dealing with multiple tabs is the brunt of the custom stripping code.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

羁绊已千年 2024-08-03 15:45:03

您是否考虑过研究 HTML Agility Pack，它内置了很多解析选项哪个来操作标签？

回复收藏 0 原文

伪心 2024-08-03 15:45:03

找到答案：

  // remove p/div/tr inside of td's
  result = System.Text.RegularExpressions.Regex.Replace(result, @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>.*?</td\b(?:[^>""']|""[^""]*""|'[^']*')*>", new MatchEvaluator(RemoveTagsWithinTD));

此代码为每个匹配调用此单独的方法：

  //a separate method
  private static string RemoveTagsWithinTD(Match matchResult) {
      return Regex.Replace(matchResult.Value, @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "");
    }

此代码（再次）基于正则表达式食谱（它一直坐在我面前，天啊！）。这真是一本很棒的书。

Found the answer:

  // remove p/div/tr inside of td's
  result = System.Text.RegularExpressions.Regex.Replace(result, @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>.*?</td\b(?:[^>""']|""[^""]*""|'[^']*')*>", new MatchEvaluator(RemoveTagsWithinTD));

This code calls this separate method for each match:

  //a separate method
  private static string RemoveTagsWithinTD(Match matchResult) {
      return Regex.Replace(matchResult.Value, @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "");
    }

This code was (again) based on another recipe from the Regular Expressions Cookbook (which was sitting in front of me the whole time, d'oh!). It's really a great book.

回复收藏 0 原文