替换

内的标签 标签?

发布于 2024-07-27 15:45:03 字数 1502 浏览 9 评论 0原文

我正在开发一个专门的 HTML 剥离器。 当前的剥离器取代了 带有制表符的标签,然后是

带有双回车符的标签。 然而,当像这样剥离代码时:

<td>First Text</td><td style="background:#330000"><p style="color:#660000;text-align:center">Some Text</p></td>

它(显然)会产生

First Text

Some Text

We’d like the

在这种情况下什么都不替换,因此它会产生:

First Text (tab) Some Text

然而,我们希望保留双回车替换其他代码,其中

标签没有被 包围 标签。

基本上,我们正在尝试替换 始终带有 \t 和

的标签 和

仅当标签没有被 包围时才带有 \r\r 标签。

当前代码:(C#)

  // insert tabs in places of <TD> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\t",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);  

  // insert line paragraphs (double line breaks) in place
  // of <P>, <DIV> and <TR> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\r\r",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);

(剥离器还有更多代码;这是相关部分)

关于如何在不完全重写整个剥离器的情况下执行此操作的任何想法?

编辑: 我宁愿不使用库,因为将其签署并包含在项目中(它本身就是一个要包含在另一个项目中的库)很麻烦,更不用说法律问题了。 不过,如果没有其他解决方案,我可能会使用 HTML Agility Pack。

大多数情况下,剥离器只是删除它发现的任何看起来像标签的东西(使用基于正则表达式食谱中的正则表达式的大型正则表达式完成。这,用 /r 替换换行符标签,并处理多个选项卡是最重要的自定义剥离代码。

I'm working on a specialized HTML stripper. The current stripper replaces <td> tags with tabs then <p> and <div> tags with double carriage-returns. However, when stripping code like this:

<td>First Text</td><td style="background:#330000"><p style="color:#660000;text-align:center">Some Text</p></td>

It (obviously) produces

First Text

Some Text

We'd like to have the <p> replaced with nothing in this case, so it produces:

First Text (tab) Some Text

However, we'd like to keep the double carriage-return replacement for other code where the <p> tag is not surrounded by <td> tags.

Basically, we're trying to replace <td> tags with \t always and <p> and <div> tags with \r\r ONLY when they're not surrounded by <td> tags.

Current code: (C#)

  // insert tabs in places of <TD> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\t",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);  

  // insert line paragraphs (double line breaks) in place
  // of <P>, <DIV> and <TR> tags
  result = System.Text.RegularExpressions.Regex.Replace(result,
           @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\r\r",
           System.Text.RegularExpressions.RegexOptions.IgnoreCase);

(there's more code to the stripper; this is the relevant part)

Any ideas on how to do this without completely rewriting the entire stripper?

EDIT:
I'd prefer to not use a library due to the headaches of getting it signed off on and included with the project (which itself is a library to be included in another project), not to mention the legal issues. If there is no other solution, though, I'll probably use the HTML Agility Pack.

Mostly, the stripper just strips out anything it finds that looks like a tag (done with a large regex based on a regex in Regular Expressions Cookbook. This, replacing line break tags with /r, and dealing with multiple tabs is the brunt of the custom stripping code.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

羁绊已千年 2024-08-03 15:45:03

您是否考虑过研究 HTML Agility Pack,它内置了很多解析选项哪个来操作标签?

Have you thought about looking into the HTML Agility Pack, which would have a lot of parsing options built in in which to manipulate tags?

伪心 2024-08-03 15:45:03

找到答案:

  // remove p/div/tr inside of td's
  result = System.Text.RegularExpressions.Regex.Replace(result, @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>.*?</td\b(?:[^>""']|""[^""]*""|'[^']*')*>", new MatchEvaluator(RemoveTagsWithinTD));

此代码为每个匹配调用此单独的方法:

  //a separate method
  private static string RemoveTagsWithinTD(Match matchResult) {
      return Regex.Replace(matchResult.Value, @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "");
    }

此代码(再次)基于 正则表达式食谱(它一直坐在我面前,天啊!)。 这真是一本很棒的书。

Found the answer:

  // remove p/div/tr inside of td's
  result = System.Text.RegularExpressions.Regex.Replace(result, @"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>.*?</td\b(?:[^>""']|""[^""]*""|'[^']*')*>", new MatchEvaluator(RemoveTagsWithinTD));

This code calls this separate method for each match:

  //a separate method
  private static string RemoveTagsWithinTD(Match matchResult) {
      return Regex.Replace(matchResult.Value, @"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "");
    }

This code was (again) based on another recipe from the Regular Expressions Cookbook (which was sitting in front of me the whole time, d'oh!). It's really a great book.

美羊羊 2024-08-03 15:45:03

我没有使用正则表达式编写它的答案,但我强烈推荐 HTML Agility Pack 对于这样的事情。 您应该能够使用简单的选择器轻松找到节点,然后将它们替换为您想要的任何内容。

I don't have an answer as far as writing it with Regular Expressions, but I'd highly recommend the HTML Agility Pack for something like this. You should be able to find the nodes easily with a simple selector and just replace them with whatever you want.

狂之美人 2024-08-03 15:45:03

所以如果你不能使用敏捷包。 如果您创建一个简单的匹配来检查该块是否存在,该怎么办? 如果它存在,那么您可以对块内的标签进行所有正确的替换,否则有第二组替换适用于不在块内的标签。

无需重写现有的替换项,只需为您的其他条件创建一个更简单的替换项即可。 我想这取决于 HTML 剥离的一个“单元”中解析了多少文本。

So if you can't use the agility pack. What if you created a simple match that checked for the existence of the block. If it exists then you can do all the proper replacements for tags within the block, otherwise have a second set of replacements that works for tags not within the block.

No need to rewrite the existing replacements, just creating one more simple one for your other condition. I guess this would depend on how much text is getting parsed in one "unit" of HTML stripping.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文