替换
、
我正在开发一个专门的 HTML 剥离器。 当前的剥离器取代了 带有制表符的标签,然后是
和
<td>First Text</td><td style="background:#330000"><p style="color:#660000;text-align:center">Some Text</p></td>
它(显然)会产生
First Text
Some Text
We’d like the
在这种情况下什么都不替换,因此它会产生:
First Text (tab) Some Text
然而,我们希望保留双回车替换其他代码,其中
标签没有被 包围 标签。
基本上,我们正在尝试替换 始终带有 \t 和
的标签 和
当前代码:(C#)
// insert tabs in places of <TD> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// insert line paragraphs (double line breaks) in place
// of <P>, <DIV> and <TR> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
(剥离器还有更多代码;这是相关部分)
关于如何在不完全重写整个剥离器的情况下执行此操作的任何想法?
编辑: 我宁愿不使用库,因为将其签署并包含在项目中(它本身就是一个要包含在另一个项目中的库)很麻烦,更不用说法律问题了。 不过,如果没有其他解决方案,我可能会使用 HTML Agility Pack。
大多数情况下,剥离器只是删除它发现的任何看起来像标签的东西(使用基于正则表达式食谱中的正则表达式的大型正则表达式完成。这,用 /r 替换换行符标签,并处理多个选项卡是最重要的自定义剥离代码。
I'm working on a specialized HTML stripper. The current stripper replaces <td> tags with tabs then <p> and <div> tags with double carriage-returns. However, when stripping code like this:
<td>First Text</td><td style="background:#330000"><p style="color:#660000;text-align:center">Some Text</p></td>
It (obviously) produces
First Text
Some Text
We'd like to have the <p> replaced with nothing in this case, so it produces:
First Text (tab) Some Text
However, we'd like to keep the double carriage-return replacement for other code where the <p> tag is not surrounded by <td> tags.
Basically, we're trying to replace <td> tags with \t always and <p> and <div> tags with \r\r ONLY when they're not surrounded by <td> tags.
Current code: (C#)
// insert tabs in places of <TD> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<td\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\t",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
// insert line paragraphs (double line breaks) in place
// of <P>, <DIV> and <TR> tags
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<(div|tr|p)\b(?:[^>""']|""[^""]*""|'[^']*')*>", "\r\r",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
(there's more code to the stripper; this is the relevant part)
Any ideas on how to do this without completely rewriting the entire stripper?
EDIT:
I'd prefer to not use a library due to the headaches of getting it signed off on and included with the project (which itself is a library to be included in another project), not to mention the legal issues. If there is no other solution, though, I'll probably use the HTML Agility Pack.
Mostly, the stripper just strips out anything it finds that looks like a tag (done with a large regex based on a regex in Regular Expressions Cookbook. This, replacing line break tags with /r, and dealing with multiple tabs is the brunt of the custom stripping code.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您是否考虑过研究 HTML Agility Pack,它内置了很多解析选项哪个来操作标签?
Have you thought about looking into the HTML Agility Pack, which would have a lot of parsing options built in in which to manipulate tags?
找到答案:
此代码为每个匹配调用此单独的方法:
此代码(再次)基于 正则表达式食谱(它一直坐在我面前,天啊!)。 这真是一本很棒的书。
Found the answer:
This code calls this separate method for each match:
This code was (again) based on another recipe from the Regular Expressions Cookbook (which was sitting in front of me the whole time, d'oh!). It's really a great book.
我没有使用正则表达式编写它的答案,但我强烈推荐 HTML Agility Pack 对于这样的事情。 您应该能够使用简单的选择器轻松找到节点,然后将它们替换为您想要的任何内容。
I don't have an answer as far as writing it with Regular Expressions, but I'd highly recommend the HTML Agility Pack for something like this. You should be able to find the nodes easily with a simple selector and just replace them with whatever you want.
所以如果你不能使用敏捷包。 如果您创建一个简单的匹配来检查该块是否存在,该怎么办? 如果它存在,那么您可以对块内的标签进行所有正确的替换,否则有第二组替换适用于不在块内的标签。
无需重写现有的替换项,只需为您的其他条件创建一个更简单的替换项即可。 我想这取决于 HTML 剥离的一个“单元”中解析了多少文本。
So if you can't use the agility pack. What if you created a simple match that checked for the existence of the block. If it exists then you can do all the proper replacements for tags within the block, otherwise have a second set of replacements that works for tags not within the block.
No need to rewrite the existing replacements, just creating one more simple one for your other condition. I guess this would depend on how much text is getting parsed in one "unit" of HTML stripping.