我有一堆 html 文件(5000)。
我的业务需求定义了一个参考格式,假设它是 XXX-YY(年份)-ZZZ。
我想在所有 html 文件中用这样的链接替换任何出现的此类格式:
XXX-YY (Year)-ZZZ
虽然使用标准正则表达式替换听起来“简单”,但实际上正如我所想的那样更困难,因为该进程可以运行多次。
我当前的流程将“嵌套”替换以生成如下内容:
;XXX-YY(年)-ZZZ>
我怎样才能实现我的目标?
PS:性能不是问题(至少在合理的情况下)
I have a bunch of html files (5000).
My business requirements defines a reference format, let's say it's XXX-YY(Year)-ZZZ.
I want to replace, in all html files, any occurrence of such format by a link like this :
<a href='~/app/document/XXX-YY(Year)-ZZZ'>XXX-YY(Year)-ZZZ</a>
While it sounds "simple" using a standard regex replace, it's actually more difficult as I thought as the process can run multiple times.
My current process will "nest" the replacements to produces something like this :
<a href='~/app/document/<a href='~/app/document/XXX-YY(Year)-ZZZ>XXX-YY(Year)-ZZZ</a>><a href='~/app/document/XXX-YY(Year)-ZZZ>XXX-YY(Year)-ZZZ</a></a>
How can I reach my goal ?
PS: performance is not an issue (at least when it stays reasonable)
发布评论
评论(1)
您所需要的只是:HTML Agility Pack
检查这个:c# html agility pack 以及许多其他有关它的问题;-)
这是因为你最好使用具有扎实理解的解析器HTML 树的结构,而不仅仅是正则表达式或文本解析,根据特定的标记可能会失败......
all you need is: HTML Agility Pack
check this one: c# html agility pack and plenty of other questions about it here in SO ;-)
this because you better to use a parser with solid understanding of the HTML tree, not just regex or text parsing which may fail depending on the specific markup...