使用正则表达式解析 Dreamweaver 模板
我需要解析 Dreamweaver 模板中的内容。 我正在使用 C#。
这是我需要解析的一些示例内容。
<div id="myDiv">
<h1><!-- InstanceBeginEditable name="PageHeading" -->
The Heading<!-- InstanceEndEditable --></h1>
<!-- InstanceBeginEditable name="PageContent" -->
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed nibh turpis,
sagittis vitae convallis at, fringilla nec augue.</p>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed nibh turpis, sagittis vitae convallis at, fringilla nec augue.</p>
<!-- InstanceEndEditable -->
</div><!-- END #myDiv-->
Dreamweaver 模板基于 HTML 注释,并带有表示其用途的特定字符串。 对我来说关键如下,因为它们表示页面中可编辑区域的开始和结束。
<!-- InstanceBeginEditable name="xxxxxx" -->
<!-- InstanceEndEditable -->
正如您从我的示例 HTML 中看到的,源代码中可能还有其他注释。
因此,从简单开始,我有以下内容,它与所有打开的可编辑区域标签相匹配。
<!-- InstanceBeginEditable(.*)?-->
所以接下来我想要得到从那里到下一个之间的一切“
<!-- InstanceBeginEditable(.*)?-->(?<content>(.*)?)<!-- InstanceEnd
你能告诉我为什么会这样吗?我本以为非贪婪捕获(.*)?在我已经工作的代码和文字之间
<!—InstanceEnd
将符合我的需要......
I have a requirement to parse the content out of Dreamweaver templates.
I'm using C#.
Here is some example content that I will need to parse.
<div id="myDiv">
<h1><!-- InstanceBeginEditable name="PageHeading" -->
The Heading<!-- InstanceEndEditable --></h1>
<!-- InstanceBeginEditable name="PageContent" -->
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed nibh turpis,
sagittis vitae convallis at, fringilla nec augue.</p>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed nibh turpis, sagittis vitae convallis at, fringilla nec augue.</p>
<!-- InstanceEndEditable -->
</div><!-- END #myDiv-->
Dreamweaver templates are based around HTML comments with specific strings denoting their purpose.
They key ones for me are as follows, as they denote the start and end of editable regions in the page.
<!-- InstanceBeginEditable name="xxxxxx" -->
<!-- InstanceEndEditable -->
As you can see from my example HTML, there may be other comments in the source code.
So starting simple, I have the following, which matches all the opening Editable region tags.
<!-- InstanceBeginEditable(.*)?-->
So next I want to get everything between there and the next "
<!-- InstanceBeginEditable(.*)?-->(?<content>(.*)?)<!-- InstanceEnd
Can you tell me why this is so. I would have thought a non-greedy capture (.*)? in-between my already working code and the literal
<!—InstanceEnd
would have matched what I need...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您不想在
.*
两边加上括号。这意味着贪婪地抓住一切,或者不抓住一切。
这意味着惰性地获取所有内容:
此外,在您的正则表达式中,结束标记中只有一个
-
。将其更改为:顺便说一句,在没有原子组的正则表达式中包含两个
.*
是很危险的。对于意外数据,您可能会遇到灾难性回溯。我建议将第一个.*?
更改为[^-]*
。而且,当我这样做时,我会建议您更宽容地处理空白:您可能已经知道这一点,但让我补充一下,对于 .NET,您将需要使用 RegexOptions.Singleline。
You don't want to put parentheses around
.*
.This means to grab everything greedily, or not.
This means to grab everything lazily:
Also, in your regex, you have only one
-
in the ending token. Change it to this:By the way, it's dangerous to have two
.*
s in a regex without an atomic group. On unexpected data, you can get catastrophic backtracking. I'd recommend changing the first.*?
to[^-]*
. And, while I'm at it, I'll suggest you handle whitespace more forgivingly:You probably already know this, but let me add that with .NET, you'll need to use RegexOptions.Singleline.
使用 HTML Agility Pack,请在此处查看我的答案,如何在 C# 中使用正则表达式解析 HTML?
Use the HTML Agility Pack, see my answer here, How do I parse HTML using regular expressions in C#?