使用正则表达式从另一组中删除一组标签
我正在使用 BBEdit 编辑一个大 XML 文件。
XML 文件是旧日记的数字化再现,其中的文本包含在注释标签中。
<note>Example of a note.</note>
然而,某些注释标签将引号括在嵌套在其中的引号标签中。
<note>Example of a note, but <quote>"here is a quotation within the note"</quote></note>
我需要从注释标签中删除所有引用实例,同时保留引用标签的实际内容。所以这个例子将变成:
<note>Example of a note, but "here is a quotation within the note"</note>
我在 BBEdit 中使用 GREP 成功删除了其中一些,但我开始遇到更复杂的注释标签,这些标签跨越几行或在两组不同的标签之间有文本。例如:
<note>Example of a note, <quote>"with a quotation"</quote> and a <quote>"second quotation"</quote> along with some text outside of the quotation before the end of the note.</note>
有些引文可以长达 10 行以上。在我的正则表达式中使用 \r 似乎没有帮助。
我还应该说,引用标签可以存在于注释标签之外,这排除了批量查找 /?quote 并删除它的可能性。我仍然需要在文档中使用引号标签,只是不在注释标签中使用。
非常感谢您的帮助。
I've got a big XML file I'm editing with BBEdit.
Within the XML file, which is a digital recreation of an old diary, is text that is enclosed in note tags.
<note>Example of a note.</note>
Some note tags, however, have quotations enclosed in quote tags nested in them.
<note>Example of a note, but <quote>"here is a quotation within the note"</quote></note>
I need to remove all instances of quote from the note tags, whilst keeping the actual content of the quote tags. So the example would become:
<note>Example of a note, but "here is a quotation within the note"</note>
I've used GREP in BBEdit to successfully remove some of these, but I'm beginning to get stuck with the more complicated note tags that go over several lines or have text between the two different sets of tags. For example:
<note>Example of a note, <quote>"with a quotation"</quote> and a <quote>"second quotation"</quote> along with some text outside of the quotation before the end of the note.</note>
Some quotations can go on for over 10 lines. Using \r in my regex doesn't seem to help.
I should also say that quote tags can exist outside of note tags, which rules out the possibility of just bulk finding /?quote and deleting it. I still need to use the quote tags within the document, just not within note tags.
Many thanks for any help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用 XSLT 这真的很容易:
使用您选择的 XSLT 处理器将此样式表应用到您的 XML 文件。例如,有些工具可以在命令行上运行。
This is really easy with XSLT:
Apply this stylesheet to your XML file with an XSLT processor of your choice. There are tools that operate on the command line, for example.
如果不限制 XML 的形成方式,我非常确定这超出了常规语言的范围,而进入了上下文无关语言的范围,这意味着正则表达式不会为您提供帮助。如果 XML 的结构很简单(没有节点嵌套在节点中或引号嵌套在引号中),您也许可以按照全局替换
(! ; ) )(!)
与\1\2\3< /node>
,但您可能使用了错误的工具来完成这项工作。正如其他答案之一所指出的,XSLT 可以帮助您,或者您可以使用 XML 解析库编写一个简单的程序来删除您正在查找的标签。Without restrictions on how the XML is formed, I'm pretty sure that this goes out of the scope of regular languages and into context-free ones, which means regular expressions are not going to help you. If the structure of the XML is simple (no nodes nested in nodes or quotes nested in quotes), you might be able to do something along the lines of a global replace of
<node>(!</node>)<quote>(!</quote>)</quote>(!</node>)</node>
with<node>\1\2\3</node>
, but you're probably using the wrong tool for the job. As one of the other answers notes, XSLT could help you, or you could use an XML parsing library to write a simple program to strip out the tags you're looking for.