如何将文本模式包含在 xml 元素内,除非它已经位于某个 xml 元素内?
我有数千个从 java 属性文件生成的 xml 文件,准备以 TTX 格式进行翻译。 它们包含相当多的变量,我需要保护它们免受翻译者的影响,因为它们经常破坏这些东西。 变量采用数字形式,有时也采用一对大括号之间的文本形式,例如。 {0},{这个}。
如果这些变量还不是属性并且还不是 ut 元素内部文本的一部分,我需要用 xml 元素包围这些变量,如下所示:
<ut DisplayText="{0}"><{0}></ut>
我的输入如下所示
<ut Type="start"DisplayText="string"><string></ut> text string {0}
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> {2}.
<ut Type="end" DisplayText="resource"></resource></ut>
:正确的输出应该是这样的:
<ut Type="start"DisplayText="string"><string></ut> text string <ut DisplayText="{0}">{0}</ut>
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> <ut DisplayText="{2}">{2}</ut>.
<ut Type="end" DisplayText="resource"></resource></ut>
我最初的方法是使用正则表达式来匹配大括号中的术语,并通过模式替换围绕它构建 xml 元素。 当存在如上面第一个代码块中所示的模式时,此方法会失败。
以前的查找和替换模式(在记事本++中):
查找
({[A-Za-z0-9]*})
替换
<ut DisplayText="\1">\1</ut>
开始看起来正则表达式不是完成这项工作的正确工具,所以我想要一些关于更好的方法、不同的工具,甚至只是更完整的正则表达式的建议这可以让我快速且可重复地解决这个问题。
更新:事实证明,问题比之前预想的要复杂一些。 似乎还有更多的东西需要保护,涉及一些相当晦涩的语法,在似乎某种条件语句中将变量与文本混合。 根据记忆:
{o,choice|1#1 error|1<{0,number,integer} errors}
其中“错误”和“错误”是可翻译的,不应受到保护。 目前我们拥有的最简单的解决方案是运行上面的正则表达式,修复它创建的奇数个错误,然后运行一些更正常的查找和查找。 替换更复杂的项目的通行证。 它可以被抽象为正则表达式,但现在这样做没有多大意义。
除了提供的改进表达式之外,我还感谢 xslt 和其他具有更好正则表达式支持的编辑器的指针。 当时间允许时,我会尝试一些选项。
I have several thousand xml files generated from java properties files prepared for translation in the TTX format. They contain quite a few variables, that I need to protect from the translators, as they often break such things. The variables are in the form of numbers or occasionally text between a pair of curly braces eg. {0}, {this}.
I need to surround these variables with an xml element if they are not already an attribute and if they are not already part of the inner text of a ut element, like so:
<ut DisplayText="{0}"><{0}></ut>
My input looks like this:
<ut Type="start"DisplayText="string"><string></ut> text string {0}
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> {2}.
<ut Type="end" DisplayText="resource"></resource></ut>
The correct output should be this:
<ut Type="start"DisplayText="string"><string></ut> text string <ut DisplayText="{0}">{0}</ut>
<ut DisplayText="{1}"><{1}></ut> in:
<ut DisplayText="\n"><\n/></ut> <ut DisplayText="{2}">{2}</ut>.
<ut Type="end" DisplayText="resource"></resource></ut>
My initial approach was to use a regular expression to match the term in the braces and just build the xml elements around it with pattern substitution. This approach fails when the pattern is present found as in the first code block above.
Previous find and replace patters (in notepad++):
Find
({[A-Za-z0-9]*})
Replace
<ut DisplayText="\1">\1</ut>
It is beginning to look like regex is not the right tool for the job, so I would like some suggestions on better approaches to take, different tools, or even just a more complete regex that may allow me to solve this quickly and repeatably.
Update: The problem turned out to be a little more complex than previously envisioned. It seems there are also a couple more things that needed protecting, involving some rather obscure syntax, mixing variables with text in what appears to be some kind of conditional statement. From memory:
{o,choice|1#1 error|1<{0,number,integer} errors}
Where "error" and "errors" are translatable and should not be protected. The simplest solution we have at present is to run the above regex, fix the odd few of erros it creates and then run a couple more normal find & replace passes for the more complex items. It could be abstracted out as regex, but right now there is not much point in doing that.
I appreciate the pointers to xslt and other editors with better regex support, in addition to the improved expressions offered. I will have a play with some of the options when time allows.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果我的假设是错误的,请告诉我,但从您的示例来看,您似乎想要更改 {} 中而不是中的文本。 元素。 对我来说,这似乎是 XSLT 的简单使用。 只需按原样输出 UT 元素并处理其间的任何文本。
Let me know if my assumption is wrong, but from your example it seems you want to change text that is in {} and not in a <ut> element. To me this seems like an easy use of XSLT. Simply output UT elements as they are and process any text in between.
为什么不尝试使用表达式
(?<=.){[A-Za-z0-9]+}(?=.$)
当遵循此模式时,这将找到包含 1 个或多个字母或数字的 { 以及 }标签和任意数量的空格 AND 后跟任意数量的空格和换行符。
Why not try using the expression
(?<=.){[A-Za-z0-9]+}(?=.$)
This would find the { with 1 or more letters or numbers and the } when this pattern follows the tag and any number of spaces AND is followed by any number of spaces and a line break.
我最终在问题中使用了正则表达式的组合,并手动修复了导致的奇怪错误。 这并不理想,但比试图找到完美的解决方案要快。
I ended up using a combination of the Regex in the question and manually fixing the odd error that caused. It wasn't ideal but it was quicker than trying to find the perfect solution.