我怎样才能加快我的“分而治之”的速度? XSLT 模板替换字符串中的某些字符?
更新:我添加了一个这个问题的答案,其中包含了几乎所有已给出的建议。下面代码中给出的原始模板需要 45605ms 来完成一个现实世界的输入文档(有关脚本编程的英文文本)。 社区 wiki 回答 使运行时间降至 605 毫秒!
我使用以下 XSLT 模板将字符串中的一些特殊字符替换为其转义变体;它使用分而治之的策略递归地调用自身,最终查看给定字符串中的每个字符。然后,它决定是否应按原样打印字符,或者是否需要任何形式的转义:
<xsl:template name="escape-text">
<xsl:param name="s" select="."/>
<xsl:param name="len" select="string-length($s)"/>
<xsl:choose>
<xsl:when test="$len >= 2">
<xsl:variable name="halflen" select="round($len div 2)"/>
<xsl:variable name="left">
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="right">
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="concat($left, $right)"/>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="$s = '"'">
<xsl:text>"\""</xsl:text>
</xsl:when>
<xsl:when test="$s = '@'">
<xsl:text>"@"</xsl:text>
</xsl:when>
<xsl:when test="$s = '|'">
<xsl:text>"|"</xsl:text>
</xsl:when>
<xsl:when test="$s = '#'">
<xsl:text>"#"</xsl:text>
</xsl:when>
<xsl:when test="$s = '\'">
<xsl:text>"\\"</xsl:text>
</xsl:when>
<xsl:when test="$s = '}'">
<xsl:text>"}"</xsl:text>
</xsl:when>
<xsl:when test="$s = '&'">
<xsl:text>"&"</xsl:text>
</xsl:when>
<xsl:when test="$s = '^'">
<xsl:text>"^"</xsl:text>
</xsl:when>
<xsl:when test="$s = '~'">
<xsl:text>"~"</xsl:text>
</xsl:when>
<xsl:when test="$s = '/'">
<xsl:text>"/"</xsl:text>
</xsl:when>
<xsl:when test="$s = '{'">
<xsl:text>"{"</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$s"/>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
该模板占用了我的 XSLT 脚本所需的大部分运行时间。将上面的 escape-text
模板替换为 just
<xsl:template name="escape-text">
<xsl:param name="s" select="."/>
<xsl:value-of select="$s"/>
</xsl:template>
使我的 XSLT 脚本在我的一个文档上的运行时间从 45 秒缩短到不到一秒。
因此我的问题是:如何加快我的 escape-text
模板?我正在使用 xsltproc 并且我更喜欢纯 XSLT 1.0 解决方案。 XSLT 2.0 解决方案也将受到欢迎。然而,外部库可能对这个项目没有用——但我仍然对使用它们的任何解决方案感兴趣。
UPDATE: I added an answer to this question which incorporates almost all the suggestions which have been given. The original template given in the code below needed 45605ms to finish a real world input document (english text about script programming). The revised template in the community wiki answer brought the runtime down to 605ms!
I'm using the following XSLT template for replacing a few special characters in a string with their escaped variants; it calls itself recursively using a divide-and-conquer strategy, eventually looking at every single character in a given string. It then decides whether the character should be printed as it is, or whether any form of escaping is necessary:
<xsl:template name="escape-text">
<xsl:param name="s" select="."/>
<xsl:param name="len" select="string-length($s)"/>
<xsl:choose>
<xsl:when test="$len >= 2">
<xsl:variable name="halflen" select="round($len div 2)"/>
<xsl:variable name="left">
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="right">
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="concat($left, $right)"/>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="$s = '"'">
<xsl:text>"\""</xsl:text>
</xsl:when>
<xsl:when test="$s = '@'">
<xsl:text>"@"</xsl:text>
</xsl:when>
<xsl:when test="$s = '|'">
<xsl:text>"|"</xsl:text>
</xsl:when>
<xsl:when test="$s = '#'">
<xsl:text>"#"</xsl:text>
</xsl:when>
<xsl:when test="$s = '\'">
<xsl:text>"\\"</xsl:text>
</xsl:when>
<xsl:when test="$s = '}'">
<xsl:text>"}"</xsl:text>
</xsl:when>
<xsl:when test="$s = '&'">
<xsl:text>"&"</xsl:text>
</xsl:when>
<xsl:when test="$s = '^'">
<xsl:text>"^"</xsl:text>
</xsl:when>
<xsl:when test="$s = '~'">
<xsl:text>"~"</xsl:text>
</xsl:when>
<xsl:when test="$s = '/'">
<xsl:text>"/"</xsl:text>
</xsl:when>
<xsl:when test="$s = '{'">
<xsl:text>"{"</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$s"/>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This template accounts for the majority of runtime which my XSLT script needs. Replacing the above escape-text
template with just
<xsl:template name="escape-text">
<xsl:param name="s" select="."/>
<xsl:value-of select="$s"/>
</xsl:template>
makes the runtime of my XSLT script go from 45 seconds to less than one seconds on one of my documents.
Hence my question: how can I speed up my escape-text
template? I'm using xsltproc and I'd prefer a pure XSLT 1.0 solution. XSLT 2.0 solutions would be welcome too. However, external libraries might not be useful for this project - I'd still be interested in any solutions using them though.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
另一种(补充)策略是,如果条件
translate($s, $vChars, '') = $s
为真,则在字符串长度减至 1 之前提前终止递归。这应该可以更快地处理根本不包含特殊字符的字符串,这可能是其中的大多数。当然,结果将取决于 xsltproc 的translate()
实现的效率。Another (complementary) strategy would be to terminate the recursion early, before the string length is down to 1, if the condition
translate($s, $vChars, '') = $s
is true. This should give much faster processing of strings that contain no special characters at all, which is probably the majority of them. Of course the results will depend on how efficient xsltproc's implementation oftranslate()
is.非常小的修正使我的测试速度提高了约 17 倍。
还有其他改进,但我想现在就足够了......:)
A very small correction improved the speed in my tests about 17 times.
There are additional improvements, but I guess this will suffice for now ... :)
这是一个基于@Dimitre 答案的更改进版本:
应该再快一点,但我还没有对它进行基准测试。无论如何,它更短。 ;-)
Here is a more improved version, based on @Dimitre's answer:
Should be another tiny bit faster, but I have not benchmarked it. In any case, it's shorter. ;-)
无论如何,这是我当前版本的
escape-text
模板,其中包含了人们在回答我的问题时给出的大部分(非常好!)建议。根据记录,我的原始版本在我的示例 DocBook 文档上平均花费了大约 45605 毫秒。之后,运行时间通过多个步骤减少:left
和right
变量以及concat()
调用将运行时间降至13052毫秒;此优化取自 托马拉克的回答。
元素中,使运行时间进一步降至 5812 毫秒。此优化首先是Dimitre建议的。
调用x < xsl:text>y
。这使得运行时间达到大约 606 毫秒(大约提高了 1%)。最终,该函数花费了 606ms,而不是 45605ms。感人的!
For what it's worth, here's my current version of the
escape-text
template which incorporates most of the (excellent!) suggestions which people have given in response to my question. For the record, my original version took about 45605ms on average on my sample DocBook document. After that, the runtime was decreased in multiple steps:left
andright
variable together with theconcat()
call brought the runtime down to 13052ms; this optimization was taken from Tomalak's answer.<xsl:choose>
element brought the runtime further down to 5812ms. This optimization was first suggested by Dimitre.<xsl:value-of select="concat('x', $s, 'y')"/>
calls with<xsl:text>x</xsl:text><xsl:value-of select="$s"/><xsl:text>y</xsl:text>
. This brought the runtime to about 606ms (so about 1% improvement).In the end, the function took 606ms instead of 45605ms. Impressive!
使用 EXSLT 怎么样? EXSLT 中的字符串函数有一个名为 替换。我认为很多 XSLT 实现都支持这一点。
How about using EXSLT? The String functions in EXSLT have a function called replace. I think it is something that is supported by quite a few XSLT implementations.
更新:我修复了这个问题以使其实际工作;现在,这不是加速!
构建@Wilfred的答案...
在摆弄了 EXSLT Replace() 函数之后,我认为发布另一个答案很有趣,即使它对OP没有用。它可能对其他人很有用。
由于该算法,它很有趣:而不是这里使用的主要算法(进行二元递归搜索,在每次递归时分成两半,每当第 2^n 个子字符串中没有特殊字符时进行修剪,并迭代选择特殊字符字符(当长度=1 字符串确实包含特殊字符时),Jeni Tennison 的 EXSLT 算法将迭代放在外部循环上的一组搜索字符串上。因此,在循环内部,一次只搜索一个字符串,可以使用 substring-before()/substring-after() 来分割字符串,而不是盲目地分成两半。
[已弃用:我想这足以显着加快速度。我的测试显示,与 @Dimitre 的最新测试相比,速度提高了 2.94 倍(平均 230 毫秒 vs 676 毫秒)。] 我在 Oxygen XML 分析器中使用 Saxon 6.5.5 进行测试。作为输入,我使用了一个 7MB XML 文档,该文档主要是一个文本节点,由 创建网页 关于 关于 cs.brown.edu/courses/bridge/1998/res/javascript/javascript-tutorial.html" rel="nofollow noreferrer">javascript,重复。在我看来,这代表了 OP 试图优化的任务。我很想听听其他人通过他们的测试数据和环境得到了什么结果。
依赖性
这使用依赖于 exsl:node-set() 的替换的 XSLT 实现。看起来 xsltproc 支持这个扩展功能(可能是它的早期版本)。所以这对你来说可能是开箱即用的,@Frerich;对于其他处理器,就像对 Saxon 所做的那样。
但是,如果我们想要 100% 纯 XSLT 1.0,我认为修改此替换模板以使其在没有 exsl:node-set() 的情况下工作并不难,只要第二个和第三个参数作为节点集而不是 RTF 传入。
这是我使用的代码,它调用替换模板。大部分长度都用在我创建搜索/替换节点集的详细方式上……这可能会被缩短。 (但是您无法搜索或替换节点属性,因为当前已编写替换模板。尝试将属性放在文档元素下时您会收到错误消息。)
导入的样式表是最初是这个。
然而,正如@Frerich 指出的那样,这从未给出正确的输出!
这应该教会我不要在没有检查正确性的情况下发布性能数据!
我可以在调试器中看到哪里出了问题,但我不知道 EXSLT 模板是否从未工作过,或者它是否只是在 Saxon 6.5.5 中不起作用......任何一个选项都会令人惊讶。
无论如何,EXSLT 的 str:replace() 指定的功能超出了我们的需要,因此我对其进行了修改,要求
以下是修改后的替换模板:
这个更简单的模板的附带好处之一是您现在可以使用搜索节点的属性并替换参数。这将使
数据更加紧凑并且更易于阅读(IMO)。性能:使用这个修改后的模板,工作可以在大约 2.5 秒内完成,而我最近对主要竞争对手 @Dimitre 的 XSLT 1.0 样式表的测试需要 0.68 秒。所以这不是加速。但同样,其他人的测试结果与我的非常不同,所以我想听听其他人使用此样式表得到的结果。
Update: I fixed this to actually work; now, it is not a speedup!
Building off @Wilfred's answer...
After fiddling with the EXSLT replace() function, I decided it was interesting enough to post another answer, even if it's not useful to the OP. It may well be useful to others.
It's interesting because of the algorithm: instead of the main algorithm worked on here (doing a binary recursive search, dividing in half at each recursion, pruned whenever a 2^nth substring has no special characters in it, and iterating over a choice of special characters when a length=1 string does contain a special character), Jeni Tennison's EXSLT algorithm puts the iteration over a set of search strings on the outside loop. Therefore on the inside of the loop, it is only searching for one string at a time, and can use substring-before()/substring-after() to divide the string, instead of blindly dividing in half.
[Deprecated: I guess that's enough to speed it up significantly. My tests show a speedup of 2.94x over @Dimitre's most recent one (avg. 230ms vs. 676ms).] I was testing using Saxon 6.5.5 in the Oxygen XML profiler. As input I used a 7MB XML document that was mostly a single text node, created from web pages about javascript, repeated. It sounds to me like that is representative of the task that the OP was trying to optimize. I'd be interested to see hear what results others get, with their test data and environments.
Dependencies
This uses an XSLT implementation of replace which relies on exsl:node-set(). It looks like xsltproc supports this extension function (possibly an early version of it). So this may work out-of-the-box for you, @Frerich; and for other processors, as it did with Saxon.
However if we want 100% pure XSLT 1.0, I think it would not be too hard to modify this replace template to work without exsl:node-set(), as long as the 2nd and 3rd params are passed in as nodesets, not RTFs.
Here is the code I used, which calls the replace template. Most of the length is taken up with the verbose way I created search/replace nodesets... that could probably be shortened. (But you can't make the search or replace nodes attributes, as the replace template is currently written. You'll get an error about trying to put attributes under the document element.)
The imported stylesheet was originally this one.
However, as @Frerich pointed out, that never gave the correct output!
That ought to teach me not to post performance figures without checking for correctness!
I can see in a debugger where it's going wrong, but I don't know whether the EXSLT template never worked, or if it just doesn't work in Saxon 6.5.5... either option would be surprising.
In any case, EXSLT's str:replace() is specified to do more than we need, so I modified it so as to
Here is the modified replace template:
One of the side benefits of this simpler template is that you could now use attributes for the nodes of your search and replace parameters. This would make the
<foo:replacements>
data more compact and easier to read IMO.Performance: With this revised template, the job gets done in about 2.5s, vs. my 0.68s for my recent tests of the leading competitor, @Dimitre's XSLT 1.0 stylesheet. So it's not a speedup. But again, others have had very different test results than I have, so I'd like to hear what others get with this stylesheet.
@Frerich-Raabe 发布了一个社区 wiki 答案,该答案结合了迄今为止的建议,并(根据他的数据)实现了 76 倍的加速 - 热烈祝贺大家!
我忍不住不再进一步:
此转换(根据我的数据)进一步加速了 1.5 倍。所以总的加速比应该在100倍以上。
After @Frerich-Raabe published a community wiki answer which combines the suggestions so far and achieves (on his data) a speedup of 76 times -- big congratulations to everybody!!!
I couldn't resist not to go further:
This transformation achieves (on my data) a further speedup of 1.5 times. So the total speedup should be more than 100 times.
好的,我插一句。虽然不如优化 XSLT 1.0 版本那么有趣,但您确实说过 XSLT 2.0 解决方案很受欢迎,所以这是我的。
这只是使用正则表达式 Replace() 分别将 \ 或 " 替换为 "\" 或 "\"" ;与另一个正则表达式 Replace() 组成,用引号将任何其他可转义字符括起来。
在我的测试中,它的性能比 Dimitre 最新的 XSLT 1.0 产品差,达 2 倍以上。(但我编写了自己的测试数据,其他条件可能很特殊,所以我'我想知道其他人得到了什么结果。)
为什么性能较慢?我只能猜测这是因为搜索正则表达式比搜索固定字符串慢。
更新:使用分析字符串
根据@Alejandro的建议,这里使用分析字符串:
虽然这似乎是一个好主意,但不幸的是它并没有给我们带来性能上的胜利:在我的设置中,它始终需要大约 14 秒才能完成,而上面的replace() 模板则需要 1 - 1.4 秒。称之为10-14x减速。 :-( 这对我来说意味着在 XSLT 级别分解和连接大量大字符串比在内置函数中遍历两次大字符串要昂贵得多。
OK, I'll chip in. Though not as interesting as optimizing the XSLT 1.0 version, you did say that XSLT 2.0 solutions are welcome, so here's mine.
This just uses a regexp replace() to replace \ or " with "\" or "\"" respectively; composed with another regexp replace() to surround any of the other escapable characters with quotes.
In my tests, this performs worse than Dimitre's most recent XSLT 1.0 offering, by a factor of more than 2. (But I made up my own test data, and other conditions may be idiosyncratic, so I'd like to know what results others get.)
Why the slower performance? I can only guess it's because searching for regular expressions is slower than searching for fixed strings.
Update: using analyze-string
As per @Alejandro's suggestion, here it is using analyze-string:
While this seems like a good idea, unfortunately it does not give us a performance win: In my setup, it consistently takes about 14 seconds to complete, versus 1 - 1.4 sec for the replace() template above. Call that a 10-14x slowdown. :-( This suggests to me that breaking and concatenating lots of big strings at the XSLT level is a lot more expensive than traversing a big string twice in a built-in function.