编写更高效的 xquery 代码(避免冗余迭代)
这是我正在解决的问题的简化版本:我有一堆 xml 数据,用于对有关人员的信息进行编码。每个人都由“id”属性唯一标识,但他们可能有多个名字。例如,在一个文档中,我可能会发现
<person id=1>Paul Mcartney</person>
<person id=2>Ringo Starr</person>
: 在另一个文档中,我可能会发现:
<person id=1>Sir Paul McCartney</person>
<person id=2>Richard Starkey</person>
我想使用 xquery 生成一个新文档,其中列出与给定 id 关联的每个名称。即:
<person id=1>
<name>Paul McCartney</name>
<name>Sir Paul McCartney</name>
<name>James Paul McCartney</name>
</person>
<person id=2>
...
</person>
我现在在 xquery 中执行此操作的方式是这样的(伪代码式):
let $ids := distinct-terms( [all the id attributes on people] )
for $id in $ids
return <person id={$id}>
{
for $unique-name in distinct-values
(
for $name in ( [all names] )
where $name/@id=$id
return $name
)
return <name>{$unique-name}</name>
}
</person>
问题是这真的很慢。我想瓶颈是最里面的循环,它为每个 id (大约有 1200 个)执行一次。我正在处理相当多的数据(300 MB,分布在大约 800 个 xml 文件中),因此即使在内部循环中执行一次查询也需要大约 12 秒,这意味着重复 1200 次将需要大约 4 秒小时(这可能是乐观的 - 该过程到目前为止已经运行了 3 个小时)。它不仅速度慢,而且使用大量虚拟内存。我正在使用 Saxon,并且必须将 java 的最大堆大小设置为 10 GB(!)以避免出现内存不足错误,并且它当前使用 6 GB 物理内存。
所以这就是我真正想要做的事情(用 Pythonic 伪代码):
persons = {}
for id in ids:
person[id] = set()
for person in all_the_people_in_my_xml_document:
persons[person.id].add(person.name)
在那里,我只是在线性时间内完成了这件事,只需要扫描一次 xml 文档。现在,有没有办法在 xquery 中做类似的事情?当然,如果我能想象的话,一种合理的编程语言应该能够做到这一点(他堂吉诃德式地说)。我想,问题在于,与 Python 不同,xquery 没有(据我所知)类似关联数组的东西。
有什么聪明的方法可以解决这个问题吗?如果做不到这一点,是否有比 xquery 更好的东西可以用来实现我的目标?因为实际上,我在这个相对简单的问题上投入的计算资源有点荒谬。
Here's a simplified version of a problem I'm working on: I have a bunch of xml data that encodes information about people. Each person is uniquely identified by an 'id' attribute, but they may go by many names. For example, in one document, I might find
<person id=1>Paul Mcartney</person>
<person id=2>Ringo Starr</person>
And in another I might find:
<person id=1>Sir Paul McCartney</person>
<person id=2>Richard Starkey</person>
I want to use xquery to produce a new document that lists every name associated with a given id. i.e.:
<person id=1>
<name>Paul McCartney</name>
<name>Sir Paul McCartney</name>
<name>James Paul McCartney</name>
</person>
<person id=2>
...
</person>
The way I'm doing this now in xquery is something like this (pseudocode-esque):
let $ids := distinct-terms( [all the id attributes on people] )
for $id in $ids
return <person id={$id}>
{
for $unique-name in distinct-values
(
for $name in ( [all names] )
where $name/@id=$id
return $name
)
return <name>{$unique-name}</name>
}
</person>
The problem is that this is really slow. I imagine the bottleneck is the innermost loop, which executes once for every id (of which there are about 1200). I'm dealing with a fair bit of data (300 MB, spread over about 800 xml files), so even a single execution of the query in the inner loop takes about 12 seconds, which means that repeating it 1200 times will take about 4 hours (which might be optimistic - the process has been running for 3 hours so far). Not only is it slow, it's using a whole lot of virtual memory. I'm using Saxon, and I had to set java's maximum heap size to 10 GB (!) to avoid getting out of memory errors, and it's currently using 6 GB of physical memory.
So here's how I'd really like to do this (in Pythonic pseudocode):
persons = {}
for id in ids:
person[id] = set()
for person in all_the_people_in_my_xml_document:
persons[person.id].add(person.name)
There, I just did it in linear time, with only one sweep of the xml document. Now, is there some way to do something similar in xquery? Surely if I can imagine it, a reasonable programming language should be able to do it (he said quixotically). The problem, I suppose, is that unlike Python, xquery doesn't (as far as I know) have anything like an associative array.
Is there some clever way around this? Failing that, is there something better than xquery that I might use to accomplish my goal? Because really, the computational resources I'm throwing at this relatively simple problem are kind of ridiculous.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不幸的是,这是 XQuery 1.0 中的一个缺点,
XQuery 1.1 在语法中添加了 group by 子句来解决此问题,您的问题将通过以下方式解决:
不幸的是,XQuery 1.1 尚未广泛实现,因此目前您无法使用 group by条款。
作为 XQSharp 的开发人员,我不能谈论任何其他实现,但我们花了很多时间调整优化器,以发现 XQuery 1.1 中常见的分组模式,并使用您指定的算法执行它们。
特别是,您的查询的以下版本:
被发现为分组依据,如以下查询计划所示:
请注意,类型注释
as element(person, xs:untyped)*
是必需的,因为不知道节点是非类型化的(未针对模式进行验证),查询处理器无法知道$person/@id
的数据值中没有多个项目。 XQSharp 尚不支持 group by 表达式,其中每个节点可以有多个键。然而,在这种情况下,仍然会发现左外连接,因此复杂性应该大致为 n log n,而不是您所经历的二次。不幸的是,尽管在组中的一组人员周围添加不同值(以过滤掉重复的名称)似乎会阻止 XQSharp 找到连接;这已被列为错误。目前,这可以通过分两遍进行查询来解决 - 按 id 对名称进行分组,并删除重复的名称。
总之,XQuery 1.0 中没有更好的方法,但某些实现(例如 XQSharp)将能够有效地评估这一点。如果有疑问,请检查查询计划。
要更详细地了解 XQSharp 执行的连接优化,请查看此 博客文章。
This unfortunately is a shortcoming in XQuery 1.0
XQuery 1.1 adds the group by clause to the syntax to resolve this problem, and your problem would be resolved with:
Unfortunately XQuery 1.1 is not widely implemented, so for the moment you are stuck without the group by clause.
As a developer on XQSharp I cannot speak for any other implementations, but we have spent a lot of time tweaking our optimizer to spot common group-by patterns in XQuery 1.1 and perform them with the algorithm you have specified.
In particular, the following version of your query:
is spotted as a group-by, as is evidenced by the following query plan:
Note that the type annotation
as element(person, xs:untyped)*
is required, as without knowing that the nodes are untyped (not validated against a schema), the query processor has no way of knowing that$person/@id
doesn't have multiple items in its data value. XQSharp does not yet support group by expressions where each node can have more than one key. However in this case a left outer join is still spotted, and so the complexity should be roughly n log n and not quadratic as you are experiencing.Unfortunately though adding in the distinct-values around the set of people in the group (to filter out duplicate names) seems to stop XQSharp from finding the join; this has been filed as a bug. For now this could be solved by doing the query in two passes - grouping the names by id, and removing duplicate names.
In summary, there is not a better approach in XQuery 1.0, but some implementations (eg. XQSharp) will be able to evaluate this efficiently. If in doubt, check the query plan.
For a more detailed look at the join optimizations performed by XQSharp, take a look at this blog post.
另一种选择:使用地图。
Another option: use a map.
这是一个简单的 XSLT 2.0 解决方案(为了方便起见,三个文档中的两个由
表示):当此转换应用于以下 XML 文档:
生成所需的正确结果:
Here is a simple XSLT 2.0 solution (for convenience two of the three documents are represented by
<xsl:variable>
s):When this transformation is applied on the following XML document:
the wanted, correct result is produced:
如果您使用支持更新的 XML 数据库(例如 eXist db),那么您可以像将 Python 代码一样直接分组到 XML 文档中,其中可能无论如何都需要结果以供以后处理。
对于我对超过 100 个不同 ID 的 10,000 个人员节点进行的实验,我们服务器上的 eXist 的吞吐量约为每秒 100 个节点。
请注意,eXist 中 XQuery 的更新扩展与 XQuery Update 语法并不完全相同。
If you use an XML database supporting update, such as eXist db, then you can do the grouping just like the Pythonesque code directly into an XML document where presumably the result is needed anyway for later processing.
For my experiments with 10,000 person nodes over 100 distinct ids, eXist on our server has a throughput of about 100 nodes per second.
Note that the update extension to XQuery in eXist are not quite the same syntax as the XQuery Update syntax