SQL 程序集与应用程序代码对大型 XML 列进行复杂查询
我有一个包含一些关系列和一个 XML 列的表,该列有时保存相当大的数据块。我还有一个使用数据库的简单网络服务。我需要能够报告诸如 XML 列中某个元素的所有实例、某个元素的所有不同值的列表之类的事情。
我能够获得一个元素的所有不同值的列表,但没有得到比这更进一步的信息。我最终编写了极其复杂的 T-SQL 代码来完成一些在 C# 中看起来非常简单的事情:遍历该表中的所有行,并将其 ( XPath | XQuery | XSLT ) 应用于 XML 列。我可以过滤关系列以减少数据量,但这对于某些查询来说仍然是大量数据。
我的计划是在 SQL Server 中嵌入一个程序集(我使用的是 2008 SP2),并让它为给定的查询动态创建一个索引视图(我有其他逻辑来清理这个视图)。这将使我能够保持网络流量较低,并且可能还允许我使用 Excel 和 MSRS 报告等工具作为廉价的用户界面,但我看到很多人说“只使用应用程序逻辑而不是 SQL 程序集” 。 (我猜我可能完全找错了树)。
将大量数据抓取到 Web 服务并在那里进行处理也会有好处 - 我较少受到 SQL Server 环境的限制(因为我不住在其中),而且我的设置过程也更容易。但这确实意味着我通过网络带来大量数据,在处理时将其存储在内存中,然后丢弃其中一些数据。
任何建议在这里将不胜感激。
谢谢
编辑:
谢谢大家,你们给了我很大的帮助。问题是我们在表中为一个文件生成一行,每个文件可能有多个结果,并且每次运行特定的构建作业时我们都会这样做。我想将其展平为表格视图。
此构建作业的每次执行都会检查数千个文件的多个属性,在某些情况下,每个测试都会生成数千个结果(MSIVAL 测试是最糟糕的罪魁祸首)。
答案(废话!)是在进入数据库之前将其展平!根据您的反馈,我决定尝试为每个文件上的每个测试的每个结果创建一行,并且 XML 仅包含该结果的详细信息 - 这使得查询更加简单。当然,现在每次运行这个工具时我们都会有数十万行,但性能要好得多。我现在有一个视图,它创建由构建作业发出的一类结果的扁平化版本 - 这将返回 >200,000 并且需要 <5 秒,而之前的等效(复杂)查询大约需要 3 分钟我选择了更平坦的路线,旧(非数据库)版本的 XML 文件处理时间为 10 到 30 分钟。
我现在在连接次数方面遇到了一些问题,但我知道如何解决这个问题。
再次感谢!全面+1
I have a table with a few relational columns and one XML column which sometimes holds a fairly large chunk of data. I also have a simple webservice which uses the database. I need to be able to report on things like all the instances of a certain element within the XML column, a list of all the distinct values for a certain element, things like that.
I was able to get a list of all the distinct values for an element, but didn't get much further than that. I ended up writing incredibly complex T-SQL code to do something that seems pretty simple in C#: go through all the rows in this table, and apply this ( XPath | XQuery | XSLT ) to the XML column. I can filter on the relational columns to reduce the amount of data, but this is still a lot of data for some of the queries.
My plan was to embed an assembly in SQL Server (I'm using 2008 SP2) and have it create an indexed view on the fly for a given query (I'd have other logic to clean this view up). This would allow me to keep the network traffic down, and possibly also allow me to use tools like Excel and MSRS reports as a cheap user interface, but I'm seeing a lot of people saying "just use application logic rather than SQL assemblies". (I could be barking entirely up the wrong tree here, I guess).
Grabbing the big chunk of data to the web service and doing the processing there would have benefits as well - I'm less constrained by the SQL Server environment (since I don't live inside it) and my setup process is easier. But it does mean I'm bringing a lot of data over the network, storing it in memory while I process it, then throwing some of it away.
Any advice here would be appreciated.
Thanks
Edit:
Thanks guys, you've all been a big help. The issue was that we were generating a row in the table for a file, and each file could have multiple results, and we would doing this each time we ran a particular build job. I wanted to flatten this out into a table view.
Each execution of this build job checked thousands of files for several attributes, and in some cases each of these tests these were generating thousands of results (MSIVAL tests were the worst culprit).
The answer (duh!) is to flatten it out before it goes into the database! Based on your feedback, I decided to try creating a row for each result for each test on each file, and the XML just had the details of that one result - this made the query much simpler. Of course, we now have hundreds of thousands of rows each time we run this tool but the performance is much better. I now have a view which creates a flattened version of one of the classes of results that are emitted by the build job - this returns >200,000 and takes <5 seconds, compared to around 3 minutes for the equivalent (complicated) query before I went the flatter route, and between 10 and 30 minutes for the XML file processing of the old (non-database) version.
I now have some issues with the number of times I connect, but I have an idea of how to fix that.
Thanks again! +1's all round
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议在 TSQL 中使用标准 xml 工具。 (http://msdn.microsoft.com/en-us/library/ms189075。 aspx)。如果您不想使用它,我建议您在另一台机器上处理 xml。
SQLCLR 非常适合较小的函数,但对可用方法有限制一旦你尝试做更高级的事情,它往往会成为一种挫败感的练习。
I suggest using the standard xml tools in TSQL. (http://msdn.microsoft.com/en-us/library/ms189075.aspx). If you don't wish to use this I would recommend processing the xml on another machine.
SQLCLR is perfect for smaller functions, but with the restrictions on the usable methods it tends to become an exercise in frustration once you are trying to do more advanced things.
你所问的实际上是一个巨大的平衡行为,它完全取决于几个因素。首先,您的数据库当前的负载是多少?如果您在负载已经很重的数据库上运行此程序,您可能需要在 Web 服务上进行此解析。 XML 分解和查询在 SQL Server 中是一个非常昂贵的过程,特别是当您在没有为其定义架构的未索引列上执行此操作时。模式和索引有助于减少处理开销,但它们无法消除 XML 解析成本高昂的事实。其次,您正在使用的数据量。完全有可能您有太多数据无法通过网络推送。根据服务器的位置和数据量,您可能会遇到难以克服的问题。
最后,你们机器的相关规格是什么?如果您的 Web 服务计算机内存不足,它将在虚拟内存中不断地进出数据并尝试解析 XML,这会破坏您的性能。也许您没有运行最强大的数据库硬件,并且粉碎 XML 将会导致数据库计算机上的 CPU 性能受到限制。
归根结底,真正了解情况的唯一方法就是尝试两种方法并找出对您有意义的方法。在 Web 服务机器上进行开发几乎毫无疑问会更容易,因为 LINQ to XML 是一种比硬塞到 T-SQL 中的 XQuery 更优雅的 XML 解析方式。根据您在问题中提供的信息,我的指示是,从长远来看,T-SQL 将为您提供更好的性能,因为您出于报告目的而对数据库中的每一行或至少大多数行进行 XML 解析。通过网络推送此类信息是很丑陋的。也就是说,如果性能不是那么重要,那么就需要采取更简单且更可维护的路线来在应用程序服务器上进行所有解析。
What you're asking about is really a huge balancing act and it totally depends on several factors. First, what's the current load on your database? If you're running this on a database that is already under heavy load, you're probably going to want to do this parsing on the web service. XML shredding and querying is an incredibly expensive procedure in SQL Server, especially if you're doing it on un-indexed columns that don't have a schema defined for them. Schemas and indexes help with this processing overhead, but they can't eliminate the fact that XML parsing isn't cheap. Secondly, the amount of data you're working with. It's entirely possible that you just have too much data to push over the network. Depending on the location of your servers and the amount of data, you could face insurmountable problems here.
Finally, what are the relative specs of your machines? If your web service machine has low memory, it's going to be thrashing data in and out of virtual memory trying to parse the XML which will destroy your performance. Maybe you're not running the most powerful database hardware and shredding XML is going to be performance prohibitive for the CPU you've got on your database machine.
At the end of the day, the only way to really know is to try both ways and figure out what makes sense for you. Doing the development on your web services machine will almost undoubtedly be easier as LINQ to XML is a more elegant way of parsing through XML than XQuery shoehorned into T-SQL is. My indication, given the information you provided in your question, is that T-SQL is going to perform better for you in the long run because you're doing XML parsing on every row or at least most rows in the database for reporting purposes. Pushing that kind of information over the network is just ugly. That said, if performance isn't that important, there's something to be said about taking the easier and more maintainable route of doing all the parsing on the application server.