如果我使用架构,如何提高 DataSet.ReadXml 的性能?
我有一个 ADO 数据集,我正在通过 ReadXml 从其 XML 文件加载该数据集。 数据和架构位于不同的文件中。
目前,加载此数据集需要接近 13 秒的时间。 如果我不读取 DataSet 的架构,而只是让 ReadXml 推断该架构,则可以将此时间缩短为 700 毫秒,但生成的 DataSet 不包含任何约束。
我尝试过这样做:
Console.WriteLine("Reading dataset with external schema.");
ds.ReadXmlSchema(xsdPath);
Console.WriteLine("Reading the schema took {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
dt.BeginLoadData();
}
ds.ReadXml(xmlPath);
Console.WriteLine("ReadXml completed after {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
dt.EndLoadData();
}
Console.WriteLine("Process complete at {0} milliseconds.", sw.ElapsedMilliseconds);
当我这样做时,读取模式需要 27 毫秒,读取数据集需要 12000 多毫秒。 这是我在所有数据表上调用 EndLoadData 之前报告的时间。
这不是一个巨大的数据量 - 大约 1.5mb,没有嵌套关系,并且所有表都包含两到三列,每列 6-30 个字符。 如果我预先阅读架构,我能想到的唯一不同之处是该架构包含所有唯一约束。 但 BeginLoadData 应该关闭约束(以及更改通知等)。 所以这不应该适用于此。 (是的,我尝试过将 EnforceConstraints 设置为 false。)
我读过许多关于人们通过首先读取架构而不是让对象推断架构来改进数据集加载时间的报告。 就我而言,推断架构的过程比显式提供架构快大约 20 倍。
这让我有点疯狂。 该数据集的架构是根据元信息生成的,我很想编写一个以编程方式创建它的方法,然后使用 XmlReader 将其反序列化。 但我宁愿不这样做。
我缺少什么? 我还能做些什么来提高这里的速度?
I'm have a ADO DataSet that I'm loading from its XML file via ReadXml. The data and the schema are in separate files.
Right now, it takes close to 13 seconds to load this DataSet. I can cut this to 700 milliseconds if I don't read the DataSet's schema and just let ReadXml infer the schema, but then the resulting DataSet doesn't contain any constraints.
I've tried doing this:
Console.WriteLine("Reading dataset with external schema.");
ds.ReadXmlSchema(xsdPath);
Console.WriteLine("Reading the schema took {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
dt.BeginLoadData();
}
ds.ReadXml(xmlPath);
Console.WriteLine("ReadXml completed after {0} milliseconds.", sw.ElapsedMilliseconds);
foreach (DataTable dt in ds.Tables)
{
dt.EndLoadData();
}
Console.WriteLine("Process complete at {0} milliseconds.", sw.ElapsedMilliseconds);
When I do this, reading the schema takes 27ms, and reading the DataSet takes 12000+ milliseconds. And that's the time reported before I call EndLoadData on all the DataTables.
This is not an enormous amount of data - it's about 1.5mb, there are no nested relations, and all of the tables contain two or three columns of 6-30 characters. The only thing I can figure that's different if I read the schema up front is that the schema includes all of the unique constraints. But BeginLoadData is supposed to turn constraints off (as well as change notification, etc.). So that shouldn't apply here. (And yes, I've tried just setting EnforceConstraints to false.)
I've read many reports of people improving the load time of DataSets by reading the schema first instead of having the object infer the schema. In my case, inferring the schema makes for a process that's about 20 times faster than having the schema provided explicitly.
This is making me a little crazy. This DataSet's schema is generated off of metainformation, and I'm tempted to write a method that creates it programatically and just deseralizes it with an XmlReader. But I'd much prefer not to.
What am I missing? What else can I do to improve the speed here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我将尝试对文本纯文件和 xml 文件中存储数据进行性能比较。
第一个函数创建两个文件:一个包含 1000000 条纯文本记录的文件和一个包含 1000000 条(相同数据)XML 记录的文件。 首先,您必须注意文件大小的差异:~64MB(纯文本)与~102MB(xml 文件)。
第二个函数读取这两个文件:首先它将纯文本读取到字典中(只是为了模拟使用它的现实世界),然后读取 XML 文件。 这两个步骤均以毫秒为单位进行测量(并将结果写入控制台):
开始将文本文件读入内存
文本文件在 7628 毫秒内加载到内存
开始将XML文件读入内存
XML 文件加载到内存需要 21018 毫秒
结论:XML 文件大小几乎是文本文件大小的两倍,加载速度比文本文件慢三倍。
XML 处理比纯文本更方便(因为抽象级别),但更消耗 CPU/磁盘。
因此,如果您的文件很小并且从性能的角度来看可以接受,那么 XML 数据集就完全可以了。 但是,如果您需要性能,我不知道 XML 数据集(使用任何可用的方法)是否比纯文本文件更快。 基本上,它从第一个原因开始:XML 文件更大,因为它有更多标签。
I will try to give you a performance comparison between storing data in text plain files and xml files.
The first function creates two files: one file with 1000000 records in plain text and one file with 1000000 (same data) records in xml. First you have to notice the difference in file size: ~64MB(plain text) vs ~102MB (xml file).
The second function reads these two files: first it reads the plain text into a dictionary (just to simulate the real world of using it) and after that it reads the XML file. Both steps are measured in milliseconds (and results are written to console):
Start read Text file into memory
Text file loaded into memory in 7628 milliseconds
Start read XML file into memory
XML file loaded into memory in 21018 milliseconds
Conclusion: the XML file size is almost double than the text file size and is loaded three times slower than the text file.
XML handling is more convenient (because of the abstraction level) than plain text but it is more CPU/disk consuming.
So, if you have small files and is acceptable from the performance point of view, XML data Sets are more than ok. But, if you need performance, I don't know if XML Data set ( with any kind of method available) is faster that plain text files. And basically, it start from the very first reason: XML file is bigger because it has more tags.
确切地说,这不是一个答案(尽管有总比没有好,这是我到目前为止所得到的),但在与这个问题斗争了很长时间之后,我发现当我的程序不在 Visual Studio 中运行时,它完全不存在。
我之前没有提到的一点是,当我将不同的(但相对较大的)XML 文档加载到 DataSet 中时,程序执行得很好,这一点让这变得更加神秘。 我现在想知道我的一个数据集是否附加了 Visual Studio 在运行时检查的某种元信息,而另一个数据集则没有。 我不知道。
It's not an answer, exactly (though it's better than nothing, which is what I've gotten so far), but after a long time struggling with this problem I discovered that it's completely absent when my program's not running inside Visual Studio.
Something I didn't mention before, which makes this even more mystifying, is that when I loaded a different (but comparably large) XML document into the DataSet, the program performed just fine. I'm now wondering if one of my DataSets has some kind of metainformation attached to it that Visual Studio is checking at runtime while the other one doesn't. I dunno.
另一个尝试的方法是读取没有架构的数据集,然后 < em>将其合并到启用了约束的类型化数据集中。 这样,它在构建用于强制约束的索引时就拥有了手头的所有数据——也许它会更有效?
来自 MSDN:
。
Another dimesion to try is to read the dataset without the schema and then Merge it into a typed dataset that has the constraints enabled. That way it has all of the data on hand as it builds the indexes used to enforce constraints -- maybe it would be more efficient?
From MSDN:
.