处理非常大的数据集和及时加载
我有一个用 C# (.NET 4.0) 编写的 .NET 应用程序。在此应用程序中,我们必须从文件中读取大型数据集并以网格状结构显示内容。因此,为了完成此任务,我在表单上放置了一个 DataGridView。它有3列,所有列数据都来自文件。最初,该文件大约有 600.000 条记录,对应于 DataGridView 中的 600.000 行。
我很快发现,DataGridView 在处理如此大的数据集时会崩溃,因此我不得不切换到虚拟模式。为了实现这一点,我首先将文件完全读入 3 个不同的数组(对应于 3 列),然后触发 CellValueNeeded 事件,我从数组中提供正确的值。
然而,正如我们很快发现的那样,该文件中可能有大量(巨大!)的记录。当记录大小非常大时,将所有数据读取到数组或List<>等中似乎不可行。我们很快就会遇到内存分配错误。 (内存不足异常)。
我们被困在那里,但后来意识到,为什么首先将数据全部读入数组,为什么不在 CellValueNeeded 事件触发时按需读取文件?这就是我们现在要做的:打开文件,但不读取任何内容,当 CellValueNeeded 事件触发时,我们首先使用 Seek() 定位到文件中的正确位置,然后读取相应的数据。
这是我们能想到的最好的办法,但是,首先这非常慢,这使得应用程序变得缓慢并且不用户友好。其次,我们不禁认为必须有更好的方法来实现这一目标。例如,某些二进制编辑器(如 HXD)对于任何文件大小都快得令人眼花缭乱,所以我想知道如何实现这一点。
哦,更糟糕的是,在 DataGridView 的虚拟模式下,当我们将 RowCount 设置为文件中的可用行数(例如 16.000.000)时,DataGridView 甚至需要一段时间才能初始化自身。对于这个“问题”的任何评论也将不胜感激。
谢谢
I have a .NET application written in C# (.NET 4.0). In this application, we have to read a large dataset from a file and display the contents in a grid-like structure. So, to accomplish this, I placed a DataGridView on the form. It has 3 columns, all column data comes from the file. Initially, the file had about 600.000 records, corresponding to 600.000 lines in the DataGridView.
I quickly found out that, DataGridView collapses with such a large data-set, so I had switch to Virtual Mode. To accomplish this, I first read the file completely into 3 different arrays (corresponding to 3 columns), and then the CellValueNeeded event fires, I supply the correct values from the arrays.
However, there can be a huge (HUGE!) number of records in this file, as we quickly found out. When the record size is very large, reading all the data into an array or a List<>, etc, appears to not be feasible. We quickly run into memory allocation errors. (Out of memory exception).
We got stuck there, but then realized, why read the data all into arrays first, why not read the file on demand as CellValueNeeded event fires? So that's what we do now: We open the file, but do not read anything, and as CellValueNeeded events fire, we first Seek() to the correct position in the file, and then read the corresponding data.
This is the best we could come up with, but, first of all this is quite slow, which makes the application sluggish and not user friendly. Second, we can't help but think that there must be a better way to accomplish this. For example, some binary editors (like HXD) are blindingly fast for any file size, so I'd like know how this can be achieved.
Oh, and to add to our problems, in virtual mode of the DataGridView, when we set the RowCount to the available number of rows in the file (say 16.000.000), it takes a while for the DataGridView to even initialize itself. Any comments for this 'problem' would be appreciated as well.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您无法将整个数据集放入内存中,那么您需要一个缓冲方案。您的应用程序不应只读取填充
DataGridView
来响应CellValueNeeded
所需的数据量,而应预测用户的操作并提前读取。因此,例如,当程序第一次启动时,它应该读取前 10,000 条记录(或者可能只读取 1,000 条或可能 100,000 条 - 无论您的情况是合理的)。然后,CellValueNeeded
请求可以立即从内存中填充。当用户在网格中移动时,您的程序尽可能领先用户一步。如果用户跳到您前面(例如,想要从前面跳到末尾),并且您必须转到磁盘才能满足请求,则可能会出现短暂的停顿。
这种缓冲通常最好由单独的线程来完成,尽管如果线程在预期用户的下一个操作时提前读取,然后用户执行一些完全意想不到的操作(例如跳转到列表的开头),则同步有时可能会成为问题。
1600 万条记录并不是要在内存中保存的那么多记录,除非记录非常大。或者如果您的服务器上没有太多内存。当然,1600 万远不及
List
的最大大小,除非T
是值类型(结构)。您在这里谈论的是多少GB的数据?If you can't fit your entire data set in memory, then you need a buffering scheme. Rather than reading just the amount of data needed to fill the
DataGridView
in response toCellValueNeeded
, your application should anticipate the user's actions and read ahead. So, for example, when the program first starts up, it should read the first 10,000 records (or maybe only 1,000 or perhaps 100,000--whatever is reasonable in your case). Then,CellValueNeeded
requests can be filled immediately from memory.As the user moves through the grid, your program as much as possible stays one step ahead of the user. There might be short pauses if the user jumps ahead of you (say, wants to jump to the end from the front) and you have to go out to disk in order to fulfill a request.
That buffering is usually best accomplished by a separate thread, although synchronization can sometimes be an issue if the thread is reading ahead in anticipation of the user's next action, and then the user does something completely unexpected like jump to the start of the list.
16 million records isn't really all that many records to keep in memory, unless the records are very large. Or if you don't have much memory on your server. Certainly, 16 million is nowhere near the maximum size of a
List<T>
, unlessT
is a value type (structure). How many gigabytes of data are you talking about here?好吧,这里有一个看起来效果更好的解决方案:
步骤 0:将 dataGridView.RowCount 设置为一个较低的值,例如 25(或适合您的表单/屏幕的实际数字)
步骤 1:禁用 dataGridView 的滚动条。
第 2 步:添加您自己的滚动条。
步骤 3:在 CellValueNeeded 例程中,响应 e.RowIndex+scrollBar.Value
步骤 4:对于数据存储,我当前打开一个 Stream,在 CellValueNeeded 例程中,首先执行 Seek() 和 Read() 所需的数据。
通过这些步骤,对于非常大的文件(测试最大为 0.8GB),我在 dataGrid 中滚动时获得了非常合理的性能。
所以总而言之,速度减慢的实际原因似乎不是我们不断进行 Seek() 和 Read() 操作,而是实际的 dataGridView 本身。
Well, here's a solution that appears to work much better:
Step 0: Set dataGridView.RowCount to a low value, say 25 (or the actual number that fits in your form/screen)
Step 1: Disable the scrollbar of the dataGridView.
Step 2: Add your own scrollbar.
Step 3: In your CellValueNeeded routine, respond to e.RowIndex+scrollBar.Value
Step 4: As for the dataStore, I currently open a Stream, and in the CellValueNeeded routine, first do a Seek() and Read() the required data.
With these steps, I get very reasonable performance scrolling through the dataGrid for very large files (tested up to 0.8GB).
So in conclusion, it appears that the actual cause of the slowdown wasn't the fact that we kept Seek()ing and Read()ing, but the actual dataGridView itself.
管理可汇总、小计、用于多列计算等的行和列提出了一系列独特的挑战;将问题与编辑器遇到的问题进行比较并不公平。自 VB6 时代起,第三方数据网格控件就一直在解决客户端显示和操作大型数据集的问题。使用按需加载或自包含的客户端庞大数据集来获得真正快速的性能并不是一项简单的任务。按需加载可能会受到服务器端延迟的影响;在客户端操作整个数据集可能会受到内存和 CPU 限制。一些支持即时加载的第三方控件同时提供客户端和服务器端逻辑,而其他控件则尝试 100% 客户端解决问题。
Managing rows and columns that can be rolled up, sub-totalled, used in multi-column calculations, etc presents a unique set of challenges; not really fair to compare the problem to the ones an editor would encounter. Third-party datagrid controls have been addressing the problem of displaying and manipulating large datasets client-side since VB6 days. It's not a trivial task to get really snappy performance using either load-on-demand or self-contained client-side garguantuan datasets. Load-on-demand can suffer from server-side latency; manipulating the entire dataset on the client can suffer from memory and CPU limits. Some third-party controls that support just-in-time loading supply both client-side and server-side logic, while others try to solve the problem 100% client-side.
由于 .net 位于本机操作系统之上,因此运行时加载和管理从磁盘到内存的数据需要另一种方法。
了解原因和方法:http://www.codeproject.com/Articles/ 38069/NET中的内存管理
Because .net is layered on top of the native OS, runtime loading and management of data from disk to memory needs another approach.
See why and how: http://www.codeproject.com/Articles/38069/Memory-Management-in-NET
为了解决这个问题,我建议不要一次加载所有数据。相反,以块的形式加载数据并在需要时显示最相关的数据。我刚刚做了一个快速测试,发现设置
DataGridView
的DataSource
属性是一个很好的方法,但是对于大量行,它也需要时间。因此,使用DataTable的Merge
功能来分块加载数据并向用户显示最相关的数据。 在这里我已经演示了一个可以帮助你的例子。To deal with this issue, I would suggest to do not load all data at once. Instead load data in chunks and display the most relevant data when needed. I just did a quick test and found that setting a
DataSource
property of aDataGridView
is a good approach, but with large number of rows it also takes time. So useMerge
function of DataTable to load data in chunks and show user the most relevant data. Here i have demonstrated an example which can help you.