通过网络服务按需传输数据
我的任务是公开一项服务,该服务可能会提供大量数据(千兆字节)。因此,它必须按需传输数据,以便数据不会缓冲在内存中。数据在发送到客户端之前将经历以下步骤。
- 从数据库提取数据
- 将数据序列化为 XML
- 使用 gzip 压缩 XML 数据
- 将数据作为流发送到客户端
步骤 3 可能会被省略,因为压缩可以由 WCF 完成。是否有推荐的方法来执行此操作,而不在任何步骤中缓冲大量数据,这显然会使应用程序崩溃,数据可能为 100GB?
I have an assignment of exposing a service which will deliver potentially very large amounts of data (gigabytes). Thus it will have to stream data on demand, so that data is not buffered in memory. The data will undergo the following steps before being sent to the client.
- Extract data from database
- Serialize data to XML
- Compress the XML data with gzip
- Send data to the client as a stream
Step 3 might be left out as compression can be done by WCF. Is there a recommended way to do this, without buffering large amounts of data in any of the steps, which will obviously crash the application with data being maybe 100GB?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于这是一项任务,我不确定您有什么限制或练习的基本目的是什么,但优化这样的数据传输服务并使其稳定并非易事。发生通信问题的可能性很大,因此您需要处理这种可能性。但是,如果出现问题,您不想重新开始,因为这会浪费您在出现问题之前所做的所有工作。
在基本层面上,服务应该将数据分解为可管理的部分(例如 100K,具体取决于网络速度、稳定性、环境)。块的大小意味着错误的可能性与请求每个块的开销之间的平衡。如果错误的可能性很高,则块应该更小。
这也解决了在内存中缓冲大量数据的问题,但对强大的错误处理机制的需求也同样重要。因此,服务应该有一个方法来发起请求,该方法将向客户端响应有关数据流总大小和块数量的信息,以及另一个请求特定数据块的方法。
客户端可以选择指定块大小,或者可以将协议设计为自动调整块大小以响应错误条件。也就是说,如果错误频繁发生,则通常应该减小块大小。
无论哪种方式,一旦发起请求,客户端就会调用另一个方法,该方法会顺序请求特定的块,并且当成功接收到每个块时,它将它们附加到文件的末尾。如果发生故障,客户端可以重新请求特定的块。
最后,以 XML 格式发送大量数据可能效率非常低,除非与标记相比数据量非常大。也就是说,如果数据结构具有许多元素(字段、记录),而与每个元素包含的信息量(例如,大量简单的数字数据)相比,则为数据格式建立契约会更有意义当它最初被请求时。另一方面,如果每个字段都包含大量数据(例如文本)的字段很少,那么这并不重要。
如果数据格式始终相同(这是典型的),那么客户端可以按照预期进行设计。如果没有,服务器可以通过提供要传输的数据的结构来开始交换,然后在建立的结构中传输数据,而无需标记标签的开销。
对于非常高效的结构化数据编码器,请查看协议缓冲区。 基本点(无论您是否使用诸如协议缓冲区之类的东西,或者只是以您自己的标准化格式布置数据),标记标签会增加大量开销,并且如果客户端和服务器对正在处理的数据的格式有约定,则完全没有必要使用标记标签已发送,并且您应该将数据分成客户特别要求的可管理的部分。
Since this is an assignment I am not sure what constraints you have or what the basic purpose of the exercise is, but optimizing a data transfer service like this, and making it stable, is not trivial. The chance of a communication problem occurring is substantial, so you will need to handle the possibility this. But you don't just want to start over if there's a problem, since that would waste all the work you've done up to the point of the problem.
At a basic level the service should break the data into manageable pieces (say, 100K, depending on the network speed, stability, environment). The size of the chunk is meant to be a balance of the likelihood of errors versus the overhead of requesting each chunk. If the likelihood of errors is high, chunks should be smaller.
This also addresses buffering huge amounts of data in memory, but the need for a robust error handling mechanism is equally important. The service should therefore have a method to initiate a request, which would respond to the client with information about the total size of the data stream, and the number of chunks, and another to request a specific chunk of data.
The client could optionally specify the chunk size, or the protocol could be designed to automatically adjust the chunk size in response to error conditions. That is, the chunk size should generally be reduced if errors are occurring frequently.
Either way, once initiating the request, the client calls another method which requests specific chunks sequentially, and when each is successfully received, it appends them to the file at its end. If a failure occurred, the client can re-request just a specific chunk.
Finally, sending huge amounts of data in XML format is probably very inefficient, unless there is a very large amount of data compared to markup. That is, if the data structure has many elements (fields, records) compared to the volume of information contained by each element (e.g., lots of simple numeric data), it would make a lot more sense to establish a contract for the data format when it's initially requested. If, on the other hand, there are few fields that each contain large amounts of data (e.g., text) then it doesn't matter much.
If the data format is always the same (which is typical) then the client can just be designed to expect that. If not, the server could begin the exchange by providing a structure for the data it's going to transmit, and then transmit data in the established structure without the overhead of markup tags.
For a very efficient, structured data encoder check out protocol buffers. The basic point (whether you use something like protocol buffers, or just lay out the data in your own standardized format) is that markup tags can add a lot of overhead, and they are entirely unnecessary if the client and the server have a contract for the format of the data that's being sent, and that you should break the data into manageable pieces which are requested specifically by the client.