医疗数据实时数据压缩算法
我正在寻找一种强大、高效的数据压缩算法,可以用来提供医疗数据的实时传输(主要是波形 - 心率等)。
我将不胜感激任何科学论文的建议/链接。
编辑:该系统将基于服务器(最有可能安装在护理点基础设施内)和移动设备(iOS 和 Android 智能手机以及带有本机应用程序的平板电脑),波形是将被转移。服务器将收集来自医院的所有数据(主要是波形数据)。就我而言,稳定性和速度比延迟更重要。
这是我目前可以提供的最详细的规格。我将研究您的建议,然后测试几种算法。但我正在寻找在类似架构中成功实现的东西。我也愿意接受有关服务器计算能力或服务器软件的任何建议。
I am looking for a robust, efficient data compression algorithm that I could use to provide a real-time transmission of medical data (primarily waveforms - heart rate, etc.).
I would appreciate any recommendations/links to scientific papers.
EDIT: The system will be based on a server (most probably installed within point-of-care infrastructure) and mobile devices (iOS & Android smartphones and tablets with native apps), to which the waveforms are going to be transferred. The server will gather all the data from the hospital (primarily waveform data). In my case, stability and speed is more important than latency.
That's the most detailed specification I can provide at the moment. I am going to investigate your recommendations and then test several algorithms. But I am looking for something that was successfully implemented in similar architecture. I also am open to any suggestions regarding server computation power or server software.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不要将其视为实时或医疗数据 - 将其视为需要压缩传输的数据包(最有可能在 TCP 数据包中)。内容的细节只与压缩算法的选择有关,即使如此,它是否是医学的也不是问题,而是数据的格式/存储方式以及实际数据的样子。重要的是数据本身以及整个系统的限制(例如,是动态心电图监测器等数据收集,还是 ICU 中的心脏监测器等实时状态报告?什么样的系统正在接收数据?)。
查看数据,它是作为原始二进制数据进行传输,还是从另一个组件或设备接收为(例如)结构化 XML 或 HL7(其中数值表示为文本)?压缩原始数据是否是最有效的选择,还是应该将其转换为仅覆盖实际数据范围的专有二进制格式(2、3 或 4 个字节是否足以覆盖值的范围?)?通过转换可以实现哪些节省以及兼容性问题是什么(例如失去 HL7 兼容性)。
选择绝对最佳的压缩算法可能也不值得做太多额外的工作,除非您将处于极低带宽的情况下;如果数据来自嵌入式设备,您应该平衡压缩效率与嵌入式处理器、工具集和周围系统的功能和限制。如果定制的压缩例程比工具中内置的压缩例程节省了 5%,那么额外的编码和调试时间以及嵌入式闪存中的存储空间值得吗?产生“足够好”输出的现有经过验证的软件库可能是首选,特别是对于医疗设备而言。
最后,根据环境,您可能希望牺牲大量压缩以支持某种程度的冗余,例如传输数据的滑动窗口,这样任何 X 数据包的丢失都不会导致数据丢失。这也可能让您更改协议,并可能更改设备的配置方式 - 流式 UDP(不会重传丢失的数据包)和 TCP(发送方可能需要能够重传)之间的差异可能很大。
而且,现在我已经对系统方面进行了喋喋不休的讨论,有很多关于打包和流模拟数据的信息,包括流协议的开发,例如 RTP 了解 GSM/CDMA 和 VOIP 语音分组化的详细信息。尽管如此,您的决策最重要的驱动因素最终可能是设备和服务器端可用的工具集。使用现有的工具集,即使它们不是最有效的选择,也可以让您显着缩短开发(和上市时间)时间,并且还可以简化医疗设备/产品的认证。在业务方面,花费额外的 3-6 个月的软件开发时间、寻找真正合格的开发人员以及处理监管审批可能是最重要的因素。
更新 2012/02/01:我只花了几分钟查看 12 导联心脏负荷心电图的 XML 导出,总观察时间超过 12 分钟,XML 文件大小约为 6MB 。我估计该文件的 25% 以上是重复的,并且在研究标题中是可极度压缩的 XML,并且波形数据是 -200 到 200 范围内的逗号分隔数字,集中在范围的中心,并且变化缓慢,数字穿过 y 轴并在该侧停留一段时间。假设您想要的大部分内容是波形值,对于本示例,您将查看未压缩的 4500KB/763 秒或大约 59 Kbps 的数据速率。完全未压缩并使用文本格式,您可以轻松地通过“2.5G”GPRS 连接运行它。在任何现代无线基础设施上,所使用的带宽几乎是不可察觉的。
我仍然认为库存压缩库会吃掉这种数据作为午餐(受压缩标头和可能的数据包标头的问题影响)。如果您坚持进行自定义压缩,我会考虑发送差异值而不是原始数字(除非您的原始数据已经是偏移量)。如果您的数据看起来像我正在审查的内容,您可能可以将每个项目转换为 -127 到 +127 的 1 字节值,可能将最末端保留为用于溢出的“特殊”值(按您所见处理这些值)适合 - 特殊表示、错误等)。如果您希望传输效率稍低,处理速度稍快一些,则可以将每个值作为带符号的 2 字节值发送,这仍然比文本表示使用更少的带宽,因为目前每个值都是 2+ 字节(值是 1-4 个字符加上不再需要的分隔符)。
基本上,不用担心数据的大小,除非要通过严格计量的低上限无线连接运行 24/7。
Don't think of it as real-time or as medical data - think of it as packets of data needing to be compressed for transmission (most likely in TCP packets). The details of the content only matter in choice of compression algorithm, and even there it's not whether it's medical it's how the data is formatted/stored and what the actual data looks like. The important things are the data itself and the constraints due to the overall system (e.g. is it data gathering such as a Holter monitor, or is it real-time status reporting such as a cardiac monitor in an ICU? What kind of system is receiving the data?).
Looking at the data, is it being presented for transmission as raw binary data, or is it being received from another component or device as (for example) structured XML or HL7 with numeric values represented as text? Will compressing the original data be the most efficient option, or should it be converted down to a proprietary binary format that only covers the actual data range (are 2, 3 or 4 bytes enough to cover the range of values?)? What kind of savings could be achieved by converting and what are the compatibility concerns (e.g. loss of HL7 compatibility).
Choosing the absolutely best-compressing algorithm may also not be worth much additional work unless you're going to be in an extremely low-bandwidth scenario; if the data is coming from an embedded device you should be balancing compression efficiency with the capabilities and limitations of the embedded processor, toolset and surrounding system for working with it. If a custom-built compression routine saves you 5% over something already built-in to the tools is it worth the extra coding and debugging time and storage space in embedded flash? Existing validated software libraries that produce "good enough" output may be preferred, particularly for medical devices.
Finally, depending on the environment you may want to sacrifice a big chunk of compression in favor of some level of redundancy, such as transmitting a sliding window of the data such that loss of any X packets doesn't result in loss of data. This may let you change protocols as well and may change how the device is configured - the difference between streaming UDP (with no retransmission of lost packets) and TCP (where the sender may need to be able to retransmit) may be significant.
And, now that I've blathered about the systems side, there's a lot of information out there on packetizing and streaming analog data, ranging from development of streaming protocols such as RTP to details of voice packetization for GSM/CDMA and VOIP. Still, the most important drivers for your decisions may end up being the toolsets available to you on the device and server sides. Using existing toolsets even if they're not the most efficient option may allow you to cut your development (and time-to-market) times significantly, and may also simplify the certification of your device/product for medical use. On the business side, spending an extra 3-6 months of software development, finding truly qualified developers, and dealing with regulatory approvals are likely to be the overriding factors.
UPDATE 2012/02/01: I just spent a few minutes looking at the XML export of a 12-lead cardiac stress EKG with a total observation time of 12+ minutes and an XML file size of ~6MB. I'm estimating that more than 25% of that file was repetitive and EXTREMELY compressible XML in the study headers, and the waveform data was comma-separated numbers in the range of -200 to 200 concentrated in the center of the range and changing slowly, with the numbers crossing the y-axis and staying on that side for a time. Assuming that most of what you want is the waveform values, for this example you'd be looking at a data rate with no compression of 4500KB / 763 seconds or around 59 Kbps. Completely uncompressed and using text formatting you could run that over a "2.5G" GPRS connection with ease. On any modern wireless infrastructure the bandwidth used will be almost unnoticeable.
I still think that the stock compression libraries would eat this kind of data for lunch (subject to issues with compression headers and possibly packet headers). If you insist on doing a custom compression I'd look at sending difference values rather than raw numbers (unless your raw data is already offsets). If your data looks anything like what I'm reviewing, you could probably convert each item into a 1-byte value of -127 to +127, possibly reserving the extreme ends as "special" values used for overflow (handle those as you see fit - special representation, error, etc.). If you'd rather be slightly less efficient on transmission and insignificantly faster in processing you could instead just send each value as a signed 2-byte value, which would still use less bandwidth than the text representation because currently every value is 2+ bytes anyway (values are 1-4 chars plus separators no longer needed).
Basically, don't worry about the size of the data unless this is going to be running 24/7 over a heavily metered wireless connection with low caps.
有一种压缩软件的速度非常快,以至于我认为它不能被称为“实时”:它们必然足够快。此类算法称为 LZ4、Snappy、LZO、QuickLZ,每个 CPU 可达数百 MB/s。
此处提供了它们的比较:
http://code.google.com/p/lz4/
“实时压缩传输”也可以看作是速度和压缩比之间的权衡。更多的压缩,即使速度较慢,也可以有效节省传输时间。
在此页面上实现了对压缩和速度之间“最佳权衡”的研究,例如:http://fastcompression.blogspot.com/p/compression-benchmark.html
There is a category of compression software which is so fast that i see no scenario in which it can't be called "real time" : they are necessarily fast enough. Such algorithms are called LZ4, Snappy, LZO, QuickLZ, and reach hundreds of MB/s per CPU.
A comparison of them is available here :
http://code.google.com/p/lz4/
"Real Time compression for transmission" can also be seen as a trade-off between speed and compression ratio. More compression, even if slower, can effectively save transmission time.
A study of the "optimal trade-off" between compression and speed has been realized on this page for example : http://fastcompression.blogspot.com/p/compression-benchmark.html
我测试了很多压缩库,这是我的结论
LZO (http://www.oberhumer.com/opensource/考虑到压缩大量数据(超过 1 MB),lzo/)速度非常快
Snappy (http://code.google.com/p/snappy/) 很好,但解压缩时需要更多处理资源(对于小于 1MB 的数据更好)
http://objectegypt.com 提供了一个名为 IHCA 的库,它在大数据压缩方面比 lzo 更快,并且提供良好的解压速度,并且不需要许可证
最后你最好创建自己的压缩函数,因为没有人比你更了解你的数据
I tested many compression libraries and this is my conclusion
LZO (http://www.oberhumer.com/opensource/lzo/) is very fast considering compressing big amount of data (more than 1 MB)
Snappy (http://code.google.com/p/snappy/) is good but requires more processing resources at decompresion (better for data less than 1MB)
http://objectegypt.com is offering a library called IHCA which is faster than lzo in big data compression and offers a good decompression speed and requires no license
finally you'd better make your own compression functions, because no one knows about your data more than you