压缩数字化文档图像
现在法律要求我们将公司的所有财务文件数字化,并每 3 个月提交一次评估。
由于这是敏感数据,我们决定亲自处理并构建某种数字数据存档器。该工具运行良好,但使用 7 个月后我们开始担心这些图像所使用的磁盘空间。
以下是有关数字化文档数量的一些信息:
- 每天扫描和存档 15K 文档,最终 PNG 大小为 +- 860KB:15 000 * 860 kilobits = 1.53779984 GB
- 每月工作 30 天:1.53779984 GB * 30 = 46.1339952
- GB一年后的磁盘空间使用量:46.1339952 GB * 12 = 553.607942 GB
到目前为止,我们已使用 424 GB 的磁盘空间(不包括备份)。我们使用 PNG 作为图像格式,但我想知道是否有人对更好的图像压缩算法或压缩 PNG 的更多甚至更好的图像存档方法以节省磁盘空间的替代策略有任何建议。
任何帮助将不胜感激,谢谢。
We are now required by law to digitalize all the financial documents in our company and submit them to evaluations every 3 months.
Since this is sensitive data we decided to take matters into our own hands and build some sort of digital data archiver. The tool works perfectly, but after 7 months of usage we are begining to worry about the disk space used by these images.
Here some info on the amount of documents digitalized:
- 15K documents scanned and archived per day, with final PNG size of +- 860KB: 15 000 * 860 kilobits = 1.53779984 gigabytes
- 30 days of work per month: 1.53779984 gigabytes * 30 = 46.1339952 gigabytes
- Expectation of disk space usage after 1 year: 46.1339952 gigabytes * 12 = 553.607942 gigabytes
So far we're at 424 gigabytes of disk space used, without counting backup. We're using PNG as image format, but I would like to know if anyone have any advice on a better compression algorithm for images or alternative strategies for compressing the PNG's even more or even better ways to archive images as to save disk space.
Any help would be appreciated, thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用 DjVu 会更好,这是一种相对较新的格式,专门用于压缩扫描文档。它适用于黑白、灰度和彩色文档。它将前景/背景分离与复杂的小波压缩方案相结合。如果您获得商业版本,我相信您还可以对文档进行 OCR 识别,以便您可以搜索它们,但是有一个完全开源的版本,名为 DjVuLibre。
You'll be better off with DjVu, a relatively new format that was designed expressly to compress scanned documents. It works well for bitonal, grayscale, and color documents. It combines foreground/background separation with a sophisticated wavelet compression scheme. If you get the commercial version I believe you can also get your documents OCR'd so you can search them, but there is a completely open-source version called DjVuLibre.
想必这些文档并不需要一直在线。如果是这种情况,从您提供的信息来看,我看不出您需要更改工作流程的任何理由。
PNG 是一种广泛支持的无损 (zlib) 压缩格式,我猜您正在使用它。如果您不需要无损压缩,只要您适当调整压缩比,好的 ole JPEG 将为您提供更严格的压缩,但会造成轻微的质量损失。 JPEG2000 可能是另一种选择,具体取决于您的软件堆栈。除了每像素 16 位的支持之外,LZW 压缩的 TIFF 与 PNG 相比没有任何重大优势,您可能不需要这种支持。其他选项包括专有的专业编解码器(如 MrSID),它可以以一定的价格提供对超大文件的极佳压缩。
由于这些是扫描文档,我想我会认为 PDF 是对其进行编码的“自然”格式。 PDF 根据文件内容提供多种压缩选项。但我不会竭尽全力去修复没有损坏的东西。
如果您考虑一下现在在驱动器空间上花费了多少,那么每天 1.5 GB 根本不算什么。驱动器空间很便宜,而且越来越便宜。只需每 6 个月购买三个新的 1 TB USB 驱动器(主/备份/异地备份),总成本为 240 美元或其他。即使是磁带备份也不是没有道理的。
Presumably these documents don't need to all be online constantly. If that is the case, from the information you've provided, I don't see any reason why you'd need to change your workflow.
PNG is a widely-supported format with lossless (zlib) compression, which I'm guessing you're using. If you don't need lossless compression, good ole JPEG will give you tighter compression at the expense of minor quality loss, provided you tweak the compression ratios appropriately. JPEG2000 may be another alternative, depending on your software stack. LZW-compressed TIFF offers no major advantages over PNG other than 16-bit-per-pixel support, which you probably don't need. Other options include proprietary specialty codecs (like MrSID) that offer extremely good compression of extremely large files, for a price.
Since these are scanned documents, I guess I would think of PDF as the "natural" format in which to encode them. PDF offers a variety of compression options depending on the contents of the files. But I wouldn't go to great lengths to fix something that isn't broken.
If you think about how much you're spending on drive space now, 1.5 GB per day is nothing. Drive space is cheap and constantly getting cheaper. Just buy three new 1 TB USB drives (primary / backup / offsite backup) every 6 months at a total cost of $240 or whatever. Even tape backup is not unreasonable.
每年 500 GB 并不算多,而且硬盘一年比一年便宜
500 Gb per year is not much, and hard drives are getting cheaper each year