将文档作为 Blob 存储在数据库中 - 有什么缺点吗?
我的文档管理系统的要求是:
- 必须通过简单地复制目录、文件等来防止盗窃。
- 必须能够防止传统病毒感染(物理文件的感染) 必须
- 能够快速检索
- 存储库必须对临时人员不可见(目录)浏览用户等。
我决定将所有文档(和扫描图像)作为 blob 存储在数据库中,到目前为止,我的体验非常好,文档检索也快得令人眼花缭乱 - 它满足上面的所有标准,甚至还有还有一些额外的优点,例如自动存储文档及其相关实体、轻松快速地搜索内容、消除围绕打开和命名文档等的各种用户活动等。
我的问题是 - 有什么严重的问题吗?在这种设计和实施中我忽略了风险或事情吗?
编辑注意:DB 是 PostgreSQL,可以很好地处理 BLOBS 并且扩展性非常好。 环境是多用户。
The requirements for my document management system were:
- Must be secure from theft by simple copying of directories, files etc.
- Must be secure against traditional virus infection (infection of physical file)
- Must be fast to retrieve
- The repository must not be visible to casual (directory) browsing users etc.
I have decided to store all documents (and scanned images) as blobs in the database and so far my experience is wonderful and document retrieval is blindingly fast as well - it meets all the criteria from above and there are even a couple of additional advantages, such as autostoring documents together with the entity it relates to, easy and fast seaching of contents, removing of all sorts of user activities around opening and naming of documents etc. etc.
My question is - are there any serious risks or things that I overlooked with this design and implementation?
EDIT Note: DB is PostgreSQL, handles BLOBS very well and scales exceptionally well. The environment is Multi-user.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我经常听说使用 blob 的主要缺点是,超过一定大小时,文件系统在存储和检索大文件方面要高效得多。 听起来您已经在您的要求列表中考虑到了这一点。
这里有一个很好的参考资料(PDF),涵盖了优点和缺点斑点。
The main disadvantage that I often hear of using blobs is that, above a certain size, the file system is much more efficient at storing and retrieving large files. It sounds like you've already taken this in to account by your list of requirements.
There's a good reference (PDF) here that covers the pros and cons of blobs.
根据我的经验,一些问题是:
速度与文件系统上有文件。
缓存。 IMO 网络服务器
会做更好的缓存工作
静态内容。 DB将做一个
干得也不错,但如果数据库也是
处理各种其他查询,
不要指望那些大文档
长时间保持缓存。 你
本质上必须转移
文件两次。 一旦从数据库到
Web 服务器,然后是 Web 服务器
客户端。
内存限制。 在我的上一份工作中,我们的数据库中有一个 40MB 的 PDF,并且日志文件中不断出现 Java OutOfMemoryErrors。 我们最终意识到,由于 Hibernate ORM 中的设置,整个 80MB PDF 不仅被读入堆一次,而且两次(如果对象是可变的,它会在内存中创建一个副本以进行编辑)。 PDF 流式传输回用户后,堆就被清理了,但是为了流式传输文档而一次从堆中吸出 80MB 是一个很大的打击。 了解您的代码以及如何使用内存!
您的网络服务器应该能够处理您的大部分安全问题,但是如果文档很小并且数据库还没有承受很大的负载,那么我认为将它们放在数据库中并不是什么大问题。
From my experience, some issues were:
speed vs having files on the file system.
caching. IMO the web server
will do a better job of caching
static contents. The DB will do a
good job too, but if the DB is also
handing all sorts of other queries,
don't expect those large documents
to stay cached for long. You
essentially have to transfer the
files twice. Once from the DB to the
Web server, and then web server to
client.
Memory constraints. At my last job we had an 40MB PDF in the database, and kept getting Java OutOfMemoryErrors in the log file. We eventually realized that the entire 80MB PDF was read into the heap not just once, but TWICE thanks to a setting in Hibernate ORM (if an object is mutable, it makes a copy for editing in memory). Once the PDF was streamed back to the user, the heap was cleaned up, but it was a big hit to suck 80MB out of the heap at once just to stream a document. Know your code and how memory is being used!
Your web server should be able to handle most of your security concerns, but if documents are small and the DB isn't already under a big load, then I don't really see a big issue with having them in the DB.
我刚刚开始研究 SQL Server 2008 的 BLOB FILESTREAMing,并且遇到了一个巨大的限制(IMO)——它只能与集成安全性一起使用。 如果不使用 Windows 身份验证连接到数据库服务器,则无法读取/写入 BLOB。 许多应用程序环境无法使用Windows身份验证。 当然不是在异构环境中。
必须存在更好的存储 BLOB 的解决方案。 最佳实践是什么?
I've just started researching SQL Server 2008's FILESTREAMing for BLOBs and have run across a HUGE limitation (IMO)--it only works with integrated security. If you don't use Windows Authentication to connect to the DB server, you're unable to read/write the BLOBs. Many application environments can't use windows authentication. Certainly not in heterogeneous environments.
A better solution for storing BLOBs must exist. What are the best practices?
这篇文章涵盖了大部分问题。 如果您使用的是 SQL Server 2008,请查看 Paul Randal 讨论的新 FILESTREAM 类型的使用 此处。
This article covers most of the issues. If you are using SQL Server 2008, check out the use of the new FILESTREAM type as discussed by Paul Randal here.
这取决于数据库类型。 Oracle 还是 SQL Server? 请注意一个缺点 - 恢复单个文档。
It depends on the databasetype. Oracle or SQLServer? Be aware of one disadvantage - restore of a single document.
根据我的经验,在 SQL Server 和 Oracle 中将内容文件存储为 blob 可以在小型数据库和少量登录用户的情况下正常工作。
ECM 系统将它们分开并使用单独的服务来传输内容。 根据文件的大小,同时检索大文件可能会影响服务器资源。 由于恢复时间长且无法从存档中检索文档,具有大量文件集的数据库存档会出现问题。
如果这些文件是公司记录,并且这是记录的权威副本,则您可能会遇到合规性和保留管理问题,尤其是在存档文件时。 搜索和版本控制也可能成为未来的一个大问题。
您可能想要研究具有某种 API 的 ECM 系统,而不是重新发明轮子。
From what I have experienced storing content files as blobs, in both SQL Server and Oracle, works OK with a small database and with a low number of logged in users.
ECM system separate them and use separate services for streaming content. Depending on the size of the files, the server resources can be impacted with simultaneous retrieval of large files. Archive of databases with large sets of files becomes problematic due to the time to restore and the inability to retrieve documents form the archive.
If these files are corporate records, and this is the authoritative copy of the records, you may have compliance and retention management issues, especially if you archive the files. Also search and version control may become a huge issue moving forward.
You may want to investigate an ECM system with an API of some sort, rather than re-inventing the wheel.
抱歉 - 我提供的答案是基于 SQL Server 的,因此维护部分不合适。 但文件 I/O 是在硬件级别完成的,任何数据库都会增加额外的处理步骤。
数据库在检索文档时会施加额外的开销。 当文件位于磁盘上时,速度的慢慢取决于服务器上 I/O 的速度。 您当然应该在数据库中管理元数据,但最终您需要文件的 UNC 并将用户指向
来源并让开。
从维护和管理的角度来看,在处理 MS SQL Server 时您将仅限于 SAN。 Documentum 等解决方案采用不同的方法,在磁盘上进行简单存储,并允许您实施您认为合适的存储解决方案。
编辑
让我澄清一下我的说法 - 使用 SQL Server,当您超出盒子的物理存储容量时,您的选择有限。 这实际上是 Sharepoint 的一大弱点,您无法简单地附加任何类型的网络存储。
Sorry - the answer I offered was based on SQL Server, so the maintenance portion is not appropriate. But file I/O is accomplished at the hardware level and any database adds extra processing steps.
The database will impose extra overhead when retrieving the document. When the file is on disk you are only as slow or as fast as the I/O on the server. You certainly should manage your meta in a database, but in the end you want the UNC of the file and point the user to
the source and get out of the way.
From a maintenance and administration perspective you will limit yourself to a SAN when dealing with MS SQL Server. Solutions like Documentum take a different approach with simple storage on the disk and allows you to implement a storage solution as you see fit.
EDIT
Let me clarify my statement - with SQL Server you have limited options when you exceed the physical storage capacity of the box. This is in fact one of the big weaknesses of Sharepoint that you are not able to simply attach any type of network storage.
当您的数据库变得越来越大时,备份就会变得越来越困难。
恢复具有超过 100 GB 数据的表的备份并不是一件让您高兴的事情。
另一件事情是,随着数据集的增长,所有表管理功能都会变得越来越慢。
但这可以通过使数据表仅包含 2 个字段来克服:
ID 和 BLOB。
在备份数据集遇到困难之后,检索数据(通过主键)可能只会成为一个问题。
When your DB grows bigger and bigger it will become harder to backup.
Restoring a backup of a table with over 100 GB of data is not something that makes you happy.
Another thing that get is that all the table management functions get slower and slower as the dataset grows.
But this can be overcome by making your data table just contain 2 fields:
ID and BLOB.
Retrieving data (by primary key) will likely only become a problem long after you hit a wall with backing up the dataset.