文档/图像数据库存储库设计问题
问题:
我应该编写应用程序来直接访问数据库图像存储库,还是编写一个中间件来处理文档请求。
背景:
我有一个自定义文档成像和工作流程应用程序,当前存储了大约 1500 万个文档/文档图像(90%+ 单页、组 4 tiff,其余为 PDF、Word 和 Excel 文档)。 图像存储库是一个商业的第三方应用程序,非常昂贵,而且坦率地说,开销太大。 我只需要一个系统来存储和检索文档图像。
我正在考虑将映像直接移至 SQL Server 2005 数据库中。 索引信息非常有限——基本上是 2 个索引字段。 这是一个人寿保险保单管理系统,因此我使用保单号和系统范围内的唯一 ID 号对图像进行索引。 还有其他索引值,但它们与图像数据分开存储和维护。 这些索引值使我能够查找单个图像检索的唯一 id 值。
数据库服务器是一个双四核 Windows 2003 机器,带有托管 DB 文件的 SAN 驱动器。 目前镜像库大小约为650GB。 我还没有进行任何测试来了解转换后的数据库有多大。 我并不是真正询问数据库设计 - 我正在与我们的 DBA 就这方面进行合作。 如果情况发生变化,我会回来的:-)
当前要替换的系统显然是一个中间件应用程序,但它是一个分布在 3 个 Windows 服务器上的非常重量级的系统。 如果我走这条路,它将是一个单一服务器系统。
我主要关心的是可扩展性和性能——性能非常重要。 我有大约 100 个用户,未来几年使用量增长可能会很缓慢。 大多数用户主要是阅读用户 - 他们不会经常向系统添加图像。 我们有一个部门负责扫描和以其他方式将图像添加到存储库。 我们还有一些其他应用程序接收文档(通过 ftp),并在收到文档时自动将其插入存储库,要么完整索引信息,要么作为用户审阅和索引的“批次”。
大多数(90%+)文档/图像都非常小,< 100K,大概< 50K,所以我相信将图像存储在数据库文件中将是最有效的,而不是使用 SQL 2008 并使用文件流。
Question:
Should I write my application to directly access a database Image Repository or write a middleware piece to handle document requests.
Background:
I have a custom Document Imaging and Workflow application that currently stores about 15 million documents/document images (90%+ single page, group 4 tiffs, the rest PDF, Word and Excel documents). The image repository is a commercial, 3rd party application that is very expensive and frankly has too much overhead. I just need a system to store and retrieve document images.
I'm considering moving the imaging directly into a SQL Server 2005 database. The indexing information is very limited - basically 2 index fields. It's a life insurance policy administration system so I index images with a policy number and a system wide unique id number. There are other index values, but they're stored and maintained separately from the image data. Those index values give me the ability to look-up the unique id value for individual image retrieval.
The database server is a dual-quad core windows 2003 box with SAN drives hosting the DB files. The current image repository size is about 650GB. I haven't done any testing to see how large the converted database will be. I'm not really asking about the database design - I'm working with our DBAs on that aspect. If that changes, I'll be back :-)
The current system to be replaced is obviously a middleware application, but it's a very heavyweight system spread across 3 windows servers. If I go this route, it would be a single server system.
My primary concerns are scalabity and performace - heavily weighted toward performance. I have about 100 users, and usage growth will probably be slow for the next few years.
Most users are primarily read users - they don't add images to the system very often. We have a department that handles scanning and otherwise adding images to the repository. We also have a few other applications that receive documents (via ftp) and they insert them into the repository automatically as they are received, either will full index information or as "batches" that a user reviews and indexes.
Most (90%+) of the documents/images are very small, < 100K, probably < 50K, so I believe that storage of the images in the database file will be the most efficient rather than getting SQL 2008 and using a filestream.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
通常,可扩展性和性能最终是相互结合的,因为六个月后,管理层回来说“应用程序 X 中的功能 Y 运行速度慢得令人无法接受,我们如何加快速度?” 通常的答案是升级后端解决方案。 当涉及到升级后端时,在硬件方面横向扩展几乎总是比纵向扩展更便宜。
因此,长话短说,我建议构建一个中间件应用程序,专门处理来自用户应用程序的传入请求,然后将它们路由到适当的目的地。 这将从后端存储解决方案中充分抽象您的前端用户应用程序,以便当可扩展性确实成为问题时,仅需要更新中间件应用程序。
Oftentimes scalability and performance are ultimately married to each other in the sense that six months from now management comes back and says "Function Y in Application X is running unacceptably slow, how do we speed it up?" And all too the often the answer is to upgrade the back end solution. And when it comes to upgrading back ends, its almost always going to less expensive to scale out than to scale up in terms of hardware.
So, long story short, I would recommend building a middleware app that specifically handles incoming requests from the user app and then routes them to the appropriate destination. This will sufficiently abstract your front-end user app from the back end storage solution so that when scalability does become an issue only the middleware app will need to be updated.
这很简单。 将应用程序写入接口,使用某种工厂机制来提供该接口,并根据需要实现该接口。
一旦您对界面感到满意,那么应用程序(大部分)就与实现隔离,无论它是直接与数据库还是与其他组件通信。
在界面设计上提前思考一下,但做得很愚蠢,“它很简单,它在这里工作,它现在工作”实现提供了面向未来的系统的良好平衡,而不必过度设计它。
很容易认为此时您甚至不需要接口,而只需实例化一个简单的类。 但是,如果您的契约定义良好(即接口或类签名),那就可以保护您免受更改(例如重做后端实现)。 如果您认为有必要,以后可以随时用接口替换该类。
至于可扩展性,请测试一下。 这样您不仅知道是否需要扩展,而且还知道何时需要扩展。 “对于 100 个用户来说效果很好,对于 200 个用户来说有问题,如果我们达到 150 个用户,我们可能会考虑再考虑一下后端,但目前来说还不错。”
恕我直言,这是尽职调查和负责任的设计策略。
This is straightforward. Write the application to an interface, use some kind of factory mechanism to supply that interface, and implement that interface however you want.
Once you're happy with your interface, then the application is (mostly) isolated from the implementation, whether it's talking straight to a DB or to some other component.
Thinking ahead a bit on your interface design but doing bone stupid, "it's simple, it works here, it works now" implementations offers a good balance of future proofing the system while not necessarily over engineering it.
It's easy to argue you don't even need an interface at this juncture, rather just a simple class that you instantiate. But if your contract is well defined (i.e. the interface or class signature), that is what protects you from change (such as redoing the back end implementation). You can always replace the class with an interface later if you find it necessary.
As far as scalability, test it. Then you know not only if you may need to scale, but perhaps when as well. "Works great for 100 users, problematic for 200, if we hit 150 we might want to consider taking another look at the back end, but it's good for now."
That's due diligence and a responsible design tactic, IMHO.
我同意 gabriel1836 的观点。 然而,另一个好处是您可以暂时运行混合系统一段时间,因为您不会在一夜之间将 1400 万个文档从您的专有系统转换到您自己开发的系统。
另外,我强烈鼓励您将文档存储在数据库之外。 将它们存储在文件系统上(本地、SAN、NAS 都没关系),并将指向数据库中文档的指针存储起来。
我很想知道您现在使用的是什么文档管理系统。
另外,不要低估替换专有系统提供的捕获(扫描和导入)的工作量。
I agree with gabriel1836. However, an added benefit would be that you could for a time run a hybrid system for a time since you aren't going to convert 14 millions documents from your proprietary system to you home grown system overnight.
Also, I would strongly encourage you to store the documents outside of a database. Store them on a file system (local, SAN, NAS it doesn't matter) and store pointers to the documents in the database.
I'd love to know what document management system you are using now.
Also, don't underestimate the effort of replacing the capture (scanning and importing) provided by the proprietary system.