建议存储有关 2 亿张图像(100 万本书)元数据的数据库(NoSQL?SQL?)
朋友们,
我们将进行一个知识保存项目,扫描超过100万本书。我们需要一些关于实现用于存储和检索元数据的数据库以及使用它来跟踪每个对象(书籍)的扫描状态的建议
你们能建议我们应该使用 SQL 还是 NoSQL(元数据可能因项目而异)项目说这个项目可以有 15 个字段)
我们正在考虑基于Lucene/Solr 或某些可扩展 RDF 数据库
任何我们能够定义自定义元数据字段并通过搜索功能存储信息的开源解决方案?
Friends,
We will be undertaking a knowledge preservation project for scanning more than 1 million books. We need some suggestions on implementing database for storing and retrieving metadata as well as use it for tracking the scanning status of each object (book)
Can you guys suggest should we go for SQL or NoSQL (The metadata could vary from project to project say this project could have 15 fields)
We are thinking something based on Lucene/Solr or some Scalable RDF database
Any open source solution where we have the ability to define custom metadata fields and store information with a search feature?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
免责声明:从未尝试过此类项目,
我已经看到 MSSQL 服务器的“Filestream”类型具有非常好的性能。它使用 NTFS 文件 API 来存储二进制数据,并在表的行中保留一个指针。
如果您没有元数据结构,您可以使用 XML,但如果您确实有重复结构,请将其推入关系数据,然后您可以使用索引等来帮助您获得性能。
文件流类型
Disclaimer: Never attempted this type of project
I have seen very good performance from MSSQL server's "Filestream" type. It uses the NTFS file APIs for storing binary data, and keeps a pointer in the rows of your table.
If you have no structure on the metadata you could use XML, but if you do have a repeating structure shove it into relation data and then you can use indexing etc. to help you get your performance.
Filestream Type
可以使用任何数据库和一些自定义代码创建这样的解决方案,但通过使用 CMS(内容管理系统)可能会变得更容易。 CMS 解决方案隐藏底层数据库的详细信息,并允许您使用一组可扩展的元数据来描述您的文档。
您使用哪种 CMS 系统取决于您的预算、内部专业知识和您的需求等因素。我一直在使用 Alfresco(商业开源),部分原因是我的公司已经决定使用它,但如果我要做一个低预算的网站,我可能会考虑非企业版本。哦,Alfresco 利用 Lucene 进行搜索。
如果您的需求非常基本,那么元数据的数据库、图像的文件系统和服务器的一些代码就足够了。避免尝试将图像存储在数据库中,因为根据我的经验,这不是数据库最擅长的。
A solution like this can be created using any database and some custom code, but is probably made easier by using a CMS (content management system). CMS solutions hide the details of the underlying database and allow you to work with a extendable set of metadata for describing your documents.
Which CMS systems you use will depend on your budget, in house expertise and your needs, amongst other factors. I have been using Alfresco (commercial open-source), partly because my company already decided on it, but if I were to do a low budget website I might consider the non-Enterprise version. Oh and Alfresco leverages Lucene for search.
If your needs are very basic then a database for the metadata, a filesystem for the images and some code for your server should be sufficient. Avoid trying to store images in the database, since from my experience this not what databases do best.