BLOB 存储 - 100+ GB、MySQL、SQLite 或 PostgreSQL + Python

发布于 2024-07-04 22:54:11 字数 268 浏览 8 评论 0原文

我有一个简单的应用程序的想法,它将监视一组文件夹,索引它找到的任何文件。 图形用户界面将允许我快速标记新文件并将它们移动到单个数据库中进行存储,并且还提供了一种通过标签、名称、文件类型和日期查询数据库的简单机制。 目前我在几个可移动硬盘上有大约 100+ GB 的文件,数据库至少有那么大。 如果可能的话,我想支持嵌入式二进制和文本文档的全文搜索。 这将是一个单用户应用程序。

并不是想引发数据库战争,但是哪种开源数据库最适合我? 我很确定 SQLLite 不可行,但我可能是错的。

I have an idea for a simple application which will monitor a group of folders, index any files it finds. A gui will allow me quickly tag new files and move them into a single database for storage and also provide an easy mechanism for querying the db by tag, name, file type and date. At the moment I have about 100+ GB of files on a couple removable hard drives, the database will be at least that big. If possible I would like to support full text search of the embedded binary and text documents. This will be a single user application.

Not trying to start a DB war, but what open source DB is going to work best for me? I am pretty sure SQLLite is off the table but I could be wrong.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

和影子一齐双人舞 2024-07-11 22:54:15

为什么你要浪费时间来模拟文件系统应该能够处理的东西? 更多存储+ grep 就是你的答案。

why are you wasting time emulating something that the filesystem should be able to handle? more storage + grep is your answer.

2024-07-11 22:54:14

几乎任何一个都可以工作(即使 SQLLite 不适合在并发多用户环境中使用,这可能是一个问题......),因为您不想索引文件的实际内容。

唯一的限制因素是给定数据库的最大“数据包”大小(按数据包我指的是查询/响应)。 通常这些限制约为 2MB,这意味着您的文件必须小于 2MB。 当然,您可以增加此限制,但整个过程效率相当低,因为例如要插入文件,您必须:

  • 将整个文件读入内存
  • 在查询中转换文件(这通常意味着对它进行十六进制编码 - 从而加倍)从一开始的大小)
  • 执行生成的查询(这本身意味着 - 对于数据库 - 它必须解析它)

我会使用一个简单的数据库和使用命名约定存储的关联文件,这使得它们很容易找到(对于基于主键的示例)。 当然这种设计并不“纯粹”,但它的性能会好很多,而且也更容易使用。

Pretty much any of them would work (even though SQLLite wasn't meant to be used in a concurrent multi-user environment, which could be a problem...) since you don't want to index the actual contents of the files.

The only limiting factor is the maximum "packet" size of the given DB (by packet I'm referring to a query/response). Usually these limit are around 2MB, meaning that your files must be smaller than 2MB. Of course you could increase this limit, but the whole process is rather inefficient, since for example to insert a file you would have to:

  • Read the entire file into memory
  • Transform the file in a query (which usually means hex encoding it - thus doubling the size from the start)
  • Executing the generated query (which itself means - for the database - that it has to parse it)

I would go with a simple DB and the associated files stored using a naming convention which makes them easy to find (for example based on the primary key). Of course this design is not "pure", but it will perform much better and is also easier to use.

香草可樂 2024-07-11 22:54:14

我总是讨厌回答“不”,但你最好使用 Lucene 之类的东西建立索引(PyLucene)。 几乎总是建议将路径存储在数据库中而不是文件内容中。

除此之外,这些数据库引擎都不会将 LOB 存储在单独的数据空间中(它们将嵌入表的数据空间中),因此这些引擎中的任何一个都应该表现得几乎相同(除了 sqllite)。 您需要迁移到 Informix、DB2、SQLServer 或其他服务器才能获得这种二进制对象处理。

I always hate to answer "don't", but you'd be better off indexing with something like Lucene (PyLucene). That and storing the paths in the database rather than the file contents is almost always recommended.

To add to that, none of those database engines will store LOBs in a separate dataspace (they'll be embedded in the table's data space) so any of those engines should perfom nearly equally as well (well except sqllite). You need to move to Informix, DB2, SQLServer or others to get that kind of binary object handling.

日久见人心 2024-07-11 22:54:13

我的偏好是将文档与元数据一起存储。 原因之一是关系完整性。 如果没有数据库代理的操作,您将无法轻松移动文件或修改文件。 我确信我可以处理这些问题,但它并不像我想要的那么干净,而且我的经验是,现在大多数供应商都可以处理数据库中的大量二进制数据。 我想我想知道 PostgreSQL 或 MySQL 在这些领域是否有明显的优势,我主要熟悉 Oracle。 无论如何,感谢您的回复,如果数据库知道外部文件在哪里,如果我愿意的话,以后也可以很容易地将文件引入。 问题的另一个方面是,使用 Python 时是否更容易使用任一数据库。 我猜那是洗的。

My preference would be to store the document with the metadata. One reason, is relational integrity. You can't easily move the files or modify the files without the action being brokered by the db. I am sure I can handle these problems but it isn't as clean as I would like and my experience has been that most vendors can handle huge amounts of binary data in the database these days. I guess I was wondering if PostgreSQL or MySQL have any obvious advantages in these areas, I am primarily familiar with Oracle. Anyway, thanks for the response, if the DB knows where the external file is it will also be easy to bring the file in at a later date if I want. Another aspect of the question was if either database is easier to work with when using Python. I'm assuming that is a wash.

我们只是彼此的过ke 2024-07-11 22:54:13

为什么要将文件存储在数据库中? 只需存储您的元数据和文件名。 如果您出于某种原因需要将它们复制到新位置,只需将其作为文件系统副本即可。

一旦删除文件内容,任何有能力的数据库都将能够处理数十万个文件的元数据。

Why store the files in the database at all? Simply store your meta-data and a filename. If you need to copy them to a new location for some reason, just do that as a file system copy.

Once you remove the file contents then any competent database will be able to handle the meta-data for a few hundred thousand files.

玻璃人 2024-07-11 22:54:12

我仍在为自己的一个项目研究此选项,但 CouchDB 可能值得一看。

I'm still researching this option for one of my own projects, but CouchDB may be worth a look.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文