存储大量数据:数据库还是文件系统?
假设我的应用程序创建、存储和检索大量条目(数千万)。每个条目具有可变数量的不同数据(例如,某些条目仅具有几个字节,例如ID/标题,而某些条目可能具有兆字节的补充数据)。每个条目的基本结构相同,并且采用 XML 格式。
条目是任意创建和编辑的(很可能是通过附加,而不是重写)。
将条目作为单独的文件存储在文件系统中,同时在数据库中保留必要的索引集与将所有内容保存在数据库中是否有意义?
Let's say my application creates, stores and retrieves a very large amount of entries (tens of millions). Each entry has variable number of different data (for example, some entries have only a few bytes such as ID/title, while some may have megabytes of supplementary data). Basic structure of each entry is same and is in XML format.
Entries are created and edited (most likely by appending, not rewriting) arbitrarily.
Does it make sense to store entries as separate files in a file system while keeping necessary sets of indexes in the DB vs. saving everything in a DB?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这实际上取决于您将如何使用它。数据库可以处理表中的条目比大多数人想象的要多,特别是在适当的索引的情况下。另一方面,如果您不打算使用关系数据库提供的功能,则可能没有太多理由使用它。
好吧,概括得够多了。无论如何,鉴于数据库最终归结为“磁盘上的文件”,我不会太担心“正确的做法”是什么。如果数据库的主要目的只是有效地检索这些文件,我认为保持数据库条目较小并查找文件路径而不是实际数据是完全可以的 - 特别是因为您的文件系统在检索数据方面应该非常有效给定特定位置。
如果您感兴趣,这实际上是搜索引擎的常见数据存储模式 - 索引将存储索引数据和指向磁盘上存储数据的指针,而不是将所有内容存储在索引中。
It really depends on how you're going to use it. Databases can handle more entries in a table than most people think, especially with proper indexing. On the other hand, if you aren't going to be making use of the functionality that a relational database provides, there might not be much reason to use it.
Ok, enough generalizing. Given that a database eventually boils down to "files on disk" anyway, I wouldn't worry too much about what "the right thing to do" is. If the primary purpose of the database is just to efficiently retrieve these files, I think it would be perfectly fine to keep the DB entries small and look up file paths instead of actual data - especially since your file system should be pretty efficient at retrieving data given a specific location.
In case you're interested, this is actually a common data storage pattern for search engines - the index will store the indexed data and a pointer to the stored data on disk, rather than storing everything in the index.
我肯定会将数据存储在文件系统上,并将
哈希存储在数据库中的路径。I would definitely store the data on the file system and
a hashthe path in the DB.好吧,根据您的成本,MS SQL Server 具有可以创建的所谓“主 XML 索引”,甚至可以在非结构化数据上创建。这允许您编写 XQuery 来搜索列,数据库将为您提供帮助。
如果数据中存在任何一致性,或者可以将其放入模式中,那么您可能会看到这样做的好处。
如果您有大量二进制数据(例如图像等),我是否建议您将它们剥离出来并将它们放置在其他地方,例如文件系统。或者,如果您使用 2008,则有一种名为“Filestream”(欢呼@Marc_s)的类型,它允许您索引、存储和保护您写下的所有文件,并使用 NTFS API 检索它们(即快速块传输),但仍然保留它们作为列保存在数据库中。
如果您的应用程序对搜索 XML 数据有很大的要求,那么拥有数据库可能会为您提供良好的抽象层和扩展性,而这意味着您不必这么做。
只是我的2c。
Well depending on your costs, MS SQL Server has what's called a "Primary XML Index" that can be created, even on unstructured data. This allows you to write XQuery to search down the columns and the database will assist you.
If there is any coherency at all in the data, or it can be placed into a schema then you may see a benefit to this.
Might I recommend if you have large amounts of binary data such as images etc, that you strip these out and place them somewhere else, such as a file system. Or if you use 2008 there is a type called "Filestream" (cheers @Marc_s) which allows you to index, store and secure all the files you write down and use NTFS APIs to retrieve them (i.e fast block transfer) but still have them kept as columns in the database.
Having the database there might give you a good layer of abstraction and scaling if your application puts large demands on searching through the XML data, which means that you don't have to.
Just my 2c.
在工作中,我经常需要积累大量 XML 文档以供以后分析。通常,这是通过将它们粘贴到一个目录中来完成的,并通过 grep(或带有所有 XML 工厂/构建器/包装器/API 工具的定制 Java 程序)来完成分析。
有一天,我缓慢地想尝试将它放入 PostgreSQL 中。我想尝试两个功能:
关于第一个功能,数据库大小小于原始文件大小的一半。使用
WHERE data::TEXT LIKE '%pattern%'
进行全文搜索(即表扫描)实际上比在文件上运行 grep 更快。当您处理几 GB 的 XML 时,仅此一点就值得使用数据库。第二个功能是索引,需要更多的维护工作。我认为有一些特定的元素适合索引。
xpath('//tradeHeader/tradeId/text()', data)
上的索引可以工作,但在每个查询中复制可能会很痛苦。我发现为某些字段添加普通列并使用插入/更新触发器来保持它们同步更容易。At work I often have to accumulate large sets of XML documents for later analysis. Normally this is done by sticking them into a directory, and the analysis is done by grep (or a bespoke Java program with all its XML factory/builder/wrapper/API paraphernalia).
One slow day I thought I'd try putting it in PostgreSQL. There are two features that I wanted to try out:
Regarding the first feature, the DB size was less than half of the raw files size. Doing a full text search, a table scan using
WHERE data::TEXT LIKE '%pattern%'
, was actually quicker than running grep on the files. When you are dealing with a few GB of XML this alone makes the DB worthwhile.The second feature, indexing, is a bit more work to maintain. There were a few particular elements I guessed would be good to index. An index on
xpath('//tradeHeader/tradeId/text()', data)
works, but it can be a pain to duplicate in each query. I found it easier to add ordinary columns for some fields, and use insert/update triggers to keep them in sync.几个考虑因素:
通常,使用数据库比使用文件系统更容易编组这些。但最困难的事情可能是将文件系统备份与数据库的前滚(重做)日志记录同步。您的应用程序的事务性越强,这些因素就越重要。
从您的问题看来,您不打算使用任何正常的数据库功能(关系完整性、加入)。在这种情况下,您应该认真考虑第三种选择:将数据存储在文件系统中,而不是使用数据库,而是使用基于文件的文本检索引擎,如 Solr(或 Lucene)、Sphinx、Autonomy 等。
A couple of considerations:
These are general easier to marshal with a database than with a file system. But probably the hardest thing is to synchronise a file system backup with a database's roll forward (redo) logging. The more transactional your application, the more these factors matter.
It appears from your question that you are not intending to make any use of normal database functionality (relational integrity, joining). In which case you should give strong consideration to a third option: store your data in the file system and, instead of a database, use a file-based text retrieval engine like Solr (or Lucene) , Sphinx, Autonomy, etc.
我将使用HDFS(Hadoop分布式文件系统)来存储数据。主要思想是您将获得高可用性、可扩展性和复制性。对您的应用程序的任何查询都可以进行地图缩减查询。主要字段可以使用 Katta 作为分布式索引存储在 Hadoop 之上。
尝试用谷歌搜索这些技术。
I will use HDFS(Hadoop distributed file system) to store the data. Main idea is that you will get high availability, scalability and replication. Any queries to your application can be made map reduce queries. And main fields can be stored as a distributed index on top of Hadoop using Katta.
Try googling for these technologies.
正如之前的回复所述,这取决于您将如何使用数据。
数据库中的数据可用于支持许多不同类型的查询,并将结果提供给报告、表单、OLAP 引擎和许多其他类型的工具。适当的索引可以显着加快搜索速度。
如果您了解 SQL,并且数据库设计良好,那么与使用文件执行相同的操作相比,进行查询会更容易、更快且更不容易出错。但是,正如其他人所指出的,您可以将 XML 数据插入到 SQL 中,而无需将其移动到数据库中。
设计一个好的多用途模式比大多数初学者想象的要困难。有很多东西需要学习,而且不仅仅是如何操作一种工具或另一种工具。糟糕的多用途模式可能比文件更难使用。
如果您决定使用数据库,请准备好进行大量投资。并确保您将从该投资中获益。
It depends on how you are going to use the data, as a previous response says.
The data in a database can be used to support a lot of different kinds of queries, and feed the results to reports, forms, OLAP engines and lots of other kinds of tools. Appropriate indexing can speed up searches dramatically.
If you know SQL, and if the database is well designed, coming up with queries is easier, quicker, and less error prone than doing the equivalent thing with files. But, as others have noted, you can plug your XML data into SQL without moving it to a database.
Designing a good multipurpose schema is harder than most beginners think it is. There's a lot to learn, and it isn't just about how to manipulate one tool or another. And a bad multipurpose schema can be even harder to work with than files.
If you decide to go with a database, be prepared to make a significant investment. And make sure you are going to get the benefits of that investment.