开源数据库的行数上限?
我有一个项目,正在对大型数据库进行数据挖掘。 我目前将所有数据存储在文本文件中,我试图了解存储数据关系数据库的成本和好处。 分数看起来像这样:
CREATE TABLE data (
source1 CHAR(5),
source2 CHAR(5),
idx11 INT,
idx12 INT,
idx21 INT,
idx22 INT,
point1 FLOAT,
point2 FLOAT
);
在合理的表现下,我可以获得多少这样的分数? 我目前拥有约 1.5 亿个数据点,可能不会超过 3 亿个。 假设我使用的是带有 4 个双核 2ghz Xeon CPU 和 8GB RAM 的盒子。
I have a project in which I'm doing data mining a large database. I currently store all of the data in text files, I'm trying to understand the costs and benefits of storing the data relational database instead. The points look like this:
CREATE TABLE data (
source1 CHAR(5),
source2 CHAR(5),
idx11 INT,
idx12 INT,
idx21 INT,
idx22 INT,
point1 FLOAT,
point2 FLOAT
);
How many points like this can I have with reasonable performance? I currently have ~150 million data points, and I probably won't have more than 300 million. Assume that I am using a box with 4 dual-core 2ghz Xeon CPUs and 8GB of RAM.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
PostgreSQL 应该能够充分容纳您的数据——每个表最多 32 TB,等等如果我理解正确的话,您当前谈论的是 5 GB,最大 10 GB(大约 36 字节/行,最多 3 亿行),因此几乎任何数据库实际上都应该能够轻松容纳您。
PostgreSQL should be able to amply accommodate your data -- up to 32 Terabytes per table, etc, etc. If I understand correctly, you're talking about 5 GB currently, 10 GB max (about 36 bytes/row and up to 300 million rows), so almost any database should in fact be able to accommodate you easily.
仅供参考:Postgres 在多处理器/重叠请求方面比 MySQL 具有更好的扩展性,来自我几个月前阅读的评论(抱歉,没有链接)。
从你的个人资料来看,我认为这是某种生物特征识别(密码子序列、酶与蛋白质氨基酸序列或类似的)问题。 如果你打算用并发请求来解决这个问题,我会选择 Postgres。
OTOH,如果数据要加载一次,然后由单个线程扫描,也许处于“不需要 ACID”模式的 MySQL 将是最佳匹配。
在选择“最佳”堆栈之前,您需要针对访问用例做好一些计划。
FYI: Postgres scales better than MySQL on multi-processor / overlapping requests, from a review I was reading a few months back (sorry, no link).
I assume from your profile this is some sort of biometric (codon sequences, enzyme vs protein amino acid sequence, or some such) problem. If you are going to attack this with concurrent requests, I'd go with Postgres.
OTOH, if the data is going to be loaded once, then scanned by a single thread, maybe MySQL in its "ACID not required" mode would be the best match.
You've got some planning to do in case of access use case(s) before you can select the "best" stack.
MySQL 完全能够满足您的需求以及 Alex 对 PostgreSQL 的建议。 合理的性能应该不难实现,但如果表将被频繁访问并具有大量 DML,您将需要了解有关最终选择的数据库使用的锁定的更多信息。
我相信 PostgreSQL 可以立即使用行级锁定,而 MySQL 将取决于您选择的存储引擎。 MyISAM 仅在表级别锁定,因此并发性会受到影响,但 InnoDB for MySQL 等存储引擎可以并且将会使用行级锁定来提高吞吐量。 我的建议是从 MyISAM 开始,仅当您发现需要行级锁定时才转向 InnoDB。 MyISAM 在大多数情况下都能很好地工作,并且非常轻量。 我使用 MyISAM 在 MySQL 中拥有超过 10 亿行的表,并且通过良好的索引和分区,您可以获得出色的性能。 您可以阅读有关 MySQL 中存储引擎的更多信息:
MySQL 存储引擎 以及关于表分区的信息 表分区。 这是一篇关于在表上实践分区的文章1.13 亿行,您可能会发现它们也很有用。
我认为将数据存储在关系数据库中的好处远远超过成本。 一旦您的数据进入数据库,您就可以做很多事情。 时间点恢复,确保数据完整性、更细粒度的安全访问、数据分区、通过通用语言对其他应用程序的可用性。 (SQL)等等等等。
祝你的项目好运。
MySQL is more than capable of serving your needs as well as Alex's suggestion of PostgreSQL. Reasonable performance shouldn't be difficult to achieve, but if the table is going to be heavily accessed and have a large amount of DML, you will want to know more about the locking used by the database you end up choosing.
I believe PostgreSQL can use row level locking out of the box, where MySQL will depend on the storage engine you choose. MyISAM only locks at the table level, and thus concurrency suffers, but storage engines such as InnoDB for MySQL can and will use row-level locking to increase throughput. My suggestion would be to start with MyISAM and move to InnoDB only if you find you need row level locking. MyISAM works well in most situations and is extremely light-weight. I've had tables over 1 billion rows in MySQL using MyISAM and with good indexing and partitioning, you can get great performance. You can read more about storage engines in MySQL at
MySQL Storage Engines and about table partitioning at Table Partitioning. Here is an article on partitions in practice on a table of 113M rows that you may find useful as well.
I think the benefits of storing the data in a relational database far outweigh the costs. There are so many things you can do once your data is within a database. Point in time recovery, ensuring data integrity, finer grained security access, partitioning of data, availability to other applications through a common language. (SQL) etc. etc.
Good luck with your project.