大主键:1+ 十亿行 MySQL + 英诺数据库?
我想知道 InnoDB 是否是格式化表的最佳方式? 该表包含一个字段、主键,该表每天将获取 816k 行(预计)。 这会很快变得非常大! 我正在研究文件存储方式(这会更快)吗? 该表将存储已经处理过的 Twitter Id 的 ID 号?
另外,SELECT min('id')
语句上的任何估计内存使用量? 任何其他想法将不胜感激!
I was wondering if InnoDB would be the best way to format the table? The table contains one field, primary key, and the table will get 816k rows a day (est.). This will get very large very quick! I'm working on a file storage way (would this be faster)? The table is going to store ID numbers of Twitter Ids that have already been processed?
Also, any estimated memory usage on a SELECT min('id')
statement? Any other ideas are greatly appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我建议您开始按 ID 或 分区 您的表日期。 分区根据一些定义的逻辑(例如按日期范围拆分)将一个大表拆分为几个较小的表,这使得它们在性能和内存方面更易于管理。 MySQL 5.1 内置了此功能,或者您可以使用自定义解决方案来实现它。
在平面文件中实现存储时,您将失去数据库的所有优势 - 您无法再执行涉及数据的查询。
I'd recommend you start partioning your table by ID or date. Partioning splits a large table into several smaller table according to some defined logic (like splitting it by date ranges), which makes them much more managable performance and memory wise. MySQL 5.1 has this feature built-in, or you can implement it using custom solutions.
In implement storage in a flat-file, you lose all the advantages of a database - you can no longer perform queries involving the data.
唯一确定的答案是两者都尝试一下,看看会发生什么。
一般来说,MyISAM 的写入和读取速度更快,但不能同时读取和写入。 当您写入 MyISAM 表时,整个表将被锁定以完成插入。 InnoDB 的开销更大,但使用行级锁定,因此读取和写入可以同时发生,而不会出现 MyISAM 表锁定带来的问题。
但是,如果我理解正确的话,你的问题有点不同。 只有一列,该列作为主键,在 MyISAM 和 InnoDB 处理主键索引的不同方式中具有重要的考虑因素。
在 MyISAM 中,主键索引就像任何其他辅助索引一样。 在内部,每一行都有一个行 ID,索引节点仅指向数据页的行 ID。 主键索引的处理方式与任何其他索引没有不同。
然而,在 InnoDB 中,主键是聚集的,这意味着它们保持附加到数据页,并确保行内容根据主键在磁盘上保持物理排序顺序(但仅限于单个数据页内,这些数据页本身可能分散在任何顺序。)
在这种情况下,我希望 InnoDB 可能有一个优势,因为 MyISAM 本质上必须做双重工作——在数据页中写入一次整数,然后在索引页中再次写入。 InnoDB不会这样做,主键索引将与数据页相同,并且只需写入一次。 它只需在一处管理数据,而 MyISAM 则无需管理两个副本。
对于任一存储引擎,在索引列上执行诸如 min() 或 max() 之类的操作应该很简单,或者只是检查索引中是否存在数字。 由于该表只有一列,因此甚至不需要书签查找,因为数据将完全在索引本身内表示。 这应该是一个非常有效的索引。
我也不会那么担心桌子的大小。 如果行的宽度仅为一个整数,则每个索引/数据页可以容纳大量行。
The only definitive answer is to try both and test and see what happens.
Generally, MyISAM is faster for writes and reads, but not both at the same time. When you write to a MyISAM table the entire table gets locked for the insert to complete. InnoDB has more overhead but uses row-level locking so that reads and writes can happen concurrently without the problems that MyISAM's table locking incurs.
However, your problem, if I understand it correctly, is a little different. Having only one column, that column being a primary key has an important consideration in the different ways that MyISAM and InnoDB handle primary key indexes.
In MyISAM, the primary key index is just like any other secondary index. Internally each row has a row id and the index nodes just point to the row ids of the data pages. A primary key index is not handled differently than any other index.
In InnoDB, however, primary keys are clustered, meaning they stay attached to the data pages and ensure that the row contents remain in physically sorted order on disk according to the primary key (but only within single data pages, which themselves could be scattered in any order.)
This being the case, I would expect that InnoDB might have an advantage in that MyISAM would essentially have to do double work -- write the integer once in the data pages, and then write it again in the index pages. InnoDB wouldn't do this, the primary key index would be identical to the data pages, and would only have to write once. It would only have to manage the data in one place, where MyISAM would needlessly have to manage two copies.
For either storage engine, doing something like min() or max() should be trivial on an indexed column, or just checking the existence of a number in the index. Since the table is only one column no bookmark lookups would even be necessary as the data would be represented entirely within the index itself. This should be a very efficient index.
I also wouldn't be all that worried about the size of the table. Where the width of a row is only one integer, you can fit a huge number of rows per index/data page.
如果这些 ID 号单调递增,并且您只写入附加数据(从不修改它),那么使用单个文件可能会快得多。 然后,
SELECT min('id')
只是读取文件的第一行,其他任何内容都是二分搜索。If these ID numbers are monotonically increasing and your writes only append data (never modify it), it'll probably be a lot faster to use a single file. A
SELECT min('id')
then just becomes reading the first line of the file, and anything else is a binary search.如果你的 id 列上有索引, select min(id) 应该是 O(1),对此应该没有太多的内存要求。
如果您的主键位于 twitter id 上,那么您就有一个索引。
If you have an index on your id column, select min(id) should be O(1), there shouldn't be much of a memory requirement for this.
If your primary key is on the twitter id then you have an index on it.
MySQL 开发区上有一个很好的存储引擎比较:
根据您的描述,我认为 MyISAM 会更好,但这在很大程度上取决于您的应用程序的比较读写模式。
There is a good comparison of storage engines on MySQL Dev zone:
From your description I would say MyISAM would be better, but it depends quite a lot on the compared reading and writing patterns of your app.
由于只有一个字段作为主键,仅添加记录,因此这并不适合常规数据库。
首先,您存储的信息量是所需信息的两倍,每个字段都进入数据表和索引。
顺便说一句,关系数据库之所以如此命名,是因为它们将相关数据存储在一行中; 很难看出您的数据如何合格:-) 如果您还存储其他内容,那么数据库将是值得的。
您没有提到数据是否会被多个进程同时访问 - 如果不是,那么您不需要数据库 ACID 原则赋予的所有优势。 即使您确实需要 ACID,仍然可以在没有完整数据库的情况下实现。
我的第一个想法是构建您自己的 B 树或 B+ 树数据文件来存储 twitter ID 以避免数据重复。 我可以看到你做的唯一查询(基于问题)是:
第一个可以通过简单地将最低值存储在 B 树结构之外的另一个文件中(并在获得较低值时替换它)来实现 O(1)。 我不确定这个的商业案例,除非它是为了快速找出某个 twitter ID 不在表中(所以在这种情况下你可能也需要 max)。
第二种是标准树搜索技术,无论如何,这是数据库通常在幕后使用的技术。
With one single field, being the primary key, only ever adding records, this is not really suited to a regular database.
For a start, you're storing twice as much info as you need to, with every field going into the data table and index.
As an aside, relational database are so called since, for one, they store related data into a single row; it's hard to see how your data qualifies :-) If you were storing other stuff as well, a database would be worth it.
You don't mention whether the data will be accessed by multiple processes at once - if not, then you don't need all the advantages conferred by database ACID principles. Even if you do want ACID, that can still be achieved without a full blown database.
My first though would be to construct your own B-tree or B+-tree data file to store the twitter IDs to avoid the data duplication. The only queries I can see you doing (based on the question) are:
The first can be made O(1) by simply storing the lowest in another file outside of the B-tree structure (and replacing it when you get a lower one). I'm not sure of the business case for this one unless it's to quickly find out a certain twitter ID isn't in the table (so you'd probably want max as well in that case).
The second is standard tree searching techniques which is what a database generally uses under the covers anyway.
我还看到一些贸易公司使用蜱数据库即。 kdb+
http://kx.com/
I've also seen some trading firms use tick database ie. kdb+
http://kx.com/