数据库与 fstream 访问
我有一个(本地)数据库(Ubuntu 10.10 上的 MySQL 5.1),其中有大约 15000 个表,每个表平均约有 1 000 000 行。每个表有 6 个 DOUBLE 列。存储引擎是MyISAM。我有一个 C++ 应用程序,一次加载一张表的数据并执行一些计算。 我从数据库检索数据的方式很简单: SELECT * FROM table ORDER BY timestamp; (时间戳是标记为UNIQUE的第一列(DOUBLE)) 到目前为止,大部分时间都花在加载和获取上。加载和获取一张表中的所有行大约需要 15 秒(使用本机 C API、C++ 连接器和 MySQL 查询浏览器进行了尝试)。 当我使用 fstream 从磁盘(纯文本文件)加载相同的数据集时,相同的操作只需要大约 4 秒。
MySQL 或任何其他数据库(SQLite?)是否有可能接近这个值? 虽然,我主要有简单的 SELECTS 和 INSERTS(+ 一个简单的 JOIN),但我喜欢数据库的想法,因为管理大型数据集更容易一些,所以我会坚持使用它,即使以一些性能损失为代价,但 15/4s考虑到桌子的数量,每张桌子太多了。不过我用 6/4 就可以了...
谢谢。 彼得
I have a (local) database (MySQL 5.1 at Ubuntu 10.10) with some 15000 tables each with ~1 000 000 rows on average. Each table has 6 DOUBLE columns. The storage engine is MyISAM. I have a C++ application that loads the data one table at a time and performs some calculations.
The way I retrieve the data from the database is simply by: SELECT * FROM table ORDER BY timestamp; (timestamp is the first column (DOUBLE) marked as UNIQUE)
By far most of the time is spent in loading and fetching. It takes ~15s to load and fetch all the rows in one table (tried with the native C API, C++ Connector and MySQL Query Browser).
When I load the same dataset from disk (plain text file) using fstream the same operation takes only ~4s.
Is it possible for MySQL or any other database (SQLite?) to get anywhere near this value?
Although, I have mostly simple SELECTS and INSERTS (+ one simple JOIN) I like the idea of database because it is somewhat easier to manage large data sets, so I would stick with it even at cost of some performance loss, but 15/4s per table is way too much given the number of tables. I would be fine with 6/4s though...
Thanks.
Petr
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对所有记录进行顺序扫描并不是关系数据库最有说服力的用例,但我绝对鼓励您也对 SQLite 进行基准测试。它通常被认为是自定义文件 I/O 的高性能替代品。
Sequential scan of all records isn't exactly the most convincing use case for a relational database, but I definitely would encourage you to benchmark SQLite as well. It's generally considered to be a high performance replacement for custom file I/O.
读取文件与使用 SQL 获取数据不同。读取文件只涉及从磁盘读取并放入内存。就是这样。
现在,使用 SQL 来获取结构化数据,情况就不同了。首先,MySQL 必须解析查询并构造它,以便它可以执行它并读取数据。执行查询时,MySQL 打开数据库文件并读取与该数据库相关的一些元数据。
然后,完成后,它会解析文件并根据查询获取数据。由于客户端和服务器之间的通信是通过以下方式完成的,因此也会产生少量开销。插座。
因此,文件访问和 MySQL 所做的事情之间存在巨大差异。使用 MySQL,您可以获得更多,更多,但代价是速度。
到底为什么需要 15 000 张桌子?如果你需要这么多桌子,我觉得你的设计有缺陷......
Reading a file is not the same as using SQL to fetch the data. Reading a file only involves reading from the disk and putting it into memory. Thats it.
Now, using SQL to fetch structured data, now thats different. First, MySQL has to parse the query and structure it so it can execute it and read the data. When executing the query, MySQL opens the database file and reads some meta data related to that database.
Then, when that is done, it parses the file and fetches the data according to the query. There is also a small overhead because the communication between client and server is done via. sockets.
So, there is a huge difference between file access and what MySQL does. With MySQL you get much, much more, at the cost of speed.
Why do you need 15 000 tables anyway? I sense a flaw in your design if you need so many tables...
如果性能绝对重要,您还可以尝试使用 mmap。这允许您拥有磁盘支持的内存区域,利用非常优化的虚拟内存和缓存代码。
我见过一个应用程序(在一个主要的社交网站中使用),为了满足非常具体的需求,该应用程序用在单个刀片上运行的优化 C++ 代码替换了由 8 个大型 MySQL 服务器组成的集群,利用率约为 5-10%。 (它计算了用户之间的社交图和最短路径)。
一般来说,您最终会为通用解决方案付费。仔细分析您的需求,应用算法知识,然后选择您的武器。
许多设计师都会犯这样的错误:选择他们所知道的东西,然后试图将算法混入其中,最后满足需求。
If performance is of absolute concern, you can also experiment with mmap. This allows you to have a disk-backed memory area, exploiting very well optimized virtual memory and caching code.
I've seen an application (used in a major social networking site) that, for a very specific need, replaced a cluster of 8 large MySQL servers with optimized C++ code running on a single blade at ~5-10% utilization. (It calculated the social graph and shortest paths between users).
In general, you end up paying for the generalized solution. Analyse your needs carefully, apply algorithmic knowledge, then choose your weapon..
Many designers make the mistake of choosing what they know, then trying to fudge algorithms into it, and finally taking care of the needs.
首先,您拥有 15,000 个表,严重滥用了数据库。这不是这些数据库的工作方式。
其次,任何客户端-服务器数据库都可能需要在内存中进行多次复制操作,即使数据已经在内存中,这也会对速度施加上限。像 sqlite 这样的东西可以通过直接使用缓冲区中的数据来避免(某些)这些副本。
您正在将 SQL 数据库用于其不适合的用途,并且滥用它。我没想到它会做得很好。
Firstly, you're absuing the database fairly badly by having 15,000 tables. This is not how these databases are intended to work.
Secondly, any client-server database is likely to need several copy operations in memory, which will impose an upper limit on the speed even when the data are already in memory. Something like sqlite may avoid (some of) these copies by using data directly from the buffer.
You're using a SQL database for something it's not intended for - and abusing it, at that. I wouldn't expect it to do a very good job.