用于在 Windows 上存储简单数据的高读取性能数据库解决方案
我虽然有一个简单的问题,但最近几天我一直在努力解决这个问题。长话短说,这里是对其的描述:
- 我每天大约有 100 万条新记录,每条记录由
record_name
(字符串,约 20 个字符)、日期
,value
,记录保存时间约为两年(数据库中约7亿条记录); - 大多数
record_names
每天都会重复; - 我需要能够找到给定日期之间的最大价值增益,同时保留通过通配符
record_name
来过滤结果的能力; - 运行此软件需要在 Windows XP 上运行。
- 最重要的是每个查询的执行时间。
到目前为止我已经尝试过 MySQL 数据库和 Cassandra。虽然 MySQL 在 Linux 上的性能相当可以接受(我所说的“可接受”是指我的技能不太高,足以编写出可以运行的东西),但在 Windows 上却非常慢。卡桑德拉也是如此。
插入到这些数据库的数据是从 .csv 文件导入的。第一次导入 MySQL 大约需要 5 分钟,导入 Cassandra 大约需要 20 分钟,后者需要更多时间。我怀疑我可能配置错误,但说实话 - 我没有更改配置文件中任何与性能相关的内容。
问题是:在考虑性能的情况下处理此类数据的最佳解决方案是什么。编程语言并不重要,几乎任何语言都可以工作,因为查询很简单,并且用任何语言实现它们都不会花费大量时间。
非常感谢您有兴趣提供帮助。
I though that I have a simple problem but I am struggling with it for the last few days. To make a long story short, here is a description of it:
- I have about 1 mil new records daily, each record consists of
record_name
(string, about 20 characters),date
,value
, the records are stored for about two years (~700mil records in database); - most of the
record_names
are repeating every day; - I need to have the ability to find the biggest value gains between the given dates, while preserving the ability to filter the results by wildcarding the
record_name
; - the software running this needs to work on Windows XP.
- the most important thing is the execution time of each query.
So far I had tried MySQL database and Cassandra. While the MySQL has rather acceptable performance on Linux (by acceptable I mean that my not-so-high skills were sufficient to program something which works), it is very slow on Windows. The same thing is with Cassandra.
The data which are inserted to those databases are being imported from .csv files. First import takes about 5 mins to MySQL and 20 mins to Cassandra, the latter ones are taking more time. I suspect that maybe I mis-configured something, but to be honest - I didn't changed anything performance-related in the config files.
The question is: what is the best solution for working with this kind of data having performance in mind. The programming language is not important, almost any will work, as the queries are simple and implementing them in any of the languages will not take big amounts of time.
Thank you very much for interest in helping.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果没有更多细节,所提出的问题似乎是相当经典的 BI 问题,其中解决方案通常是预先物化聚合数据,优化查询而不是插入。
我首先问一个简单的问题 - 你需要数据库中的所有数据吗?我很想将不同的聚合 - 所有、记录类型等按时间范围(例如按天或小时)烹饪成更简单、更小的行集,使用简单的控制台应用程序读取行、计算聚合,然后将它们转储到 SQL 中进行访问通过查询。
虽然在所有行中选择 * 很酷 - 除非有真正的业务需求,否则请抵制它。
考虑到聚合查询,No-SQL 存储会分散注意力,普通的 ol' 文件非常适合保留原始行存储,而 SQL 等工具对于聚合查询来说绰绰有余。
Without more detail, the problems posed seems to be fairly classic BI problems, where the solution is usually to prematerialize aggregate data, optimizing for query than insertion.
I'd ask a simple question 1st - do you need all the data in the DB? I'd be tempted to just cook the different aggregates - all, record type etc. by time range e.g. by day or hour into a simpler, smaller rowsets using simple console apps to read rows, compute aggregates, then dump them into SQL for access by query.
While it's cool to select * across all rows - unless there's real business need, resist it.
With aggregate queries in mind, the No-SQL storage is a distraction, plain ol' files would be great for retaining your raw row storage, and tools like SQL are more than enough for the aggregate queries.
您需要首先确定瓶颈。可能的选项:硬盘驱动器、数据库、驱动程序、应用程序代码。您还应该尝试 postresql,但说实话,让某些东西在 Windows 上可靠运行的选择很少。除非您使用 Windows SQL Server(不是免费的)。 Cassandra 和 MySql 也非常不同,解决不同的问题。
You need to identify the bottleneck first. Possible options: harddrive, database, driver, application code. You should also try postresql but honestly your choices of having something running reliably on Windows are slim. unless you go with Windows SQL Server (not free). Also Cassandra and MySql are VERY different and solve different problems.
另一个值得考虑的选择是 Berkeley DB。它体积小、速度快、可扩展且非常轻量。它支持各种 API,包括 SQL、键值对 (NoSQL) 和 Java 对象持久性 API。一般来说,Berkeley DB 的性能优于通用 RDBMS,因为它更小、更高效且开销更少。 Berkeley DB 在 Windows 上运行,对于寻求简单、易于使用的嵌入式数据管理的开发人员来说是一个绝佳的选择。
Another option to consider is Berkeley DB. It's small, fast, scalable and very lightweight. It supports various APIs, including SQL, key-value pairs (NoSQL) and a Java Object persistence API. Generally speaking, Berkeley DB will outperform a general-purpose RDBMS because it's much smaller, more efficient and has much less overhead. Berkeley DB runs on Windows and is an excellent choice for developers who are looking for simple, easy to use, embedded data management.
SQL Server Express 应该可以很好地满足您的需求。
只需聚合数据 - 使用 SQL 获取所需的值并将其存储在其他表中。
使用此方法,您可以轻松解决 SQL Express 的任何性能困难和大小限制。
SQL Server Express should fit your needs just fine.
Just aggregate the data- use SQL to get the values you need and store those in other tables.
Using this method, you can easily get around any performance difficulties and size limitations with SQL express.