每天3000万条记录,SQL Server跟不上,需要其他类型的数据库系统吗?
不久前,我考虑为我们的数百万用户网站设计一个新的统计系统,以记录和报告客户的用户操作。
数据库设计非常简单,包含一张表,其中包含一个foreignId(200,000 个不同的id)、一个日期时间字段、一个actionId(30 个不同的id)以及另外两个包含一些元信息(只是小整数)的字段。其他表没有任何限制。此外,我们有两个索引,每个索引包含 4 个字段,不能删除它们,因为当我们拥有较小的索引时,用户会超时。 foreignId 是最重要的字段,因为每个查询都包含此字段。
我们选择使用 SQL Server,但实施后关系数据库似乎不太适合,因为我们不能每天插入 3000 万条记录(它只是插入,我们不做任何更新),同时还进行大量随机操作读取数据库;因为索引更新得不够快。因此:我们遇到了一个大问题 :-) 我们已经暂时解决了这个问题,但是
关系数据库似乎不适合这个问题!
像 BigTable 这样的数据库会是更好的选择吗?为什么?或者在处理此类问题时还有其他更好的选择吗?
注意。此时我们使用具有 4 GB 内存和 Win 2003 32 位的单 8 核 Xeon 系统。据我所知,RAID10 SCSI。索引大小约为表大小的 1.5 倍。
Some time ago I thought an new statistics system over, for our multi-million user website, to log and report user-actions for our customers.
The database-design is quite simple, containing one table, with a foreignId (200,000 different id's), a datetime field, an actionId (30 different id's), and two more fields containing some meta-information (just smallints). There are no constraints to other tables. Furthermore we have two indexes each containing 4 fields, which cannot be dropped, as users are getting timeouts when we are having smaller indexes. The foreignId is the most important field, as each and every query contains this field.
We chose to use SQL server, but after implementation doesn't a relational database seem like a perfect fit, as we cannot insert 30 million records a day (it's insert only, we don't do any updates) when also doing alot of random reads on the database; because the indexes cannot be updated fast enough. Ergo: we have a massive problem :-) We have temporarily solved the problem, yet
a relational database doesn't seem to be suited for this problem!
Would a database like BigTable be a better choice, and why? Or are there other, better choices when dealing with this kind of problems?
NB. At this point we use a single 8-core Xeon system with 4 GB memory and Win 2003 32-bit. RAID10 SCSI as far as I know. The index size is about 1.5x the table size.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
您说您的系统在没有索引的情况下每秒能够插入 3000 条记录,但在有两个附加非聚集索引的情况下只能插入大约 100 条记录。如果 3k/s 是您的 I/O 允许的最大吞吐量,那么理论上添加两个索引应该会降低大约 1000-1500/秒的吞吐量。相反,您会发现性能下降了 10 倍。正确的解决方案和答案是“视情况而定”,并且必须进行一些认真的故障排除和瓶颈识别。考虑到这一点,如果我大胆猜测,我会给出两个可能的罪魁祸首:
A. 附加的非聚集索引将脏页的写入分配到更多的分配区域。解决方案是将聚集索引和每个非聚集索引放入其自己的文件组中,并将三个文件组分别放入 RAID 上的单独 LUN 上。
B. 非聚集索引的低选择性导致读写之间的高争用(键冲突以及 %lockres% 冲突) 导致插入和选择的锁定等待时间较长。可能的解决方案是将快照与读取提交的快照模式一起使用,但我必须警告在 版本中添加大量 IO 的危险存储(即在 tempdb 中)在可能已经处于高 IO 压力下的系统上。第二种解决方案是使用数据库快照进行报告,它们会降低 IO 压力并且可以更好地控制它们(不涉及 tempdb 版本存储),但报告不再基于实时数据。
我倾向于相信 B) 是可能的原因,但我必须再次强调需要进行适当的调查和适当的根本案例分析。
“RAID10”并不是一个非常精确的描述。
至于关系数据库是否适合这样的问题,是的,绝对适合。还有很多因素需要考虑,可恢复性、可用性、工具集生态系统、专业知识、开发的简易性、部署的简易性、管理的简易性等等。关系数据库可以轻松处理您的工作负载,它们只需要适当的调整。每天 3000 万次插入,每秒 350 次,对于数据库服务器来说只是很小的变化。但无论 CPU 数量有多少,32 位 4GB RAM 系统都很难成为数据库服务器。
You say that your system is capable of inserting 3000 records per second without indexes, but only about 100 with two additional non-clustered indexes. If 3k/s is the maximum throughput your I/O permits, adding two indexes should in theory reduces the throughput at about 1000-1500/sec. Instead you see a degradation 10 times worse. The proper solution and answer is 'It Dependts' and some serious troubleshooting and bottleneck identification would have to be carried out. With that in mind, if I was to venture a guess, I'd give two possible culprits:
A. Th additional non-clustered indexes distribute the writes of dirty pages into more allocation areas. The solution would be to place the the clustered index and each non-clustered index into its own filegroup and place the three filegroups each onto separate LUNs on the RAID.
B. The low selectivity of the non-clustered indexes creates high contention between reads and writes (key conflicts as well as %lockres% conflicts) resulting in long lock wait times for both inserts and selects. Possible solutions would be using SNAPSHOTs with read committed snapshot mode, but I must warn about the danger of adding lot of IO in the version store (ie. in tempdb) on system that may already be under high IO stress. A second solution is using database snapshots for reporting, they cause lower IO stress and they can be better controlled (no tempdb version store involved), but the reporting is no longer on real-time data.
I tend to believe B) as the likely cause, but I must again stress the need to proper investigation and proper root case analysis.
'RAID10' is not a very precise description.
As on the question whether relational databases are appropriate for something like this, yes, absolutely. There are many more factors to consider, recoverability, availability, toolset ecosystem, know-how expertise, ease of development, ease of deployment, ease of management and so on and so forth. Relational databases can easily handle your workload, they just need the proper tuning. 30 million inserts a day, 350 per second, is small change for a database server. But a 32bit 4GB RAM system hardly a database server, regardless the number of CPUs.
听起来您可能遇到两个特殊的问题。您遇到的第一个问题是每次执行插入时您的索引都需要重建 - 您是否真的尝试运行事务服务器的实时报告(这通常被认为是禁忌)?其次,您可能还会遇到服务器必须调整数据库大小的问题 - 检查以确保您已分配足够的空间并且不依赖数据库来为您执行此操作。
您是否考虑过研究 SQL Server 中的索引视图之类的内容?它们是从主表中删除索引并将其移至物化视图的好方法。
It sounds like you may be suffering from two particular problems. The first issue that you are hitting is that your indexes require rebuilding everytime you perform an insert - are you really trying to run live reports of a transactional server (this is usually considered a no-no)? Secondly, you may also be hitting issues with the server having to resize the database - check to ensure that you have allocated enough space and aren't relying on the database to do this for you.
Have you considered looking into something like indexed views in SQL Server? They are a good way to remove the indexing from the main table, and move it into a materialised view.
您可以尝试将表设为分区表 。这样索引更新将影响较小的行集。也许每日分区就足够了。如果没有,请尝试按小时分区!
You could try making the table a partitioned one. This way the index updates will affect smaller sets of rows. Probably daily partitioning will be sufficient. If not, try partitioning by the hour!
您没有提供足够的信息;我不确定为什么你说关系数据库似乎不适合,除了你现在遇到性能问题这一事实。 RDBMS 运行在什么类型的机器上?鉴于您有外国 ID,似乎关系数据库正是这里所需要的。假设 SQL Server 在足够的硬件上运行,它每天应该能够处理 3000 万次插入。
You aren't providing enough information; I'm not certain why you say that a relational database seems like a bad fit, other than the fact that you're experiencing performance problems now. What sort of machine is the RDBMS running on? Given that you have foreign ID's, it seems that a relational database is exactly what's called for here. SQL Server should be able to handle 30 million inserts per day, assuming that it's running on sufficient hardware.
鉴于交通繁忙,复制数据库进行报告似乎是最佳途径。但是,首先要尝试一些事情......
使用单个索引,而不是两个索引。聚集索引可能是比非聚集索引更好的选择。更少、更宽的索引通常会比较更多、更窄的索引表现得更好。而且,正如您所说,索引正在杀死您的应用程序。
您没有说明您使用的 ID,但如果您使用 GUID,您可能需要将密钥更改为 bigint。由于 GUID 是随机的,因此无论是在构建索引还是在使用索引时,它们都给索引带来了沉重的负担。使用 bigint 标识列将使索引几乎按时间顺序运行,如果您确实对实时访问最近数据的查询感兴趣,那么您的访问模式更适合单调递增的键。
Replicating the database for reporting seems like the best route, given heavy traffic. However, a couple of things to try first...
Go with a single index, not two indexes. A clustered index is probably going to be a better choice than non-clustered. Fewer, wider indexes will generally perform better than more, narrower, indexes. And, as you say, it's the indexing that's killing your app.
You don't say what you're using for IDs, but if you're using GUIDs, you might want to change your keys over to bigints. Because GUIDs are random, they put a heavy burden on indexes, both in building indexes and in using them. Using a bigint identity column will keep the index running pretty much chronological, and if you're really interested in real-time access for queries on your recent data, your access pattern is much better suited for monotonically increasing keys.
正如我们的架构师/DBA 所指出的那样,Sybase IQ 似乎非常适合实现该目标(例如,他们明确地将我们所有的统计数据转移到 IQ 上,并指出该功能是原因)。但我无法证实自己的观点——只是向我们公司的人点头,他们通常根据过去的经验知道自己在说什么。
但是,我想知道是否必须存储所有 30mm 记录?存储一些预先聚合的数据不是更好吗?
Sybase IQ seems pretty good for the goal as our architects/DBAs indicated (as in, they explicitly move all our stats onto IQ stating that capability as the reason). I can not substantiate myself though - merely nod at the people in our company who generally know what they are talking about from past experience.
However, I'm wondering whether you MUST store all 30mm records? Would it not be better to store some pre-aggregated data?
不确定 SQL Server,但在我很久以前使用过的另一个数据库系统中,此类活动的理想方法是存储更新,然后批量关闭索引,添加新记录,然后重新索引。我们每晚这样做一次。我不确定您的报告需求是否适合这种类型的解决方案,或者即使它可以在 MS SQL 中完成,但我认为可以。
Not sure about SQL server but in another database system I have used long ago, the ideal method for this type activity was to store the updates and then as a batch turn off the indexes, add the new records and then reindex. We did this once per night. I'm not sure if your reporting needs would be a fit for this type solution or even if it can be done in MS SQL, but I'd think it could.
您没有说明如何管理插入。它们是批量的还是每个统计数据单独写入?因为在单个操作中插入一千行可能比在一千个单独的操作中插入单行更有效。您仍然可以足够频繁地插入以提供或多或少的实时报告;)
You don't say how the inserts are managed. Are they batched or is each statistic written separately? Because inserting one thousand rows in a single operation would probably be way more efficient than inserting a single row in one thousand separate operations. You could still insert frequently enough to offer more-or-less real time reporting ;)