是否有快速且可扩展的解决方案来保存数据?
我正在开发一项需要在 Windows 平台中可扩展的服务。
最初,它每秒将接收大约 50 个连接(每个连接将发送大约 5kb 数据),但它需要可扩展以在未来接收超过 500 个连接。
将接收到的数据保存到像 Microsoft SQL Server 这样的通用数据库是不切实际的(我猜)。
还有其他解决方案来保存数据吗? 考虑到它每天会收到超过600万条“记录”。
有 5 个步骤:
- 通过 http handler (c#) 接收数据;
- 保存接收到的数据; <- HERE
- 请求处理保存的数据;
- 处理请求的数据;
- 保存处理后的数据。 <- HERE
我的预解决方案是:
- 通过http处理程序接收数据(c#);
- 将接收到的数据保存到消息队列中;
- MSQ 请求使用 Windows 服务处理保存的数据;
- 处理请求的数据;
- 将处理后的数据保存到Microsoft SQL Server(这里是瓶颈);
I'm developing a service that needs to be scalable in Windows platform.
Initially it will receive aproximately 50 connections by second (each connection will send proximately 5kb data), but it needs to be scalable to receive more than 500 future.
It's impracticable (I guess) to save the received data to a common database like Microsoft SQL Server.
Is there another solution to save the data? Considering that it will receive more than 6 millions "records" per day.
There are 5 steps:
- Receive the data via http handler (c#);
- Save the received data; <- HERE
- Request the saved data to be processed;
- Process the requested data;
- Save the processed data. <- HERE
My pre-solution is:
- Receive the data via http handler (c#);
- Save the received data to Message Queue;
- Request from MSQ the saved data to be processed using a windows services;
- Process the requested data;
- Save the processed data to Microsoft SQL Server (here's the bottleneck);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
每天 600 万条记录听起来并不是特别大。 特别是,每天 24 小时不是每秒 500 次 - 您是否预计流量会“爆发”?
我个人不会使用消息队列 - 我之前就遇到过不稳定和普遍困难的问题。 我可能会直接写入磁盘。 在内存中,使用生产者/消费者队列和单个线程写入磁盘。 生产者只会转储要写入队列的记录。
有一个单独的批处理任务,它将一次将一堆记录插入数据库。
一次对最佳(或至少批量上传的“良好”数量的记录)进行基准测试。 您可能希望有一个线程从磁盘读取数据,并有一个单独的线程写入数据库(如果数据库线程有大量积压,则文件线程将被阻塞),这样您就不必同时等待文件访问和数据库同时。
我建议您尽早进行一些测试,看看数据库可以处理什么(并让您测试各种不同的配置)。 找出瓶颈在哪里,以及它们会对你造成多大的伤害。
6 million records per day doesn't sound particularly huge. In particular, that's not 500 per second for 24 hours a day - do you expect traffic to be "bursty"?
I wouldn't personally use message queue - I've been bitten by instability and general difficulties before now. I'd probably just write straight to disk. In memory, use a producer/consumer queue with a single thread writing to disk. Producers will just dump records to be written into the queue.
Have a separate batch task which will insert a bunch of records into the database at a time.
Benchmark the optimal (or at least a "good" number of records to batch upload) at a time. You may well want to have one thread reading from disk and a separate one writing to the database (with the file thread blocking if the database thread has a big backlog) so that you don't wait for both file access and the database at the same time.
I suggest that you do some tests nice and early, to see what the database can cope with (and letting you test various different configurations). Work out where the bottlenecks are, and how much they're going to hurt you.
我认为你过早地进行了优化。 如果您需要将所有内容发送到数据库中,那么在假设数据库是瓶颈之前先看看数据库是否可以处理它。
如果数据库无法处理它,那么可能会转向基于磁盘的队列,就像 Jon Skeet 所描述的那样。
I think that you're prematurely optimizing. If you need to send everything into a database, then see if the database can handle it before assuming that the database is the bottleneck.
If the database can't handle it, then maybe turn to a disk-based queue like Jon Skeet is describing.
为什么不这样做:
1.) 接收数据
2.) 处理数据
3.) 立即保存原始数据和处理后的数据
如果您已经拥有数据,这将省去您再次请求的麻烦。 不过,我更担心你的表结构和数据库机器,而不是实际的流程。 我一定会确保您的插入物尽可能便宜。 如果这是不可能的,那么排队工作就有意义了。 我自己不会使用消息队列。 假设你有一台像样的 SQL Server 机器,每天 600 万条记录应该没问题,假设你没有在每条记录中写入大量数据。
Why not do this:
1.) Receive data
2.) Process data
3.) Save original and processsed data at once
That would save you the trouble of requesting it again if you already have it. I'd be more worried about your table structure and your database machine then the actual flow though. I'd be sure to make sure that your inserts are as cheap as possible. If that isn't possible then queuing up the work makes some sense. I wouldn't use message queue myself. Assuming you have a decent SQL Server machine 6 million records a day should be fine assuming you're not writing a ton of data in each record.