CouchDB 的推荐文档结构
我们目前正在考虑将使用情况监控应用程序从 Postgres 更改为 CouchDB。一些数字:
大约 2000 个连接,每 5 分钟轮询一次,每天大约 600,000 个新行。在 Postgres 中,我们存储这些数据,按天分区:
t_usage {service_id, timestamp, data_in, data_out}
t_usage_20100101继承t_usage。
t_usage_20100102继承t_usage。 。
我们使用乐观存储过程写入数据,该存储过程假定分区存在并在必要时创建它 我们可以很快地插入。
为了读取数据,我们的用例(按照重要性和当前性能的顺序)是:
* 单一服务,单日使用:良好的性能
* 多项服务,每月使用量:性能不佳
* 单一服务,每月使用量:性能不佳
* 多个服务,多个月:性能非常差
* 单日多项服务:良好的性能
这是有道理的,因为分区针对几天进行了优化,这是迄今为止我们最重要的用例。不过,我们正在寻找改进次要要求的方法。
我们经常还需要按小时参数化查询,例如只在上午 8 点到下午 6 点之间给出结果,因此汇总表的用途有限。 (这些参数变化的频率足够高,以至于无法创建多个数据汇总表)。
有了这个背景,第一个问题是:CouchDB 适合这些数据吗?如果是,考虑到上述用例,您将如何最好地对 CouchDB 文档中的数据进行建模?到目前为止,我已经汇总的一些选项(我们正在进行基准测试)包括(_id、_rev 除外):
每天每个连接一个文档
{
service_id:555
day:20100101
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
每月大约 60,000 个新文档。大多数新数据将是对现有文档的更新,而不是新文档。
(这里,使用中的对象以轮询的时间戳以及字节输入和字节输出的值为关键)。
每月每个连接一个文档
{
service_id:555
month:201001
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
每月大约 2,000 个新文档。需要对现有文档进行适度更新。
收集每行数据一个文档
{
service_id:555
timestamp:1265248762
in:584
out:11342
}
{
service_id:555
timestamp:1265249062
in:94
out:1242
}
每月大约 15,000,000 个新文档。所有数据都将插入到新文档中。插入速度更快,但我对一年或两年后处理数亿个文档的效率有疑问。文件 IO 似乎令人望而却步(尽管我是第一个承认我并不完全理解它的机制的人)。
我正在尝试以面向文档的方式来解决这个问题,尽管打破 RDMS 习惯很困难:) 事实上,您只能对视图进行最小程度的参数化,这让我有点担心。也就是说,以上哪一个最合适?还有其他我没有考虑过的表现更好的格式吗?
预先感谢,
杰米。
We are currently considering a change from Postgres to CouchDB for a usage monitoring application. Some numbers:
Approximately 2000 connections, polled every 5 minutes, for approximately 600,000 new rows per day. In Postgres, we store this data, partitioned by day:
t_usage {service_id, timestamp, data_in, data_out}
t_usage_20100101 inherits t_usage.
t_usage_20100102 inherits t_usage. etc.
We write data with an optimistic stored proc that presumes the partition exists and creates it if necessary. We can insert very quickly.
For reading of the data, our use cases, in order of importance and current performance are:
* Single Service, Single Day Usage : Good Performance
* Multiple Services, Month Usage : Poor Performance
* Single Service, Month Usage : Poor Performance
* Multiple Services, Multiple Months : Very Poor Performance
* Multiple Services, Single Day : Good Performance
This makes sense because the partitions are optimised for days, which is by far our most important use case. However, we are looking at methods of improving the secondary requirements.
We often need to parameterise the query by hours as well, for example, only giving results between 8am and 6pm, so summary tables are of limited use. (These parameters change with enough frequency that creating multiple summary tables of data is prohibitive).
With that background, the first question is: Is CouchDB appropriate for this data? If it is, given the above use cases, how would you best model the data in CouchDB documents? Some options I've put together so far, which we are in the process of benchmarking are (_id, _rev excluded):
One Document Per Connection Per Day
{
service_id:555
day:20100101
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 60,000 new documents a month. Most new data would be updates to existing documents, rather than new documents.
(Here, the objects in usage are keyed on the timestamp of the poll, and the values the bytes in and byes out).
One Document Per Connection Per Month
{
service_id:555
month:201001
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 2,000 new documents a month. Moderate updates to existing documents required.
One Document Per Row of Data Collected
{
service_id:555
timestamp:1265248762
in:584
out:11342
}
{
service_id:555
timestamp:1265249062
in:94
out:1242
}
Approximately 15,000,000 new documents a month. All data would be an insert to a new document. Faster inserts, but I have questions about how efficient it's going to be after a year or 2 years with hundreds of millions of documents. The file IO would seem prohibitive (though I'm the first to admit I don't fully understand the mechanics of it).
I'm trying to approach this in a document-oriented way, though breaking the RDMS habit is difficult :) The fact you can only minimally parameterise views as well has me a bit concerned. That said, which of the above would be the most appropriate? Are there other formats that I haven't considered which will perform better?
Thanks in advance,
Jamie.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不认为这是一个可怕的想法。
让我们考虑一下您的连接/月场景。
鉴于一个条目的长度约为 40 个(相当大),并且您每月收到约 8,200 个条目,那么到月底您的最终文档大小将约为 350K 长。
这意味着,如果全力以赴,您每 5 分钟就要读写 2000 个 350K 文档。
I/O 方面,考虑到读取和写入,5m 时间窗口的平均值小于 6 MB/s。即使是现在的低端硬件也能做到这一点。
然而,还有另一个问题。当您存储该文档时,Couch 将评估其内容以构建其视图,因此 Couch 将解析 350K 文档。我担心的是(最后一次检查,但已经有一段时间了)我不相信 Couch 能够很好地跨 CPU 核心进行扩展,因此这可以轻松固定 Couch 将使用的单个 CPU 核心。我希望 Couch 能够以 2 MB/s 的速度读取、解析和处理,但坦率地说我不知道。尽管有这么多优点,erlang 并不是直线计算机语言中最好的。
最后一个问题是与数据库保持同步。到月底将每 5 分钟写入 700 MB。使用 Couchs 架构(仅附加),您将每 5 分钟写入 700MB 数据,即每小时 8.1GB,24 小时后写入 201GB。
数据库压缩后,它会压缩到 700MB(一个月),但在此过程中,该文件会变得很大,而且速度很快。
在检索方面,这些大文档并没有吓到我。加载 350K JSON 文档,是的,它很大,但它并没有那么大,在现代硬件上不行。布告栏上的头像比这个还大。因此,我认为,对于一个月以上的连接活动,您想要做的任何事情都会非常快。跨连接,显然你抓取的越多,它就越贵(所有 2000 个连接需要 700MB)。 700MB是一个具有实际影响力的真实数字。另外,您的流程需要积极丢弃您不关心的数据,这样就可以扔掉无用的信息(除非您想在报告流程中加载 700MB 的堆)。
考虑到这些数字,连接/天可能是更好的选择,因为您可以更好地控制粒度。然而,坦率地说,我会尽可能选择最粗略的文档,因为我认为这可以为您提供数据库的最佳价值,仅仅是因为今天所有的磁头寻道和盘片旋转都会消耗大量 I/O 性能,许多磁盘流数据非常好。较大的文档(假设数据位置良好,因为 Couch 不断压缩,这应该不是问题)流式传输多于查找式。与磁盘相比,在内存中查找是“免费的”。
无论如何,请在我们的硬件上运行您自己的测试,但请牢记所有这些注意事项。
编辑:
经过更多实验......
一些有趣的观察。
在导入大型文档期间,CPU 与 I/O 速度同样重要。这是因为将 JSON 转换为内部模型以供视图使用会消耗大量的编组和 CPU。通过使用大型 (350k) 文档,我的 CPU 几乎达到了极限 (350%)。相比之下,对于较小的文档,它们的效率为 200%,尽管总体而言,信息相同,只是以不同的方式进行了划分。
对于 I/O,在 350K 文档期间,我绘制了 11MB/秒的图表,但对于较小的文档,仅为 8MB/秒。
压缩似乎几乎受 I/O 限制。我很难获得 I/O 潜力的良好数据。缓存文件的副本推送速度为 40+MB/秒。压缩速度约为 8MB/秒。但这与原始负载一致(假设沙发正在逐条消息地移动东西)。 CPU 较低,因为它执行的处理较少(它不解释 JSON 有效负载,或重建视图),而且它是单个 CPU 完成工作。
最后,为了阅读,我尝试转储整个数据库。为此,我使用了一个 CPU,而我的 I/O 相当低。我特意确保 CouchDB 文件实际上没有被缓存,我的机器有大量内存,所以缓存了很多东西。通过 _all_docs 的原始转储仅为约 1 MB/秒。这几乎是所有寻道和旋转延迟的原因。当我对大型文档执行此操作时,I/O 达到 3 MB/秒,这正好显示了我提到的对大型文档的好处的流影响。
应该指出的是,Couch 网站上有一些关于提高性能的技术,但我没有遵循。值得注意的是,我使用的是随机 ID。最后,这并不是为了衡量 Couch 的性能,而是为了衡量负载的最终结果。我认为大文档与小文档的差异很有趣。
最后,最终性能并不重要,重要的是为您的硬件应用程序提供足够好的性能。正如您所提到的,您正在进行自己的测试,这才是真正重要的。
I don't think it's a horrible idea.
Let's consider your Connection/Month scenario.
Given that an entry is ~40 (that's generous) characters long, and you get ~8,200 entries per month, your final document size will be ~350K long at the end of the month.
That means, going full bore, you're be reading and writing 2000 350K documents every 5 minutes.
I/O wise, this is less than 6 MB/s, considering read and write, averaged for the 5m window of time. That's well within even low end hardware today.
However, there is another issue. When you store that document, Couch is going to evaluate its contents in order to build its view, so Couch will be parsing 350K documents. My fear is that (at last check, but it's been some time) I don't believe Couch scaled well across CPU cores, so this could easily pin the single CPU core that Couch will be using. I would like to hope that Couch can read, parse, and process 2 MB/s, but I frankly don't know. With all it's benefits, erlang isn't the best haul ass in a straight line computer language.
The final concern is keeping up with the database. This will be writing 700 MB every 5 minutes at the end of the month. With Couchs architecture (append only), you will be writing 700MB of data every 5 min, which is 8.1GB per hour, and 201GB after 24 hrs.
After DB compression, it crushes down to 700MB (for a single month), but during that process, that file will be getting big, and quite quickly.
On the retrieve side, these large documents don't scare me. Loading up a 350K JSON document, yes it's big, but it's not that big, not on modern hardware. There are avatars on bulletin boards bigger than that. So, anything you want to do regarding the activity of a connection over a month will be pretty fast I think. Across connections, obviously the more you grab, the more expensive it will get (700MB for all 2000 connections). 700MB is a real number that has real impact. Plus your process will need to be aggressive in throwing out the data you don't care about so it can throw away the chaff (unless you want to load up 700MB of heap in your report process).
Given these numbers, Connection/Day may be a better bet, as you can control the granularity a bit better. However, frankly, I would go for the coarsest document you can, because I think that gives you the best value from the database, solely because today all the head seeks and platter rotations are what kill a lot of I/O performance, many disk stream data very well. Larger documents (assuming well located data, since Couch is constantly compacted, this shouldn't be a problem) stream more than seek. Seeking in memory is "free" compared to a disk.
By all means run your own tests on our hardware, but take all these considerations to heart.
EDIT:
After more experiments...
Couple of interesting observations.
During import of large documents CPU is equally important to I/O speed. This is because of the amount of marshalling and CPU consumed by converting the JSON in to the internal model for use by the views. By using the large (350k) documents, my CPUs were pretty much maxed out (350%). In contrast, with the smaller documents, they were humming along at 200%, even though, overall, it was the same information, just chunked up differently.
For I/O, during the 350K docs, I was charting 11MB/sec, but with the smaller docs, it was only 8MB/sec.
Compaction appeared to be almost I/O bound. It's hard for me to get good numbers on my I/O potential. A copy of a cached file pushes 40+MB/sec. Compaction ran at about 8MB/sec. But that's consistent with the raw load (assuming couch is moving stuff message by message). The CPU is lower, as it's doing less processing (it's not interpreting the JSON payloads, or rebuilding the views), plus it was a single CPU doing the work.
Finally, for reading, I tried to dump out the entire database. A single CPU was pegged for this, and my I/O pretty low. I made it a point to ensure that the CouchDB file wasn't actually cached, my machine has a lot of memory, so a lot of stuff is cached. The raw dump through the _all_docs was only about 1 MB/sec. That's almost all seek and rotational delay than anything else. When I did that with the large documents, the I/O was hitting 3 MB/sec, that just shows the streaming affect I mentioned a benefit for larger documents.
And it should be noted that there are techniques on the Couch website about improving performance that I was not following. Notably I was using random IDs. Finally, this wasn't done as a gauge of what Couch's performance is, rather where the load appears to end up. The large vs small document differences I thought were interesting.
Finally, ultimate performance isn't as important as simply performing well enough for you application with your hardware. As you mentioned, you're doing you're own testing, and that's all that really matters.