创建测试数据后是否需要等待后台压缩完成才能进行良好的读取基准测试？

发布于 2025-01-13 20:12:06 字数 151 浏览 3 评论 0原文

我正在使用 RocksDB Java 对我自己的应用程序数据进行一些基准测试，并希望在开始测量读取性能之前确保创建的数据尽可能以最佳方式存储（即，在插入期间/之后是否进行任何后台压缩等）想要等待完成）。这是我需要关心的事情吗？如果是的话，我如何以编程方式知道何时可以启动我的读取基准测试？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

作妖 2025-01-20 20:12:06

这是一个棘手的话题。过于笼统的答案是测试您关心的内容。如果您关心混合读写工作负载下的整体系统性能，那么这可能就是您应该测试的内容。如果您关心这些条件下的读取性能，那么您可能应该在这些条件下进行测试。（请注意，RocksDB LOG 文件报告操作计数和延迟统计信息，尽管这些不包括与 Java 层相关的惩罚。）但是，在这种混乱的条件下可能需要数小时的测试才能获得有关性能某一方面的可靠数据，例如读取延迟或最大吞吐量。

如果您愿意牺牲一些统计有效性来获得更高的统计可靠性（为了更快、更准确的性能测量），那么您可以只运行读取路径。正如您所注意到的，您希望避免后台压缩，以便始终隔离读取路径。为此，我建议以只读方式重新打开数据库，然后执行读取。或者，您可以通过定期轮询数据库属性 kNumRunningCompactions 直到它为零（可能连续几次）来等待挂起的压缩完成。这种方法通常使 LSM 处于某种随机的平均状态，该状态反映了读取在活动读写系统中的执行方式，尽管特定的 LSM 状态可能有很大差异，因此您可能需要对多个此类状态进行平均。

在测试读取性能之前运行完全压缩的问题是，您的 LSM 将始终处于“优化”状态，因此读取速度将尽可能快。如果您的实际工作负载在压缩后始终是只读的，那么请务必以这种方式进行测试，但它被认为对于大多数工作负载的实际读取性能的有效性较低。

如果您正在对不影响数据库写入方式的更改进行 AB 测试，那么最好的方法是构建单个数据库，并在该数据库上的 A 和 B 配置下测试读取性能，以只读方式打开。您甚至可以同时运行 A 和 B 测试，以便每个测试都同样受到系统上其他进程的任何噪音的影响。

当然，最大的挑战之一是小型数据库与大型数据库的性能特征可能会发生巨大变化，而且大型数据库需要很长时间才能构建。

This is a tricky subject. The overly general answer is to test what you care about. If you care about system performance overall under a mixed read-write workload, that's probably what you should test. If you care about read performance under those conditions, then you should probably test under those conditions. (Note that RocksDB LOG file reports operation counts and latency statistics, though those don't include penalties associated with the Java layer.) However, it can require hours of testing under such chaotic conditions to get reliable data about one aspect of performance such as read latency or max throughput.

If you are willing to sacrifice some statistical validity for more statistical reliability (for faster accurate performance measurement) then you can run just the read path. As you note, you want to avoid background compactions in order to consistently isolate just the read path. For this I recommend re-opening the database as read-only and then performing your reads. Or you can wait for pending compactions to finish by periodically polling DB property kNumRunningCompactions until it is zero (perhaps several times in a row). This approach generally leaves the LSM in some random, average-ish state that reflects how reads will perform in an active read-write system, though the particular LSM state can vary considerably, so you might want to average over several such states.

The problem with running a full compaction before testing read performance is that your LSM will always be in an "optimized" state, so reads will be as fast as they can be. If your actual workload is always read-only after compaction, then by all means test this way, but it's considered to have low validity for real-world read performance for most workloads.

If you are doing A-B testing on a change that doesn't affect how the DB is written, then the best approach is to build a single DB and test read performance under both A and B configurations on that DB, opened read-only. You can even run the A and B tests simultaneously so that each is similarly affected by any noise from other processes on the system.

And of course one of the big challenges is that performance characteristics can change dramatically for small DBs vs. large DBs, and large DBs take a very long time to construct.

回复收藏 0 原文