复制大型数据库表的快速方法

发布于 2024-09-09 04:30:55 字数 498 浏览 5 评论 0原文

我们目前正在尝试解决性能问题。搜索数据并以分页方式呈现大约需要 2-3 分钟。

经过进一步调查(以及几次sql调整后),似乎搜索速度很慢,只是因为数据量太大。

我目前正在研究的一个可能的解决方案是将数据复制到可搜索的缓存中。现在这个缓存可以位于数据库中(即物化视图),也可以位于数据库外部(nosql 方法)。但是,由于我希望缓存能够水平扩展,因此我倾向于将其缓存在数据库之外。

我已经创建了一个概念证明,事实上,在缓存中搜索比在数据库中搜索更快。但是,初始完整复制需要很长时间才能完成。尽管完整复制只会发生一次,然后后续的复制将仅针对自上次复制以来更改的复制进行增量复制,但如果我可以加快初始完整复制的速度,那就太好了。

然而,在完全复制期间,除了查询执行速度缓慢之外,我还必须应对网络延迟。事实上,我可以处理缓慢的查询执行时间。但网络延迟确实减慢了复制速度。

这引出了我的问题,如何加快复制速度?我应该生成多个线程,每个线程执行一个查询吗?我应该使用可滚动的吗?

We are currently trying to solve a performance problem. Which is searching for data and presenting it in a paginated way takes about 2-3 minutes.

Upon further investigation (and after several sql tuning), it seems that searching is slow just because of the sheer amount of data.

A possible solution that I'm currently investigating is to replicate the data in a searchable cache. Now this cache can be in the database (i.e. materialized view) or it could be outside the db (nosql approach). However, since I would like the cache to be horizontally scalable, I am leaning towards caching it outside the database.

I've created a proof of concept, and indeed, searching in my cache is faster than in the db. However, the initial full replication takes a long time to complete. Although the full replication will just happen once, and then succeeding replication will just be incremental against those that changed since the last replication, it would still be great if I can speed up the initial full replication.

However, during full replication, aside from the slowness of the query's execution, I also have to battle against network latency. In fact, I can deal with the slow query execution time. But the network latency is really really slowing the replication down.

So which leads me to my question, how can I speed up my replication? Should I spawn several threads each one doing a query? Should I use a scrollable?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

最单纯的乌龟 2024-09-16 04:30:55

复制缓存中的数据看起来就像复制数据库的功能。

通过阅读其他评论,我发现您这样做并不是为了避免网络往返,而是因为连接成本高昂。在许多 DBMS 中,您可以创建临时表 - 如下所示:

CREATE TEMPORARY TABLE abTable AS SELECT * FROM a , b ;

如果 a 和 b 是大型(相对永久)表,那么您将需要 2-3 分钟的一次性成本来创建临时表。 要小得多,

SELECT name, city, ... , FROM a , b ;

但是,如果您使用 abTable 进行许多查询,那么后续每个查询的成本将比其他数据库系统具有视图概念

CREATE VIEW abView AS SELECT * FROM a , b ;

它可以让您执行类似这样的操作:基础 a 和 b 表中的更改将反映在 abView 中。

如果您确实担心网络往返,那么您也许可以在本地计算机上复制部分数据库。

一个好的数据库管理系统应该能够满足您的数据需求。那么为什么要重新发明轮子呢?

Replicating the data in a cache seems like replicating the functionality of the database.

From reading other comments, I see that you are not doing this to avoid network roundtrips, but because of costly joins. In many DBMS you can create temporary tables - like this:

CREATE TEMPORARY TABLE abTable AS SELECT * FROM a , b ;

If a and b are large (relatively permanent) tables, then you will have a one-time cost of 2-3 minutes to create the temporary table. However, if you use abTable for many queries, then the subsequent per query cost will be much smaller than

SELECT name, city, ... , FROM a , b ;

Other database systems have a view concept which lets you do something like this

CREATE VIEW abView AS SELECT * FROM a , b ;

Changes in the underlying a and b table will be reflected in the abView.

If you really are concerned about network round trips, then you may be able to replicate parts of the database on the local computer.

A good database management system should be able to handle your data needs. So why reinvent the wheel?

茶花眉 2024-09-16 04:30:55
  1. SELECT * FROM YOUR_TABLE
  2. 将结果映射到对象或数据结构中 为
  3. 每个对象或数据结构分配唯一的键
  4. 将键和对象或数据结构加载到 WeakHashMap 中以充当缓存。

我不明白为什么你需要排序,因为你的缓存应该在 O(1) 时间内通过唯一键访问值。什么是排序买你?

一定要考虑线程安全。

我假设这是一个只读缓存,您这样做是为了避免持续的网络延迟。我还假设您将在启动时执行此操作一次。

每条记录有多少数据? 12M 记录(每条记录 1KB)意味着您需要 12GB RAM 来保存缓存。

  1. SELECT * FROM YOUR_TABLE
  2. Map results into an object or data structure
  3. Assign a unique key for each object or data structure
  4. Load the key and object or data structure into a WeakHashMap to act as your cache.

I don't see why you need sorting, because your cache should access values by unique key in O(1) time. What is sorting buying you?

Be sure to think about thread safety.

I'm assuming that this is a read-only cache, and you're doing this to avoid the constant network latency. I'm also assuming that you'll do this once on start up.

How much data per record? 12M records at 1KB per record means you'll need 12GB of RAM just to hold your cache.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文