在 N 个 postgresql 分片上使用 WHERE 和 ORDER BY 进行应用程序级 JOIN
我有一个 postgresql 集群,其中不同的表驻留在不同的分片(不同的物理 postgresql 服务器)中。 EG:
碎片A + user_group (user_group_id, user_group_name)
分片 B + user (user_id, user_group_id (NULL), user_name)
分片 C + comment (comment_id, user_id, comment_content)
我需要运行查询,如果所有 3 个表都位于同一个分片上,则它看起来像:
SELECT comment_id, comment_content FROM comment INNER JOIN user ON comment.user_id = user.user_id LEFT JOIN user_group ON user.user_group_id = user_group.user_group_id WHERE user_group_id > > 10 AND user_name LIKE 'foo%' ORDER BY user_group_name ASC, user_name ASC, comment_id ASC
如果 3 个表驻留在 3 个不同的物理 postgresql 分片中,如何实现这样的查询?
我读过有关必须“在应用程序层中进行连接”的参考资料,但我不确定如何进行此操作。一些复杂性包括: 1.不同表的基数是未知的(或者可以随着时间的推移而改变),因此从应用程序层(例如:php,python等...),我们不知道是否应该首先查询user_group,获取所有用户组,然后查询用户,获取所有用户等...,或者首先查询评论,获取所有评论,然后按用户过滤检索到的评论,然后按用户组过滤等...
我正在寻找将 sql 转换为应用程序级连接的通用方法,以及上面的架构只是一个假设的例子。
I have a postgresql cluster with different tables residing within different shards (different physical postgresql servers). EG:
shard A
+ user_group (user_group_id, user_group_name)
shard B
+ user (user_id, user_group_id (NULL), user_name)
shard C
+ comment (comment_id, user_id, comment_content)
I need to run queries that if all 3 tables where on the same shard, it would look something like:
SELECT comment_id, comment_content FROM comment INNER JOIN user ON comment.user_id = user.user_id LEFT JOIN user_group ON user.user_group_id = user_group.user_group_id WHERE user_group_id > 10 AND user_name LIKE 'foo%' ORDER BY user_group_name ASC, user_name ASC, comment_id ASC
How would such a query be implemented if the 3 tables are residing within 3 different physical postgresql shards?
I've read about references that one would have to 'do the join in the application layer' but I am not sure how to go about this. Some of the complexities include:
1. The cardinality of the different tables is unknown (or can change over time), so from the application layer (EG: php, python, etc...), we would not know if we should first go query user_group, get all the user groups, then query user, get all the users, etc..., or first query comment, get all the comments, then filter the retrieved comments by users, then filter by user groups, etc...
I am looking for a generic way to translate sql into application-level joins, and the above schema is only a hypothetical example.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
通常,数据在分片上的划分方式可以完全避免跨服务器连接。因为这个手术难度大、成本高。如果您的示例是假设的,我建议将所有数据除以 user_id 字段或 user_group_id。
例如,分片 A 将包含所有带有 user_id % 3 = 0 用户信息的表,分片 B - user_id % 3 = 1,分片 C - user_id % 3 = 2。因此,大多数所需的 JOINS 将位于一个分片内。对于一些复杂的跨服务器查询,您可能拥有常见的 NO-SQL 存储,例如 memcached 或 Redis,它们将具有来自所有分片的所需数据的副本(当然它不是所有表的完整副本)。此类存储可以轻松地复制到您需要的任意数量的服务器上。这就是高负载项目的工作原理。
Usually data divides at shards in a way which allow to avoid cross-server JOINS at all. Because this operation is difficult and expensive. If your example is hypothetical I would recommend divide all data by user_id field or user_group_id.
For example shard A will contain all tables with information from users which user_id % 3 = 0, shard B - which user_id % 3 = 1, shard C - which user_id % 3 = 2. So most of needed JOINS will be inside one shard. For some complex cross-server queries you may have common NO-SQL storage like memcached or Redis which will have copies of needed data from all shards (of course it's not full copy of all tables). Such storages can be easily replicated on as much servers as you need. This is how highload projects works.
您需要查看 db_link 贡献。
You need to look into the db_link contrib.