优化查询 - 使用字段或使用另一个表
我有一个需要很长时间的查询,我想优化它。我正在寻找最有效的方法来做到这一点。
我正在使用 Postgresql DB 开发 Hibernate/JPA,但任何解决方案都应该是通用的 JPA 解决方案。
术语
- 用户:系统中的用户。
- 朋友:用户的朋友。一个用户将有N个朋友。
- 会话:使用系统的会话。可以打开或关闭。
- 上下文:会话的上下文。用户在任何给定时间内每个上下文可能有一个打开的会话,并且每个上下文可能有许多过去的关闭会话。
查询
我需要实现一个查询,在给定用户名的情况下,该查询给出以下内容:
- 获取该用户的所有朋友
- 对于每个朋友:
- 如果好友有任何打开的会话,请获取所有打开的会话(针对所有上下文)
- 否则,请从所有上下文中获取好友的最新会话。
请注意,友谊存储在不同的数据库中,因此在任何情况下我都无法将其合并到一个大查询中。
示例
用户 A 有 3 个好友:B、C、D。有两个上下文,1和2。朋友有以下数据:(
下面的格式是会话ID - 用户,上下文)
- 1 - B,1:开放会话
- 2 - B,2:2月27日开始的封闭会话
- 3 - B,2:2 月 26 日开始的闭门会议
- 4 - C,1:2 月 27 日开始的闭门会议
- 5 - C,1:2 月 26 日开始的闭门会议
- 6 - C,2:闭门会议于 2 月 26 日开始的
- 7 - C,2:于 2 月 25 日开始的封闭会话
- 8 - D,1:开放会话
- 9 - D,2:开放会话
查询应该得到我: B:第 1 场(所有公开场次) C:第 4 场(最新闭门会议) D:会话 8,9(所有打开的会话)
当前状态
我的查询分三个步骤进行:
- 获取用户的所有朋友
- 对于每个朋友:
- 获取好友的所有开放会话
- 如果有任何打开的会话,则返回所有打开的会话
- 获取好友的最新会话,返回该会话
显然,这是很多查询。 对于初学者,我将执行上面的步骤 2,并将其转换为单个查询。我的担忧与第二个查询有关。问题是——如何使其更加优化。因此,该问题可以重新表述:
“给定一组 N 个朋友 ID,获取所有这些朋友的所有开放会话或最新会话。”
建议的解决方案
我们基本上提出了两种解决方案,并且正在考虑哪种更好。
表解决方案表示保留一个新表,该表将在用户、上下文和最新会话之间关联。该解决方案的含义是:
- 创建一个新实体并创建一个新实体。 “最新会话”表
- 该表将包含以下几列:
- 用户
- 上下文
- 最新会话 ID
- 该表将由会话实体在持久化后更新,以便任何新持久化的会话将自动更新此表。
- 新查询将从该表中获取用户所有朋友的所有记录,并对它们进行处理以创建最终结果。
列解决方案表示在会话表上保留“最新”标志列。这个解决方案的含义是:
- 为最新的(一个布尔值)创建一个新字段,
- 该列将由会话实体的 post persist 设置,这样以前的“最新”会话将不再是最新的,而新的会话将成为最新的会话。
- 新查询将从原始会话表中获取用户所有朋友的所有最新记录(通过将新列合并到语句的条件中),并对其进行处理以创建最终结果。
这些方法各有利弊,但我们似乎还没有找到赢家。显然,可能还有其他更好的解决方案我们没有考虑过。我想了解的是上述哪一个更好以及为什么,或者您自己的更好的新方法。
I have a query which takes a long time and I want to optimize it. I'm looking for the most efficient way to do it.
I'm working on Hibernate/JPA with Postgresql DB but any solution should be a generic JPA one.
Terminology
- User: A user in the system.
- Friend: A friend of the user. A user will have N friends.
- Session: A session of using the system. Can be open or closed.
- Context: A context of the session. A user may have one open session per context in any given time, and may have many past closed sessions per context.
The query
I need to implement a query that, given a user name, gives me the following:
- Fetch all the friends of that user
- For each friend:
- If the friend has any open sessions, fetch all the open sessions (for all the contexts)
- Otherwise, get the friend's latest session out of all the contexts.
Note that the friendships are stored in a different DB so I cannot incorporate that into one big query in any case.
Example
User A has three friends: B,C,D. There are two contexts, 1 and 2. The friends have the following data:
(The formatting below is Session ID - User,Context)
- 1 - B,1: Open session
- 2 - B,2: Closed session that started on Feb-27
- 3 - B,2: Closed session that started on Feb-26
- 4 - C,1: Closed session that started on Feb-27
- 5 - C,1: Closed session that started on Feb-26
- 6 - C,2: Closed session that started on Feb-26
- 7 - C,2: Closed session that started on Feb-25
- 8 - D,1: Open session
- 9 - D,2: Open session
The query should get me:
B: Session 1 (All open sessions)
C: Session 4 (Latest closed session)
D: Sessions 8,9 (All open sessions)
Current state
My query works in three steps:
- Get all the friends of the user
- For each friend:
- Get all the open sessions for the friend
- If there is any open session, return all the open sessions
- Get the latest session for the friend, return that session
Obviously this is a lot of queries. For starters, I'm going to take step 2 above and convert it into a single query. My concerns are related to that second query. The question is - how to make it more optimized. The problem can be therefore rephrased:
"Given a set of N friend IDs, get all the open sessions or the latest session for all these friends."
Suggested solutions
There are basically two solutions we came up with and we're contemplating what would be better.
The table solution says to keep a new table that will correlate between user, context, and latest session. The implications of this solution are:
- Create a new entity & table for "latest sessions"
- The table will have these columns:
- User
- Context
- Latest session ID
- The table will be updated by the session entity on post persist, so that any newly persisted session will automatically update this table.
- The new query will fetch all the records for all the friends of the user from this table and work on them to create the final result.
The column solution says to keep a "latest" flag column on the sessions table. The implications of this solution are:
- Create a new field for the latest (a boolean)
- The column will be set by the post persist of the session entity, so that the former "latest" session will no longer be the latest, and the new session will become the latest one.
- The new query will fetch all the latest records (by incorporating the new column into the condition of the statement) for all the friends of the user from the original sessions table and work on them to create the final result.
There are pros and cons to each of these, and we don't seem to have a winner yet. Obviously there may be other, better solutions we have not considered. What I'd like to see is which of the above is better and why, or a new better approach of your own.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的两种解决方案之间的差异应该很小。根据活动的不同,表溶液可能会更干净。
但是,请注意“你做错了”(根据理论)。
RDBMS 应用程序设计原则明确指出,您不应该尝试指定如何执行查询,而应该指定您想要什么数据。数据库将为您的解决方案找到最佳路径(RDBMS 距离数据最近,根据您的架构,可能会节省网络往返、存储往返等;这里的可扩展性可能会严重受损,如果您可能没有意识到这一点)你没有进行像样的压力测试;此外,RDBMS 了解索引和内部统计数据,以确定扫描或查找是否更有效,并且它知道如何最佳地执行连接)。
在实践中,尝试提出一个问题:为什么友谊数据库不同? (这真的是不同的数据库还是同一数据库上的不同模式?)。
此外,如果你真的想按照你的方式去做(禁用 RDBMS 来寻找最佳执行计划),那么最重要的因素是:
编辑:
因此,考虑“给定一组 N 个朋友 ID,获取所有这些朋友的所有开放会话或最新会话”。这是一个在引入新结构之前应该测试的查询 会话
(SessionID、User、Context、Start、End)
有更多方法可以编写上述内容来尝试帮助优化器,特别是如果您的数据库支持 CTE,则上述内容可以是更有效地重写。
笔记:
:friendsList
- 好友用户列表。另外,我假设打开的会话将 NULL 作为打开会话的
End
值。您可能已经选择了其他方法(也许您有一个字段表示它;或者有两个表,一个用于打开会话,一个用于关闭会话)上述查询将从某些索引中受益(原则是首先尝试使用索引进行优化,然后进行重组;我尝试的第一个索引是
User, End
上的复合索引)和相对较少数量的朋友(假设它作为字符串传递),这应该表现良好已经。The difference between your two solutions should be marginal. Table solution might be cleaner depending on activity.
However, do note that 'you are doing it wrong' (according to the theory).
The RDBMS application design principle clearly states that you should not try to specify how your queries should be executed, but what data you want. The database will find optimal path to your solution (the RDBMS sits closest to the data and depending on your architecture might save network round trips, storage round trips and so on; scalability can be seriously crippled here and you might not be aware of it if you don't do decent stress testing; furthermore RDBMS knows about indexes and internal statistics that determine if scans or seeks will be more effective and it knows how to optimally execute joins).
In practice, try to raise the question why different database for friendships? (is it really different db or different schema on the same db?).
Furthermore, if you really want to go the way you do it (disabling the RDBMS to look for optimal execution plan), then the most important factors are:
EDIT:
So, considering "Given a set of N friend IDs, get all the open sessions or the latest session for all these friends." here is a query that should be tested before introducing new structures
Sessions (SessionID, User, Context, Start, End)
There are more ways to write the above to try to help the optimizer, in particular if your DB supports CTE the above can be rewritten more efficiently.
Notes:
:friendsList
- list of Users that are friends.Also, I am assuming open sessions have NULL as value of the
End
for open sessions. You might already be choosing some other approach (maybe you have a field denoting it; or there are two tables, one for open sessions, one for closed)The above query will benefit from certain indexes (principle is to first try to optimize with indexes, then with restructuring; first index I would try is composite index on
User, End
) and on relatively small number of friends (assumed from the fact that it passed around as a string), this should perform decently already.为什么不缓存对象?您不需要点击数据库。
why not cache objects? You do not need to hit the DB.
您的主要瓶颈似乎是您所需的信息分布在两个数据库中。因此,您获得了一份朋友列表并遍历了他们。
我建议您尝试删除迭代,将其减少为单个查询。
我实现此目的的方法是构建一个以逗号分隔的用户 ID 字符串,并将该字符串传递到第二个数据库。然后,第二个数据库中的 sql 可以(例如,使用函数)将字符串转换为 ids 的单个字段表,并加入该表。
这对我来说感觉很不优雅,但这是我一直在做的事情。
我使用的唯一实用的替代方案是构建一个将 ID 插入表中的单个查询,然后加入该查询。临时表或带有 SessionID 字段的永久表,允许多个会话同时使用它。
无论您使用什么方法,对步骤 2 使用单个查询,使用基于集合的方法而不是迭代,应该会产生显着的好处。
Your main bottle neck appears to be the fact that the information you need is distributed over two databases. Thus, you acquire a list of friends and itterate through them.
I would suggest that you attempt to remove the itteration, reducing it to a single query instead.
The way I would achieve this is build up a comma delimited string of user ids, and pass that string to the second database. The sql in the second database could then (using a function, for example) translate the string intol a single field table of ids, and join on that.
It feel very inellegant to me, yet it's something I do all the time.
The only practical alternative that I have used is to build up a single query that inserts the IDs in to a table, then join on that. Either a temporary table, or a permanent table with a SessionID field allowing multiple sessions to use it concurrently.
Whatever approach you use, have a single query for step 2, using a set based approach rather than itteration, should yield significant benefits.