Django 使用 Annotate 而不是 Distinct()
我读到过,distinct() API 调用有时会出现一些性能问题。我想尝试通过 orm 重写查询,避免使用不同的(至少分析差异)。
我的理解是,values() 在底层执行 Group By。但是,当我测试这两种方法时,对象的计数会有所不同,具体取决于我使用的是distinct()还是values()/annotate()。
zip_codes = Location.objects.values('zip_code').annotate(zip_count=Count('zip_code')).exclude(zip_code=None).count()
VS。
zip_codes = Location.objects.values_list('zip_code', flat=True).exclude(zip_code=None).distinct()
对这里出了什么问题有什么想法吗?
谢谢!
I have read that distinct() API call has some performance issues at times. I wanted to try to rewrite a query through the orm which avoided using distinct (at least profile the difference).
My understanding is that values() performs a Group By under the hood. When I test out the two methods, though, the Count of objects differs depending on whether I use distinct() or values()/annotate().
zip_codes = Location.objects.values('zip_code').annotate(zip_count=Count('zip_code')).exclude(zip_code=None).count()
VS.
zip_codes = Location.objects.values_list('zip_code', flat=True).exclude(zip_code=None).distinct()
any thoughts on what is wrong here?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我刚刚用类似的查询快速对照我拥有的数据库检查了您的查询。计数是相同的,因此我不确定您的数据是否导致了问题。
不过,我也对这个前提表示高度怀疑。 DISTINCT确实是一个CPU密集型查询。但是,COUNT(*) 也是如此,您的第二个查询将首先使用 group by 运行计数聚合,然后对结果运行 COUNT。我会把钱花在更快的单个 DISTINCT 调用上(我还会检查您使用的任何数据库后端)。所有这些与 django 的 ORM 关系不大,而与数据库后端关系更大。
也请思考一下这一点。与基于注释的查询相比,基于不同的查询要完成的任务要清晰一个数量级。您是否有证据支持 DISTINCT 在您的情况下会很慢,或者更好的是它现在正在形成瓶颈?如果不是,您就已经进入了过早优化的范围,应该认真重新考虑您的路径。
过早优化。
换句话说,清楚地编写您的软件,然后当您发现问题时,追溯到源头并修复它。在此之前你所做的任何事情都会适得其反。花时间考虑哪些索引对您的数据库很重要,以及在哪里使用 select_lated。这些比您在这里担心的有效 10000%(除非您一直在计算邮政编码,在这种情况下,让我向您介绍缓存)
I just quickly checked your queries against a database I have with a similar query. The counts was identical so I'm not sure what about your data is resulting in issues.
I'd also be HIGHLY skeptical of the premise though. DISTINCT is indeed a cpu intensive query. However, so is COUNT(*) and your second query is going to first run an count aggregate with a group by and then run a COUNT on the results. I'd be put money on the single DISTINCT call being faster (I'd also check with whichever database backend you're using to see). All of this has very little to do with django's ORM and a whole heck of a lot more to do with your database backend.
Also think about this. The distinct based query is an order of magnitude clearer as to what it's accomplishing compared to the annotate based one. Do you have evidence to support that DISTINCT is going to be slow in your situation, or better still that it's forming a bottlneck right now? If not you're well into the range of premature optimization and should heavily reconsider your path.
Premature Optimization.
In other words write your software clearly and then when you find a problem trace it to the source and fix it. Anything you do before that is counterproductive. Spend your time worrying about which indexes are going to matter on your db, and where to use select_related. Those are 10000% more effective than what you are worrying about here (unless you are counting zip codes all the time, in which case let me introduce you to caching)