Django:为什么 Foo.objects.extra(...) 比 Foo.objects.raw 快得多?
所以我试图优化一个相当奇怪的查询,但这是一个遗留数据库,所以我用我所拥有的来凑合。这些是我正在尝试的查询。此时它们提供相同的输出。 w 是我的查询集。
def future_schedule(request):
past = datetime.date.today()-datetime.timedelta(days=730)
extra_select = {
'addlcomplete': 'SELECT Complete FROM tblAdditionalDates WHERE Checkin.ShortSampleID = tblAdditionalDates.ShortSampleID',
'addldate': 'SELECT AddlDate FROM tblAdditionalDates WHERE Checkin.ShortSampleID = tblAdditionalDates.ShortSampleID'
}
extra_where = ['''(Checkin.Description <> "Sterilization Permit" AND Checkin.Description <> "Registration State" AND Checkin.Description <> "Miscellaneous" AND Checkin.Description <> "Equipment Purchase" AND Checkin.DateArrived > %s AND Checkin.DateCompleted IS NULL AND Checkin.Canceled = 0) OR (Checkin.Description <> "Sterilization Permit" AND Checkin.Description <> "Registration State" AND Checkin.Description <> "Miscellaneous" AND Checkin.Description <> "Equipment Purchase" AND Checkin.DateArrived > %s AND Checkin.DateCompleted IS NOT NULL AND Checkin.DateFinalCompleted IS NULL AND Checkin.DateFinalExpected IS NOT NULL AND Checkin.Canceled = 0) '''
]
extra_params = [past, past]
w = Checkin.objects.extra(select=extra_select, where=extra_where, params=extra_params)
# OR This one
w = Checkin.objects.raw('''SELECT Checkin.SampleID, Checkin.ShortSampleID, Checkin.Company, A.Complete, Checkin.HasDates, A.AddlDate FROM Checkin LEFT JOIN (SELECT ShortSampleID, Complete, AddlDate FROM tblAdditionalDates) A ON A.ShortSampleID = Checkin.ShortSampleID WHERE (Checkin.Description <> "Sterilization Permit" AND Checkin.Description <> "Registration State" AND Checkin.Description <> "Miscellaneous" AND Checkin.Description <> "Equipment Purchase" AND Checkin.DateArrived > "2009-01-01" AND Checkin.DateCompleted IS NULL AND Checkin.Canceled = 0) OR (Checkin.Description <> "Sterilization Permit" AND Checkin.Description <> "Registration State" AND Checkin.Description <> "Miscellaneous" AND Checkin.Description <> "Equipment Purchase" AND Checkin.DateArrived > "2009-01-01" AND Checkin.DateCompleted IS NOT NULL AND Checkin.DateFinalCompleted IS NULL AND Checkin.DateFinalExpected IS NOT NULL AND Checkin.Canceled = 0)''')
这两者都返回相同数量的记录 (322)。 .extra 在渲染 HTML 方面比 .raw 查询快大约 10 秒,并且对于所有密集目的,.raw 查询甚至稍微简单一些。有谁知道这可能是为什么?根据我的结构,.raw 可能是我获取所需数据的唯一方法(我需要 extra_select 字典中的 addlcomplete 和 addldate 并在having子句中使用它们来进一步过滤查询集),但我当然不喜欢如何这需要很长时间。是模板层慢还是实际查询层慢?我怎样才能最好地调试这个?
感谢您在糟糕的数据结构中寻求优化的帮助。
更新 1:2011-10-03
所以我安装了 django-debugtoolbar 来窥探一下,我启用了 MySQL 常规日志记录并提出了以下内容:
使用 .filter()
或 .extra()
总查询计数为 2。使用 .raw()
总查询计数为 1984!!! (不忽略幽灵般的文学参考)
我的模板使用重新组合,然后循环该重新组合。没有遵循任何关系,没有使用除内置函数之外的模板标签。 Select_lated 没有被使用,我仍然只收到 2 个查询。查看mysql日志,果然——1984条查询。
当查看执行的查询时,基本上看起来每个 {{ Modelinstance.field }}
django 都在执行 SELECT pk, field FROM Model WHERE Model.pk = Modelinstance.pk如果你问我的话,这似乎是完全错误的。我在这里遗漏了一些东西还是 django 真的在查询中疯狂运行吗?
结束更新 1
更新 2 请参阅下面的答案
格雷格
So I am trying to optimize a fairly odd query, but this is a legacy database so I make do with what I have. These are the queries I am trying. They provide the same output at this point. w is my queryset.
def future_schedule(request):
past = datetime.date.today()-datetime.timedelta(days=730)
extra_select = {
'addlcomplete': 'SELECT Complete FROM tblAdditionalDates WHERE Checkin.ShortSampleID = tblAdditionalDates.ShortSampleID',
'addldate': 'SELECT AddlDate FROM tblAdditionalDates WHERE Checkin.ShortSampleID = tblAdditionalDates.ShortSampleID'
}
extra_where = ['''(Checkin.Description <> "Sterilization Permit" AND Checkin.Description <> "Registration State" AND Checkin.Description <> "Miscellaneous" AND Checkin.Description <> "Equipment Purchase" AND Checkin.DateArrived > %s AND Checkin.DateCompleted IS NULL AND Checkin.Canceled = 0) OR (Checkin.Description <> "Sterilization Permit" AND Checkin.Description <> "Registration State" AND Checkin.Description <> "Miscellaneous" AND Checkin.Description <> "Equipment Purchase" AND Checkin.DateArrived > %s AND Checkin.DateCompleted IS NOT NULL AND Checkin.DateFinalCompleted IS NULL AND Checkin.DateFinalExpected IS NOT NULL AND Checkin.Canceled = 0) '''
]
extra_params = [past, past]
w = Checkin.objects.extra(select=extra_select, where=extra_where, params=extra_params)
# OR This one
w = Checkin.objects.raw('''SELECT Checkin.SampleID, Checkin.ShortSampleID, Checkin.Company, A.Complete, Checkin.HasDates, A.AddlDate FROM Checkin LEFT JOIN (SELECT ShortSampleID, Complete, AddlDate FROM tblAdditionalDates) A ON A.ShortSampleID = Checkin.ShortSampleID WHERE (Checkin.Description <> "Sterilization Permit" AND Checkin.Description <> "Registration State" AND Checkin.Description <> "Miscellaneous" AND Checkin.Description <> "Equipment Purchase" AND Checkin.DateArrived > "2009-01-01" AND Checkin.DateCompleted IS NULL AND Checkin.Canceled = 0) OR (Checkin.Description <> "Sterilization Permit" AND Checkin.Description <> "Registration State" AND Checkin.Description <> "Miscellaneous" AND Checkin.Description <> "Equipment Purchase" AND Checkin.DateArrived > "2009-01-01" AND Checkin.DateCompleted IS NOT NULL AND Checkin.DateFinalCompleted IS NULL AND Checkin.DateFinalExpected IS NOT NULL AND Checkin.Canceled = 0)''')
Both of these return the same number of records (322). .extra is about 10 seconds faster in rendering the HTML than the .raw query and for all intensive purposes, the .raw query is mildly less complex even. Does anyone have any insight as to why this might be? Based on my structure, .raw may be the only way I get the data I need (I need the addlcomplete and addldate in the extra_select dict and use them in a Having clause to further filter the queryset) but I certainly don't like how long it is taking. Is it on the template layer that it is slower or the actual query layer? How can I best debug this?
Thank for your help in this quest for optimization amidst poor data structures.
UPDATE 1: 2011-10-03
So I installed django-debugtoolbar to snoop around a bit and I eneabled MySQL general logging and came up with the following:
using .filter()
or .extra()
Total Query count is 2. Using .raw()
Total Query count is 1984!!! (Spooky literary reference not ignored)
My template is using a regroup and then looping through that regroup. No relations are being followed, no template tags other than builtins are being used. Select_related is NOT being used and I still only get the 2 queries. Looking at the mysql log, sure enough - 1984 queries.
When looking at the queries that were executed, basically it looks like for every {{ Modelinstance.field }}
django was doing a SELECT pk, field FROM Model WHERE Model.pk = Modelinstance.pk
This seems completely wrong if you ask me. Am I missing something here or is django really running wild with queries?
END UPDATE 1
UPDATE 2
See answer below
Greg
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好的。这是我的最终结论。虽然 Furbeenator 关于 Django 内部优化的说法是正确的,但事实证明,存在一个更大的用户错误,导致了速度减慢和上述数千个查询。
原始查询集文档 当您推迟字段(即不使用
SELECT * FROM ...
)并仅专门选择某些字段时(SELECT Checkin.Sampleid, ...
这些字段您不选择仍然可以访问,但是因此,如果您在原始查询中选择字段的子集,并且忘记了在模板中使用的查询中的字段,Django 会执行数据库查找来查找您在模板中引用的字段。比抱怨它不存在或其他什么所以,假设您从查询中遗漏了 5 个字段(这就是我所做的),您最终在模板中引用了这些字段,并且您有 300 条记录需要循环。 1500 次额外的数据库点击才能获取每条记录的这 5 个字段,因此,请注意隐藏的引用并感谢上帝 Django 调试工具栏
Ok. Here are my final conclusions. While Furbeenator is correct about the internal Django optimizations, turns out there is a much larger, user error that caused the slowdown and the aforementioned thousands of queries.
It is clearly documented in the Raw queryset docs that when you defer fields (i.e. not using
SELECT * FROM ...
) and are selecting only certain fields specifically (SELECT Checkin.Sampleid, ...
the fields that you don't select can still be accessed but with another database call. So, if you are selecting a subset of fields in your raw query and you forgot a field in your query that you use in your template, Django performs a database lookup to find that field you are referencing in your template rather than complaining about it not existing or whatever. So, let's say you leave out 5 fields from your query (which is what I did) that you end up referencing in your template and you have 300 records that you are looping through. This incurs 1500 extra database hits to get those 5 fields for each record.So, beware of hidden references and thank god for Django Debug Toolbar
从优化部分:数据库访问优化,他们提出了优化方法,其中之一是 extra() 方法。然后他们提到了.raw()。我的假设是他们使 raw() 更加健壮和强大,因此它在优化方面提供了最大的灵活性。 执行原始 SQL 查询 可以让你做的不仅仅是额外的( )。我的直觉是,它只是更注重灵活性而不是性能,并且在可能的情况下应该使用 extra() 而不是 raw() 。
From the Optimization section: Database access optimization, they suggest ways to optimize, one of which is the extra() method. Then they mention .raw(). It is my assumption that they made raw() much more robust and powerful so it offered maximum flexibility over optimization. Performing raw SQL queries allows you to do a lot more than extra(). My hunch is that it is just geared more toward flexibility than performance and extra() should be used over raw() where possible.