最小化在大型 SQL 表上计算多个排名(基于等级)的运行时间,它能得到多短(毫秒访问)
我已经被一个相当著名的按年级对学生排名的问题困扰了几个星期,虽然我学到了很多东西,但我仍然没有解决我的问题(排名已生成,但过程太慢) :
我有一个大表(320,000 行),其中包含学生代码(用作标识符,而不是他们的姓名)、学生教室、测试、测试日期、主题、问题编号以及学生的成绩问题。该表是其他所有计算的基础,它的大小使得所有这些计算非常非常慢,以至于我发现我几乎破坏了工作中的所有内容。
首先,一些关于学校的信息(信息很少,需要理解问题)
在学校,我们每周都会进行几个科目的测试。学校还分为不同用途的教室(一间专注于数学、物理和化学,另一间专注于生物,最后一间专注于历史、葡萄牙语和地理)。但他们每周都会进行相同的测试。
我们想要做的是计算学校中每个人(不是每个教室)每个问题的标准差和每个问题的平均成绩(也包括学校中每个人),然后生成以下排名(所有排名均按日期):
-每个教室每个科目的排名(使用“原始”成绩),考虑整个学校的每个科目排名(使用“原始”成绩)以及考虑整个学校的每个科目排名(使用标准化成绩,每个问题的标准差以及每题平均成绩信息)
-与上面提到的相同的排名,但不是按科目,而是考虑所有科目
如您所见,在计算平均成绩和标准差后,我们仍然需要计算每个问题的成绩总和,并根据到这些总和(实际科目/测试成绩)。我通过以下几种方式解决了这个问题:
1) 创建了两张表,一张包含每个学生每个科目的成绩(字段:学生代码、学生课堂、测试日期、科目、成绩、标准化成绩、课堂排名、学校排名、使用标准化成绩的学校排名),另一个表格包含每个学生每次测试的成绩(考虑所有科目,字段:学生代码、学生课堂、测试日期、成绩、标准化成绩、课堂排名、学校排名、使用标准化成绩的学校排名)。
在这些表中插入数据大约需要50秒
然后,我尝试使用SQL进行排名,但是,我遇到了一些问题:
-Access 没有 ROW_NUMBER 或 RANK 函数,因此我必须使用 COUNT 查询,例如(下面只是一个简化版本):
SELECT 1+(SELECT Count(*) FROM grades_table_per_subject t2 WHERE
t2.Grade > t1.Grade AND t1.Date=t2.Date AND t1.Subject=t2.Subject) AS [Global Rank],
1+(SELECT Count(*) FROM grades_table_per_subject t3 WHERE t3.Grade > t1.Grade AND
t3.Date=t1.Date AND t3.Subject=t1.Subject AND t3.Classroom=t1.Classroom) AS
[Rank in classroom] FROM grades_table_per_subject;
上面的查询中仍然有标准化成绩的排名,但我省略了它.
表 Grades_table_per_subject 大约有 45,000 行,即使使用索引,此查询也需要超过 15 分钟(尝试了许多不同的索引组合,甚至当我看到应该工作的索引组合不起作用时,还尝试了一些奇怪的组合)。
我还尝试对内部选择进行 ORDER BY Count() DESC ,但 7 分钟后按 ctrl+break 却没有结果。
2) 将以下字段添加到上表中: 课堂排名、学校排名、使用标准化成绩的学校排名
然后我尝试使用 VBA 和 DAO 并手动更新排名字段,运行以下代码(简化版本):
Set rs = CurrentDb.OpenRecordset("SELECT Classroom, Date, Subject, Grade, [Rank in classroom] FROM
grades_table_per_subject ORDER BY Date, Classroom, Subject, Grade DESC;", dbOpenDynaset)
...
...
rs.movefirst
i=1
While Not rs.eof
'Verifies if there was a change on either one of Subject, Classroom, Date and if so:
...
i = 1
...
rs.Edit
rs![Rank in classroom]=i
rs.Update
i = i + 1
rs.movenext
Wend
rs.close
这显然只构建了一个排名(在本例中为每个排名)每个教室的科目),仅需要 3 分 10 秒。
我验证了由于表上的写入而需要很长时间(rs.Edit 和 rs.Update 是罪魁祸首,注释它们使整个过程仅在 4 秒内运行),但我需要写入表的排名来生成稍后提供访问报告。
最后:
我可以一次生成所有排名,并为用户提供快速访问所有数据的方法,但想法是一切都应该即时计算。然而,我们所取得的成就使这成为不可能。
总的来说,要问的问题如下:
- 有没有办法通过 Access 查询在 10 秒内计算上面显示的排名,或者使用 VBA 并计算 - 考虑到此处使用的表的大小,在类似的时间内将这些排名插入到表中?
另外,我很想看到一个有效的排名算法列表,这样即使我不能快速完成所有事情,我也可以尽可能地改进它。
I've been stuck with the rather famous problem of ranking students by grade for a couple weeks now, and while I've learned much, I still haven't solved my problem (the ranks are generated, but the process is too slow):
I have a large table (320,000 rows) that contains the student codes (serves as an identifier, instead of their names), the students classroom, the test , the tests date, the subject, the question number and the students grade on that question. This table is the base for everything else that is calculated and its size makes all these calculations very very slow, to the point where I find me almost breaking everything here at work.
First, some intel on the school (very little info, required to understand the problem)
Here at the school we have weekly tests over several subjects. The school is also separated in classrooms with different purposes (one is focused on math, physics and chemistry, another one is focused on biology, and the last one focuses on history, Portuguese and geography). But they all do the same tests every week.
What we want to do is calculate the standard deviation for each question for everyone in the school (not per-classroom) and the average grade per question (also for everyone in the school), and then generate the following ranks (all of them per date):
-Rank per subject per classroom (with "raw" grades), Rank per subject considering the whole school (with "raw" grades) and Rank per subject considering the whole school (using normalized grades, with the standard deviation per question and the average grade per question information)
-The same ranks that were mentioned above, but not per Subject, considering instead all subjects
As you can see, after calculating the average grades and the standard deviations, we still need to calculate the sums of the grades on each question, and rank according to these sums (the actual subject/test grades). I've attacked this problem in a few ways:
1)
Created two tables, one with the grades per student per subject (fields: Students code, Students classroom, Date of test, Subject, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades) and another with the grades per student per test (all subjects taken into account, fields: Students code, Students classroom, Date of test, Grade, Normalized Grade, Rank in Classroom, Rank in School, Rank in School using normalized grades).
The insertion of data in these tables takes about 50 seconds
Then, I tried using SQL to rank, however, I ran into some problems:
-Access has no ROW_NUMBER or RANK functions, and thus I have to use queries with COUNT, like (below is just a simplified version):
SELECT 1+(SELECT Count(*) FROM grades_table_per_subject t2 WHERE
t2.Grade > t1.Grade AND t1.Date=t2.Date AND t1.Subject=t2.Subject) AS [Global Rank],
1+(SELECT Count(*) FROM grades_table_per_subject t3 WHERE t3.Grade > t1.Grade AND
t3.Date=t1.Date AND t3.Subject=t1.Subject AND t3.Classroom=t1.Classroom) AS
[Rank in classroom] FROM grades_table_per_subject;
There still is the rank with the normalized grades in the query above, but I omitted it.
The table grades_table_per_subject has about 45,000 lines and this query takes more than 15 minutes here, even with indexing (tried many different index combinations, even some odd ones when I saw that the ones that should work didn't).
I also tried to ORDER BY Count() DESC the inner selects, but I hit ctrl+break after 7 minutes and no results.
2)
Added the following fields to the tables above:
Rank in Classroom, Rank in School, Rank in School using normalized grades
Then I tried using VBA with DAO and manually update the Rank fields, running the following code (simplified version):
Set rs = CurrentDb.OpenRecordset("SELECT Classroom, Date, Subject, Grade, [Rank in classroom] FROM
grades_table_per_subject ORDER BY Date, Classroom, Subject, Grade DESC;", dbOpenDynaset)
...
...
rs.movefirst
i=1
While Not rs.eof
'Verifies if there was a change on either one of Subject, Classroom, Date and if so:
...
i = 1
...
rs.Edit
rs![Rank in classroom]=i
rs.Update
i = i + 1
rs.movenext
Wend
rs.close
This obviously builds only one of the ranks (in this case per subject per classroom), and it takes alone 3min 10sec.
I verified that it takes so long due to the writes on the table (rs.Edit and rs.Update are the culprits, commenting them makes the whole thing run in only 4 seconds), but I need the ranks written to the table to generate an access report later.
FINALLY:
I could generate all the ranks once and make ways for the users to access all the data very quickly, but the idea is that everything should be calculated on-the-fly. The times we have achieved, however, make this impossible.
Overall, the question to be asked is the following:
-Is there a way to calculate the ranks shown above through an Access Query under 10 seconds, or to use VBA and calculate-insert these ranks to the table in a similar time considering the size of the tables used here?
Also, I would love to see a list of efficient ranking algorithms, so that even if I can't do everything quickly, I can improve it as much as possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为什么?
为什么要一遍又一遍地重新生成相同的数据呢?最可能的做法是在数据发生变化时生成这些统计数据,然后每隔一段时间就查找它们。每当有人想要检查某些内容时,重做已经完成的工作是愚蠢的。
Why?
Why bother regenerating the same data over and over? It's most likely preferable to generate these statistics when the data changes and just look them up every other time. Redoing work you've already done whenever somebody wants to check something is just silly.
我刚刚看到您说仅限 ms access,
因此请忽略此答案 - 或者如果您希望能够进行这种类型的电源处理,请考虑迁移到真正的数据库。
下面的原始答案
我无权访问您的测试数据,但是它运行的速度有多快?
我的猜测是,您将无法在 VBA 中击败 SQL Server 的排名速度,如果这还不够快,那么您需要查看探查器并查看它建议您创建哪些索引。
I just saw you say ms access only
so ignore this answer -- or consider moving to a real DB if you want to be able to do this type of power processing.
original answer below
I don't have access to your test data, but how fast does this run?
My guess is you are not going to be able to beat SQL Servers ranking speed in VBA, if this is not fast enough then you need to look in the profiler and see what indexes it suggests you make.