如何查询数据库中用户的排名,但只考虑每个用户的最新条目?
假设我有一个名为“Scrape”的数据库表,其设置可能如下:
UserID (int)
UserName (varchar)
Wins (int)
Losses (int)
ScrapeDate (datetime)
我试图根据用户的赢/输比对他们进行排名。但是,每周我都会抓取有关用户的新数据,并在抓取表中添加另一个条目。
如何查询按赢/输排序的用户列表,但只考虑最近的条目 (ScrapeDate)?
另外,您认为人们会访问该网站并且刮擦可能正在完成中这很重要吗?
例如,我可以:
1 - Bob - Wins: 320 - Losses: 110 - ScrapeDate: 7/8/09
1 - Bob - Wins: 360 - Losses: 122 - ScrapeDate: 7/17/09
2 - Frank - Wins: 115 - Losses: 20 - ScrapeDate: 7/8/09
其中,这表示到目前为止仅更新了 Bob 的抓取,并且正在更新 Frank 但尚未插入。您也将如何处理这种情况?
所以,我的问题是:
- 您如何处理仅查询每个用户最近的抓取以确定排名
- 您是否认为数据库可能处于更新状态(特别是如果抓取可能需要长达 1 天的时间)完成),并且并非所有用户都已完全更新,这还重要吗?如果是这样,你会如何处理?
谢谢您,也感谢您对我的相关问题的答复:
Lets say I have a database table called "Scrape" possibly setup like:
UserID (int)
UserName (varchar)
Wins (int)
Losses (int)
ScrapeDate (datetime)
I'm trying to be able to rank my users based on their Wins/Loss ratio. However, each week I'll be scraping for new data on the users and making another entry in the Scrape table.
How can I query a list of users sorted by wins/losses, but only taking into consideration the most recent entry (ScrapeDate)?
Also, do you think it matters that people will be hitting the site and the scrape may possibly be in the middle of completing?
For example I could have:
1 - Bob - Wins: 320 - Losses: 110 - ScrapeDate: 7/8/09
1 - Bob - Wins: 360 - Losses: 122 - ScrapeDate: 7/17/09
2 - Frank - Wins: 115 - Losses: 20 - ScrapeDate: 7/8/09
Where, this represents a scrape that has only updated Bob so far, and is in the process of updating Frank but has yet to be inserted. How would you handle this situation as well?
So, my question is:
- How would you handle querying only the most recent scrape of each user to determine the rankings
- Do you think the fact that the database may be in a state of updating (especially if a scrape could take up to 1 day to complete), and not all users have completely updated yet matters? If so, how would you handle this?
Thank you, and thank you for your responses you have given me on my related question:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这就是我所说的“每组最大的”问题。它每周在 StackOverflow 上出现几次。
我使用外连接技术解决了此类问题:
这将为每个用户名仅返回一行 -
ScrapeDate
列中具有最大值的行。这就是外连接的用途,尝试将s1
与具有相同用户名和更大日期的其他行s2
进行匹配。如果没有这样的行,则外连接将为s2
的所有列返回 NULL,然后我们就知道s1
对应于给定用户名具有最大日期的行。当您正在进行部分完成的抓取时,这也应该有效。
这种技术不一定像其他答案给出的 CTE 和 RANKING 解决方案那么快。您应该两者都尝试一下,看看哪种更适合您。我更喜欢我的解决方案的原因是它适用于任何 SQL 风格。
This is what I call the "greatest-n-per-group" problem. It comes up several times per week on StackOverflow.
I solve this type of problem using an outer join technique:
This will return only one row for each username -- the row with the greatest value in the
ScrapeDate
column. That's what the outer join is for, to try to matchs1
with some other rows2
with the same username and a greater date. If there is no such row, the outer join returns NULL for all columns ofs2
, and then we knows1
corresponds to the row with the greatest date for that given username.This should also work when you have a partially-completed scrape in progress.
This technique isn't necessarily as speedy as the CTE and RANKING solutions other answers have given. You should try both and see what works better for you. The reason I prefer my solution is that it works in any flavor of SQL.
尝试如下操作:
这应该可行,但取决于您的数据库大小。
我没有测试此代码,因此它无法在首次运行时编译。
Try something like:
This should work, however depends on your database size.
I do not test this code, so it could not compile on first run.
您问题第一部分的答案取决于您使用的 SQL Server 版本 - SQL 2005+ 提供 排名函数使这种查询比 SQL 2000 及之前的版本更简单。如果您指出您正在使用哪个平台,我将更详细地更新此内容。
我怀疑处理第 2 部分的最清晰方法是显示最新完整抓取练习的统计数据,否则您不会显示时间一致的排名(尽管,如果您的数据收集练习需要 24 小时,则存在一定的自由度)已经)。
为了简化这一过程,您可以创建一个表来保存有关每个抓取操作的元数据,为每个操作提供一个 ID、开始日期和完成日期(至少),并显示与最新完整抓取相关的记录。为了使这更容易,您可以从数据收集表中删除“抓取日期”,并将其替换为将每个数据行链接到抓取表中的行的外键。
编辑
以下代码说明了如何根据用户的最新分数对用户进行排名,而不管它们是否具有时间一致性:
编辑2
使用元数据表进行选择的说明最新完整的抓取:
The answer to part one of your question depends on the version of SQL server you are using - SQL 2005+ offers ranking functions which make this kind of query a bit simpler than in SQL 2000 and before. I'll update this with more detail if you will indicate which platform you're using.
I suspect the clearest way to handle part 2 is to display the stats for the latest complete scraping exercise, otherwise you aren't showing a time-consistent ranking (although, if your data collection exercise takes 24 hours, there's a certain amount of latitude already).
To simplify this, you could create a table to hold metadata about each scrape operation, giving each one an id, start date and completion date (at a minimum), and display those records which relate to the latest complete scrape. To make this easier, you could remove the "scrape date" from the data collection table, and replace it with a foreign key linking each data row to a row in the scrape table.
EDIT
The following code illustrates how to rank users by their latest score, regardless of whether they are time-consistent:
EDIT 2
An illustration of the use of a metadata table to select the latest complete scrape: