如何查询数据库中用户的排名，但只考虑每个用户的最新条目？

发布于 2024-08-04 16:11:46 字数 940 浏览 12 评论 0原文

假设我有一个名为“Scrape”的数据库表，其设置可能如下：

UserID (int)   
UserName (varchar)  
Wins (int)   
Losses (int)  
ScrapeDate (datetime)

我试图根据用户的赢/输比对他们进行排名。但是，每周我都会抓取有关用户的新数据，并在抓取表中添加另一个条目。

如何查询按赢/输排序的用户列表，但只考虑最近的条目 (ScrapeDate)？

另外，您认为人们会访问该网站并且刮擦可能正在完成中这很重要吗？

例如，我可以：

1 - Bob - Wins: 320 - Losses: 110 - ScrapeDate: 7/8/09  
1 - Bob - Wins: 360 - Losses: 122 - ScrapeDate: 7/17/09  
2 - Frank - Wins: 115 - Losses: 20 - ScrapeDate: 7/8/09

其中，这表示到目前为止仅更新了 Bob 的抓取，并且正在更新 Frank 但尚未插入。您也将如何处理这种情况？

所以，我的问题是：

您如何处理仅查询每个用户最近的抓取以确定排名
您是否认为数据库可能处于更新状态（特别是如果抓取可能需要长达 1 天的时间）完成），并且并非所有用户都已完全更新，这还重要吗？如果是这样，你会如何处理？

谢谢您，也感谢您对我的相关问题的答复：

从网页中抓取大量统计信息时，我应该多久将收集到的结果插入到数据库中？

原文

Lets say I have a database table called "Scrape" possibly setup like:

UserID (int)   
UserName (varchar)  
Wins (int)   
Losses (int)  
ScrapeDate (datetime)

I'm trying to be able to rank my users based on their Wins/Loss ratio. However, each week I'll be scraping for new data on the users and making another entry in the Scrape table.

How can I query a list of users sorted by wins/losses, but only taking into consideration the most recent entry (ScrapeDate)?

Also, do you think it matters that people will be hitting the site and the scrape may possibly be in the middle of completing?

For example I could have:

1 - Bob - Wins: 320 - Losses: 110 - ScrapeDate: 7/8/09  
1 - Bob - Wins: 360 - Losses: 122 - ScrapeDate: 7/17/09  
2 - Frank - Wins: 115 - Losses: 20 - ScrapeDate: 7/8/09

Where, this represents a scrape that has only updated Bob so far, and is in the process of updating Frank but has yet to be inserted. How would you handle this situation as well?

So, my question is:

How would you handle querying only the most recent scrape of each user to determine the rankings
Do you think the fact that the database may be in a state of updating (especially if a scrape could take up to 1 day to complete), and not all users have completely updated yet matters? If so, how would you handle this?

Thank you, and thank you for your responses you have given me on my related question:

When scraping a lot of stats from a webpage, how often should I insert the collected results in my DB?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挖个坑埋了你 2024-08-11 16:11:46

这就是我所说的“每组最大的”问题。它每周在 StackOverflow 上出现几次。

我使用外连接技术解决了此类问题：

SELECT s1.*, s1.wins / s1.losses AS win_loss_ratio
FROM Scrape s1
LEFT OUTER JOIN Scrape s2
  ON (s1.username = s2.username AND s1.ScrapeDate < s2.ScrapeDate)
WHERE s2.username IS NULL
ORDER BY win_loss_ratio DESC;

这将为每个用户名仅返回一行 - ScrapeDate 列中具有最大值的行。这就是外连接的用途，尝试将s1与具有相同用户名和更大日期的其他行s2进行匹配。如果没有这样的行，则外连接将为 s2 的所有列返回 NULL，然后我们就知道 s1 对应于给定用户名具有最大日期的行。

当您正在进行部分完成的抓取时，这也应该有效。

这种技术不一定像其他答案给出的 CTE 和 RANKING 解决方案那么快。您应该两者都尝试一下，看看哪种更适合您。我更喜欢我的解决方案的原因是它适用于任何 SQL 风格。

This is what I call the "greatest-n-per-group" problem. It comes up several times per week on StackOverflow.

I solve this type of problem using an outer join technique:

SELECT s1.*, s1.wins / s1.losses AS win_loss_ratio
FROM Scrape s1
LEFT OUTER JOIN Scrape s2
  ON (s1.username = s2.username AND s1.ScrapeDate < s2.ScrapeDate)
WHERE s2.username IS NULL
ORDER BY win_loss_ratio DESC;

This will return only one row for each username -- the row with the greatest value in the ScrapeDate column. That's what the outer join is for, to try to match s1 with some other row s2 with the same username and a greater date. If there is no such row, the outer join returns NULL for all columns of s2, and then we know s1 corresponds to the row with the greatest date for that given username.

This should also work when you have a partially-completed scrape in progress.

This technique isn't necessarily as speedy as the CTE and RANKING solutions other answers have given. You should try both and see what works better for you. The reason I prefer my solution is that it works in any flavor of SQL.

回复收藏 0 原文

明媚殇 2024-08-11 16:11:46

尝试如下操作：

为每个用户选择用户 ID 和上次条目的最大日期。
根据以上查询结果对记录进行选择和排序以获得排名。

这应该可行，但取决于您的数据库大小。

DECLARE 
    @last_entries TABLE(id int, dte datetime)

-- insert date (dte) of last entry for each user (id)
INSERT INTO
    @last_entries (id, dte)
SELECT
    UserID,
    MAX(ScrapeDate)
FROM
    Scrape WITH (NOLOCK)
GROUP BY
    UserID

-- select ranking
SELECT
    -- optionally you can use RANK OVER() function to get rank value
    UserName,
    Wins,
    Losses
FROM
    @last_entries
    JOIN
        Scraps WITH (NOLOCK)
    ON
        UserID = id
        AND ScrapeDate = dte
ORDER BY
    Winds,
    Losses

我没有测试此代码，因此它无法在首次运行时编译。

Try something like:

Select user id and max date of last entry for each user.
Select and order records to get ranking based on above query results.

This should work, however depends on your database size.

DECLARE 
    @last_entries TABLE(id int, dte datetime)

-- insert date (dte) of last entry for each user (id)
INSERT INTO
    @last_entries (id, dte)
SELECT
    UserID,
    MAX(ScrapeDate)
FROM
    Scrape WITH (NOLOCK)
GROUP BY
    UserID

-- select ranking
SELECT
    -- optionally you can use RANK OVER() function to get rank value
    UserName,
    Wins,
    Losses
FROM
    @last_entries
    JOIN
        Scraps WITH (NOLOCK)
    ON
        UserID = id
        AND ScrapeDate = dte
ORDER BY
    Winds,
    Losses

I do not test this code, so it could not compile on first run.

回复收藏 0 原文

携君以终年 2024-08-11 16:11:46

您问题第一部分的答案取决于您使用的 SQL Server 版本 - SQL 2005+ 提供排名函数使这种查询比 SQL 2000 及之前的版本更简单。如果您指出您正在使用哪个平台，我将更详细地更新此内容。

我怀疑处理第 2 部分的最清晰方法是显示最新完整抓取练习的统计数据，否则您不会显示时间一致的排名（尽管，如果您的数据收集练习需要 24 小时，则存在一定的自由度）已经）。

为了简化这一过程，您可以创建一个表来保存有关每个抓取操作的元数据，为每个操作提供一个 ID、开始日期和完成日期（至少），并显示与最新完整抓取相关的记录。为了使这更容易，您可以从数据收集表中删除“抓取日期”，并将其替换为将每个数据行链接到抓取表中的行的外键。

编辑

以下代码说明了如何根据用户的最新分数对用户进行排名，而不管它们是否具有时间一致性：

create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeDate datetime
)

INSERT #scrape
      select 'Alice',100,200,'20090101'
union select 'Alice',120,210,'20090201'
union select 'Bob'  ,200,200,'20090101'
union select 'Clara',300,100,'20090101'
union select 'Clara',300,210,'20090201'
union select 'Dave' ,100,10 ,'20090101'


;with latestScrapeCTE
AS
(
        SELECT *
               ,ROW_NUMBER() OVER (PARTITION BY userName
                                   ORDER BY scrapeDate desc
                                  ) AS rn
               ,wins + losses AS totalPlayed
               ,wins - losses as winDiff
        from #scrape
)
SELECT userName
       ,wins
       ,losses
       ,scrapeDate
       ,winDiff
       ,totalPlayed
       ,RANK() OVER (ORDER BY winDiff desc
                              ,totalPlayed desc
                    ) as rankPos
FROM latestScrapeCTE
WHERE rn = 1
ORDER BY rankPos

编辑2

使用元数据表进行选择的说明最新完整的抓取：

create table #scrape_run
(runID int identity
,startDate datetime
,completedDate datetime
)

create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeRunID int
)


INSERT #scrape_run
select '20090101', '20090102'
union select '20090201', null --null completion date indicates that the scrape is not complete

INSERT #scrape
      select 'Alice',100,200,1
union select 'Alice',120,210,2
union select 'Bob'  ,200,200,1
union select 'Clara',300,100,1
union select 'Clara',300,210,2
union select 'Dave' ,100,10 ,1


;with latestScrapeCTE
AS
(
        SELECT TOP 1 runID
                     ,startDate
        FROM #scrape_run
        WHERE completedDate IS NOT NULL
)
SELECT userName
       ,wins
       ,losses
       ,startDate     AS scrapeDate
       ,wins - losses AS winDiff
       ,wins + losses AS totalPlayed
       ,RANK() OVER (ORDER BY (wins - losses)  desc
                              ,(wins + losses) desc
                    ) as rankPos
FROM #scrape
JOIN latestScrapeCTE
ON   runID = scrapeRunID
ORDER BY rankPos

The answer to part one of your question depends on the version of SQL server you are using - SQL 2005+ offers ranking functions which make this kind of query a bit simpler than in SQL 2000 and before. I'll update this with more detail if you will indicate which platform you're using.

I suspect the clearest way to handle part 2 is to display the stats for the latest complete scraping exercise, otherwise you aren't showing a time-consistent ranking (although, if your data collection exercise takes 24 hours, there's a certain amount of latitude already).

To simplify this, you could create a table to hold metadata about each scrape operation, giving each one an id, start date and completion date (at a minimum), and display those records which relate to the latest complete scrape. To make this easier, you could remove the "scrape date" from the data collection table, and replace it with a foreign key linking each data row to a row in the scrape table.

EDIT

The following code illustrates how to rank users by their latest score, regardless of whether they are time-consistent:

create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeDate datetime
)

INSERT #scrape
      select 'Alice',100,200,'20090101'
union select 'Alice',120,210,'20090201'
union select 'Bob'  ,200,200,'20090101'
union select 'Clara',300,100,'20090101'
union select 'Clara',300,210,'20090201'
union select 'Dave' ,100,10 ,'20090101'


;with latestScrapeCTE
AS
(
        SELECT *
               ,ROW_NUMBER() OVER (PARTITION BY userName
                                   ORDER BY scrapeDate desc
                                  ) AS rn
               ,wins + losses AS totalPlayed
               ,wins - losses as winDiff
        from #scrape
)
SELECT userName
       ,wins
       ,losses
       ,scrapeDate
       ,winDiff
       ,totalPlayed
       ,RANK() OVER (ORDER BY winDiff desc
                              ,totalPlayed desc
                    ) as rankPos
FROM latestScrapeCTE
WHERE rn = 1
ORDER BY rankPos

EDIT 2

An illustration of the use of a metadata table to select the latest complete scrape:

create table #scrape_run
(runID int identity
,startDate datetime
,completedDate datetime
)

create table #scrape
(userName varchar(20)
,wins int
,losses int
,scrapeRunID int
)


INSERT #scrape_run
select '20090101', '20090102'
union select '20090201', null --null completion date indicates that the scrape is not complete

INSERT #scrape
      select 'Alice',100,200,1
union select 'Alice',120,210,2
union select 'Bob'  ,200,200,1
union select 'Clara',300,100,1
union select 'Clara',300,210,2
union select 'Dave' ,100,10 ,1


;with latestScrapeCTE
AS
(
        SELECT TOP 1 runID
                     ,startDate
        FROM #scrape_run
        WHERE completedDate IS NOT NULL
)
SELECT userName
       ,wins
       ,losses
       ,startDate     AS scrapeDate
       ,wins - losses AS winDiff
       ,wins + losses AS totalPlayed
       ,RANK() OVER (ORDER BY (wins - losses)  desc
                              ,(wins + losses) desc
                    ) as rankPos
FROM #scrape
JOIN latestScrapeCTE
ON   runID = scrapeRunID
ORDER BY rankPos

回复收藏 0 原文

~没有更多了~