在不同行上选择满足不同条件的值

发布于 2024-07-12 07:01:53 字数 288 浏览 16 评论 0原文

假设我有一个如下所示的两列表:

userid  |  roleid
--------|--------
   1    |    1
   1    |    2
   1    |    3
   2    |    1

我想要获取具有 roleids 1、2 和 3 的所有不同用户 ID。使用上面的示例,我想要返回的唯一结果是 userid 1. 我该怎么做?

Let's say I have a two-column table like this:

userid  |  roleid
--------|--------
   1    |    1
   1    |    2
   1    |    3
   2    |    1

I want to get all distinct userids that have roleids 1, 2 AND 3. Using the above example, the only result I want returned is userid 1. How do I do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

轻拂→两袖风尘 2024-07-19 07:01:54

执行此操作的经典方法是将其视为关系划分问题。

英语:选择不缺少任何所需 roleid 值的用户。

我假设您有一个 UserRole 表引用的 Users 表,并且假设所需的 roleid 值位于表中:

create table RoleGroup(
  roleid int not null,
  primary key(roleid)
)
insert into RoleGroup values (1);
insert into RoleGroup values (2);
insert into RoleGroup values (3);

我还将假设所有相关列都不可为 NULL,因此 IN 不会出现任何意外或不存在。 下面是一个表达上面英文的 SQL 查询:

select userid from Users as U
where not exists (
  select * from RoleGroup as G
  where not exists (
    select R.roleid from UserRole as R
    where R.roleid = G.roleid
    and R.userid = U.userid
  )
);

另一种编写方式是这样

select userid from Users as U
where not exists (
  select * from RoleGroup as G
  where G.roleid not in (
    select R.roleid from UserRole as R
    where R.userid = U.userid
  )
);

这可能会也可能不会有效,具体取决于索引、平台、数据等。在网络上搜索“关系划分”,您会发现很多。

The classic way to do this is to treat it as a relational division problem.

In English: Select those users for whom none of the desired roleid values is missing.

I'll assume you have a Users table to which the UserRole table refers, and I'll assume the desired roleid values are in a table:

create table RoleGroup(
  roleid int not null,
  primary key(roleid)
)
insert into RoleGroup values (1);
insert into RoleGroup values (2);
insert into RoleGroup values (3);

I'll also assume all the relevant columns are not NULLable, so there are no surprises with IN or NOT EXISTS. Here's a SQL query that expresses the English above:

select userid from Users as U
where not exists (
  select * from RoleGroup as G
  where not exists (
    select R.roleid from UserRole as R
    where R.roleid = G.roleid
    and R.userid = U.userid
  )
);

Another way to write it is this

select userid from Users as U
where not exists (
  select * from RoleGroup as G
  where G.roleid not in (
    select R.roleid from UserRole as R
    where R.userid = U.userid
  )
);

This may or may not end up being efficient, depending on indexes, platform, data, etc. Search the web for "relational division" and you'll find a lot.

当梦初醒 2024-07-19 07:01:54

假设 userid、roleid 包含在唯一索引中(意味着不能有 2 条记录,其中 userid = x 且 roleid = 1

select count(*), userid from t
where roleid in (1,2,3)
group by userid
having count(*) = 3

Assuming userid, roleid are contained in a unique index (meaning there cannot be 2 records where userid = x and roleid = 1

select count(*), userid from t
where roleid in (1,2,3)
group by userid
having count(*) = 3
长不大的小祸害 2024-07-19 07:01:54
select userid from userrole where userid = 1
intersect
select userid from userrole where userid = 2
intersect
select userid from userrole where userid = 3

这不就能解决问题了吗? 对于典型的关系数据库来说,这是一个多好的解决方案? 查询优化器会自动优化吗?

select userid from userrole where userid = 1
intersect
select userid from userrole where userid = 2
intersect
select userid from userrole where userid = 3

Won't this solve the problem? How good a solution is this on typical Relational DBs? Will query optimizer auto optimize this?

伏妖词 2024-07-19 07:01:54

如果您在这里需要任何类型的通用性(不同的 3 角色组合或不同的 n 角色组合)...我建议您为您的角色使用位屏蔽系统并使用按位运算符来执行查询...

If you need any kind of generality here (different 3-role combinations or different n-role combinations)...I'd suggest you use a bit masking system for your roles and use the bitwise operators to perform your queries...

榕城若虚 2024-07-19 07:01:53

好吧,我对此表示反对,所以我决定对其进行测试:

CREATE TABLE userrole (
  userid INT,
  roleid INT,
  PRIMARY KEY (userid, roleid)
);

CREATE INDEX ON userrole (roleid);

运行以下命令:

<?php
ini_set('max_execution_time', 120); // takes over a minute to insert 500k+ records

$start = microtime(true);

echo "<pre>\n";
mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
    echo "Connect error: " . mysql_error() . "\n";
}
mysql_select_db('scratch');
if (mysql_error()) {
    echo "Selct DB error: " . mysql_error() . "\n";
}

$users = 200000;
$count = 0;
for ($i=1; $i<=$users; $i++) {
    $roles = rand(1, 4);
    $available = range(1, 5);
    for ($j=0; $j<$roles; $j++) {
        $extract = array_splice($available, rand(0, sizeof($available)-1), 1);
        $id = $extract[0];
        query("INSERT INTO userrole (userid, roleid) VALUES ($i, $id)");
        $count++;
    }
}

$stop = microtime(true);
$duration = $stop - $start;
$insert = $duration / $count;

echo "$count users added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $insert seconds.\n";
echo "</pre>\n";

function query($str) {
    mysql_query($str);
    if (mysql_error()) {
        echo "$str: " . mysql_error() . "\n";
    }
}
?>

输出:

499872 users added.
Program ran for 56.5513510704 seconds.
Insert time 0.000113131663847 seconds.

这会添加 500,000 个随机用户角色组合,并且大约有 25,000 个与所选条件匹配。

第一次查询:

SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3

查询时间:0.312s

SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1

查询时间:0.016s

没错。 我提出的连接版本比聚合版本快二十倍。

抱歉,但我这样做是为了在现实世界中生活和工作,在现实世界中我们测试 SQL,结果不言而喻。

这其中的原因应该是很清楚的。 聚合查询的成本将随着表的大小而变化。 每行都通过 HAVING 子句进行处理、聚合和过滤(或不进行处理)。 连接版本将(使用索引)根据给定角色选择用户子集,然后对照第二个角色检查该子集,最后对照第三个角色检查该子集。 每个选择(在关系代数术语)适用于越来越小的子集。 由此您可以得出结论:

加入版本的性能会随着匹配发生率的降低而变得更好。

如果只有 500 个用户(上面的 50 万个样本中)具有三个指定的角色, join 版本将会变得更快。 聚合版本不会(并且任何性能改进都是传输 500 个用户而不是 25k 个用户的结果,而加入版本显然也得到了 25k)。

我也很好奇真正的数据库(即 Oracle)将如何处理这个问题。 因此,我基本上在 Oracle XE 上重复了相同的练习(与上一个示例中的 MySQL 在同一台 Windows XP 桌面计算机上运行),结果几乎相同。

连接似乎不受欢迎,但正如我所演示的,聚合查询可能会慢一个数量级。

更新:经过一些广泛的测试,情况变得更加复杂,答案将取决于您的数据、数据库和其他因素。 这个故事的寓意是测试、测试、测试。

Ok, I got downvoted on this, so I decided to test it:

CREATE TABLE userrole (
  userid INT,
  roleid INT,
  PRIMARY KEY (userid, roleid)
);

CREATE INDEX ON userrole (roleid);

Run this:

<?php
ini_set('max_execution_time', 120); // takes over a minute to insert 500k+ records

$start = microtime(true);

echo "<pre>\n";
mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
    echo "Connect error: " . mysql_error() . "\n";
}
mysql_select_db('scratch');
if (mysql_error()) {
    echo "Selct DB error: " . mysql_error() . "\n";
}

$users = 200000;
$count = 0;
for ($i=1; $i<=$users; $i++) {
    $roles = rand(1, 4);
    $available = range(1, 5);
    for ($j=0; $j<$roles; $j++) {
        $extract = array_splice($available, rand(0, sizeof($available)-1), 1);
        $id = $extract[0];
        query("INSERT INTO userrole (userid, roleid) VALUES ($i, $id)");
        $count++;
    }
}

$stop = microtime(true);
$duration = $stop - $start;
$insert = $duration / $count;

echo "$count users added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $insert seconds.\n";
echo "</pre>\n";

function query($str) {
    mysql_query($str);
    if (mysql_error()) {
        echo "$str: " . mysql_error() . "\n";
    }
}
?>

Output:

499872 users added.
Program ran for 56.5513510704 seconds.
Insert time 0.000113131663847 seconds.

That adds 500,000 random user-role combinations and there are approximately 25,000 that match the chosen criteria.

First query:

SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3

Query time: 0.312s

SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1

Query time: 0.016s

That's right. The join version I proposed is twenty times faster than the aggregate version.

Sorry but I do this for a living and work in the real world and in the real world we test SQL and the results speak for themselves.

The reason for this should be pretty clear. The aggregate query will scale in cost with the size of the table. Every row is processed, aggregated and filtered (or not) through the HAVING clause. The join version will (using an index) select a subset of the users based on a given role, then check that subset against the second role and finally that subset against the third role. Each selection (in relational algebra terms) works on an increasingly small subset. From this you can conclude:

The performance of the join version gets even better with a lower incidence of matches.

If there were only 500 users (out of the 500k sample above) that had the three stated roles, the join version will get significantly faster. The aggregate version will not (and any performance improvement is a result of transporting 500 users instead of 25k, which the join version obviously gets too).

I was also curious to see how a real database (ie Oracle) would deal with this. So I basically repeated the same exercise on Oracle XE (running on the same Windows XP desktop machine as the MySQL from the previous example) and the results are almost identical.

Joins seem to be frowned upon but as I've demonstrated, aggregate queries can be an order of magnitude slower.

Update: After some extensive testing, the picture is more complicated and the answer will depend on your data, your database and other factors. The moral of the story is test, test, test.

意中人 2024-07-19 07:01:53
SELECT userid
FROM UserRole
WHERE roleid IN (1, 2, 3)
GROUP BY userid
HAVING COUNT(DISTINCT roleid) = 3;

只是大声思考,另一种编写自连接的方法 cletus 的描述是:

SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid
JOIN userrole t3 ON t2.userid = t3.userid
WHERE (t1.roleid, t2.roleid, t3.roleid) = (1, 2, 3);

这对你来说可能更容易阅读,并且 MySQL 支持这样的元组比较。 MySQL 还知道如何在此查询中智能地使用覆盖索引。 只需通过 EXPLAIN 运行它,并在所有三个表的注释中看到“使用索引”,这意味着它正在读取索引,甚至不必接触数据行。

我在 MacBook 上使用 MySQL 5.1.48 运行了超过 210 万行的查询(PostTags 的 Stack Overflow 7 月数据转储),并在 1.08 秒内返回结果。 在一个为 innodb_buffer_pool_size 分配了足够内存的体面服务器上,它应该会更快。

对于阅读本文的任何人:我的答案简单明了,并获得了“已接受”状态,但请务必阅读cletus给出的答案。 它具有更好的性能。

SELECT userid
FROM UserRole
WHERE roleid IN (1, 2, 3)
GROUP BY userid
HAVING COUNT(DISTINCT roleid) = 3;

Just thinking out loud, another way to write the self-join described by cletus is:

SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid
JOIN userrole t3 ON t2.userid = t3.userid
WHERE (t1.roleid, t2.roleid, t3.roleid) = (1, 2, 3);

This might be easier to read for you, and MySQL supports comparisons of tuples like that. MySQL also knows how to use covering indexes intelligently for this query. Just run it through EXPLAIN and see "Using index" in the notes for all three tables, which means it's reading the index and doesn't even have to touch the data rows.

I ran this query over 2.1 million rows (the Stack Overflow July data dump for PostTags) using MySQL 5.1.48 on my MacBook, and it returned the result in 1.08 seconds. On a decent server with enough memory allocated to innodb_buffer_pool_size, it should be even faster.

To anyone reading this: my answer is simple and straightforward, and got the 'accepted' status, but please do go read the answer given by cletus. It has much better performance.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文