使用 JOIN 时如何避免结果集中出现冗余数据字段？

发布于 2024-09-08 15:32:27 字数 1687 浏览 9 评论 0原文

以下联接应该检索具有特定状态的用户的用户信息及其消息：

SELECT * FROM user, message WHERE message.user_id=user.id AND user.status=1

问题是结果集中有关特定用户的所有行都包含重复有关该用户的相同数据的冗余列（从用户检索的那些字段）表），只有消息表中的字段包含非冗余信息。像这样的事情：

user.id  username email            message.id  subject
1        jane     [email protected]   120         Notification 
1        jane     [email protected]   122         Re:Hello 
1        jane     [email protected]   125         Quotation
2        john     [email protected]   127         Hi jane
2        john     [email protected]   128         Fix thiss 
2        john     [email protected]   129         Ok
3        jim      [email protected]      140         Re:Re:Quotation

正如你所看到的，许多数据都是冗余的，我们不想首先找到用户，然后在类似结构或类似结构的循环中处理他们的消息。应不惜一切代价避免导致微查询的循环。

我不关心程序的输出，这在 UI 中处理得很好。我认为，如果我能够设法消除与该用户相关的所有行中的用户数据重复，那么返回该查询结果所产生的网络流量可能会大大减少。

原文

Following join is supposed to retrieve user info along with their messages for users with a certain status:

SELECT * FROM user, message WHERE message.user_id=user.id AND user.status=1

The problem is that all rows about a certain user in the result set contain redundant columns that repeat the same data about that user (those fields retrieved from user table), only fields from the message table contain non-redundant information. Something like this:

user.id  username email            message.id  subject
1        jane     [email protected]   120         Notification 
1        jane     [email protected]   122         Re:Hello 
1        jane     [email protected]   125         Quotation
2        john     [email protected]   127         Hi jane
2        john     [email protected]   128         Fix thiss 
2        john     [email protected]   129         Ok
3        jim      [email protected]      140         Re:Re:Quotation

As you can see many data are redundant and we do not want to first find the users and then go about their messages in loop like structures or something like that. Loops that cause micro-queries should be avoided at all costs.

I am not concerned about the output of my program, that is well handled in the UI. I think perhaps the network traffic produced by returning the result of this query could be substantially reduced if somehow I can manage to eliminate the repetition of user data in all rows pertaining to that user.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小嗲 2024-09-15 15:32:27

有几件事你应该知道。

第一个是默认的 SQL JOIN 构造本质上是一个集合叉积，受 WHERE 子句的限制。这意味着它是乘法的 - 您会得到重复的结果，然后将其修剪掉。您还必须小心 NULL 字段的存在。

第二个是有一个“DISTINCT”关键字。当您在选择的列中添加此前缀时，您将在结果中最多获得该列的某个值的一个实例。因此，根据您的查询，“SELECT DISTINCT user.id FROM”将消除服务器端的冗余。

第三，解决这个问题的正确方法很可能不使用 * 运算符。我建议：

SELECT user.id,username,email,subject FROM message m,user WHERE m.user_id=user.id AND user.status=1

这使用简单、易于理解的隐式连接语法，并且在任何服务器上都应该是有效的 SQL。至少我可以保证它可以与 MySQL 一起使用。它还将“message”表别名为“m”作为简写。

正如您所猜测的，这将减少从 SQL 服务器到数据库的流量。

编辑：如果您想消除“冗余”电子邮件信息，则不能 - 您必须进行两个不同的查询。 SQL 结果是表格，并且必须是矩形，并填充了所有已知值。没有“同上”条目。

编辑2：您只需进行两个查询。例如：

SELECT subject FROM message WHERE message.id IN (SELECT user.id FROM user WHERE status=1)

这是一个包含嵌套查询的查询，因此它实际上会产生两次数据库命中。但它没有任何编程循环。

There are several things you should know.

The first is that the default SQL JOIN construct is essentially a set cross product, restricted by the WHERE clause. This means it's multiplicative - you get duplicate results out which you then prune down. You also have to be careful in the presence of NULL fields.

The second is that there is a 'DISTINCT' keyword. When you prefix a column in the selection with this, you'll get at most one instance of a certain value for that column in the results. So, as per your query, 'SELECT DISTINCT user.id FROM' will eliminate the redundancies on the server side.

The third is that the correct way to solve this problem is likely not using the * operator. I suggest:

SELECT user.id,username,email,subject FROM message m,user WHERE m.user_id=user.id AND user.status=1

This uses the simple, easy-to-understand implicit-join syntax and should be valid SQL on whatever server. I can vouch for it working with MySQL, at least. It also aliases the 'message' table to 'm' as shorthand.

As you surmise, this will reduce the traffic from the SQL server to your database.

edit: if you want to eliminate the "redundant" email information, you can't - you must make two distinct queries. SQL results are tables and must be rectangular, with all known values filled. There's no 'ditto' entry.

edit 2: You only have to make two queries. For instance:

SELECT subject FROM message WHERE message.id IN (SELECT user.id FROM user WHERE status=1)

This is one query that contains a nested query, so it's really making two database hits. But it doesn't have any programmatic loops.

回复收藏 0 原文

白衬杉格子梦 2024-09-15 15:32:27

在直接的 sql 查询中，如果将它们保留为单个查询，则不会出现这种情况。如果您以编程方式打印此信息，那么您将按用户 ID 排序，并且仅在用户 ID 更改时重新打印该信息。

回复收藏 0 原文

黯淡〆 2024-09-15 15:32:27

在 SQL 标准中，您将使用 NATURAL JOIN；这会连接公共列名称，并且仅保留这些公共名称的一份副本。

在实践中，您仔细列出所需的列，而不是诉诸“*”速记符号。

回复收藏 0 原文

酷遇一生 2024-09-15 15:32:27

假设您可以使用存储过程，您可以编写一个来运行上述查询，然后使用游标来存储“冗余信息”的空值，以获得类似的信息

user.id 用户名 电子邮件 message.id 主题 
1 jane [电子邮件受保护] 120 通知  
空 空 空 122 回复：你好  
null null null 125 报价 
2 约翰 [电子邮件受保护] 127 你好，简 
null null null 128 修复此问题  
空 空 空 129 好的 
3 jim [电子邮件受保护] 140 Re:Re:Quotation

然后将此结果集返回到临时表中。但是，虽然这可能会减少网络流量，但会增加处理开销

另一种方法是运行 2 个查询，一个查询获取用户信息，另一个查询仅使用链接的用户 ID 获取消息信息，然后执行“加入”使用应用程序服务器端代码。类似于

SELECT DISTINCT user.* FROM user, message WHERE message.user_id=user.id AND user.status=1

和

SELECT user.id, message.* FROM user, message WHERE message.user_id=user.id AND user.status=1

的内容将导致 2 次访问数据库，而不是 1 次，即使网络流量减少，最终也可能会变慢。

另一种方法是将这两个结果集中到一个结果集中，例如

SELECT user.* FROM user, message WHERE message.user_id=user.id AND user.status=1
UNION ALL
SELECT user.id, message.* FROM user, message WHERE message.user_id=user.id AND user.status=1

得到类似的结果

 user.id 用户名/message.id 电子邮件/主题 
    1 jane [电子邮件受保护]   
    2 john [电子邮件受保护]   
    3 jim [电子邮件受保护]      
    1 120 通知           
    1 122 回复：你好           
    1 125 报价          
    2 127 嗨，简          
    2 128 解决这个问题           
    2 129 好的          
    3 140 回复：回复：报价

然后使用应用程序服务器逻辑将其分离出来。网络流量减少，但应用程序服务器负载增加/数据库服务器负载略有增加。

但所节省的网络流量根本不值得增加复杂性。

Assuming you can use stored procedure, you could write one to run the above query and then use a cursor to store nulls for 'redundant information' to get something like

user.id  username email            message.id  subject 
1        jane     [email protected]   120         Notification  
null     null     null             122         Re:Hello  
null     null     null             125         Quotation 
2        john     [email protected]   127         Hi jane 
null     null     null             128         Fix thiss  
null     null     null             129         Ok 
3        jim      [email protected]      140         Re:Re:Quotation

and then return this resultset in a temporary table. but while this may reduce network traffic, it will add a processing overhead

Another way is to run 2 queries, one to get the user information, and the other to get the message information with just the linked user id and then do the "join" using application server side code. something like

SELECT DISTINCT user.* FROM user, message WHERE message.user_id=user.id AND user.status=1

and

SELECT user.id, message.* FROM user, message WHERE message.user_id=user.id AND user.status=1

which will result in 2 trips to the database, instead of 1, which might eventually be slower, even if the network traffic is reduced.

And another way is to bunch these 2 into a single resultset with something like

SELECT user.* FROM user, message WHERE message.user_id=user.id AND user.status=1
UNION ALL
SELECT user.id, message.* FROM user, message WHERE message.user_id=user.id AND user.status=1

to get something like

   user.id  username/message.id    email/subject 
    1        jane                   [email protected]   
    2        john                   [email protected]   
    3        jim                    [email protected]      
    1        120                    Notification           
    1        122                    Re:Hello           
    1        125                    Quotation          
    2        127                    Hi jane          
    2        128                    Fix thiss           
    2        129                    Ok          
    3        140                    Re:Re:Quotation

and then use application server logic to separate it out. reduced network traffic but more application server load / marginally more database server load.

But the saved network traffic is rarely worth the added complexity.

回复收藏 0 原文

~没有更多了~