通过 SQL 查询安全地标准化数据

发布于 2024-07-23 08:42:24 字数 3476 浏览 7 评论 0原文

假设我有一个客户表：

CREATE TABLE customers (
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

该表没有有主键。但是，对于任何给定的 customer_number，customer_name 和 customer_address 应该是唯一的。

该表包含许多重复客户的情况并不罕见。为了避免这种重复，使用以下查询来仅隔离唯一的客户：

SELECT
  DISTINCT customer_number, customer_name, customer_address
FROM customers

幸运的是，该表传统上包含准确的数据。也就是说，任何 customer_number 都不会出现冲突的 customer_name 或 customer_address。然而，假设表中确实出现了冲突的数据。我希望编写一个失败的查询，而不是返回相关 customer_number 的多行。

例如，我尝试了这个查询，但没有成功：

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number

Is there a way to write such a query using standard SQL? 如果没有的话，Oracle特定的SQL有解决方案吗？

编辑：奇怪查询背后的基本原理：

说实话，这个客户表实际上并不存在（谢天谢地）。我创建它是希望它能够足够清晰地展示查询的需求。然而，基于该示例，人们（幸运的是）意识到对此类查询的需求是我最不用担心的。因此，我现在必须剥离一些抽象，并希望恢复我提出如此令人厌恶的表格的声誉......

我从外部系统收到一个包含发票（每行一张）的平面文件。我逐行读取该文件，将其字段插入到该表中：

CREATE TABLE unprocessed_invoices (
    invoice_number   INTEGER,
    invoice_date     DATE,
    ...
    // other invoice columns
    ...
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

如您所见，从外部系统到达的数据是非规范化的。也就是说，外部系统在同一行上包含发票数据及其关联的客户数据。多张发票可能共享同一客户，因此可能存在重复的客户数据。

在保证所有客户都在系统中注册之前，系统无法开始处理发票。因此，系统必须识别唯一客户并根据需要对其进行注册。这就是我想要查询的原因：因为我正在处理我无法控制的非规范化数据。

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number

希望这有助于澄清问题的初衷。

编辑：好/坏数据的示例

澄清一下：customer_name 和 customer_address 仅对于特定 customer_number 来说必须是唯一的。

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 3               | 'Fred'        | '456 Avenue'
 3               | 'Fred'        | '789 Crescent'

前两行很好，因为 customer_number 1 的 customer_name 和 customer_address 是相同的。

中间两行很好，因为它是相同的customer_number 2 的 customer_name 和 customer_address（即使另一个 customer_number 具有相同的 customer_name和customer_address）。

最后两行不行，因为customer_number有两个不同的customer_addresses 3。

如果运行的话，我正在查找的查询将会失败所有这六行。但是，如果仅前四行实际存在，则视图应返回：

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'

我希望这能澄清我所说的“customer_name 和 customer_address 冲突”的含义。每个 customer_number 必须是唯一的。

我很欣赏那些解释如何正确从外部系统导入数据的人。事实上，我已经做了其中的大部分工作。我故意隐藏了我正在做的事情的所有细节，以便更容易专注于手头的问题。此查询并不是唯一的验证形式。我只是认为这会起到很好的画龙点睛作用（可以说是最后的防御）。这个问题只是为了调查 SQL 的可能性。 :)

原文

Suppose I have a table of customers:

CREATE TABLE customers (
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

This table does not have a primary key. However, customer_name and customer_address should be unique for any given customer_number.

It is not uncommon for this table to contain many duplicate customers. To get around this duplication, the following query is used to isolate only the unique customers:

SELECT
  DISTINCT customer_number, customer_name, customer_address
FROM customers

Fortunately, the table has traditionally contained accurate data. That is, there has never been a conflicting customer_name or customer_address for any customer_number. However, suppose conflicting data did make it into the table. I wish to write a query that will fail, rather than returning multiple rows for the customer_number in question.

For example, I tried this query with no success:

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number

Is there a way to write such a query using standard SQL? If not, is there a solution in Oracle-specific SQL?

EDIT: The rationale behind the bizarre query:

Truth be told, this customers table does not actually exist (thank goodness). I created it hoping that it would be clear enough to demonstrate the needs of the query. However, people are (fortunately) catching on that the need for such a query is the least of my worries, based on that example. Therefore, I must now peel away some of the abstraction and hopefully restore my reputation for suggesting such an abomination of a table...

I receive a flat file containing invoices (one per line) from an external system. I read this file, line-by-line, inserting its fields into this table:

CREATE TABLE unprocessed_invoices (
    invoice_number   INTEGER,
    invoice_date     DATE,
    ...
    // other invoice columns
    ...
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

As you can see, the data arriving from the external system is denormalized. That is, the external system includes both the invoice data and its associated customer data on the same line. It is possible that multiple invoices will share the same customer, therefore it is possible to have duplicate customer data.

The system cannot begin processing the invoices until all customers are guaranteed to be registered with the system. Therefore, the system must identify the unique customers and register them as necessary. This is why I wanted the query: because I was working with denormalized data I had no control over.

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number

Hopefully this helps clarify the original intent of the question.

EDIT: Examples of good/bad data

To clarify: customer_name and customer_address only have to be unique for a particular customer_number.

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 3               | 'Fred'        | '456 Avenue'
 3               | 'Fred'        | '789 Crescent'

The first two rows are fine because it is the same customer_name and customer_address for customer_number 1.

The middle two rows are fine because it is the same customer_name and customer_address for customer_number 2 (even though another customer_number has the same customer_name and customer_address).

The last two rows are not okay because there are two different customer_addresses for customer_number 3.

The query I am looking for would fail if run against all six of these rows. However, if only the first four rows actually existed, the view should return:

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'

I hope this clarifies what I meant by "conflicting customer_name and customer_address". They have to be unique per customer_number.

I appreciate those that are explaining how to properly import data from external systems. In fact, I am already doing most of that already. I purposely hid all the details of what I'm doing so that it would be easier to focus on the question at hand. This query is not meant to be the only form of validation. I just thought it would make a nice finishing touch (a last defense, so to speak). This question was simply designed to investigate just what was possible with SQL. :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

戏蝶舞 2024-07-30 08:42:25

使查询失败可能很棘手...

这将显示表中是否有重复记录：

select customer_number, customer_name, customer_address
from customers
group by customer_number, customer_name, customer_address
having count(*) > 1

如果您只是为所有三个字段添加唯一索引，则没有人可以在表中创建重复记录。

Making the query fail may be tricky...

This will show you if there are any duplicate records in the table:

select customer_number, customer_name, customer_address
from customers
group by customer_number, customer_name, customer_address
having count(*) > 1

If you just add a unique index for all the three fields, noone can create a duplicate record in the table.

回复收藏 0 原文

披肩女神 2024-07-30 08:42:25

事实上的键是名称+地址，所以这就是您需要分组的依据。

SELECT
  Customer_Name,
  Customer_Address,
  CASE WHEN Count(DISTINCT Customer_Number) > 1
    THEN 1/0 ELSE 0 END as LandMine
FROM Customers
GROUP BY Customer_Name, Customer_Address

如果您想从 Customer_Number 的角度进行操作，那么这也很好。

SELECT *, 
CASE WHEN Exists((
  SELECT top 1 1
  FROM Customers c2
  WHERE c1.Customer_Number != c2.Customer_Number
    AND c1.Customer_Name = c2.Customer_Name
    AND c1.Customer_Address = c2.Customer_Address
)) THEN 1/0 ELSE 0 END as LandMine
FROM Customers c1
WHERE Customer_Number = @Number

The defacto key is Name+Address, so that's what you need to group by.

SELECT
  Customer_Name,
  Customer_Address,
  CASE WHEN Count(DISTINCT Customer_Number) > 1
    THEN 1/0 ELSE 0 END as LandMine
FROM Customers
GROUP BY Customer_Name, Customer_Address

If you want to do it from the point of view of a Customer_Number, then this is good too.

SELECT *, 
CASE WHEN Exists((
  SELECT top 1 1
  FROM Customers c2
  WHERE c1.Customer_Number != c2.Customer_Number
    AND c1.Customer_Name = c2.Customer_Name
    AND c1.Customer_Address = c2.Customer_Address
)) THEN 1/0 ELSE 0 END as LandMine
FROM Customers c1
WHERE Customer_Number = @Number

回复收藏 0 原文

∝单色的世界 2024-07-30 08:42:25

如果你有脏数据，我会先清理它。

用它来查找重复的客户记录...

Select * From customers
Where customer_number in 
  (Select Customer_number from customers
  Group by customer_number Having count(*) > 1)

If you have dirty data, I would clean it up first.

Use this to find the duplicate customer records...

Select * From customers
Where customer_number in 
  (Select Customer_number from customers
  Group by customer_number Having count(*) > 1)

回复收藏 0 原文

土豪 2024-07-30 08:42:25

如果你希望它失败，你将需要有一个索引。如果您不想有索引，那么您可以创建一个临时表来完成这一切。

CREATE TABLE #temp_customers 
    (customer_number int, 
    customer_name varchar(50), 
    customer_address varchar(50),
    PRIMARY KEY (customer_number),
     UNIQUE(customr_name, customer_address))

）

INSERT INTO #temp_customers
SELECT DISTINCT customer_number, customer_name, customer_address
FROM customers

SELECT customer_number, customer_name, customer_address
FROM #temp_customers

DROP TABLE #temp_customers

如果存在问题，这将失败，但会防止您的重复记录引起问题。

If you want it to fail you're going to need to have an index. If you don't want to have an index, then you can just create a temp table to do this all in.

CREATE TABLE #temp_customers 
    (customer_number int, 
    customer_name varchar(50), 
    customer_address varchar(50),
    PRIMARY KEY (customer_number),
     UNIQUE(customr_name, customer_address))

)

INSERT INTO #temp_customers
SELECT DISTINCT customer_number, customer_name, customer_address
FROM customers

SELECT customer_number, customer_name, customer_address
FROM #temp_customers

DROP TABLE #temp_customers

This will fail if there are issues but will keep your duplicate records from causing issues.

回复收藏 0 原文

梦言归人 2024-07-30 08:42:25

让我们将数据放入临时表或带有不同查询的表变量中。

select distinct customer_number, customer_name, customer_address, 
  IDENTITY(int, 1,1) AS ID_Num
into #temp 
from unprocessed_invoices

如果可能的话，我个人也会向未处理的发票添加不确定性。我从来不会在不创建具有标识列的临时表的情况下进行导入，只是因为删除重复记录更容易。

现在让我们查询表来查找您的问题记录。我想你会想看看是什么原因导致了问题，而不仅仅是让他们失望。

Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number

您可以使用这些查询的变体从 #temp 中删除问题记录（取决于您选择保留一个问题还是删除所有可能的问题），然后从 #temp 插入到生产表中。您还可以将问题记录提供给向您提供数据以供最终修复的人员。

Let's put the data into a temp table or table variable with your distinct query

select distinct customer_number, customer_name, customer_address, 
  IDENTITY(int, 1,1) AS ID_Num
into #temp 
from unprocessed_invoices

Personally I would add an indetity to unporcessed invoices if possible as well. I never do an import without creating a staging table that has an identity column just because it is easier to delete duplicate records.

Now let's query the table to find your problem records. I assume you would want to see what is causing the problem not just fail them.

Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number

You can use a variation on these queries to delete the problem records from #temp (depends on if you choose to keep one or delete all possible problems) and then insert from #temp to your production table. You can also porvide the problem records back to whoever is providing you data to be fixed at their end.

回复收藏 0 原文

百善笑为先 2024-07-30 08:42:25

Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number

Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number

回复收藏 0 原文

红颜悴 2024-07-30 08:42:24

你的方法有缺陷。您不希望成功存储的数据在选择时抛出错误 - 这是一个等待发生的地雷，意味着您永远不知道选择何时会失败。

我建议您向表中添加一个唯一的键，然后慢慢开始修改您的应用程序以使用该键，而不是依赖于有意义数据的任何组合。

然后，您就可以不再关心重复数据，因为这些数据本来就不是真正重复的。两个同名的人完全有可能共享同一个地址。

您还将通过这种方法获得性能改进。

顺便说一句，我强烈建议您标准化数据，即将名称分解为 FirstName 和 LastName（也可以选择 MiddleName），并将地址字段分解为每个组件的单独字段（Address1、Address2、City、State、Country）、邮政编码或其他）

更新：如果我正确理解您的情况（我不确定我是否正确），您希望防止表中出现重复的名称和地址组合（即使这在现实生活中可能发生）。最好通过对这两个字段设置唯一约束或索引来防止插入数据。也就是说，在插入之前捕获错误。这将告诉您导入文件或生成的应用程序逻辑很糟糕，然后您可以选择采取适当的措施。

我仍然认为，当你查询时抛出错误已经太晚了，无法采取任何措施。

回复收藏 0 原文

童话里做英雄 2024-07-30 08:42:24

标量子查询必须只返回一行（每个结果集行...），因此您可以执行以下操作：

select distinct
       customer_number,
       (
       select distinct
              customer_address
         from customers c2
        where c2.customer_number = c.customer_number
       ) as customer_address
  from customers c

A scalar sub-query must only return one row (per result set row...) so you could do something like:

select distinct
       customer_number,
       (
       select distinct
              customer_address
         from customers c2
        where c2.customer_number = c.customer_number
       ) as customer_address
  from customers c

回复收藏 0 原文

~没有更多了~