通过 SQL 查询安全地标准化数据
假设我有一个客户表:
CREATE TABLE customers (
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
该表没有有主键。 但是,对于任何给定的 customer_number
,customer_name
和 customer_address
应该是唯一的。
该表包含许多重复客户的情况并不罕见。 为了避免这种重复,使用以下查询来仅隔离唯一的客户:
SELECT
DISTINCT customer_number, customer_name, customer_address
FROM customers
幸运的是,该表传统上包含准确的数据。 也就是说,任何 customer_number
都不会出现冲突的 customer_name
或 customer_address
。 然而,假设表中确实出现了冲突的数据。 我希望编写一个失败的查询,而不是返回相关 customer_number
的多行。
例如,我尝试了这个查询,但没有成功:
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number
Is there a way to write such a query using standard SQL? 如果没有的话,Oracle特定的SQL有解决方案吗?
编辑:奇怪查询背后的基本原理:
说实话,这个客户表实际上并不存在(谢天谢地)。 我创建它是希望它能够足够清晰地展示查询的需求。 然而,基于该示例,人们(幸运的是)意识到对此类查询的需求是我最不用担心的。 因此,我现在必须剥离一些抽象,并希望恢复我提出如此令人厌恶的表格的声誉......
我从外部系统收到一个包含发票(每行一张)的平面文件。 我逐行读取该文件,将其字段插入到该表中:
CREATE TABLE unprocessed_invoices (
invoice_number INTEGER,
invoice_date DATE,
...
// other invoice columns
...
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
如您所见,从外部系统到达的数据是非规范化的。 也就是说,外部系统在同一行上包含发票数据及其关联的客户数据。 多张发票可能共享同一客户,因此可能存在重复的客户数据。
在保证所有客户都在系统中注册之前,系统无法开始处理发票。 因此,系统必须识别唯一客户并根据需要对其进行注册。 这就是我想要查询的原因:因为我正在处理我无法控制的非规范化数据。
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number
希望这有助于澄清问题的初衷。
编辑:好/坏数据的示例
澄清一下:customer_name
和 customer_address
仅对于特定 customer_number 来说必须是唯一的
。
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
3 | 'Fred' | '456 Avenue'
3 | 'Fred' | '789 Crescent'
前两行很好,因为 customer_number
1 的 customer_name
和 customer_address
是相同的。
中间两行很好,因为它是相同的customer_number
2 的 customer_name
和 customer_address
(即使另一个 customer_number
具有相同的 customer_name
和customer_address
)。
最后两行不行,因为customer_number
有两个不同的customer_address
es 3。
如果运行的话,我正在查找的查询将会失败所有这六行。 但是,如果仅前四行实际存在,则视图应返回:
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
我希望这能澄清我所说的“customer_name
和 customer_address
冲突”的含义。 每个 customer_number
必须是唯一的。
我很欣赏那些解释如何正确从外部系统导入数据的人。事实上,我已经做了其中的大部分工作。 我故意隐藏了我正在做的事情的所有细节,以便更容易专注于手头的问题。 此查询并不是唯一的验证形式。 我只是认为这会起到很好的画龙点睛作用(可以说是最后的防御)。 这个问题只是为了调查 SQL 的可能性。 :)
Suppose I have a table of customers:
CREATE TABLE customers (
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
This table does not have a primary key. However, customer_name
and customer_address
should be unique for any given customer_number
.
It is not uncommon for this table to contain many duplicate customers. To get around this duplication, the following query is used to isolate only the unique customers:
SELECT
DISTINCT customer_number, customer_name, customer_address
FROM customers
Fortunately, the table has traditionally contained accurate data. That is, there has never been a conflicting customer_name
or customer_address
for any customer_number
. However, suppose conflicting data did make it into the table. I wish to write a query that will fail, rather than returning multiple rows for the customer_number
in question.
For example, I tried this query with no success:
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number
Is there a way to write such a query using standard SQL? If not, is there a solution in Oracle-specific SQL?
EDIT: The rationale behind the bizarre query:
Truth be told, this customers table does not actually exist (thank goodness). I created it hoping that it would be clear enough to demonstrate the needs of the query. However, people are (fortunately) catching on that the need for such a query is the least of my worries, based on that example. Therefore, I must now peel away some of the abstraction and hopefully restore my reputation for suggesting such an abomination of a table...
I receive a flat file containing invoices (one per line) from an external system. I read this file, line-by-line, inserting its fields into this table:
CREATE TABLE unprocessed_invoices (
invoice_number INTEGER,
invoice_date DATE,
...
// other invoice columns
...
customer_number INTEGER,
customer_name VARCHAR(...),
customer_address VARCHAR(...)
)
As you can see, the data arriving from the external system is denormalized. That is, the external system includes both the invoice data and its associated customer data on the same line. It is possible that multiple invoices will share the same customer, therefore it is possible to have duplicate customer data.
The system cannot begin processing the invoices until all customers are guaranteed to be registered with the system. Therefore, the system must identify the unique customers and register them as necessary. This is why I wanted the query: because I was working with denormalized data I had no control over.
SELECT
customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number
Hopefully this helps clarify the original intent of the question.
EDIT: Examples of good/bad data
To clarify: customer_name
and customer_address
only have to be unique for a particular customer_number
.
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
3 | 'Fred' | '456 Avenue'
3 | 'Fred' | '789 Crescent'
The first two rows are fine because it is the same customer_name
and customer_address
for customer_number
1.
The middle two rows are fine because it is the same customer_name
and customer_address
for customer_number
2 (even though another customer_number
has the same customer_name
and customer_address
).
The last two rows are not okay because there are two different customer_address
es for customer_number
3.
The query I am looking for would fail if run against all six of these rows. However, if only the first four rows actually existed, the view should return:
customer_number | customer_name | customer_address
----------------------------------------------------
1 | 'Bob' | '123 Street'
2 | 'Bob' | '123 Street'
I hope this clarifies what I meant by "conflicting customer_name
and customer_address
". They have to be unique per customer_number
.
I appreciate those that are explaining how to properly import data from external systems. In fact, I am already doing most of that already. I purposely hid all the details of what I'm doing so that it would be easier to focus on the question at hand. This query is not meant to be the only form of validation. I just thought it would make a nice finishing touch (a last defense, so to speak). This question was simply designed to investigate just what was possible with SQL. :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
使查询失败可能很棘手...
这将显示表中是否有重复记录:
如果您只是为所有三个字段添加唯一索引,则没有人可以在表中创建重复记录。
Making the query fail may be tricky...
This will show you if there are any duplicate records in the table:
If you just add a unique index for all the three fields, noone can create a duplicate record in the table.
事实上的键是名称+地址,所以这就是您需要分组的依据。
如果您想从 Customer_Number 的角度进行操作,那么这也很好。
The defacto key is Name+Address, so that's what you need to group by.
If you want to do it from the point of view of a Customer_Number, then this is good too.
如果你有脏数据,我会先清理它。
用它来查找重复的客户记录...
If you have dirty data, I would clean it up first.
Use this to find the duplicate customer records...
如果你希望它失败,你将需要有一个索引。 如果您不想有索引,那么您可以创建一个临时表来完成这一切。
)
如果存在问题,这将失败,但会防止您的重复记录引起问题。
If you want it to fail you're going to need to have an index. If you don't want to have an index, then you can just create a temp table to do this all in.
)
This will fail if there are issues but will keep your duplicate records from causing issues.
让我们将数据放入临时表或带有不同查询的表变量中。
如果可能的话,我个人也会向未处理的发票添加不确定性。 我从来不会在不创建具有标识列的临时表的情况下进行导入,只是因为删除重复记录更容易。
现在让我们查询表来查找您的问题记录。 我想你会想看看是什么原因导致了问题,而不仅仅是让他们失望。
您可以使用这些查询的变体从 #temp 中删除问题记录(取决于您选择保留一个问题还是删除所有可能的问题),然后从 #temp 插入到生产表中。 您还可以将问题记录提供给向您提供数据以供最终修复的人员。
Let's put the data into a temp table or table variable with your distinct query
Personally I would add an indetity to unporcessed invoices if possible as well. I never do an import without creating a staging table that has an identity column just because it is easier to delete duplicate records.
Now let's query the table to find your problem records. I assume you would want to see what is causing the problem not just fail them.
You can use a variation on these queries to delete the problem records from #temp (depends on if you choose to keep one or delete all possible problems) and then insert from #temp to your production table. You can also porvide the problem records back to whoever is providing you data to be fixed at their end.
你的方法有缺陷。 您不希望成功存储的数据在选择时抛出错误 - 这是一个等待发生的地雷,意味着您永远不知道选择何时会失败。
我建议您向表中添加一个唯一的键,然后慢慢开始修改您的应用程序以使用该键,而不是依赖于有意义数据的任何组合。
然后,您就可以不再关心重复数据,因为这些数据本来就不是真正重复的。 两个同名的人完全有可能共享同一个地址。
您还将通过这种方法获得性能改进。
顺便说一句,我强烈建议您标准化数据,即将名称分解为 FirstName 和 LastName(也可以选择 MiddleName),并将地址字段分解为每个组件的单独字段(Address1、Address2、City、State、Country) 、邮政编码或其他)
更新:如果我正确理解您的情况(我不确定我是否正确),您希望防止表中出现重复的名称和地址组合(即使这在现实生活中可能发生)。 最好通过对这两个字段设置唯一约束或索引来防止插入数据。 也就是说,在插入之前捕获错误。 这将告诉您导入文件或生成的应用程序逻辑很糟糕,然后您可以选择采取适当的措施。
我仍然认为,当你查询时抛出错误已经太晚了,无法采取任何措施。
Your approach is flawed. You do not want data that was successfully able to be stored to then throw an error on a select - that is a land mine waiting to happen and means you never know when a select could fail.
What I recommend is that you add a unique key to the table, and slowly start modifying your application to use this key rather than relying on any combination of meaningful data.
You can then stop caring about duplicate data, which is not really duplicate in the first place. It is entirely possible for two people with the same name to share the same address.
You will also gain performance improvements from this approach.
As an aside, I highly recommend you normalize your data, that is break up the name into FirstName and LastName (optionally MiddleName too), and break up the address field into separate fields for each component (Address1, Address2, City, State, Country, Zip, or whatever)
Update: If I understand your situation correctly (which I am not sure I do), you want to prevent duplicate combinations of name and address from ever occurring in the table (even though that is a possible occurrence in real life). This is best done by a unique constraint or index on these two fields to prevent the data from being inserted. That is, catch the error before you insert it. That will tell you the import file or your resulting app logic is bad and you can choose to take the appropriate measures then.
I still maintain that throwing the error when you query is too late in the game to do anything about it.
标量子查询必须只返回一行(每个结果集行...),因此您可以执行以下操作:
A scalar sub-query must only return one row (per result set row...) so you could do something like: