带有 NULL 的唯一键
这个问题需要一些假设的背景。让我们考虑一个 employee
表,其中包含列 name
、date_of_birth
、title
、salary
,使用 MySQL 作为 RDBMS。因为如果任何给定的人与另一个人具有相同的名字和出生日期,那么根据定义,他们是同一个人(除非有令人惊奇的巧合,我们有两个人都名叫亚伯拉罕·林肯,出生于 1809 年 2 月 12 日),所以我们将name
和 date_of_birth
上的唯一键,表示“不要存储同一个人两次”。现在考虑这个数据:
id name date_of_birth title salary
1 John Smith 1960-10-02 President 500,000
2 Jane Doe 1982-05-05 Accountant 80,000
3 Jim Johnson NULL Office Manager 40,000
4 Tim Smith 1899-04-11 Janitor 95,000
如果我现在尝试运行以下语句,它应该并且将会失败:
INSERT INTO employee (name, date_of_birth, title, salary)
VALUES ('Tim Smith', '1899-04-11', 'Janitor', '95,000')
如果我尝试这个,它将成功:
INSERT INTO employee (name, title, salary)
VALUES ('Jim Johnson', 'Office Manager', '40,000')
现在我的数据将如下所示:
id name date_of_birth title salary
1 John Smith 1960-10-02 President 500,000
2 Jane Doe 1982-05-05 Accountant 80,000
3 Jim Johnson NULL Office Manager 40,000
4 Tim Smith 1899-04-11 Janitor 95,000
5 Jim Johnson NULL Office Manager 40,000
这不是我想要的,但我可以'并不是说我完全不同意所发生的事情。如果我们用数学集合来讨论,
{'Tim Smith', '1899-04-11'} = {'Tim Smith', '1899-04-11'} <-- TRUE
{'Tim Smith', '1899-04-11'} = {'Jane Doe', '1982-05-05'} <-- FALSE
{'Tim Smith', '1899-04-11'} = {'Jim Johnson', NULL} <-- UNKNOWN
{'Jim Johnson', NULL} = {'Jim Johnson', NULL} <-- UNKNOWN
我的猜测是 MySQL 会说:“因为我不知道出生日期为 NULL 的吉姆·约翰逊 (Jim Johnson) 不在其中这桌,我加他吧。”
我的问题是:即使date_of_birth
并不总是已知,如何防止重复?到目前为止我想出的最好办法是移动date_of_birth
code> 到不同的表。然而,这样做的问题是,我最终可能会遇到两个收银员,他们的姓名、头衔和工资相同,出生日期不同,而且无法在不重复的情况下存储他们。
This question requires some hypothetical background. Let's consider an employee
table that has columns name
, date_of_birth
, title
, salary
, using MySQL as the RDBMS. Since if any given person has the same name and birth date as another person, they are, by definition, the same person (barring amazing coincidences where we have two people named Abraham Lincoln born on February 12, 1809), we'll put a unique key on name
and date_of_birth
that means "don't store the same person twice." Now consider this data:
id name date_of_birth title salary
1 John Smith 1960-10-02 President 500,000
2 Jane Doe 1982-05-05 Accountant 80,000
3 Jim Johnson NULL Office Manager 40,000
4 Tim Smith 1899-04-11 Janitor 95,000
If I now try to run the following statement, it should and will fail:
INSERT INTO employee (name, date_of_birth, title, salary)
VALUES ('Tim Smith', '1899-04-11', 'Janitor', '95,000')
If I try this one, it will succeed:
INSERT INTO employee (name, title, salary)
VALUES ('Jim Johnson', 'Office Manager', '40,000')
And now my data will look like this:
id name date_of_birth title salary
1 John Smith 1960-10-02 President 500,000
2 Jane Doe 1982-05-05 Accountant 80,000
3 Jim Johnson NULL Office Manager 40,000
4 Tim Smith 1899-04-11 Janitor 95,000
5 Jim Johnson NULL Office Manager 40,000
This is not what I want but I can't say I entirely disagree with what happened. If we talk in terms of mathematical sets,
{'Tim Smith', '1899-04-11'} = {'Tim Smith', '1899-04-11'} <-- TRUE
{'Tim Smith', '1899-04-11'} = {'Jane Doe', '1982-05-05'} <-- FALSE
{'Tim Smith', '1899-04-11'} = {'Jim Johnson', NULL} <-- UNKNOWN
{'Jim Johnson', NULL} = {'Jim Johnson', NULL} <-- UNKNOWN
My guess is that MySQL says, "Since I don't know that Jim Johnson with a NULL
birth date isn't already in this table, I'll add him."
My question is: How can I prevent duplicates even though date_of_birth
is not always known? The best I've come up with so far is to move date_of_birth
to a different table. The problem with that, however, is that I might end up with, say, two cashiers with the same name, title and salary, different birth dates and no way to store them both without having duplicates.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
唯一密钥的一个基本属性是
它必须是独一无二的。将该键的一部分设置为 Null 会破坏此属性。
您的问题有两种可能的解决方案:
一种方法,错误的方法,是使用一些神奇的日期来表示未知。这只会让你过去
DBMS 的“问题”,但并没有从逻辑意义上解决问题。
预计两个日期未知的“John Smith”条目会出现问题
出生的。这些人是同一个人还是独一无二的个体?
如果你知道它们是不同的,那么你又回到了同样的老问题 -
您的唯一密钥并不唯一。甚至不要考虑分配整个范围的神奇日期
代表“未知”——这确实是通往地狱的道路。
更好的方法是创建 EmployeeId 属性作为代理键。这只是一个
您分配给您知道唯一的个人的任意标识符。这
标识符通常只是一个整数值。
然后创建一个 Employee 表来关联 EmployeeId(唯一的、不可为空的
key)到您认为的依赖属性,在这种情况下
姓名和出生日期(其中任何一项都可以为空)。在您所在的任何地方使用 EmployeeId 代理键
以前使用过姓名/出生日期。这会向您的系统添加一个新表,但是
以稳健的方式解决了未知值的问题。
A fundamental property of a unique key is that
it must be unique. Making part of that key Nullable destroys this property.
There are two possible solutions to your problem:
One way, the wrong way, would be to use some magic date to represent unknown. This just gets you past
the DBMS "problem" but does not solve the problem in a logical sense.
Expect problems with two "John Smith" entries having unknown dates
of birth. Are these guys one and the same or are they unique individuals?
If you know they are different then you are back to the same old problem -
your Unique Key just isn't unique. Don't even think about assigning a whole range of magic dates
to represent "unknown" - this is truly the road to hell.
A better way is to create an EmployeeId attribute as a surrogate key. This is just an
arbitrary identifier that you assign to individuals that you know are unique. This
identifier is often just an integer value.
Then create an Employee table to relate the EmployeeId (unique, non-nullable
key) to what you believe are the dependant attributers, in this case
Name and Date of Birth (any of which may be nullable). Use the EmployeeId surrogate key everywhere that you
previously used the Name/Date-of-Birth. This adds a new table to your system but
solves the problem of unknown values in a robust manner.
我认为 MySQL 就在这里做到了。其他一些数据库(例如 Microsoft SQL Server)将 NULL 视为只能在 UNIQUE 列中插入一次的值,但我个人认为这是奇怪且意外的行为。
但是,由于这是您想要的,因此您可以使用一些“神奇”值而不是 NULL,例如过去很长时间的日期
I think MySQL does it right here. Some other databases (for example Microsoft SQL Server) treat NULL as a value that can only be inserted once into a UNIQUE column, but personally I find this to be strange and unexpected behaviour.
However since this is what you want, you can use some "magic" value instead of NULL, such as a date a long time in the past
我建议创建额外的表列
checksum
,其中包含name
和date_of_birth
的 md5 哈希值。删除唯一键(name, date_of_birth)
因为它不能解决问题。在校验和上创建一个唯一的密钥。该解决方案产生的技术开销很小,因为您需要为每个插入的对生成散列(每个搜索查询都是相同的)。为了进一步改进,您可以添加触发器,该触发器将在每次插入中为您生成哈希:
I recommend to create additional table column
checksum
which will contain md5 hash ofname
anddate_of_birth
. Drop unique key(name, date_of_birth)
because it doesn't solve the problem. Create one unique key on checksum.This solution creates small technical overhead, cause for every inserted pairs you need to generate hash (same thing for every search query). For further improvements you can add trigger that will generate hash for you in every insert:
您的基于名称不重复的问题无法解决,因为您没有自然键。为出生日期未知的人输入虚假日期并不能解决您的问题。出生于 1900/01/01 的约翰·史密斯与出生于 1960/03/09 的约翰·史密斯仍然是不同的人。
我每天都会处理来自大大小小的组织的姓名数据,我可以向您保证,他们始终有两个不同的人具有相同的名字。有时具有相同的职称。出生日期也不能保证唯一性,许多约翰·史密斯出生在同一天。当我们使用医生办公室数据时,我们经常有两名医生具有相同的姓名、地址和电话号码(父子组合)。
如果您要插入员工数据来唯一地识别每个员工,最好的选择是拥有员工 ID。然后检查用户界面中的唯一名称,如果有一个或多个匹配项,则询问用户是否是这个意思,如果他说不是,则插入该记录。然后构建一个 deupping 流程来修复某人意外分配了两个 id 时出现的问题。
Your problem of not having duplicates based on name is not solvable because you do not have a natural key. Putting a fake date in for people whose date of birth is unknown will not solve your problem. John Smith born 1900/01/01 is still going to be a differnt person than John Smithh born 1960/03/09.
I work with name data from large and small organizations every day and I can assure you they have two different people with the same name all the time. Sometimes with the same job title. Birthdate is no guarantee of uniqueness either, plenty of John Smiths born on the same date. Heck when we work with physicians office data we have often have two doctors with the same name, address and phone number (father and son combinations)
Your best bet is to have an employee ID if you are inserting employee data to identify each employee uniquely. Then check for the uniquename in the user interface and if there are one or more matches, ask the user if he meant them and if he says no, insert the record. Then build a deupping process to fix problems if someone gets assigned two ids by accident.
还有另一种方法可以做到这一点。添加一列(不可为空)来表示 date_of_birth 列的字符串值。如果 date_of_birth 为 null,则新列值将为“”(空字符串)。
我们将该列命名为date_of_birth_str,并创建一个唯一约束employee(name, date_of_birth_str)。因此,当两条记录具有相同名称且 date_of_birth 值为空时,唯一约束仍然有效。
但对于两个相同含义的色谱柱的维护工作量以及新色谱柱的性能危害,应慎重考虑。
There is another way to do it. Adding a column(non-nullable) to represent the String value of date_of_birth column. The new column value would be ""(empty string) if date_of_birth is null.
We name the column as date_of_birth_str and create a unique constraint employee(name, date_of_birth_str). So when two records come with the same name and null date_of_birth value, the unique constraint still works.
But the efforts of maintenance for the two same-meaning columns, and, the performance harm of new column, should be considered carefully.
您可以添加一个生成列,其中
NULL
值被未使用的常量(例如零)替换。然后您可以对该列应用唯一约束:You can add a generated column where the
NULL
value is replaced by an unused constant, e.g. zero. Then you can apply the unique constraint to this column:完美的解决方案是支持基于函数的 UK,但这会变得更加复杂,因为 mySQL 还需要支持基于函数的索引。这将防止需要使用“假”值来代替 NULL,同时还允许开发人员决定如何处理 UK 中的 NULL 值。不幸的是,据我所知,mySQL 目前不支持此类功能,因此我们只能采取解决方法。
(注意在唯一键定义中使用 IFNULL() 函数)
The perfect solution would be support for function based UK's, but that becomes more complex as mySQL would also then need to support function based indexes. This would prevent the need to use "fake" values in place of NULL, while also allowing developers the ability to decide how to treat NULL values in UK's. Unfortunately, mySQL doesn't currently support such functionality that I am aware of, so we're left with workarounds.
(Note the use of the IFNULL() function in the unique key definition)
我遇到了与此类似的问题,但有所不同。就您而言,每个员工都有生日,尽管可能未知。在这种情况下,系统为生日未知但其他信息相同的员工分配两个值是合乎逻辑的。 NealB 接受的答案非常准确。
然而,我遇到的问题是数据字段不一定有值。例如,如果您在表中添加了“name_of_spouse”字段,则表的每一行不一定都有一个值。在这种情况下,NealB 的第一个要点(“错误的方式”)实际上是有道理的。在这种情况下,对于没有已知配偶的每一行,应在 name_of_spouse 列中插入字符串“None”。
我遇到这个问题的情况是编写一个带有数据库的程序来对 IP 流量进行分类。目标是创建专用网络上的 IP 流量图表。每个数据包都被放入一个数据库表中,该表具有基于其 IP 源和目标、端口源和目标、传输协议和应用程序协议的唯一连接索引。然而,许多数据包根本没有应用程序协议。例如,所有没有应用协议的 TCP 数据包应归为一类,并且应在连接索引中占据一个唯一的条目。这是因为我希望这些数据包形成我的图的单个边。在这种情况下,我采纳了上面的建议,并在应用程序协议字段中存储了一个字符串“None”,以确保这些数据包形成一个唯一的组。
I had a similar problem to this, but with a twist. In your case, every employee has a birthday, although it may be unknown. In that case, it makes logical sense for the system to assign two values for employees with unknown birthdays but otherwise identical information. NealB's accepted answer is very accurate.
However, the problem I encountered was one in which the data field did not necessarily have a value. For example, if you added a 'name_of_spouse' field to your table, there wouldn't necessarily be a value for each row of the table. In that case, NealB's first bullet point (the 'wrong way') actually makes sense. In this case, a string 'None' should be inserted in the column name_of_spouse for each row in which there was no known spouse.
The situation where I ran into this problem was in writing a program with database to classify IP traffic. The goal was to create a graph of IP traffic on a private network. Each packet was put into a database table with a unique connection index based on its ip source and dest, port source and dest, transport protocol, and application protocol. However, many packets simply don't have an application protocol. For example, all TCP packets without an application protocol should be classed together, and should occupy one unique entry in the connections index. This is because I want those packets to form a single edge of my graph. In this situation, I took my own advice from above, and stored a string 'None' in the application protocol field to ensure that these packets formed a unique group.
我正在寻找一种解决方案,Alexander Yancharuk 建议对我来说是个好主意。但就我而言,我的列是外键,employee_id 可以为空。
我有这样的结构:
并且带有employee_id NULL的room_id不能重复
我解决了在插入之前添加触发器的问题,如下所示:
我还添加了room_id和<的唯一约束强>employee_id
I were looking for one solution and the Alexander Yancharuk suggested was good idea for me. But in my case my columns are foreign keys and employee_id can be null.
I have this structure:
And the room_id with employee_id NULL can not be duplicated
I solved adding a trigger before insert, like this:
I also added a constraint unique for room_id and employee_id
我认为这里的根本问题是你真正的意思
您自己对人的定义是姓名和出生日期,那么此语句在该上下文中意味着什么?我想说,解决你的问题的方法是通过在你的 name 和 date_of_birth 列上添加 NOT NULL 来禁止插入一半身份,就像上面的那样。这样,该语句将失败并强制您输入完整的身份,并且唯一密钥将发挥作用以防止您两次输入同一个人。
I think the fundamental question here is what you actually mean with
Your own definition of a person is name AND birth date, so what does this statement mean in that context? I'd say that the solution to your problem is to prohibit inserting half identities, like the one above, by adding NOT NULL on both your name and date_of_birth columns. That way, the statement will fail and force you to enter complete identities and the unique key will do its job to prevent you from entering the same person twice.
简单来说,唯一约束的作用就是使字段或列。
null 会破坏此属性,因为数据库将 null 视为未知
为了避免重复并允许 null:
In simple words,the role of Unique constraint is to make the field or column.
The null destroys this property as database treats null as unknown
Inorder to avoid duplicates and allow null: