从数据库模型中消除 NULLable 列的选项(以避免 SQL 的三值逻辑)?

发布于 2024-09-06 04:42:26 字数 2546 浏览 6 评论 0 原文

不久前,我读了这本书SQL 和关系理论由 CJ 日期。作者因批评 SQL 的三值逻辑 (3VL) 而闻名。1)

作者提出了一些关于为什么在 SQL 中应避免 3VL 的强项观点,但是他没有概述如果不允许可为空的列,数据库模型会是什么样子。我对此进行了思考并提出了以下解决方案。如果我错过了其他设计选项,我想听听它们!

1) Date 对 SQL 3VL 的批评反过来也受到了批评:参见 Claude Rubinson 的这篇论文(包括 CJ Date 的原始评论)。


示例表:

作为示例,请使用下表,其中有一个可为 null 的列 (DateOfBirth):

#  +-------------------------------------------+
#  |                   People                  |
#  +------------+--------------+---------------+
#  |  PersonID  |  Name        |  DateOfBirth  |
#  +============+--------------+---------------+
#  |  1         |  Banana Man  |  NULL         |
#  +------------+--------------+---------------+

选项 1:模拟 NULL通过标志和默认值:

不是使列可以为空,而是指定任何默认值(例如 1900-01-01)。附加的 BOOLEAN 列将指定是否应简单地忽略 DateOfBirth 中的值,或者它是否实际包含数据。

#  +------------------------------------------------------------------+
#  |                              People'                             |
#  +------------+--------------+----------------------+---------------+
#  |  PersonID  |  Name        |  IsDateOfBirthKnown  |  DateOfBirth  |
#  +============+--------------+----------------------+---------------+
#  |  1         |  Banana Man  |  FALSE               |  1900-01-01   |
#  +------------+--------------+----------------------+---------------+

选项 2:将可为空的列转换为单独的表

可为空的列被新表 (DatesOfBirth) 替换。如果记录没有该列的数据,则新表中不会有记录:

#  +---------------------------+ 1    0..1 +----------------------------+
#  |         People'           | <-------> |         DatesOfBirth       |
#  +------------+--------------+           +------------+---------------+
#  |  PersonID  |  Name        |           |  PersonID  |  DateOfBirth  |
#  +============+--------------+           +============+---------------+
#  |  1         |  Banana Man  |
#  +------------+--------------+

虽然这似乎是更好的解决方案,但这可能会导致需要连接许多表以进行单个查询。由于不允许使用 OUTER JOIN(因为它们会将 NULL 引入到结果集中),因此可能无法再通过单个查询获取所有必要的数据和以前一样。


问题: 是否还有其他选项可以消除 NULL(如果有,它们是什么)?

Some while ago, I've been reading through the book SQL and Relational Theory by C. J. Date. The author is well-known for criticising SQL's three-valued logic (3VL).1)

The author makes some strong points about why 3VL should be avoided in SQL, however he doesn't outline how a database model would look like if nullable columns weren't allowed. I've thought on this for a bit and have come up with the following solutions. If I missed other design options, I would like to hear about them!

1) Date's critique of SQL's 3VL has in turn been criticized too: see this paper by Claude Rubinson (includes the original critique by C. J. Date).


Example table:

As an example, take the following table where we have one nullable column (DateOfBirth):

#  +-------------------------------------------+
#  |                   People                  |
#  +------------+--------------+---------------+
#  |  PersonID  |  Name        |  DateOfBirth  |
#  +============+--------------+---------------+
#  |  1         |  Banana Man  |  NULL         |
#  +------------+--------------+---------------+

Option 1: Emulating NULL through a flag and a default value:

Instead of making the column nullable, any default value is specified (e.g. 1900-01-01). An additional BOOLEAN column will specify whether the value in DateOfBirth should simply be ignored or whether it actually contains data.

#  +------------------------------------------------------------------+
#  |                              People'                             |
#  +------------+--------------+----------------------+---------------+
#  |  PersonID  |  Name        |  IsDateOfBirthKnown  |  DateOfBirth  |
#  +============+--------------+----------------------+---------------+
#  |  1         |  Banana Man  |  FALSE               |  1900-01-01   |
#  +------------+--------------+----------------------+---------------+

Option 2: Turning a nullable column into a separate table:

The nullable column is replaced by a new table (DatesOfBirth). If a record doesn't have data for that column, there won't be a record in the new table:

#  +---------------------------+ 1    0..1 +----------------------------+
#  |         People'           | <-------> |         DatesOfBirth       |
#  +------------+--------------+           +------------+---------------+
#  |  PersonID  |  Name        |           |  PersonID  |  DateOfBirth  |
#  +============+--------------+           +============+---------------+
#  |  1         |  Banana Man  |
#  +------------+--------------+

While this seems like the better solution, this would possibly result in many tables that need to be joined for a single query. Since OUTER JOINs won't be allowed (because they would introduce NULL into the result set), all the necessary data could possibly no longer be fetched with just a single query as before.


Question:
Are there any other options for eliminating NULL (and if so, what are they)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

卖梦商人 2024-09-13 04:42:26

我看到 Date 的同事 Hugh Darwen 在一篇精彩的演讲“如何在不使用 NULL 的情况下处理缺失信息”中讨论了这个问题,该演讲可在 第三宣言网站

他的解决方案是第二种方法的变体。它是第六范式,其中的表保存出生日期和未知的标识符:

#  +-----------------------------+ 1    0..1 +----------------------------+
#  |         People'             | <-------> |         DatesOfBirth       |
#  +------------+----------------+           +------------+---------------+
#  |  PersonID  |  Name          |           |  PersonID  |  DateOfBirth  |
#  +============+----------------+           +============+---------------+
#  |  1         |  Banana Man    |           ! 2          | 20-MAY-1991   |
#  |  2         |  Satsuma Girl  |           +------------+---------------+
#  +------------+----------------+
#                                  1    0..1 +------------+
#                                  <-------> | DobUnknown |
#                                            +------------+
#                                            |  PersonID  |
#                                            +============+
#                                            | 1          |
#                                            +------------+

然后从人员中选择需要连接所有三个表,包括指示未知出生日期的样板。

当然,这有些理论性。目前 SQL 的状态仍然不够先进,无法处理所有这些问题。休的演讲涵盖了这些缺点。他提到的一件事并不完全正确:某些 SQL 风格确实支持多重赋值 - 例如 Oracle 的 INSERT ALL 语法

I saw Date's colleague Hugh Darwen discuss this issue in an excellent presentation "How To Handle Missing Information Without Using NULL", which is available on the Third Manifesto website.

His solution is a variant on your second approach. It's sixth normal form, with tables to hold both Date of Birth and identifiers where it is unknown:

#  +-----------------------------+ 1    0..1 +----------------------------+
#  |         People'             | <-------> |         DatesOfBirth       |
#  +------------+----------------+           +------------+---------------+
#  |  PersonID  |  Name          |           |  PersonID  |  DateOfBirth  |
#  +============+----------------+           +============+---------------+
#  |  1         |  Banana Man    |           ! 2          | 20-MAY-1991   |
#  |  2         |  Satsuma Girl  |           +------------+---------------+
#  +------------+----------------+
#                                  1    0..1 +------------+
#                                  <-------> | DobUnknown |
#                                            +------------+
#                                            |  PersonID  |
#                                            +============+
#                                            | 1          |
#                                            +------------+

Selecting from People then requires joining all three tables, including boilerplate to indicate the unknown Dates Of Birth.

Of course, this is somewhat theoretical. The state of SQL these days is still not sufficiently advanced to handle all this. Hugh's presentation covers these shortcomings. One thing he mentions is not entirely correct: some flavours of SQL do support multiple assignment - for instance Oracle's INSERT ALL syntax.

猥︴琐丶欲为 2024-09-13 04:42:26

我建议你选择选项 2。我相当肯定 Chris Date 也会这么做,因为本质上你所做的就是完全标准化为 6NFDate 共同负责引入的最高可能范式
我赞同推荐的 Darwen 的论文关于处理缺失信息。

因为不允许使用 OUTER JOIN(因为它们会引入 NULL
到结果集中),所有必要的数据可能不再
像以前一样只需一个查询即可获取。

…事实并非如此,但我同意 Darwen 论文中没有明确提及外连接问题;这是让我想要的一件事。明确的答案可以在 Date 的另一本书中找到……

首先,请注意 Date 和 Darwen 自己的真正关系语言 教程 D 只有一种连接类型,即自然连接。理由是实际上只需要一种连接类型。

我提到的 Date 书是一本优秀的 SQL 和关系理论:如何编写准确的SQL代码

4.6:关于外连接的评论:“相对而言,[外连接是]
一种霰弹枪婚姻:它迫使桌子进入一种联盟——是的,我
确实意味着联合,而不是联接——即使有问题的表无法
符合工会的通常要求...它这样做,在
效果,通过在执行之前用空值填充一个或两个表
工会,从而使他们符合那些通常的要求
毕竟。但没有理由不应该进行填充
使用正确的值而不是空值

使用您的示例和默认值“1900-01-01”作为“填充”,外部联接的替代方案可能如下所示:

SELECT p.PersonID, p.Name, b.DateOfBirth
  FROM Person AS p
       INNER JOIN BirthDate AS b
          ON p.PersonID = b.PersonID
UNION
SELECT p.PersonID, p.Name, '1900-01-01' AS DateOfBirth
  FROM Person AS p
 WHERE NOT EXISTS (
                   SELECT * 
                     FROM BirthDate AS b
                    WHERE p.PersonID = b.PersonID
                  );

Darwen 的论文散文两个显式表,例如 BirthDateBirthDateKnown,但 SQL 不会有太大不同,例如与 BirthDateKnown 的半连接代替上面与 BirthDate 的半差异。

请注意,上面使用 JOININNER JOIN 只是因为标准 SQL-92 NATURAL JOINUNION CORRESPONDING 并不广泛在现实生活中的 SQL 产品中实现(找不到引用,但 IIRC Darwen 在很大程度上负责将后两者纳入标准)。

进一步注意,上面的语法看起来很冗长,只是因为 SQL 一般来说很冗长。在纯关系代数中,它更像是(伪代码):

Person JOIN BirthDate UNION Person NOT MATCHING BirthDate ADD '1900-01-01' AS DateOfBirth;

I recommend you go for your option 2. I'm fairly certain Chris Date would too because essentially what you are doing is fully normalizing to 6NF, the highest possible normal form which Date was jointly responsible for introducing.
I second the recommended Darwen's paper on handling missing information.

Since OUTER JOINs won't be allowed (because they would introduce NULL
into the result set), all the necessary data could possibly no longer
be fetched with just a single query as before.

…this is not the case, but I agree the issue of outer join is not explicitly mentioned in the Darwen paper; it was the one thing that left me wanting. The explicit answer may be found in another of Date's book…

First, note that Date and Darwen's own truly relational language Tutorial D has but one join type being the natural join. The justification is that only one join type is actually needed.

The Date book I alluded to is the excellent SQL and Relational Theory: How to Write Accurate SQL Code:

4.6: A Remark on Outer Join: "Relationally speaking, [outer join is] a
kind of shotgun marriage: It forces tables into a kind of union—yes, I
do mean union, not join—even when the tables in question fail to
conform to the usual requirements for union... It does this, in
effect, by padding one or both of the tables with nulls before doing
the union, thereby making them conform to those usual requirements
after all. But there's no reason why that padding shouldn't be done
with proper values instead of nulls

Using your example and default value '1900-01-01' as 'padding', the alternative to outer join could look like this:

SELECT p.PersonID, p.Name, b.DateOfBirth
  FROM Person AS p
       INNER JOIN BirthDate AS b
          ON p.PersonID = b.PersonID
UNION
SELECT p.PersonID, p.Name, '1900-01-01' AS DateOfBirth
  FROM Person AS p
 WHERE NOT EXISTS (
                   SELECT * 
                     FROM BirthDate AS b
                    WHERE p.PersonID = b.PersonID
                  );

Darwen's paper proses two explicit tables, say BirthDate and BirthDateKnown, but the SQL would not be much different e.g. a semi join to BirthDateKnown in place of the semi difference to BirthDate above.

Note the above uses JOIN and INNER JOIN only because Standard SQL-92 NATURAL JOIN and UNION CORRESPONDING are not widely implemented in real life SQL products (can't find a citation but IIRC Darwen was largely responsible for the latter two making it into the Standard).

Further note the above syntax looks long-winded only because SQL in general is long-winded. In pure relational algebra it is more like (pseudo code):

Person JOIN BirthDate UNION Person NOT MATCHING BirthDate ADD '1900-01-01' AS DateOfBirth;
煮酒 2024-09-13 04:42:26

我还没有读过,但是在 如何使用 S-by-C 处理缺失信息的文章="nofollow noreferrer">第三宣言 网站,由 Hugh Darwen 和 CJ Date 运营。
这不是 CJ Date 写的,但我认为由于这是该网站上的文章之一,因此可能与他的观点相似。

I haven't read it, but there's an article called How To Handle Missing Information Using S-by-C on the the Third Manifesto website that's run by Hugh Darwen and C.J. Date.
This isn't written by C.J. Date, but I'd assume that since it's one of the articles on that website it's probably similar to his opinions.

你与清晨阳光 2024-09-13 04:42:26

一种替代方案可能是 entity-attribute-value 模型:

 entity  attribute    value
 1       name         Banana Man
 1       birthdate    1968-06-20

如果出生日期是未知,您只需省略其行即可。

One alternative may be the entity-attribute-value model:

 entity  attribute    value
 1       name         Banana Man
 1       birthdate    1968-06-20

If the birthdate was unknown, you'd just omit its row.

谜泪 2024-09-13 04:42:26

选项 3:记录编写者的责任:

CREATE TABLE Person
(
  PersonId int PRIMARY KEY IDENTITY(1,1),
  Name nvarchar(100) NOT NULL,
  DateOfBirth datetime NOT NULL
)

当您的目标是消除空表示时,为什么要扭曲模型以允许空表示?

Option 3: Onus on the record writer:

CREATE TABLE Person
(
  PersonId int PRIMARY KEY IDENTITY(1,1),
  Name nvarchar(100) NOT NULL,
  DateOfBirth datetime NOT NULL
)

Why contort a model to allow null representation when your goal is to eliminate them?

追星践月 2024-09-13 04:42:26

您还可以使用 < 消除输出中的 null代码>合并

SELECT personid  /*primary key, will never be null here*/
       , COALESCE(name, 'no name') as name
       , COALESCE(birthdate,'no date') as birthdate
FROM people

并非所有数据库都支持 COALESCE,但几乎所有数据库都有一个名为
的后备选项
IFNULL(arg1, arg2) 或类似的东西会执行相同的操作(但仅适用于 2 个参数)

You can eliminate null in the output as well by using COALESCE.

SELECT personid  /*primary key, will never be null here*/
       , COALESCE(name, 'no name') as name
       , COALESCE(birthdate,'no date') as birthdate
FROM people

Not all databases support COALESCE, but almost all have a fallback option called
IFNULL(arg1, arg2) or something simular that will do the same (but only for 2 arguments).

陪你到最终 2024-09-13 04:42:26

一种选择是使用显式的选项类型,类似于Haskell的Maybe函子。

不幸的是,许多现有的 SQL 实现对用户定义的代数数据类型的支持很差,甚至对用户定义的类型构造函数的支持也很差,而您确实需要干净地完成此操作。

这只会为那些您明确要求的属性恢复某种“null”,但不会出现 null 愚蠢的三值逻辑。 Nothing == NothingTrue,而不是 unknownnull

当由于多种原因导致信息丢失时,对用户定义的代数类型的支持也会有所帮助,例如,对于明显的应用程序来说,相当于以下 Haskell 类型的数据库将是一个很好的解决方案:(

data EmploymentStatus = Employed EmployerID | Unemployed | Unknown

当然,支持此功能的数据库也需要以支持随之而来的比平常更复杂的外键约束。)

除此之外,我同意 APConedaywhen 关于 6NF 的回答。

One option is to use explicit option types, analogous to Haskell's Maybe functor.

Unfortunately a lot of existing SQL implementations have poor support for user-defined algebraic data types and even poorer support for user-defined type constructors that you really need to do this cleanly.

This recovers a sort of "null" for only those attributes where you explicitly ask for it, but without null's silly three-valued logic. Nothing == Nothing is True, not unknown or null.

Support for user-defined algebraic types also helps when there are a few reasons for missing information, for example a database equivalent of the following Haskell type would be a good solution for the obvious application:

data EmploymentStatus = Employed EmployerID | Unemployed | Unknown

(Of course, a database supporting this would also need to support the more-complicated-than-usual foreign key constraint that comes with it.)

Short of this, I agree with APC's and onedaywhen's answers about 6NF.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文