数据库设计:用时间戳列替换布尔列?
早些时候,我以这种方式创建了表:
create table workflow (
id number primary key,
name varchar2(100 char) not null,
is_finished number(1) default 0 not null,
date_finished date
);
列 is_finished 指示工作流程是否完成。 date_finished 列是工作流程完成的时间。
然后我有了一个想法“我不需要 is_finished 因为我可以说:其中 data_finished 不为空”,并且我设计时没有 is_finished 列:(
create table workflow (
id number primary key,
name varchar2(100 char) not null,
date_finished date
);
我们使用 Oracle 10)
这是一个好主意还是坏主意?我听说不能在具有 NULL 值的列上建立索引,因此 where data_finished is not null
在大表上会非常慢。
Earlier I have created tables this way:
create table workflow (
id number primary key,
name varchar2(100 char) not null,
is_finished number(1) default 0 not null,
date_finished date
);
Column is_finished indicates whether the workflow finished or not. Column date_finished is when the workflow was finished.
Then I had the idea "I don't need is_finished as I can just say: where data_finished is not null", and I designed without is_finished column:
create table workflow (
id number primary key,
name varchar2(100 char) not null,
date_finished date
);
(We use Oracle 10)
Is it a good or bad idea? I've heard you can not have an index on a column with NULL values, so where data_finished is not null
will be very slow on big tables.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
好主意。
您已经消除了冗余列占用的空间; DATE 列有双重作用——您知道工作何时完成以及何时。
这是不正确的。 Oracle 索引忽略 NULL 值。
您可以创建基于函数的索引,以避免未索引的 NULL 值,但我遇到的大多数 DBA 确实不喜欢他们,所以要做好战斗的准备。
Good idea.
You've eliminated space taken by a redundant column; the DATE column serves double duty--you know the work was finished, and when.
That's incorrect. Oracle indexes ignore NULL values.
You can create a function based index in order to get around the NULL values not being indexed, but most DBAs I've encountered really don't like them so be prepared for a fight.
有一种正确的方法来索引空值,并且它不使用 FBI。 Oracle将索引空值,但不会索引树中的空LEAF值。因此,您可以消除列
is_finished
并创建这样的索引。然后,如果您检查此查询的解释计划:
您可能会看到正在使用的索引(如果优化器满意)。
回到最初的问题:看看这里的各种答案,我认为没有正确答案。我个人可能倾向于在不必要的情况下删除列,但我也不喜欢过多地解释列的含义。这里有两个概念:
is_finished
date_finished
也许您需要将它们分开,也许您不需要。当我考虑删除
is_finished
列时,我很烦恼。接下来,可能会出现记录完成的情况,但您不知道确切的时间。也许您必须从其他来源导入数据并且日期未知。当然,这不属于现在的业务需求,但情况发生了变化。那你怎么办?好吧,您必须在date_finished
列中放入一些虚拟值,现在您已经对数据进行了一些妥协。并不可怕,但那里有一个摩擦。当我做那样的事情时,我脑子里的小声音在喊你做错了。我的建议是,将其分开。您正在谈论一个很小的列和一个非常瘦的索引。存储在这里不应该成为问题。
There is a right way to index null values, and it doesn't use a FBI. Oracle will index null values, but it will NOT index null LEAF values in the tree. So, you could eliminate the column
is_finished
and create the index like this.Then, if you check the explain plan on this query:
You might see the index being used (if the optimizer is happy).
Back to the original question: looking at the variety of answers here, I think there is no right answer. I may have a personal preference to eliminate a column if it is unnecessary, but I also don't like overloading the meaning of columns either. There are two concepts here:
is_finished
date_finished
Maybe you need to keep these separate, maybe you don't. When I think about eliminating the
is_finished
column, it bothers me. Down the road, the situation may arise where the record finished, but you don't know precisely when. Perhaps you have to import data from another source and the date is unknown. Sure, that's not in the business requirements now, but things change. What do you do then? Well, you have to put some dummy value in thedate_finished
column, and now you've compromised the data a bit. Not horribly, but there is a rub there. The little voice in my head is shouting YOU'RE DOING IT WRONG when I do things like that.My advice, keep it separate. You're talking about a tiny column and a very skinny index. Storage should not be an issue here.
对于所有认为该专栏浪费空间的人:
双重职责在数据库中并不是一件好事。您的主要目标应该是清晰。许多系统、工具、人们都会使用你的数据。如果您通过将含义隐藏在其他列中来掩饰值,那么您就是在乞求其他系统或用户弄错。
任何认为它可以节省空间的人都是完全错误的。
您将需要该日期列上的两个索引...一个将是基于函数的,正如 OMG 建议的那样。它看起来像这样:
NVL(Date_finished, TO_DATE('01-JAN-9999'))
因此,要查找未完成的作业,您必须确保正确编写 where 子句
它看起来像这样:
WHERE
NVL(Date_finished, TO_DATE('01-JAN-9999')) = TO_DATE('01-JAN-9999')
是的。那是那么清楚。 好
完全比WHERE
IS_Unfinished = 'YES'
您希望在同一列上有第二个索引的原因是对于该日期的所有其他查询...您不会希望使用该索引按日期查找工作。
那么让我们看看您根据 OMG 的建议等取得了哪些成果。
您使用了更多的空间,您混淆了数据的含义,您更有可能犯错误......胜利者!
有时,程序员似乎还生活在 70 年代,当时 1MB 的硬盘空间就是一套房子的首付。
您可以在不放弃很多清晰度的情况下节省空间。将 Is_unfinished 设置为 Y 或 NULL...如果您将仅使用该列来查找“待办事项”。这将使该索引保持紧凑。它只会与未完成的行一样大(通过这种方式,您可以利用未索引的空值,而不是被它搞砸)。你在桌子上放了一点空间,但总的来说,它比联邦调查局要少。该列需要 1 个字节,并且只需对未完成的行进行索引,这样,这只是工作的一小部分,并且可能保持相当恒定。无论您是否试图找到它们,FBI 的每一行都需要 7 个字节。该索引将与表的大小保持同步,而不仅仅是与未完成作业的大小保持同步。
回复 OMG 的评论
在他/她的评论中,他/她指出要找到未完成的工作,您只需使用
但在他的回答中他说
如果您按照他指向的链接操作,使用 NVL 将 null 值替换为其他任意值,那么我不确定还有什么解释。
To all those who said the column is a waste of space:
Double Duty isn't a good thing in a database. Your primary goal should be clarity. Lots of systems, tools, people will use your data. If you disguise values by burying meaning inside of other columns you're BEGGING for another system or user to get it wrong.
And anyone who thinks it saves space is utterly wrong.
You'll need two indexes on that date column... one will be Function Based as OMG suggests. It will look like this:
NVL(Date_finished, TO_DATE('01-JAN-9999'))
So to find unfinished jobs you'll have to make sure to write the where clause correctly
It will look like this:
WHERE
NVL(Date_finished, TO_DATE('01-JAN-9999')) = TO_DATE('01-JAN-9999')
Yep. That's so clear. It's completely better than
WHERE
IS_Unfinished = 'YES'
The reason you'll want to have a second index on the same column is for EVERY OTHER query on that date... you won't want to use that index for finding jobs by date.
So let's see what you've accomplish with OMG's suggestion et al.
You've used more space, you've obfuscated the meaning of the data, you've made errors more likely... WINNER!
Sometime it seems programmers are still living in the 70's when a MB of hard drive space was a down payment on a house.
You can be space efficient about this without giving up a lot of clarity. Make the Is_unfinished either Y or NULL... IF you will only use that column to find 'work to do'. This will keep that index compact. It will only be as big as rows which are unfinished (in this way you exploit the unindexed nulls instead of being screwed by it). You put a little bit of space in your table, but over all it's less than the FBI. You need 1 byte for the column and you'll only index the unfinished rows so that' a small fraction of job and probably stays pretty constant. The FBI will need 7 bytes for EVERY ROW whether you're trying to find them or not. That index will keep pace with the size of the table, not just the size of the unfinished jobs.
Reply to the comment by OMG
In his/her comment he/she states that to find unfinished jobs you'd just use
But in his answer he says
If you follow the link he points you toward, using NVL to replace null values with some other arbitrary value then I'm not sure what else there is to explain.
Oracle 会索引可空字段,但不会索引
NULL
值这意味着您可以在标记为
NULL<的字段上创建索引/code>,但该字段中包含
NULL
的记录不会进入索引。这反过来意味着,如果您将
date_finished NULL
设置为date_finished NULL
,则索引将尺寸减小,因为NULL
值不会存储在索引中。因此,涉及
date_finished
上范围搜索相等的查询实际上会执行得更好。当然,此解决方案的缺点是涉及
date_finished
的NULL
值的查询必须恢复为全表扫描。您可以通过创建两个索引来解决此问题:
并使用此查询查找未完成的作品:
这将表现得像分区索引:完整的作品将由第一个索引索引;不完整的将由第二个索引。
如果您不需要搜索完整或不完整的作品,您始终可以删除适当的索引。
Oracle does index nullable fields, but does not index
NULL
valuesThis means that you can create an index on a field marked
NULL
, but the records holdingNULL
in this field won't make it into the index.This, on its turn, means that if you make
date_finished NULL
, the index will be less in size, as theNULL
values won't be stored in the index.So the queries involving equality of range searches on
date_finished
will in fact perform better.The downside of this solution, of course, is that the queries involving the
NULL
values ofdate_finished
will have to revert to full table scan.You can work around this by creating two indexes:
and use this query to find unfinished work:
This will behave like partitioned index: the complete works will be indexed by the first index; the incomplete ones will be indexed by the second.
If you don't need to search for complete or incomplete works, you can always get rid of the appropriate indexes.
在表设计方面,我认为您删除了 is_finished 列是件好事,正如您所说的那样,它是不必要的(它是多余的)。如果没有必要,就没有必要存储额外的数据,只会浪费空间。就性能而言,我认为这对于 NULL 值来说不是问题。他们应该被忽略。
In terms of table design, I think it's good that you removed the
is_finished
column as you said that it isn't necessary (it's redundant). There's no need to store extra data if it isn't necessary, it just wastes space. In terms of performance, I don't see this being a problem for NULL values. They should be ignored.正如其他答案中已经提到的那样,对于除“WHERE date_finished IS NULL”之外的所有查询(因此这取决于您是否需要使用该查询),我将使用空值作为索引工作。我绝对不会使用像 9999 年这样的异常值,正如答案所建议的:
像 9999 年这样的异常值会影响性能,因为(来自 http://richardfoote.wordpress.com/2007/12/13/outlier-values-an-enemy-of -索引/):
如果您使用像9999这样的值,那么数据库会认为存储在该字段中的值的范围是例如2008-9999,而不是实际的2008-2010;因此,任何范围查询(例如“2008 年和 2009 年之间”)似乎只覆盖了可能值范围的一小部分,而实际上覆盖了大约一半的范围。它使用这个统计数据来表示,如果覆盖的可能值的百分比很高,则可能会匹配很多行,然后全表扫描将比索引扫描更快。如果数据中存在异常值,它将无法正确执行此操作。
I would use nulls as indexes work, as already mentioned in other answers, for all queries apart from "WHERE date_finished IS NULL" (so it depends if you need to use that query). I definitely wouldn't use outliers like year 9999 as suggested by the answer:
Outliers like year 9999 affect performance, because (from http://richardfoote.wordpress.com/2007/12/13/outlier-values-an-enemy-of-the-index/):
If you use a value like 9999 then the DB will think the range of values being stored in the field is e.g. 2008-9999 rather than the actual 2008-2010; so any range query (e.g. "between 2008 and 2009") will appear to be covering a tiny % of the range of possible values, vs. actually covering about half the range. It uses this statistic to say, if the % of the ths possible values covered is high, probably a lot of rows will match, and then a full table scan will be faster than an index scan. It won't do this correctly if there are outliers in the data.
正如其他人所说,删除可导出值列是个好主意。
另一种想法是,通过删除该列,您将避免需要编写代码的自相矛盾的情况,例如当 is_finished = No 和 finish_date = Saturday 时会发生什么......等。
good idea to remove the deriveable value column as others have said.
one more thought is that by removing the column, you will avoid paradoxical conditions that you will need to code around, such as what happens when the is_finished = No and the finished_date = yesterday... etc.
要解析索引/非索引列,简单地 JOIN 两个表不是更容易吗,如下所示:
因此,如果workflow_finished 中存在记录,则此工作流已完成,否则未完成。在我看来,这相当简单。
当查询未完成的工作流程时,查询变为:
也许您想要原始查询?有旗帜和日期吗?然后像这样查询:
对于数据的消费者,可以而且应该根据他们的需要创建视图。
To resolve the indexed / non-indexed columns, wouldn't it be easier to simply JOIN two tables, like this:
Thus, if a record exists in workflow_finished, this workflow's completed, else it isn't. It seems to me this is rather simple.
When querying for unfinished workflows, the query becomes:
Maybe you want the original query? With a flag and the date? Query like this then:
For consumers of the data, views can and should be created for their needs.
作为基于函数的索引的替代方法,您还可以使用“虚拟”值(例如 9999 年 12 月 31 日,或者最早预期 date_finished 值的前一天)作为未完成工作流的 date_finished 值。
编辑:替代虚拟日期值,以下评论。
As an alternative to a function-based index, you could also use a "dummy" value (such as 31 December 9999, or alternatively one day before the earliest expected date_finished value) as the date_finished value for unfinished workflows.
EDIT: Alternative dummy date value, following comments.
我更喜欢单列解决方案。
但是,在我最常使用的数据库中,索引中包含 NULL,因此搜索开放工作流程的常见情况会很快,而在您的情况下会较慢。由于搜索开放工作流程的情况可能是您最常做的事情之一,因此您可能需要冗余列来支持该搜索。
测试性能,看看是否可以使用性能更好的解决方案,然后在必要时回退到不太好的解决方案。
I prefer the single-column solution.
However, in the databases I use most often NULLs are included in indexes, so your common case of searching for open workflows will be fast whereas in your case it will be slower. Because the case of searching for open workflows is likely to be one of the most common things you do, you may need the redundant column simply to support that search.
Test for performance to see if you can use the better solution performance-wise, then fall back to the less-good solution if necessary.