在每个数据库引擎中存储二进制标志/布尔值的最佳方法是什么?

发布于 2024-10-09 20:57:57 字数 645 浏览 2 评论 0原文

我已经看到了一些可能的方法(在某些数据库引擎中,其中一些是同义词):

  1. TINYINT(1)
  2. BOOL
  3. BIT(1)
  4. ENUM(0,1)
  5. CHAR(0) NULL

应该注意 PHP 支持的所有主要数据库引擎,但作为参考,如果能注明其他引擎就更好了。

我要求的是一种最适合阅读的设计。 例如,在 WHERE 条件中使用标志字段进行选择,或使用标志进行 GROUP BY。 性能比存储空间重要得多(除非大小对性能有影响)。

还有一些更多细节:

在创建表时,我不知道它是否会稀疏(如果大多数标志打开或关闭),但我可以稍后更改表,所以如果我知道有什么可以优化的对此,应当指出。

此外,如果每行只有一个(或几个)标志与许多(或很多)标志有区别,则应注意这一点。

顺便说一句,我在某处读过以下内容:

使用布尔值可以做同样的事情 使用tinyint,但是它有 语义传达的优势 你的意图是什么,那就是 值得一些东西。

好吧,就我而言,它没有任何价值,因为每个表都由我的应用程序中的一个类表示,并且所有内容都在类中显式定义并有详细记录。

I've seen some possible approaches (in some database engines some of them are synonyms):

  1. TINYINT(1)
  2. BOOL
  3. BIT(1)
  4. ENUM(0,1)
  5. CHAR(0) NULL

All major database engine supported by PHP should be noted, but just as a refference it'll be even better if also other engines will be noted.

I'm asking for a design that is best optimized for reading.
e.g. SELECTing with the flag field in the WHERE condition, or GROUP BY the flag.
Performance is much more important than storage space (except when the size has an impact on performance).

And some more details:

While creating the table I can't know if it'll be sparse (if most flags are on or off), but I can ALTER the tables later on, so if there is something I can optimize if I know that, it should be noted.

Also if it's make a difference if there is only one flag (or a few) per row, versus many (or a lot of) flags it should be noted.

BTW, I've read somewhere in SO the following:

Using boolean may do the same thing as
using tinyint, however it has the
advantage of semantically conveying
what your intention is, and that's
worth something.

Well, in my case it doesn't worth nothing, because each table is represented by a class in my application and everything is explicitly defined in the class and well documented.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

假装不在乎 2024-10-16 20:57:57

这个答案适用于 ISO/IEC/ANSI 标准 SQL,并且包括更好的免费软件假装 SQL。

第一个问题是您已经确定了两个类别,而不是一个类别,因此无法对它们进行合理比较。

A.类别一

(1)、(4)和(5)包含多个可能的值并且是一个类别。所有这些都可以在 WHERE 子句中轻松有效地使用。它们具有相同的存储,因此存储和读取性能都不是问题。因此,剩下的选择只是基于列的实际数据类型。

ENUM 是非标准的;更好或标准的方法是使用查找表;那么这些值在表中可见,而不是隐藏,并且可以通过任何报表工具进行枚举。由于内部处理,ENUM 的读取性能会受到轻微影响。

B. 第二类

(2) 和 (3) 是二值元素:True/False;男/女;死/活。该类别与第一类不同。它在数据模型和每个平台中的处理方式都是不同的。 BOOLEAN 只是 BIT 的同义词,它们是同一个东西。从法律上讲(SQL 方面),所有兼容 SQL 的平台都会进行相同的处理,并且在 WHERE 子句中使用它没有问题。

性能差异取决于平台。 Sybase 和 DB2 将最多 8 位打包为一个字节(这里存储并不重要),并动态映射 2 的幂,因此性能非常好。 Oracle 在每个版本中做了不同的事情,我看到建模者使用 CHAR(1) 而不是 BIT 来克服性能问题。 MS 在 2005 年之前表现还不错,但在 2008 年就打破了这一局面,因为结果是不可预测的;所以简短的答案可能是将其实现为 CHAR(1)。

当然,假设您不会做愚蠢的事情,例如将 8 个单独的列打包到一个 TINYINT 中。这不仅是一个严重的规范化错误,而且对编码人员来说是一场噩梦。保持每列离散且数据类型正确。

C. 多重指标&可空列

这与 (A) 和 (B) 无关且独立。列的正确数据类型是什么,与您拥有的数量以及是否可为 Null 无关。可空意味着(通常)该列是可选的。本质上您还没有完成建模或标准化练习。功能依赖性不明确。如果完成规范​​化练习,将不会有可为 Null 的列,也不会存在可选列;它们要么对于特定关系明显存在,要么不存在。这意味着使用普通的父类型-子类型的关系结构。

当然,这意味着更多的表,但没有空值。企业 DBMS 对于更多表或更多连接没有问题,这就是它们优化的目的。规范化数据库的性能比非规范化或非规范化数据库要好得多,并且无需“重构”即可扩展它们。您可以通过为每个子类型提供视图来简化使用。

如果您想了解有关此主题的更多信息,请查看此问题/答案如果。如果您需要建模方面的帮助,请提出一个新问题,在您的提问级别,我建议您坚持使用 5NF。

D. 空值的性能

如果性能对您很重要,则排除每个可空列。存储为可变长度;这需要对每行/列进行额外处理,以允许日志记录等在不妨碍固定行的情况下移动队列。索引中的长度列(包括可为 Null 的列):需要在每次访问时进行解包。

E. 民意调查

最后,我认为这个问题没有民意调查的意义。很公平,你会得到技术答案,甚至意见,但民意调查是为了人气竞赛,而SO的响应者的技术能力涵盖了非常广泛的范围,所以最受欢迎的答案和技术上最正确的答案是在两个不同的答案频谱的两端。

This answer is for ISO/IEC/ANSI Standard SQL, and includes the better freeware pretend-SQLs.

First problem is you have identified two Categories, not one, so they cannot be reasonably compared.

A. Category One

(1) (4) and (5) contain multiple possible values and are one category. All can be easily and effectively used in the WHERE clause. They have the same storage so neither storage nor read performance is an issue. Therefore the remaining choice is simply based on the actual Datatype for the purpose of the column.

ENUM is non-standard; the better or standard method is to use a lookup table; then the values are visible in a table, not hidden, and can be enumerated by any report tool. The read performance of ENUM will suffer a small hit due to the internal processing.

B. Category Two

(2) and (3) are Two-Valued elements: True/False; Male/Female; Dead/Alive. That category is different to Category One. Its treatment both in your data model, and in each platform, is different. BOOLEAN is just a synonym for BIT, they are the same thing. Legally (SQL-wise) there are handled the same by all SQL-compliant platforms, and there is no problem using it in the WHERE clause.

The difference in performance depends on the platform. Sybase and DB2 pack up to 8 BITs into one byte (not that storage matters here), and map the power-of-two on the fly, so performance is really good. Oracle does different things in each version, and I have seen modellers use CHAR(1) instead of BIT, to overcome performance problems. MS was fine up to 2005 but they have broken it with 2008, as in the results are unpredictable; so the short answer may be to implement it as CHAR(1).

Of course, the assumption is that you do not do silly things such as pack 8 separate columns in to one TINYINT. Not only is that a serious Normalisation error, it is a nightmare for coders. Keep each column discrete and of the correct Datatype.

C. Multiple Indicator & Nullable Columns

This has nothing to do with, and is independent of, (A) and (B). What the columns correct Datatype is, is separate to how many you have and whether it is Nullable. Nullable means (usually) the column is optional. Essentially you have not completed the modelling or Normalisation exercise. The Functional Dependencies are ambiguous. if you complete the Normalisation exercise, there will be no Nullable columns, no optional columns; either they clearly exist for a particular relation, or they do not exist. That means using the ordinary Relational structure of Supertype-Subtypes.

Sure, that means more tables, but no Nulls. Enterpise DBMS have no problem with more tables or more joins, that is what they are optimised for. Normalised databases perform much better than unnormalised or denormalised ones, and they can be extended without "re-factoring'. You can ease the use by supplying a View for each Subtype.

If you want more information on this subject, look at this question/answer. If you need help with the modelling, please ask a new question. At your level of questioning, I would advise that you stick with 5NF.

D. Performance of Nulls

Separately, if performance is important to you, then exclude Nulls. Each Nullable column is stored as variable length; that requires additional processing for each row/column. The enterprise databases use a "deferred" handling for such rows, to allow the logging, etc to move thought the queues without impeding the fixed rows. In particular never use variable length columns (that includes Nullable columns) in an Index: that requires unpacking on every access.

E. Poll

Finally, I do not see the point in this question being a poll. It is fair enough that you will get technical answers, and even opinions, but polls are for popularity contests, and the technical ability of responders at SO covers a very range, so the most popular answers and the most technically correct answers are at two different ends of the spectrum.

心如荒岛 2024-10-16 20:57:57

我知道这不是您想要的答案,但除了最极端的特殊情况外,这种差异实际上可以忽略不计。在每种此类特定情况下,简单地切换数据类型不足以解决性能问题。

例如,这里有一些替代方案,其性能将大大优于任何数据类型更改。当然,每种方法都有其缺点。

如果您有 200 个可选标志,并且一次最多查询 1-2 个大量行,则通过将每个标志放在其自己的表中可以获得更好的性能。如果数据确实稀疏,情况会更好。

如果您有 200 个强制标志并且仅执行单个记录提取,则应该将它们放在同一个表中。

如果您有一小组标志,您可以使用位掩码将它们打包到一列中,这是高效的存储方式,但您将无法(轻松)查询各个标志。当然,当标志可以为 NULL 时,这不起作用...

或者您可以发挥创意并使用“垃圾维度”概念,在其中创建一个单独的表,其中所有 200 个布尔标志表示为列。为每个不同的标志值组合创建一行。每行都有一个自动增量主键,您可以在主记录中引用该主键。瞧,主表现在包含 1 个 int,而不是 200 个
列。黑客的天堂,DBA 的噩梦。

我想说的一点是,尽管争论哪个是“最好的”很有趣,但还有其他更重要的问题(例如您引用的评论)。很简单,因为当您遇到真正的性能问题时,数据类型既不是问题也不是解决方案。

I know this is not the answer you want, but the difference is really negligeble in all but the most extreme special cases. And in each such specific case, simply switching datatype won't be enough to fix a performance problem.

For example, here are some alternatives that will outperform any datatype changes by a large factor. Each carries with it a downside of course.

If you have 200 optional flags and you query for at most 1-2 at a time for lots of rows, you would get better performance by having each flag in its own table. If the data is really sparse this gets even better.

If you have 200 mandatory flags and you only perform single record fetches, you should put them in the same table.

If you have a small set of flags, you could pack them in one column using a bitmask, which is efficient storage wise, but you won't be able to (easily) query individual flags. Of course, this doesn't work when flags can be NULL...

Or you could get creative and use a "junk dimension" concept, in which you create a separate table with all 200 boolean flags represented as columns. Create one row for each distinct combination of flag values. Each row gets an autoincrement primary key, which you reference in the master record. Voila, the master table now contains 1 int, instead of 200
columns. Hackers heaven, DBA nightmare.

The point I'm trying to make is that even though it is interesting to argue over which is "the best", there are other concerns that are of much greater importance (like the comment you quoted). Simply because when you encounter a real performance issue, the datatype will neither be the problem nor the solution.

落花随流水 2024-10-16 20:57:57

上述任何一个都可以,如果得到适当支持,我个人倾向于使用 BOOL,因为这最好地传达您的意图,但我会避免使用 ENUM(0,1) >。

ENUM 的第一个问题是它要求其值是字符串。 01 看起来像一个数字,因此程序员倾向于向其发送一个数字。

ENUM 的第二个问题是,如果你向它发送了错误的值,它会默认为第一个枚举,并且在某些数据库中它甚至不会指示错误(我正在看你的 MySQL)。这使得第一个问题变得更糟,因为如果您不小心发送了 1 而不是 "1" ,它将存储值 "0" --非常违反直觉!

我不认为这会影响所有数据库引擎(不知道,还没有尝试过所有数据库引擎),但它影响了足够多的数据库引擎,我认为避免它是一个很好的做法。

Any of the above is fine and I have a personal preference of using BOOL if it is properly supported because that best conveys your intent but I would avoid using ENUM(0,1).

The first problem with ENUM is that it requires its value to be a string. 0 and 1 looks like a number so programmers have a tendency to send it a number.

The second problem with ENUM is that if you send it a wrong value it defaults to the first enumeration and in some databases it won't even indicate an error (I'm looking at you MySQL). This makes the first problem much worse since if you accidentally send it 1 instead of "1" it will store the value "0" -- very counter-intuitive!

I don't think this affects all database engines (don't know, havent't tried them all) but it affects enough of them that I consider avoiding it to be good practice.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文