在数据库中存储性别(性别)
我想以尽可能小的(大小/性能)成本将用户的性别存储在数据库中。
到目前为止,我想到了 3 个场景
- Int - 与代码中的 Enum 一致(1 = 男性,2 = 女性,3 = ...)
- char(1 ) - 存储 m、f 或其他单个字符标识符
- 位 (布尔值) - 这个选项有合适的字段名称吗?
我问的原因是因为这个answer 其中提到字符小于布尔。
我应该澄清一下,我使用的是 MS SQL 2008,它实际上具有位数据类型。
I want to store a user's gender in a database with as little (size/performance) cost as possible.
So far, 3 scenarios come to mind
- Int - aligned with Enum in code (1 = Male, 2 = Female, 3 = ...)
- char(1) - Store m, f or another single character identifier
- Bit (boolean) - is there an appropriate field name for this option?
The reason I ask is because of this answer which mentions that chars are smaller than booleans.
I should clarify that I'm using MS SQL 2008, which DOES in fact have the bit datatype.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
对此已经有一个 ISO 标准;无需发明自己的方案:
http://en.wikipedia.org/wiki/ISO_5218
根据标准,该列应称为“性别”,“最接近”的数据类型将是带有 CHECK 约束或查找表的tinyint(视情况而定)。
There is already an ISO standard for this; no need to invent your own scheme:
http://en.wikipedia.org/wiki/ISO_5218
Per the standard, the column should be called "Sex" and the 'closest' data type would be tinyint with a CHECK constraint or lookup table as appropriate.
我将这一栏称为“性别”。
可以排除 BIT 数据类型,因为它仅支持两种可能的性别这是不够的。虽然 INT 支持两个以上的选项,但它需要 4 个字节——性能使用更小/更窄的数据类型会更好。
CHAR(1)
优于 TinyINT - 两者都采用相同的字节数,但 CHAR 提供的值数量更窄。使用CHAR(1)
将使用“m”、“f”等自然键,而不是使用被称为代理/人工键的数字数据。如果需要移植,任何数据库都支持CHAR(1)
。结论
我会使用选项 2:CHAR(1)。
附录
性别列上的索引可能没有帮助,因为低基数列上的索引没有任何价值。这意味着,索引的值没有足够的多样性来提供任何值。
I'd call the column "gender".
The BIT data type can be ruled out because it only supports two possible genders which is inadequate. While INT supports more than two options, it takes 4 bytes -- performance will be better with a smaller/more narrow data type.
CHAR(1)
has the edge over TinyINT - both take the same number of bytes, but CHAR provides a more narrow number of values. UsingCHAR(1)
would make using "m", "f",etc natural keys, vs the use of numeric data which are referred to as surrogate/artificial keys.CHAR(1)
is also supported on any database, should there be a need to port.Conclusion
I would use Option 2: CHAR(1).
Addendum
An index on the gender column likely would not help because there's no value in an index on a low cardinality column. Meaning, there's not enough variety in the values for the index to provide any value.
医学上有四种性别:男性、女性、不确定和未知。您可能不需要全部四个,但您肯定需要 1、2 和 4。为此数据类型设置默认值是不合适的。更不用说将其视为具有“是”和“不是”状态的布尔值了。
In medicine there are four genders: male, female, indeterminate, and unknown. You mightn't need all four but you certainly need 1, 2, and 4. It's not appropriate to have a default value for this datatype. Even less to treat it as a Boolean with 'is' and 'isn't' states.
与
Enum
字段对齐的Int
(或TinyInt
)将是我的方法。首先,如果数据库中有一个
bit
字段,该行仍将使用一个完整字节,因此就节省空间而言,只有在有多个bit
时才有意义代码>字段。其次,字符串/字符有一种“神奇的价值”感觉,无论它们在设计时看起来多么明显。更不用说,它允许人们存储几乎任何他们不一定映射到任何明显的值。
第三,为了强制引用完整性,数值更容易(也是更好的实践)创建查找表,并且可以与枚举进行一对一关联,因此在将值存储在内存中时存在奇偶校验应用程序或数据库中。
An
Int
(orTinyInt
) aligned to anEnum
field would be my methodology.First, if you have a single
bit
field in a database, the row will still use a full byte, so as far as space savings, it only pays off if you have multiplebit
fields.Second, strings/chars have a "magic value" feel to them, regardless of how obvious they may seem at design time. Not to mention, it lets people store just about any value they would not necessarily map to anything obvious.
Third, a numeric value is much easier (and better practice) to create a lookup table for, in order to enforce referential integrity, and can correlate 1-to-1 with an enum, so there is parity in storing the value in memory within the application or in the database.
选项 3 是最好的选择,但并非所有数据库引擎都有“位”类型。如果您没有一点,那么 TinyINT 将是您最好的选择。
Option 3 is your best bet, but not all DB engines have a "bit" type. If you don't have a bit, then TinyINT would be your best bet.
我使用字符“f”、“m”和“u”,因为我从姓名、声音和对话中推测性别,有时不知道性别。最终的决定是他们的意见。
这实际上取决于您对这个人的了解程度以及您的标准是外貌还是个人身份。心理学家可能需要额外的选择——跨性别、跨性别、跨性别、跨性别、跨性别、雌雄同体和犹豫不决。有 9 个选项,没有由单个字符明确定义,我可能会接受 Hugo 的小整数建议。
I use char 'f', 'm' and 'u' because I surmise the gender from name, voice and conversation, and sometimes don't know the gender. The final determination is their opinion.
It really depends how well you know the person and whether your criteria is physical form or personal identity. A psychologist might need additional options - cross to female, cross to male, trans to female, trans to male, hermaphrodite and undecided. With 9 options, not clearly defined by a single character, I might go with Hugo's advice of tiny integer.
在此处输入链接说明
enter link description here
我会选择选项 3,但使用多个 NON NULLABLE 位列而不是一个。
是男性(1=是/0=否)
IsFemale(1=是/0=否)
(如果需要):
性别未知(1=是/0=否)
等等...
这使得定义易于阅读、易于扩展、易于编程、不可能使用域之外的值,并且不需要第二个查找表+FK或CHECK约束来锁定值。
编辑:更正,您确实需要至少一个约束来确保设置的标志有效。
I would go with Option 3 but multiple NON NULLABLE bit columns instead of one.
IsMale (1=Yes / 0=No)
IsFemale (1=Yes / 0=No)
if requried:
IsUnknownGender (1=Yes / 0=No)
and so on...
This makes for easy reading of the definitions, easy extensibility, easy programmability, no possibility of using values outside the domain and no requirement of a second lookup table+FK or CHECK constraints to lock down the values.
EDIT: Correction, you do need at least one constraint to ensure the set flags are valid.