通常,字符串(或 varchar)字段用作连接字段吗?
我们有两张桌子。第一个包含名称 (varchar) 字段。第二个包含一个引用第一个表中的名称字段的字段。第二个表中的外键将针对与该名称关联的每一行重复。通常不鼓励使用 varchar/string 字段作为两个表之间的联接吗?什么时候是字符串字段可以用作连接字段的最佳情况?
We have two tables. The first contains a name (varchar) field. The second contains a field that references the name field from the first table. This foreign key in the second table will be repeated for every row associated with that name. Is it generally discouraged to use a varchar/string field as a join between two tables? When is the best case where a string field can be used as a join field?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
当然可以使用 varchar 作为关键字段(或者只是要加入的字段)。它的主要问题取决于您通常在 varchar 字段中存储的内容;可变数据。严格来说,不建议更改关键字段。一个人的姓名、电话号码、甚至社会安全号(SSN)都可能发生变化。但是,内部 ID 为 3 的员工将始终为 ID 3,即使有两个 John Smith。
其次,字符串比较取决于许多挑剔的细节,例如区域性、排序规则、空格转换等,这些细节可能会在没有明显原因的情况下破坏连接。假设您对要加入的某个字符串使用制表符 \t。稍后,您更改软件以将 \t 替换为 3 个空格,以减少原始字符串中的字符转义。您现在已经破坏了任何需要将带有转义制表符的字符串与外观相同但组成不同的字符串相匹配的功能。
最后,即使给定两个完全相同的字符串,比较两个整数比比较两个字符串也有轻微的性能优势。整数比较实际上是恒定时间的。字符串比较最多是线性的,基于字符串的长度。
It's certainly possible to use a varchar as a key field (or simply something to join on). The main problems with it are based on what you normally store in a varchar field; mutable data. Strictly speaking, it's not advisable to have key fields change. A person's name, telephone number, even their SSN can all change. However, the employee with internal ID 3 will always be ID 3, even if there are two John Smiths.
Second, string comparison is dependent on a number of nit-picky details, such as culture, collation, whitespace translation, etc. that can break a join for no immediately-apparent reason. Say you use a tabspace character \t for a certain string you're joining on. Later, you change your software to replace \t with 3 spaces to reduce character escapes in your raw strings. You have now broken any functionality requiring a string with escaped tabs to be matched to an identical-looking, but differently-composed, string.
Lastly, even given two perfectly identical strings, there is a slight performance benefit to comparing two integer numbers than comparing two strings. Integer comparison is effectively constant-time. String comparison is linear at best, based on the length of the string.
如果需要使用自然键(在现实生活中极其罕见,但州/省缩写就是一个很好的例子),那么 VARCHAR 字段就可以了。
取决于数据库,因为分配给数据类型的位,但通常 VARCHAR(4) 或更少占用与 INT 相同的空间量(字符数越少)。
If there's a natural key to be used (extremely rare in real life, but state/province abbreviations are a good example), then VARCHAR fields are fine.
Depends on the database because of the bits allocated to the data type, but generally VARCHAR(4) or less takes around the same amount of space (less the less number of characters) as INT would.
一般来说,您不应该使用最终用户可编辑的任何内容作为 FK,因为编辑不需要一次更新,而是每个引用该键的表一次更新。
其他人已经提到了查询的潜在性能影响,但更新成本也值得注意。我强烈建议使用生成的密钥。
Generally speaking you shouldn't use anything that is editable by the end users as a FK as an edit would require not one update, but one update per table which references that key.
Everyone else has already mentioned the potenetial performance implications of a query, but the update cost is also worth noting. I strongly suggest the use of a generated key instead.
如果您担心性能,最好的了解方法是创建实现您潜在设计选择的表,然后向它们加载大量数据以查看会发生什么。
理论上,非常小的字符串在连接中的性能应该与数字一样好。实际上,这肯定取决于数据库、索引和其他实现选择。
If you're concerned about performance, the best way to know is to create tables that implement your potential design choices, then load them up with massive amounts of data to see what happens.
In theory, very small strings should perform as well as a number in joins. In practice, it would definitely depend upon the database, indexing, and other implementation choices.
在关系数据库中,不应在一个表中使用引用另一表中相同字符串的字符串。如果第二个表是查找表,请为该表创建一个标识列,然后引用第一个表中的整数值。显示数据时,使用第二个表的联接。只需确保在第二个表中您从未真正删除记录。
唯一的例外是,如果您要创建一个存档表,您希望在其中准确存储给定时间选择的内容。
In a relational database, you shouldn't use a string in one table that references the same string in another table. If the second table is a look-up, create an identity column for the table, and then reference the integer value in the first. When displaying the data, use a join to the second table. Just make sure in the second table you never actually delete records.
The only exception would be if you are creating an archive table where you want to store exactly what was chosen at a given time.
有时,联接会发生在非“联接字段”的字段上,因为这就是查询的本质(例如,识别特定列中重复记录的大多数方法)。如果您想要的查询与这些值相关,那么这就是连接所在的位置,故事结束。
如果一个字段确实标识了一行,那么就可以将其用作键。如果它可以改变(它会带来问题,但不是不可克服的问题),只要它仍然是一个真正的标识符(它永远不会更改为另一行中存在的值),甚至可以这样做。
性能影响因常见查询和数据库而异。通过数据库,某些数据库的索引策略类型使它们比其他数据库更擅长使用 varchar 和其他文本键(特别是散列索引很好)。
即使没有哈希索引,常见查询也可以使用 varchar 来提高性能。一个典型的例子是为多语言网站存储文本片段。每一段这样的文本都会有一个与它所使用的语言相关的特定语言ID。但是,很少需要获取有关该语言的其他信息(它的名称等);更经常需要的是通过 RFC 5646 代码进行过滤,或者找出 RFC 6546 代码是什么。如果我们使用数字 ID,那么我们将必须连接两种类型的查询才能获取该代码。如果我们使用代码作为 ID,那么与该语言相关的最常见查询根本不需要查看语言表。大多数关心语言细节的查询也不需要进行任何连接;几乎唯一一次将键用作外键是在更新和插入文本或删除语言时维护引用完整性。因此,虽然在使用连接时效率较低,但通过使用较少的连接,整个系统将更加高效。
Sometimes a join will happen on fields that are not "join fields", because that's just the nature of the query (e.g. most ways of identifying records that are duplicates in a particular column). If the query you want relates to those values, then that's what the join will be on, end of story.
If a field genuinely identifies a row, then it is possible to use it as a key. It's even possible to do so if it could change (it brings issues, but not insurmountable issues) as long as it remains a genuine identifier (it'll never change to a value that exists for another row).
The performance impact varies by common query and by database. By database the type of indexing strategies of some makes them better at using varchar and other textual keys than other databases (in particular, hash-indices are nice).
Common queries can be such that it becomes more performant to use varchar even without hash indices. A classic example is storing pieces of text for a multi-lingual website. Each such piece of text will have a particular languageID relating to the language it is in. However, obtaining other information about that language (it's name etc.) is rarely needed; what's much more often needed is to either filter by the RFC 5646 code, or to find out what that RFC 6546 code is. If we use a numeric id, then we will have to join for both types of query to obtain that code. If we use the code as the ID, then the most common queries concerned with the language won't need to look in the language table at all. Most queries that do care about the details of the language also won't need to do any join; pretty much the only time the key will be used as a foreign key is in maintaining referential integrity on update and insert of text or on deletion of languages. Hence while the join is less efficient when it is used the system as a whole will be more efficient by using fewer joins.
这取决于您的数据的性质。
如果该字符串是一些用户输入和更新的值,那么我可能会回避加入它。在父表和详细表中存储名称时,您可能会遇到一致性问题。
没有什么有重名的吗?
当使用 GUID 或单字符标识符或当我知道字符串是自然键时,我使用字符串字段作为连接(尽管我几乎总是更喜欢代理)
It depends on the nature of your data.
If the string is some user-entered and updated value then I would probably shy away from joining on it. You may run into consistency difficulties for storing the name in both the parent and the detail table.
Nothing has duplicate names?
I have used a string field as a join when using GUIDs or single char identifiers or when I know the string to be a natural key (though I almost always prefer a surrogate)
邮政编码、电话号码、电子邮件地址或用户名等自然主键根据定义都是字符串。有独特的和相对较短的。
如果您在这样的列上放置索引,则使用它们进行连接没有问题。对性能的影响通常很小。
Natural primary keys like a zip code, phone number, email address or user name are by definition strings. There are unique and relatively short.
If you put an index on such a column there is no problem with using them a join. Impact on performance will usually be minimal.