VARCHAR 完全像 20 世纪 90 年代吗?
- VARCHAR 不存储 Unicode 字符。
- NVARCHAR 确实存储 Unicode 字符。
- 当今的应用程序应该始终与 Unicode 兼容。
- NVARCHAR 需要两倍的空间来存储它。
- 第 4 点并不重要,因为存储空间非常便宜。
因此:当今设计 SQL Server 数据库时,应始终使用 NVARCHAR。
这是合理的推理吗? 有人不同意其中的任何前提吗? 现在有什么理由选择 VARCHAR 而不是 NVARCHAR?
- VARCHAR does not store Unicode characters.
- NVARCHAR does store Unicode characters.
- Today's applications should always be Unicode compatible.
- NVARCHAR takes twice the amount of space to store it.
- Point 4 doesn't matter because storage space is extremely inexpensive.
Ergo: When designing SQL Server databases today, one should always use NVARCHAR.
Is this sound reasoning? Does anyone disagree with any of the premises?
Are there any reasons to choose VARCHAR over NVARCHAR today?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
您将数据类型与将存储在列中的数据相匹配。 通过类似的论点,您可能会说为什么不将所有数据存储在 NVARCHAR 列中,因为数字和日期可以表示为数字字符串。
如果将存储在列中的数据的最佳匹配是 VARCHAR,则使用它。
You match the datatype with the data that will be stored in the column. By a similar argument you could say why not store all data in NVARCHAR columns, because numbers and dates can be represented as strings of digits.
If the best match for the data that will be stored in the column is VARCHAR, then use it.
它不仅仅是存储,还有带宽——CPU、内存、备份、恢复、传输。 养护。
it is not just storage, but bandwidth - cpu, memory, backup, recovery, transfer. Conserve.
我想说,不使用 nvarchar 仍然有充分的理由。
真的巨大。
然而,新的开发可能应该使用 nvarchar esp。 因为 64 位系统正在成为常态。 此外,公司(甚至是小公司)现在更加全球化。
I'd say that there are still valid reasons to not use nvarchar.
really huge.
However new development should probably use nvarchar esp. since 64-bit systems are becoming the norm. Also, companies (even small ones) are now more commonly global.
对于许多不同类型的列,您应该选择 VARCHAR 而不是 NVARCHAR,并且选择将基于每列。
不需要 NVARCHAR 产生额外开销的典型列有:
ID 类型列:车牌、SSN、患者图表标识符等。
代码列:国际货币代码(USD、UKP 等)、ISO 国家/地区代码(美国、 UK 等)、语言代码(en-us 等)、会计段代码等
邮政编码和邮政编码列。
You should choose VARCHAR over NVARCHAR for many different types of columns, and the choice would be on a per-column basis.
Typical columns which would not require the extra overhead NVARCHAR incurs would be:
ID-type columns: License plates, SSNs, Patient Chart identifiers etc.
Code columns: International currency codes (USD, UKP, etc.), ISO country codes (US, UK, etc), Language codes (en-us, etc), accounting segment codes, etc
Postal code and zip code columns.
我相信 nvarchar 的比较比 varchar 的成本更高,因此它完全有效,甚至在您确实不需要 unicode 功能的地方(即对于某些内部 ID)来说是首选。
存储成本仍然很重要。 如果您有数十亿行,那么这些“小”差异很快就会变得很大。
I believe that comparison of nvarchars is more costly than varchars so it's perfectly valid and even preferred in places where you really don't need unicode capabilities, i.e., for some internal IDs.
And storage cost still does matter. If you have billions of rows then those "small" differences get big pretty fast.
正如其他人指出的那样,这不仅仅是存储成本。
列的长度将影响每页的行数。 每页的行数越少意味着可以放入缓存的行数就越少,从而降低性能。 我假设在 MSSQL 中,索引的 NVARCHAR 列将占用索引中的更多空间。 这意味着每个块的索引条目更少,因此索引中的块更多,因此在扫描(或搜索)索引时进行更多的查找,这也会减慢索引访问的速度。
所以它会降低你在各个方面的表现。 如果您真的不关心(或者当然可以衡量性能并对此感到满意),那也没关系。 但如果您确实需要存储 unicode 字符,当然可以使用 NVARCHAR。
我认为在整个数据库中使用 NVARCHAR 所获得的可维护性可能超过任何性能成本。
As others have pointed out, it's not just the cost of the storage.
The length of a column will affect the number of rows per page. Having fewer rows per page means that fewer can fit into your caches, which drops performance. I am assuming that in MSSQL, a NVARCHAR column which is indexed will use up more space in the index. Which means fewer index entries per block, therefore more blocks in the index, therefore more seeks when scanning (or searching) indexes, which slows down indexed access too.
So it loses you performance on every single front. If you genuinely don't care (or can measure the performance and are happy with it, of course), then that's fine. But if you have a genuine requirement to store unicode characters, of course, use NVARCHAR.
I may be that the maintainability gained by using NVARCHAR throughout your database outweighs any performance cost.
这类问题总是有相同的答案:这取决于。 没有什么神奇的规则是你应该盲目遵循的。 即使在现代编程语言中使用 GOTO 也是合理的:在支持循环的语言中使用“goto”是否有利和功能? 如果是这样,为什么?
所以答案是:用你的头脑思考特定的情况。 在这个特定实例中,请记住,如果您的需求发生变化,您始终可以在数据库中从 varchar 转换为 nvarchar。
These sorts of questions always have the same answer: it depends. There is no magical rule that you should follow blindly. Even the use of GOTO in modern programming languages can be justified: Is it ever advantageous to use 'goto' in a language that supports loops and functions? If so, why?
So the answer is: use your head and think about the particular situation. In this particular instance keep in mind that you can always convert from varchar to nvarchar in the database if it turns out your requirements change.
我看到 nvarchar 列转换为 varchar 有两个原因:
应用程序正在使用 MSSQL Express
版本,数据库大小为 4GB
限制。 切换到 MSSQL 标准
如果版本太贵了
有很多数据库部署,
就像单租户 Web 应用程序一样
或具有嵌入式 DBMS 的应用程序。
更便宜的SQL2008网络版
可以在这里提供帮助。
nvarchar(4000) 不够但是你
不想要 ntext 列。 那么你
转换为 varchar(8000)。 然而,
在大多数情况下,您可能应该转换为 nvarchar(max)。
I have seen nvarchar columns converted to varchar for two reasons:
Application is using MSSQL Express
Edition, which has 4GB database size
limit. Switching to MSSQL Standard
Edition would be too expensive if
there are many database deployments,
as would be in single-tenant webapps
or applications with embedded DBMS.
The cheaper SQL2008 Web Edition
could help here.
nvarchar(4000) is not enough but you
don't want an ntext column. So you
convert to varchar(8000). However,
in most cases you probably should convert to nvarchar(max).
你的第3点无效。 专为单个国家/地区使用而设计的系统不必担心 unicode,并且正在使用的某些语言/产品根本不支持或仅部分支持 unicode。 例如,TurboTax 仅适用于美国(即使有法语的加拿大版本,仍然只是 LATIN-1),所以他们不需要或不必担心 unicode 并且可能不支持它(我不知道他们是否支持,但即使他们支持,这只是一个例子)。
“今天的应用程序应该始终与 Unicode 兼容。”
可能更有效的表述是:
“如果没有什么特殊需要来正确处理 Unicode,那么今天的应用程序应该始终与 Unicode 兼容,并且以前存在的代码库或应用程序的任何其他部分不需要专门更新来支持它”
Your point 3 is invalid. Systems that are designed only for a single country's use don't have to worry about unicode, and some languages/products in use don't support unicode either at all or only partially. For example, TurboTax is only for the U.S. (and even with a Canadian version with French is still just LATIN-1), so they wouldn't need or have to worry about unicode and probably don't support it (I don't know if they do or not, but even if they do, it's just an example).
"Today's applications should always be Unicode compatible."
is probably more valid expressed as:
"Today's applications should always be Unicode compatible if nothing special needs to occur to handle Unicode properly, and a previously existing codebase or any other piece of the application does not need to be updated specifically to support it"
存储比以往任何时候都便宜,但如果您可以在给定的硬盘驱动器上存储两倍的数据,这仍然很有吸引力,不是吗?
还有用于缓存的 RAM 和固态硬盘,它们都比硬盘贵得多。 当您有数百万行时,使用更紧凑的数据格式是有益的。
Storage is less expensive than it's ever been historically, but still if you can store twice as much data on a given hard drive, that's attractive, isn't it?
Also there's RAM for caching, and solid-state drives, which are both a lot more expensive than hard drives. It's beneficial to use more compact data formats when you have millions of rows.
有没有办法让您的数据库服务器使用 UTF-8 作为编码? 然后,您可以获得大部分 ASCII 负载的低存储空间的优势,以及存储 Unicode 范围内的任何内容的能力,以便可以进行扩展。
我会要求您的数据库供应商也支持 UTF-8 作为
VARCHAR
SQL 类型的编码。 我不知道其他数据库服务器是如何做到这一点的,但我知道至少可以在 MySQL 和 PostgreSQL 中的 VARCHAR 和 TEXT 字段中使用 UTF-8。尽管如此,不使用 UTF-16 编码字段的唯一原因是,如果您必须与在 UTF-16 输入时会中断的应用程序进行交互。 这将是大多数设计用于处理 ASCII 或 ISO-8815 文本编码的遗留应用程序,处理 UTF-8 会更好。
Is there a way for your database server to use UTF-8 as an encoding? You then get the benefits of low storage for mostly ASCII loads, and the ability to store anything in the range of Unicode so that expansion is possible.
I would ask your database vendor to support UTF-8 as an encoding for the
VARCHAR
SQL type, as well. I don't know how other DB servers do it, but I do know that you can use UTF-8 inVARCHAR
andTEXT
fields in at least MySQL and PostgreSQL.All that having been said though, the only reason to not use a UTF-16 encoded field is if you have to interact with applications which will break on UTF-16 input. This would be most legacy applications which were designed to handle ASCII or ISO-8815 text encodings, which would be better off processing UTF-8.
我的倾向是“使用 NVARCHAR”作为默认值...但 @CadeRoux 有一个很好的观点:如果您确定数据永远不会包含除 ASCII 之外的任何内容 - 就像美国车牌 - VARCHAR 可能会为您节省一点点成本。
我想说的是,他精心表述的另一面是“对于任何有名称(人、街道、地点)或自然语言文本(电子邮件、聊天、文章、博客帖子、照片标题)的东西,请使用 NVARCHAR”。 否则,您的“名字”列将无法正确编码“François”或“José”,并且您的文本列将不允许带有“外国”变音符号的文本,或者 - 就此而言 - 非常常见的美国字符,例如分号“¢”、段落标记“¶”、项目符号“•”。 (因为这些都不是 ASCII 字符,并且没有好的、标准的方法将它们放入 VARCHAR 字段。相信我:你会伤害自己。)
在我工作过的任何项目中继续,我从来没有因为使用 NVARCHAR 而被责骂,因为我“在磁盘空间上浪费了太多公司的钱”。 如果我必须重新编写代码或数据库模式(尤其是在实时生产系统上),则重新安装所花费的成本将轻松超过购买缩小 50% 的磁盘所“节省的成本”。
要真正理解这个问题,您必须了解 ASCII、Unicode 和 Unicode 的典型编码(例如 UCS-2 和 UTF-8)。
My leaning is "use NVARCHAR" as a default... but @CadeRoux has a good point: if you are SURE the data will never hold anything but ASCII -- like a US license plate -- VARCHAR might save you a tiny bit of cost.
I'd say the flip side of his well-put statement is "DO use NVARCHAR" for anything that will have names (people, streets, places) or natural language text (email, chat, articles, blog postings, photo captions). Otherwise, your "firstname" column will not be able to encode "François" or "José" correctly, and your text columns will not allow text with "foreign" diacritcal marks, or -- for that matter -- very common US characters like cent-mark "¢", paragraph mark "¶", a bullet "•". (Because none of those are ASCII characters, and there is no good, standard way to put them in to a VARCHAR field. Trust me: you'll hurt yourself.)
On ANY project I've worked on, I've NEVER been scolded for using NVARCHAR because I was "squandering too much company money on disk space". And if I had to rework code or the DB schema (especially on a live, production system), the cost spent in the re-fit would EASILY outweigh the "savings" from buying a disk that was 50% smaller.
To really understand this question you really have to understand ASCII, Unicode, and Unicode's typical encodings (like UCS-2 and UTF-8).
我不是这个问题的专家。 但是为什么不能使用 UTF-8 来获得小空间和 unicode 的组合呢?
I'm no expert on the subject. But any reason why you couldn't use UTF-8 to get a combination of small space and unicode?
我见过一些数据库,其中索引(索引?...不同的争论)比数据大。 如果可以满足索引中一半的存储需求(varchar),那么可以假设这相当于给定页面的命中密度的两倍,并且更有效的填充因子可以导致更快的数据检索/写入/锁定和恢复。 更少的存储需求(已经提到)。
I've seen some database where the indices (indexes?...different debate) have been larger than the data. If one can get away with half the storage demands (varchar) within the index then one assumes that equates to twice the hit density of a given page and more efficient fill-factoring leading to faster data retrieval/writing/locking & less storage requirements (already mentioned).