通过存储原始字符串大小写和小写来绕过数据库区分大小写问题是疯狂的吗?

发布于 2024-09-29 08:22:42 字数 1125 浏览 0 评论 0原文

我正在实现一个数据库,其中多个表将字符串数据作为候选键(例如:用户名),并将进行相应的索引。对于这些字段,我想要:

  1. 当有人查询这些键上的表时不区分大小写

  2. 以某种方式保留最初编写的大小写,以便应用程序可以呈现使用原始案例向用户提供数据

我还希望数据库模式与数据库无关这是可能的,因为应用程序代码不(或不应)从属于特定的 RDBMS。

另外值得注意的是,对数据库进行的绝大多数查询将由应用程序代码完成,而不是通过客户端直接访问表来完成。

在实现这个过程中,我遇到了很多烦人的问题。一是并非所有 RDBMS 都以相同的方式实现 COLLATE(区分大小写似乎可以在模式级别进行调整)。另一个问题是排序规则和区分大小写选项可以在多个级别(服务器、数据库、表(?)、列)设置,我无法向应用程序保证它将获得什么设置。另一个问题是 COLLATE 本身可能会变得很复杂,因为其中除了区分大小写之外还有很多其他内容(例如:unicode 选项)。

为了避免所有这些令人头疼的问题,我正在考虑通过为一条数据存储两列来完全避免这个问题。一列保留原始大小写,另一列由应用层改为小写。

例如:表中的两个字段

user_name = "fredflintstone" (a unique index on this one)
orig_name = "FredFlintstone" (just data... no constraints)

我认为这样做的优点和缺点是:

优点:

  1. 没有歧义 - 应用程序代码将管理大小写转换,我永远不需要担心当底层 RDBMS/设置更改时,单元测试“神秘地”失败。

  2. 索引上的搜索将是干净的,并且永远不会因为排序功能或对 LOWER() 或任何调用的调用而减慢(假设此类事情会减慢索引,这似乎合乎逻辑)

缺点:

  1. 加倍需要额外的存储空间data

  2. 看起来有点野蛮

我知道它会起作用,但同时它闻起来不对劲。

这样做是疯狂/毫无意义的吗?是否有一些我不知道的事情使得区分大小写问题不像我现在看来那么棘手?

I'm implementing a database where several tables have string data as candidate keys (eg: username) and will be correspondingly indexed. For these fields I want:

  1. Case insensitivity when someone queries the table on those keys

  2. The initially written case to be preserved somehow so that the application can present the data to the user with the original case used

I also want the database schema to be as database independent as possible, as the application code is (or should not be) not slaved to a particular RDBMS.

Also worth noting is that the vast majority of queries done on the database will be done by the application code, not via direct table access by the client.

In implementing this, I'm running into a lot of annoying issues. One is that not all RDBMS implement COLLATE (which is where cases sensitivity appears to be tunable at schema level) in the same way. Another issue is that the collation and case sensitivity options can be set at multiple levels (server, database, table (?), column) and I can't guarantee to the application what setting it will get. Yet another issue is that COLLATE itself can get hairy because there is a heck of a lot more in there than simply case sensitivity (eg: unicode options).

To avoid all of these headaches, what I'm considering is dodging the issue altogether by storing two columns for one piece of data. One column with the original case, another dropped to lower case by the application layer.

eg: Two of the fields in the table

user_name = "fredflintstone" (a unique index on this one)
orig_name = "FredFlintstone" (just data... no constraints)

The pros and cons of this as I see it are:

Pros:

  1. No ambiguity - the application code will manage the case conversions and I never need to worry about unit tests failing "mysteriously" when the underlying RDBMS/settings changes.

  2. Searches on the index will be clean and never be slowed down by collation features or calls to LOWER() or anything (assuming such things slow down the index, which seems logical)

Cons:

  1. Extra storage space required for the doubled-up data

  2. It seems a bit brutish

I know it will work, but at the same time it smells wrong.

Is it insane/pointless to do this? Is there something I don't know that makes the case sensitivity issue less tricky than it seems to me at the moment?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

仙女 2024-10-06 08:22:42

当然,这样的决定总是需要权衡,但我不认为这一定是“双倍数据”。将字符串转换为小写可能是一项不简单的操作,特别是当您超出 ASCII 范围时,因此字符串的小写版本不仅仅是“重复”。它与原始字符串有些相关,但仅此而已。

如果您将其视为将计算结果存储在数据库中的模拟,那么它会变得更自然。

UPPER(UserName) 上查询的选项是另一个很好的解决方案,它避免了第二列。但是,要使用它,您至少需要一个可靠的 UPPER 函数(特别是您可以控制它用于非 ASCII 字符的区域设置),并且可能还需要基于函数的索引以获得良好的性能。

Of course, decisions like this are always a trade-off, but I don't think this is necessarily "doubled-up data". Lowercasing a string can be a non-trivial operation, in particular if you go beyond ASCII, so the lowercased version of the string is not just "duplicate". It is somewhat related to the original string, but not more than that.

If you think of it as an analog to storing computed results in the DB, it becomes more natural.

The option of querying on UPPER(UserName) is another good solution, which avoids the second column. However, to use it you need at least a reliable UPPER function (where in particular you can control the locale that it uses for non-ASCII characters), and probably function-based indices for decent performance.

恏ㄋ傷疤忘ㄋ疼 2024-10-06 08:22:42

索引上的搜索将是干净的,并且永远不会因排序功能或对 LOWER() 或任何调用的调用而减慢(假设此类事情会减慢索引,这似乎是合乎逻辑的)

不,这不合逻辑。您可以在常量函数上建立索引。

create index users_name on users(name); -- index on name
create index users_name_lower on users(lower(name)); -- index on the function result

您的 RDBMS 应该足够聪明,知道在收到此查询时要使用users_name_lower

select * from users where lower(name) = ?

如果没有 users_name_lower,是的,那将必须遍历表。有了功能索引,它就可以做正确的事情。

Searches on the index will be clean and never be slowed down by collation features or calls to LOWER() or anything (assuming such things slow down the index, which seems logical)

No, that's not logical. You can have indexes on constant functions.

create index users_name on users(name); -- index on name
create index users_name_lower on users(lower(name)); -- index on the function result

Your RDBMS should be smart enough to know to use users_name_lower when it gets this query:

select * from users where lower(name) = ?

Without users_name_lower, yes, that would have to walk the table. With the functional index, it does the right thing.

请远离我 2024-10-06 08:22:42

由于性能原因,我经常看到以这种方式重复数据。它允许您保留原始大小写(您显然需要它,因为您并不总是能够猜测大小写应该是什么,例如您无法确定每个名称都以大写字母开头)。如果数据库不支持其他方法来执行此操作(功能索引),那么这是实用的,而不是疯狂的。您可以使用触发器保持数据一致。

I've often seen data duplicated in this way for performance reasons. It allows you to keep the original casing (which you'll obviously need as you're not always able to guess what the casing should be, you can't be sure that each name begins with a capital letter for example). If the database doesn't support other ways to do this (functional indexes), then this is practical, not crazy. You can keep the data consistent by using triggers.

空心空情空意 2024-10-06 08:22:42

建议您的搜索查询执行以下操作:

  • SELECT * FROM Users WHERE LOWER(UserName) = LOWER('fredFlinstone')
  • 在应忽略/尊重大小写敏感性时,在查询中显式包含 COLLATION 提示

I' d 考虑区分大小写的重复数据过于繁重。

Suggest your search queries do something like this:

  • SELECT * FROM Users WHERE LOWER(UserName) = LOWER('fredFlinstone')
  • explicitly include the COLLATION hint on the query when case sensitivity should be ignored/respected

I'd consider the duplication of data for case sensitivity too onerous.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文