有什么办法可以创建一个“签名”吗?字符串以确保字符串是唯一的?或者我应该使用唯一的数据库索引?
我正在建立一个网站。用户可以提交“标题”,它是一串unicode字符(不仅仅是英文)。
当用户提交“标题”时,我想查看它是否已经在数据库(MySQL)中。如果是的话,我只会更新现有记录。如果它是一个新的“标题”,我会为其创建一个新记录。
我想测试唯一性的标准方法是在“标题”列上创建一个索引。但我担心这样一个索引的大小,因为“标题”可能会很长。
所以我想知道是否有办法创建“标题”的“签名”并用它来测试唯一性?是否有一些哈希函数可以将 unicode 字符串哈希为唯一值?
任何指示将不胜感激。谢谢。
I'm building a website. Users can submit "Title" which is a string of unicode characters (not just English).
When a user submit a "Title", I want to see if it's already in the database (MySQL). If it is, I'd just update the existing record. If it's a new "Title", I'd create a new record for it.
I guess the standard way to test for uniqueness is to just create an INDEX on the column "Title". But I'm concerned about the size of such an index because "Title" could be quite long.
So I'm wondering if there's a way to create a "signature" of "Title" and use that to test for uniqueness? Is there some hash function that would hash from an unicode string to a unique value?
Any pointers will be greatly appreciated. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
简单的答案是使用 MySql 散列函数 (MD5 SHA1) 来创建每个标题的散列并将其与标题本身一起存储。
然后,您可以对哈希值进行索引,这将产生更好更快的索引。
这些基本上是加密函数,会占用大量 CPU,但您的语言环境可能会提供更简单、更快的哈希,例如 crc32。
在散列之前,还值得对您的“标题”进行清理。将多个空格强制为一个空格,将所有字符折叠为小写,删除标点符号等。
所以“STACKOVERLOW IS GREAT ......”和“stackoverflow is Great”会产生相同的哈希值。
The simple answer is to use one of the MySql hash functions (MD5 SHA1) to create a hash of each title and store this alongside the title itself.
You can then index the hashed value which will produce a better faster index.
These are basically cryptographic functions and eat up a lot of cpu, but your language environment might provide a simpler faster hash such as crc32.
Its also worth putting your "Title" through a cleanup before hashing ie. coerce multiple spaces to a single space, fold all characters to lower case remove punctuation etc.etc.
So the "STACKOVERLOW IS GREAT ...... " and "stackoverflow is great" result in the same hash.