如何解决MySQL存储人名的困境,同时保持可区分性和相似名字的搜索?
我被授予了一项美丽的任务;-) 在 MySQL 数据库中设计一些应该保存人名的表。
标准:
- 我只有全名。 (没有分隔,例如名字、姓氏等)
存储应该区分变音符号。 (以下名称代表不同人)
- “Voss”和“Voß”。
- “乔尔”和“乔尔”。
- “法郎”、“法郎”和“法郎”。
搜索应返回与搜索字符串相似的所有名称:例如:搜索“franc”应返回 [“franc”、“Franc”、“Fránc”] 等等...(如果搜索不仅会返回不区分变音符号的匹配项,而且可能还会返回发音相似的名称或部分与搜索字符串匹配的名称...)
我想对将在其中存储名称的列(声明为unique
)使用COLLATION utf8_bin
。这可以满足第二点。但这会损害第三点。使用 collation utf8_unicode_ci
将列名声明为 unique
可以满足第 3 点,但它会损害第 2 点。
所以我的问题是:有没有办法解决这个任务并尊重所有标准?因为我不想重新发明轮子:是否有一种优雅的方式来处理数据库中的人名(及其搜索)? (遗憾的是,我无法将名称分为名字、姓氏和可选的中间名...)
编辑:
名称数量约为一百万(~1.000.000)个条目。如果重要的话:我使用 python 作为脚本语言来填充数据库并稍后查询数据。
I was granted with the beautiful task ;-) to design some tables in a MySQL Database which should hold human names.
Criteria:
- I have only the full names. (There is no separation for e.g. prename, surname and so on)
The storage should be diacritic sensitive. (The following names stand for different persons)
- "Voss" and "Voß".
- "Joel" and "Joël".
- "franc" and "Franc" and "Fránc".
A search should return all similar names to the search string: E.g: Search for "franc" should return ["franc", "Franc", "Fránc"] and so on... (It would be awesome if the search would return not only the diacritice insensitive matches but perhaps similar sounding names or names that match in parts to the search string, too...)
I thougt of using the COLLATION utf8_bin
for the column (declared as unique
) in which I will store the names. This would satisfy point 2. But this will hurt point three. Declaring the column name as unique
with collation utf8_unicode_ci
satisfys point 3. but it hurts point two.
So my question is: Is there a way to solve this task and respecting all criteria? And since I don't want to reinvent the wheel: Is there an elegant way to handle human names (and their searches) in databases? (Sadly, I do not have the possibility of splitting the names into prename, surnames and optional middlenames...)
Edit:
The amount of names is arount a million (~1.000.000) entrys. And if it matters: I am using python as scripting language to populate the database and query the data later on.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有用的是,如果您可以将全名分解为组件“名称词”,并为每个组件存储一个语音编码(变音位或许多其他选择之一)。不过,您只需要名称词的概念,而不是专门将其分类为第一个、中间或最后一个,这很好,因为这些类别无论如何都不能很好地跨文化工作)。但是,如果您愿意,您可以稍后在排名中使用位置顺序信息,以便搜索“Paul Carl”比匹配“Carl Paul”更好地匹配“Paul Karl”。您需要注意不明确的标点符号,这可能需要存储某些名称单词的多个版本。例如,Bre-Anna Heim 将被分解为名字单词“bre”、“anna”、“breanna”和“heim”。有时破折号是无关紧要的,就像 Bre-Anna 一样,但有时不像 Sally-June 那样。Bre-Anna 从来不会只使用 Bre 或 Anna,但 Sally-June 有时可能只使用 Sally 或 June。很难知道是哪一个,所以 。
您可以通过类似地分解和语音编码您正在搜索的全名来编写您的查询,例如,您的查询可以返回具有两个或多个组件名称语音匹配的全名(或者如果有的话) 是搜索或源中只有一个名称)。这为您提供了一个完整名称的子集,您可以对它们进行简单的排名,甚至可以对该子集执行诸如距离匹配算法之类的操作。当我说距离匹配时,我指的是 Levenshtein 距离等在线算法
(编辑),其原因是处理类似以下名称的情况:Maria de los。安吉利斯·戈麦斯-罗德里格斯。一名数据输入人员可能只需输入 Maria Gomez。另一个可能是玛丽亚·戈麦斯·罗德里格斯。另一个可能是玛丽亚·安吉利斯·罗德里格斯(Maria Angeles Rodrigus)。
What is useful is if you can decompose the full name into component "name words" and store a phonetic encoding (metaphone or one of the many other choices) for each of them. You just need the notion of name words though, not specifically categorizing it as first or middle or last, which is fine because those categories don't work well across cultures anyway). But you can use positional order information later in ranking if you want so that searching for "Paul Carl" matches "Paul Karl" better than matching "Carl Paul". You need to be aware of ambiguous punctuation that may require storing multiple versions of some name words. For instance Bre-Anna Heim would be broken into the name words "bre" "anna" "breanna" and "heim". Sometimes the dash is irrelevant like Bre-Anna, but sometimes not like in Sally-June". Bre-Anna never uses just Bre or Anna, but Sally-June may just use Sally or just June sometimes. It's hard to know which, so cover both possibilities.
You can write your query against this by similarly decomposing and phonetically encoding the full name you're searching for. Your query can return, say, those full names that have two or more component name phonetic matches (or one if there is only one name in the search or the source). This gives you a subset of full names to consider further. You could come up with a simple ranking of them, or even do something like a distance matching algorithm on this subset, which would be too expensive computationally to do against the entire million names. When I say distance matching, I'm talking on-line algorithms like Levenshtein distance and the like.
(edit) The reasoning for this is handling cases like the following name: Maria de los Angeles Gomez-Rodriguez. One data entry person may just enter Maria Gomez. Another might enter Maria Gomez Rodriguez. Yet another might enter Maria Angeles Rodrigus.
您可以在另一列中使用诸如 Metaphone(或 Double Metaphone)之类的算法,以便您可以尝试找到彼此“相似”的名字。您必须寻找了解德语 esset 字符的国际版本。
You can use an algorithm like Metaphone (or Double Metaphone) in another column so that you can try to find names that are "similar" to each other. You will have to look for an international version that knows about the german esset character.