使用用户输入进行数据库规范化
我开发了一个 mysql 数据库,其中包含每个用户的国家、城市和职业。 虽然我可以使用“国家/地区”表,然后将国家/地区的 ID 插入到用户表中,但我仍然必须为其他两个表寻找完美的方法。
问题在于,每个用户的城市和职业都是从输入字段中获取的,这意味着用户可以为每个城镇键入“NYC”或“New York”或“New York City”以及数百万个其他组合。
忽略这个问题,创建一个自己的“town”表,其中包含用户插入的所有城镇,然后将城镇条目的 id 放入用户表中,或者使用 VARCHAR 列“town”是否更合适? “在用户表中并且没有规范化有关此关系的数据库? 我想在用户个人资料页面上显示三个表中的数据。
我担心规范化,因为我不想在数据库中有太多冗余数据,因为它会消耗大量空间,并且如果我使用 varchar 索引而不是整数索引,查询会变慢(例如我知道):
谢谢
I develop a mysql database that will contain the country,city and occupation of each user.
While I can use a "country" table and then insert the id of the country into the user table, I still have to look for the perfect method for the other two tables.
The problem is that the city and occupation of each user are taken from an input field, meaning that users can type "NYC" or "New York" or "New York City" and millions of other combinations for each town, for example.
Is it a good idea to disregard this issue, create an own "town" table containing all the towns inserted by users and then put the id of the town entry into the user table or would it be more appropriate to use a VARCHAR column "town" in the user table and not normalize the database concerning this relation?
I want to display the data from the three tables on user profile pages.
I am concerned about normalization because I don't want to have too much redundant data in my database because it consumes a lot of space and the queries will be slower if I use a varchar index instead of an integer index for example (as far as I know):
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我们遇到了这个问题。我们的解决方案是收集人们使用的各种同义词和包含拼写错误的版本,并将它们明确映射到已知的规范城市名称。这允许在 99% 的情况下根据用户输入正确猜测名称。
对于剩下的 1%,我们创建了一个新的城市条目并将其标记为非规范。我们定期查看非规范条目。对于可识别的已知城市,我们将非规范条目重新映射到规范条目(更新链接记录的 FK 并添加同义词)。对于我们不知道的真正的新城市名称,我们将创建的条目保留为规范。
所以我们有这样的事情:
We had this problem. Our solution was to collect the various synonyms and typo-containing versions that people use and explicitly map them to a known canonical city name. This allowed to correctly guess the name from user input in 99% of cases.
For the remaining 1%, we created a new city entry and marked it as a non-canonical. Periodically we looked through non-canonical entries. For recognizable known cities, we remapped the non-canonical entry to the canonical (updating FKs of linked records and adding a synonym). For a genuinely new city name we didn't know about we kept the created entry as canonical.
So we had something like this:
通常,数据标准化可以帮助您处理数据并使其保持简单。如果规范化模式不满足您的需求,您也可以使用非规范化数据。所以这取决于您想要使用的查询。
如果不创建单独的表(将每个城市的所有名称保存在单个 ID 中),就没有好的解决方案来对城市进行分组。因此,最好有 3 个表:user(user_id, city_id)、city(city_id, 正确名称)、city_alias(alias_id, city_id, name)。
Usually data normalization helps you to work with data and keep it simple. If normalized schema not fit your needs you can use denormalized data as well. So it depends on queries you want to use.
There is no good solution to group cities without creating separate table where you will keep all names for each city within single id. So it will be good to have 3 tables then: user(user_id, city_id), city (city_id, correct name), city_alias(alias_id, city_id, name).
最好以标准化设计存储数据,其中包含实际、政府认可城市名称。
@Varela 关于城市“别名”的建议在这种情况下可能会很有效。但您必须返回一条类似“您输入了‘Now Yerk’。您的意思是‘纽约’吗?”的消息。实际上,无论如何,您都希望获得这些类型的更正...
当然,您可能实际上存储的不是城市,而是邮政编码。表设计遵循以下原则:
然后,只要您有地址,就存储对
Zip_Code.Id
的引用。您想确切地知道用户拥有(声称)属于哪个邮政编码。原因包括:关于地址验证,还可以执行许多其他操作,包括地理位置,但这是一个基本设计,可以帮助您满足大多数需求(并防止大多数可能的“无效”异常)。
It would be better to store the data in a normalized design, containing the actual, government recognized city names.
@Varela's suggestion of an 'alias' for the city would probably work well in this situation. But you have to return a message along the lines of "You typed in 'Now Yerk'. Did you perhaps mean 'New York'?". Actually, you want to get these kinds of corrections regardless...
Of course, what you should probably actually store isn't the city, but the postal/zip code. Table design is along these lines:
And then store a reference to
Zip_Code.Id
whenever you have an address. You want to know exactly which zip code a user has (claimed) to be a part of. Reasons include:There are a number of other things that can be done about address verification, including geo-location, but this is a basic design that should help you in most of your needs (and prevent most of the possible 'invalid' anomalies).