使用可变用户输入填充外键表
我正在开发一个基于用户贡献数据的网站, 使用常规 HTML 表单提交。
为了简化我的问题,假设有两个字段 形式:“用户名”和“国家”(这只是一个示例,不是 实际站点)。
数据库中有两个表:“国家”和“用户”, “users.country_id”是“countries”表的外键 (一对多)。
初始数据库将为空。来自世界各地的用户将 提交他们的姓名和居住的国家,并最终提交 “国家”表将填写所有国家/地区名称 世界。
由于一个国家可以有多个替代名称,因此输入如下 Chile、Chili、Chilli 将在 中产生 3 个不同的记录 国家表,但实际上只有一个国家。 当我搜索来自智利的记录时,Chili和Chilli不会被包括在内。
所以我的问题是 - 处理这种情况的最佳方法是什么 像这样的情况,条件是初始数据库是 空的,没有其他资源可用,一切都基于 用户输入?
我怎样才能以这样的方式组织它,智利,辣椒和辣椒会 被视为一个国家,人为干预最少。
规范用户时的最佳实践是什么 提交的数据有科学术语吗?我确定这个 是一个常见问题。
再说一遍,我使用国家/地区名称只是为了简化我的问题,它可以是 任何可能有不同拼写的东西。
I'm working on a website that will be based on user contributed data,
submitted using a regular HTML form.
To simplify my question, let's say that there will be two fields in
the form: "User Name" and "Country" (this is just an example, not the
actual site).
There will be two tables in the database : "countries" and "users,"
with "users.country_id" being a foreign key to the "countries" table
(one-to-many).
The initial database will be empty. Users from all over the world will
submit their names and the countries they live in and eventually the
"countries" table will get filled out with all of the country names in
the world.
Since one country can have several alternative names, input like
Chile, Chili, Chilli will generate 3 different records in the
countries table, but in fact there is only one country.
When I search for records from Chile, Chili and Chilli will not be included.
So my question is - what would be the best way to deal with a
situation like this, with conditions such that the initial database is
empty, no other resources are available and everything is based on
user input?
How can I organize it in such way that Chile, Chili and Chilli would
be treated as one country, with minimum manual interference.
What are the best practices when it comes to normalizing user
submitted data and is there a scientific term for this? I'm sure this
is a common problem.
Again, I used country names just to simplify my question, it can be
anything that has possible different spellings.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
无论是基于文件(Lucene、Sphinx 等)还是基于数据库(Oracle Text、MSSQL Fulltext)的搜索引擎都通过同义词库解决了这个问题。也就是说,它们在同义词的基础上将单词收集在一起。成为同义词的资格比 Roget 书中的更严格,但原则是相同的。同义词包含缩写词、首字母缩略词和常见拼写错误。例如,搜索同义词库可能会将 street 和 st 识别为同一事物。尽管如此,上下文才是一切:在字符串“St Pancras Road”中,st 是saint 的同义词。
那么,这对您有帮助吗?在某种程度上。它表明了您想要实现的事情:
不幸的是,构建和维护同义词库需要人类的聪明才智和努力。建立分类法需要专业知识;跟踪新添加的内容需要时间。另一件事是,即使有了同义词库,匹配仍然是概率性的:MoMA可能与现代艺术博物馆相同,但它与SFMOMA相同吗? em>还是NYMOMA?不完全一样,但也许 90% 一样?
另一种方法是像 SO 对标签所做的那样。当您标记问题时,会出现一个下拉框,建议可用的标签。当您输入更多字母时,列表会缩小。这并不是万无一失的,看看
tsql
和t-sql
等标签的存在,但它相当不错。 SO 还有一个备份,即为高级用户提供新创建的标签列表,以便他们可以调查这些货币,甚至可能废除它们。但这仍然是一个手动过程。唉,没有任何算法能够判断出MoMA与现代艺术博物馆相同,更不用说弄清楚它是否引用了纽约的机构或旧金山。
Search engines whether file based (Lucene, Sphinx, etc) or database (Oracle Text, MSSQL Fulltext) solve this problem with a thesaurus. That is, they collect words together on the basis of them being synonyms. The qualification for being a synonym is tighter than in Roget's book, but the principle is the same. Synonyms bundle up abbreviations, acronyms and common misspellings. So for instance, a search thesaurus might identify street and st as being the same thing. Although, context is everything: in the string "St Pancras Road" st is a synonym for saint.
So, does this help you at all? Up to a point. It suggests the sort of thing you want to implement:
The unfortunate thing is that building and maintaining a thesaurus requires human ingenuity and effort. Building a taxonomy requires expertise; tracking new additions requires time. The other thing is that even with a thesaurus the matches remain probabalistic: MoMA might be the same as Museum of Modern Art but is it the same as SFMOMA or NYMOMA? Not exactly but maybe 90% the same?
An alternative approach would be to do what SO does with tags. When you tagged your question a dropdown box appeared, suggesting available tags. As you typed more letters the list narrowed. This is isn't fool foolproof, witness the presence of tags like
tsql
andt-sql
but it is pretty good. SO also has a backup, which is to provide the power users with a list of freshly minted tags so they can investigate these coinages and perhaps quash them. But that still remnains a manual process.Alas there is no alogorithm that is going to be able to tell that MoMA is the same as Museum of Modern Art, let alone figure out whether it references the institution in New York or San Francisco.
我想说使用国家/地区的下拉列表,您可以使用 javascript 轻松填充它。您可以在此处找到所有国家/地区的列表 http://openconcept.ca/blog/mgifford/text_list_all_countries
关于您的标准化问题,根据您的示例,我认为设计没有任何问题
I would say to use dropdown list for the country and you can easily populate it using javascript. You can find list of all countries here http://openconcept.ca/blog/mgifford/text_list_all_countries
As to your normalization question, I don't see any problem with design according to your example
您无法以编程方式确定 Chile 应该与 Chili 相同,而 Chili 与 Chilli 相同。通过国家/地区示例,您可以拥有输入数据库的国家/地区列表,并拥有用户可以从中选择的下拉列表。
如果数据都是用户输入的,则只有完全相同时才能匹配,因此它们的含义是相同的。
您可能想出一种算法来关联相似的单词,但恕我直言,这只是要求不确定的结果(灾难)。例如(使用与您所在国家/地区不同的示例)您可以通过编程方式确定单词 fight 和 sight 仅相差一个字母,因此它们是相似的。但他们真的是这样吗?仅仅因为这两个词在语法上接近并不意味着它们在语义上也接近。我猜这就是你所追求的。
You cannot programmatically determine that Chile should be the same as Chili which is the same as Chilli. With the country example you can have a list of countries that you enter into your db and have a drop down list that users can select from.
If the data is all user entered, you can only match if it is exactly the same therefore their meaning is the same.
You might come up with an algorithm that associates words that are alike but IMHO that is just calling for nondeterministic results (disaster). For example (using a different example than your country one) you might programmatically determine that the word fight and sight are only off by one letter therefore they are alike. But are they really? Just because the two words are syntactically close does not mean they are semantically close too. And I am guessing that is what you are after.