使用可变用户输入填充外键表

发布于 2024-08-31 00:34:44 字数 580 浏览 13 评论 0原文

我正在开发一个基于用户贡献数据的网站，使用常规 HTML 表单提交。

为了简化我的问题，假设有两个字段形式：“用户名”和“国家”（这只是一个示例，不是实际站点）。

数据库中有两个表：“国家”和“用户”， “users.country_id”是“countries”表的外键（一对多）。

初始数据库将为空。来自世界各地的用户将提交他们的姓名和居住的国家，并最终提交 “国家”表将填写所有国家/地区名称世界。

由于一个国家可以有多个替代名称，因此输入如下 Chile、Chili、Chilli 将在中产生 3 个不同的记录国家表，但实际上只有一个国家。当我搜索来自智利的记录时，Chili和Chilli不会被包括在内。

所以我的问题是 - 处理这种情况的最佳方法是什么像这样的情况，条件是初始数据库是空的，没有其他资源可用，一切都基于用户输入？

我怎样才能以这样的方式组织它，智利，辣椒和辣椒会被视为一个国家，人为干预最少。

规范用户时的最佳实践是什么提交的数据有科学术语吗？我确定这个是一个常见问题。

再说一遍，我使用国家/地区名称只是为了简化我的问题，它可以是任何可能有不同拼写的东西。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠檬色的秋千 2024-09-07 00:34:44

无论是基于文件（Lucene、Sphinx 等）还是基于数据库（Oracle Text、MSSQL Fulltext）的搜索引擎都通过同义词库解决了这个问题。也就是说，它们在同义词的基础上将单词收集在一起。成为同义词的资格比 Roget 书中的更严格，但原则是相同的。同义词包含缩写词、首字母缩略词和常见拼写错误。例如，搜索同义词库可能会将 street 和 st 识别为同一事物。尽管如此，上下文才是一切：在字符串“St Pancras Road”中，st 是saint 的同义词。

那么，这对您有帮助吗？在某种程度上。它表明了您想要实现的事情：

string      | canonical
------------+----------
street      |
st          | street
strete      | street
Chile       |
chilly      | Chile
chili       | Chile

不幸的是，构建和维护同义词库需要人类的聪明才智和努力。建立分类法需要专业知识；跟踪新添加的内容需要时间。另一件事是，即使有了同义词库，匹配仍然是概率性的：MoMA可能与现代艺术博物馆相同，但它与SFMOMA相同吗？ em>还是NYMOMA？不完全一样，但也许 90% 一样？

另一种方法是像 SO 对标签所做的那样。当您标记问题时，会出现一个下拉框，建议可用的标签。当您输入更多字母时，列表会缩小。这并不是万无一失的，看看 tsql 和 t-sql 等标签的存在，但它相当不错。 SO 还有一个备份，即为高级用户提供新创建的标签列表，以便他们可以调查这些货币，甚至可能废除它们。但这仍然是一个手动过程。

唉，没有任何算法能够判断出MoMA与现代艺术博物馆相同，更不用说弄清楚它是否引用了纽约的机构或旧金山。

Search engines whether file based (Lucene, Sphinx, etc) or database (Oracle Text, MSSQL Fulltext) solve this problem with a thesaurus. That is, they collect words together on the basis of them being synonyms. The qualification for being a synonym is tighter than in Roget's book, but the principle is the same. Synonyms bundle up abbreviations, acronyms and common misspellings. So for instance, a search thesaurus might identify street and st as being the same thing. Although, context is everything: in the string "St Pancras Road" st is a synonym for saint.

So, does this help you at all? Up to a point. It suggests the sort of thing you want to implement:

string      | canonical
------------+----------
street      |
st          | street
strete      | street
Chile       |
chilly      | Chile
chili       | Chile

The unfortunate thing is that building and maintaining a thesaurus requires human ingenuity and effort. Building a taxonomy requires expertise; tracking new additions requires time. The other thing is that even with a thesaurus the matches remain probabalistic: MoMA might be the same as Museum of Modern Art but is it the same as SFMOMA or NYMOMA? Not exactly but maybe 90% the same?

An alternative approach would be to do what SO does with tags. When you tagged your question a dropdown box appeared, suggesting available tags. As you typed more letters the list narrowed. This is isn't fool foolproof, witness the presence of tags like tsql and t-sql but it is pretty good. SO also has a backup, which is to provide the power users with a list of freshly minted tags so they can investigate these coinages and perhaps quash them. But that still remnains a manual process.

Alas there is no alogorithm that is going to be able to tell that MoMA is the same as Museum of Modern Art, let alone figure out whether it references the institution in New York or San Francisco.

回复收藏 0 原文

落墨 2024-09-07 00:34:44

我想说使用国家/地区的下拉列表，您可以使用 javascript 轻松填充它。您可以在此处找到所有国家/地区的列表 http://openconcept.ca/blog/mgifford/text_list_all_countries

关于您的标准化问题，根据您的示例，我认为设计没有任何问题

回复收藏 0 原文

或十年 2024-09-07 00:34:44

您无法以编程方式确定 Chile 应该与 Chili 相同，而 Chili 与 Chilli 相同。通过国家/地区示例，您可以拥有输入数据库的国家/地区列表，并拥有用户可以从中选择的下拉列表。

如果数据都是用户输入的，则只有完全相同时才能匹配，因此它们的含义是相同的。

您可能想出一种算法来关联相似的单词，但恕我直言，这只是要求不确定的结果（灾难）。例如（使用与您所在国家/地区不同的示例）您可以通过编程方式确定单词 fight 和 sight 仅相差一个字母，因此它们是相似的。但他们真的是这样吗？仅仅因为这两个词在语法上接近并不意味着它们在语义上也接近。我猜这就是你所追求的。