设计基于 soundex 的搜索的最佳方法
我有一个论坛帖子表,并且希望改进基本搜索功能,因为我们有很多来自世界各地的用户,他们的母语不是英语,并且在拼写错误时很难找到结果。目前的论坛搜索是准确的。
这些设计中哪一个表现最好?假设数据库有50万条记录,并且经常使用搜索。理想情况下,我希望它搜索每条记录。
Design One
除了每个论坛帖子之外,我还存储 soundex_post,其中包含所有 soundex 数据。运行搜索时,它会对所有搜索词进行 soundexes,并对 soundex 字段执行 LIKE 操作。
设计二
我将其标准化。每个 soundex 代码都存储在一个新表 tblSoundexCodes 中。然后有一个表tblForumPostSoundexCodes:
ID|ID帖子 ID | SoundexCode_ID |计数
然后,当搜索 soundex 时,我只需提取所有 Post_ID,其中 SoundexCode_ID = n
我是否正确,方法二会快得多,但维护起来会困难得多(IE ,当人们编辑帖子时)。
I have a table of forum posts, and want to improve the basic search functionality as we get a lot of users from all over the world who are not native English speakers and will have trouble finding results when they spell incorrectly. The current forum search is exact.
Which of these designs will perform best? Assume the database has 500,000 records and the search is used frequently. Ideally I would like it to search every record.
Design One
Along side each forum post, I store soundex_post, which contains all the soundex data. When a search is run, it soundexes all search terms, and does a LIKE operation on the soundex fields.
Design Two
I normalise it. Every soundex code is stored in a new table, tblSoundexCodes. Then there is a table tblForumPostSoundexCodes:
ID | Post_ID | SoundexCode_ID | Count
Then when a soundex is searched for, I simply pull out all the Post_IDs where SoundexCode_ID = n
Am I correct that method two will be considerably faster, but that it will be a lot harder to maintain (IE, when people edit posts).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
设计二更好。
设计二不会更快。数据存储将更加紧凑,当有人编写或更新帖子时,您将必须更新或插入一行到 tblForumPostSoundexCodes,以及插入一行到 tblSoundexCodes。
您必须验证该 soundex 事务处理是否针对帖子的每次更改(创建、更新、删除)进行。
Design Two is better.
Design two won't be faster. The data storage will be more compact, and you will have to update or insert a row into tblForumPostSoundexCodes, as well as insert a row into tblSoundexCodes, when someone writes or updates a post.
You'll have to verify that this soundex transaction processing takes place for every change to a post (create, update, delete).
我对你的想法有些怀疑。
让我们对以下评论之一做出您自己的回答:
“@Frank 是 MSSQL,可能只会是 Express 版本,我使用 soundex 作为示例,我可能会实现更新的算法”
这个简单的评论有 15单词(我丢弃了那些少于 4 个字符的单词)。
因此,您可能会有 15 条不同的记录指向此处,其值为“1”。
如果我(母语非英语)决定搜索(“esample”和“dait”)怎么办?你们会提供搜索多个单词的功能吗?与和-?或者-?不是-?
按原样运行搜索并提供“也许您正在寻找‘示例和日期’”不是更好吗?像谷歌或维基百科那样?
I have some doubts about your idea.
Let's take your own answer to one of the comments:
"@Frank is MSSQL, probably only going to be express edition, and I used soundex as an example I'll probably be implementing a more up to date algorithm"
This simple comments has 15 words (I discarded those shorter than 4 characters).
So you will have potentially 15 different records pointing here with value "1".
What if I, a non-native English speaker, decide to search for ("esample" and "dait")? Will you offer the ability to search for more than one word? With and-? or-? not-?
Wouldn't it be better to run the search as-is and provide a "maybe you are looking for 'Example and Date'?" like Google or Wikipedia do?