编写帖子搜索算法
我正在尝试编写一个免费文本搜索算法,用于查找墙上的特定帖子(与 Facebook 使用的类似墙)。用户应该能够在搜索字段中写下一些单词,并在包含这些单词的帖子上获得点击;最佳匹配位于顶部,然后其他帖子根据匹配分数按降序排列。
我使用编辑距离(Levenshtein)“e(x, y) = e”来计算每个帖子与查询词“x”和帖子词“y”相比的得分,根据:得分(x,y) ) = 2^(2 - e)(1 - min(e, |x|) / |x|),其中“|x|”是查询词中的字母数。
帖子中的每个单词都会影响该特定帖子的总分。当帖子大小大致相同时,这种方法似乎效果很好,但有时某些大型帖子仅凭借其中包含大量单词而获得分数,而实际上与查询无关。
我是否以错误的方式处理这个问题,或者是否有某种方法可以标准化我没有想到的分数?
I'm trying to write a free text search algorithm for finding specific posts on a wall (similar kind of wall as Facebook uses). A user is suppose to be able to write some words in a search field and get hits on posts that contain the words; with the best match on top and then other posts in decreasing order according to match score.
I'm using the edit distance (Levenshtein) "e(x, y) = e" to calculate the score for each post when compared to the query word "x" and post word "y" according to: score(x, y) = 2^(2 - e)(1 - min(e, |x|) / |x|), where "|x|" is the number of letters in the query word.
Each word in a post contributes to the total score for that specific post. This approach seems to work well when the posts are of roughly the same size, but sometime certain large posts manages to rack up score solely on having a lot of words in them while in practice not being relevant to the query.
Am I approaching this problem in the wrong way or is there some way to normalize the score that I haven't thought of?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的。您可以使用多种标准化方法。这是一个经过充分研究的领域!
看一下向量空间模型。 TDF/IDF 可能与您正在做的事情相关。它与您使用的方法并不严格相关,但可以为您提供一些标准化线索。
另请注意,比较每个帖子的时间复杂度为 O(N),并且可能会变得非常慢。使用 词干分析 可能会获得更好的结果,而不是字符串距离。然后您可以将其放入 VSM 倒排索引中。
许多数据库(包括 MySQL 和 Postgres)都具有全文搜索功能。这可能比自己做更实用。
Yes. There are many normalization methods you could use. This is a well-researched field!
Take a look at the vector space model . TDF/IDF could be relevant to what you're doing. It's not strictly related to the method you're using but could give you some normalization leads.
Also note that comparing each post will be O(N) and could get very slow. Instead of string-distance, you may have better results with stemmming. You can then put that into a VSM inverted index.
Many databases (including MySQL and Postgres) have full-text search. That's probably more practical than doing it yourself.