不使用数据库\索引的模糊运行时搜索

发布于 2024-09-09 04:11:46 字数 235 浏览 8 评论 0原文

我需要通过检查预定义字符串的模糊匹配的每个条目来过滤文本文章流(我正在搜索拼写错误的产品名称,有时它们具有不同的单词顺序和额外的非字母字符,例如“:”或“,”)。

通过将这些文章放入 sphinx 索引并对其执行搜索,我得到了很好的结果,但不幸的是,我每秒都会收到数百篇文章,并且在获取每篇文章后更新索引太慢(而且我知道它不是为此类任务设计的)。我需要一些可以在小〜100kb文本的内存索引中构建并对其执行模糊搜索的库,是否存在这样的东西?

I need to filter stream of text articles by checking every entry for fuzzy matches of predefined string(I am searching for misspelled product names, sometime they have different order of words and extra non letter characters like ":" or ",").

I get excellent results by putting this articles in sphinx index and performing search on it, but unfortunately I get hundreds of articles every second and updating index after getting every article is too slow(and I understand that it's not designed for such task). I need some library which can build in memory index of small ~100kb text and perform fuzzy search on it, does anything like this exist?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

油焖大侠 2024-09-16 04:11:46

这个问题几乎与贝叶斯垃圾邮件过滤相同,并且已经为此编写的工具可以进行训练根据您的标准进行识别。

针对评论添加

那么您现在如何将流分区到容器中?如果您已经有一个由单独文章组成的语料库,只需将其输入分类器即可。贝叶斯分类器是在上下文中进行模糊内容匹配的方法,可以对从垃圾邮件到核苷酸到天文光谱类别的所有内容进行分类。

您可以使用不太随机的方法(例如 Levenshtein),但在某些时候您必须描述命中和未命中之间的差异。贝叶斯方法的美妙之处在于,您实际上不需要明确地知道如何分类,特别是如果您手中已经有一个隔离的语料库。

This problem is almost identical to Bayesian spam filtering and tools already written for that can just be trained to recognize according to your criteria.

added in response to comment:

So how are you partitioning the stream into bins now? If you already have a corpus of separated articles, just feed that into the classifier. Bayesian classifiers are the way to do fuzzy content matching in context and can classify everything from spam to nucleotides to astronomical spectral categories.

You could use less stochastic methods (e.g. Levenshtein), but at some point you have to describe the difference between hits and misses. The beauty of Bayesian methods, especially if you already have a segregated corpus in hand is that you don't actually need to expressly know how you are classifying.

和影子一齐双人舞 2024-09-16 04:11:46

使用 sqlite fts3 扩展怎么样?

使用 fts3(content TEXT) 创建虚拟表 enrondata1;

(您可以创建任意数量的列 - 所有列都将被索引)

之后,您可以插入您喜欢的任何内容,并且可以在不重建索引的情况下搜索它 - 匹配特定列或整行。

(http://www.sqlite.org/fts3.html)

How about using sqlite fts3 extension?

CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);

(You may create any number of columns -- all of them will be indexed)

After that you insert whatever you like, and can search it without index rebuild -- matching either specific column, or the whole row.

(http://www.sqlite.org/fts3.html)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文