实现大型数据集的快速查找:MySQL MEMORY(HEAP)、Memcached 或其他

发布于 2024-07-26 15:54:14 字数 631 浏览 5 评论 0原文

目前正在开展一个以 SNOMED 医学术语为中心的项目。 snomed 的核心是三个关系数据集,长度分别为 350,000、110 万和 130 万条记录。 我们希望能够快速查询该数据集的数据输入部分,我们希望在其中具有某种形状或形式的自动完成/建议。

目前它位于 MySQL MyISAM DB 中,仅用于开发目的,但我们希望开始使用一些内存选项。 当前大小为 30MB + 90MB + 70MB(包括索引)。 MEMORY MySQL Engine 和 MemCached 是显而易见的,所以我的问题是您会建议其中哪一个,或者有更好的吗?

我们主要在应用程序级别使用 Python 进行工作(如果这有影响的话)。 此外,我们正在一个小型专用服务器上运行,很快就会迁移到 4GB DDR2。

编辑:附加信息

我们有兴趣保持快速的建议和自动完成。 对于这些类型的 queires 来说,能够表现良好的东西是可取的。 snomed 中的每个术语通常都有多个同义词、缩写和首选名称。 我们将大量查询该数据集(包括索引在内,大小为 90MB)。 我们还在考虑建立一个倒排索引,以加快速度并返回更相关的结果(许多术语很长“整个蜕膜基底动脉(身体结构)”)。 Lucene 或其他一些全文搜索可能是合适的。

Currently working on a project that is centered around a medical nomenclature known as SNOMED. At the heart of snomed is are three relational datasets that are 350,000, 1.1 mil, and 1.3 mil records in length. We want to be able to quickly query this dataset for the data entry portion where we would like to have some shape or form of auto-completion/suggestion.

Its currently in a MySQL MyISAM DB just for dev purposes but we want to start playing with some in memory options. It's currently 30MB + 90MB + 70MB in size including the indexes. The MEMORY MySQL Engine and MemCached were the obvious ones, so my question is which of these would you suggest or is there something better out there?

We're working in Python primarily at the app level if that makes a difference. Also we're running on a single small dedicated server moving to 4GB DDR2 soon.

Edit: Additional Info

We're interested in keeping the suggesting and autocompletion fast. Something that will peform well for these types of queires is desirable. Each term in snomed typically has several synonyms, abbreviations, and a preferred name. We will be querying this dataset heavily (90MB in size including index). We're also considering building an inverted index to speed things up and return more relevant results (many of the terms are long "Entire coiled artery of decidua basalis (body structure)"). Lucene or some other full text search may be appropriate.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

落在眉间の轻吻 2024-08-02 15:54:15

从您的用例来看,听起来您想要进行全文搜索; 我建议sphinx。 即使在大型数据集上,它的速度也非常快。 如果您需要额外的速度,您可以集成 memcached。

From your use case, it sounds like you want to do full-text searching; I would suggest sphinx. It's blazing fast, even on large data sets. You can integrate memcached if you need extra speed.

别忘他 2024-08-02 15:54:15

请参阅

了解如何使用 Lucene 执行此操作。 Lucene 是最接近行业标准的全文搜索库。 它速度快并提供高质量的结果。 然而,掌握 Lucene 需要时间 - 你必须处理许多底层细节。 一种更简单的方法可能是使用 Solr,这是一个更容易设置的 Lucene 子项目,并且可以提供 JSON 输出,可用于自动完成

正如托德所说,您还可以使用 Sphinx。 我从未使用过它,但听说它与 MySQL 高度集成。 我未能找到如何使用 Sphinx 实现自动完成 - 也许您应该将其作为一个单独的问题发布。

Please see

For how to do this with Lucene. Lucene is the closest to industry standard full-text search library. It is fast and gives quality results. However, It takes time to master Lucene - you have to handle many low-level details. An easier way may be to use Solr, a Lucene sub-project which is much easier to set up, and can give JSON output, that can be used for autocomplete.

As Todd said, you can also use Sphinx. I have never used it, but heard it is highly integrable with MySQL. I failed to find how to implement autocomplete using Sphinx - maybe you should post this as a separate question.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文