使用 MySQL 搜索大型数据集？

发布于 2024-09-12 18:18:26 字数 780 浏览 11 评论 0原文

现在我是一名非常高级的 PHP 开发人员，并且对小型 MySQL 集非常了解，但是我现在正在为我最近加入的一家初创公司构建大型基础设施，他们的服务器每天使用它们的服务器推送大约 100 万行数据巨大的服务器能力和以前的架构。

我需要知道搜索数据库大小为 394.4 GB 的大型数据集（目前有 8490 万行）的最佳方法是什么。它使用 Amazon RDS 托管，因此没有任何停机时间或速度缓慢，只是我想知道内部访问大型数据集的最佳方式是什么。

例如，如果我想搜索包含 8400 万行的数据库，则需要 6 分钟。现在，如果我直接请求特定的 ID 或标题，它会立即提供服务。那么我将如何搜索大型数据集。

只是提醒您，通过传入一个变量来通过数据库查找信息很快，但在搜索时执行速度非常慢。

MySQL 查询示例：

SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100

84m 行以下的查询速度明显慢。具体来说，独立执行此查询需要 47.41 秒，大家有什么想法吗？

我所需要的只是解决这个挑战，然后我就能掌握方向。另外，我知道 MySQL 不太适合大型数据集以及 Oracle 或 MSSQL 之类的数据集，但我被告知目前要在 MySQL 而不是其他数据库解决方案上重建它。

原文

Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.

I need to know what is the best way to search through large data sets (it currently resides at 84.9 million) rows with a database size of 394.4 gigabytes. It is hosted using Amazon RDS so it does not have any downtime or slowness, it's just that I want to know what's the best way to access large data sets internally.

For example, if I wanted to search through a database of 84 million rows it takes me 6 minutes. Now, if I made a direct request to a specific id or title it would serve it instantly. So how would I search through a large data set.

Just to remind you, it's fast to find information through database by passing in one variable but when searching it performs VERY slow.

MySQL query example:

SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100

That query under 84m rows is sigificantly slow. Specifically 47.41 seconds to perform this query standalone, any ideas guys?

All I need is that challenge sorted and I'll be able to get the drift. Also, I know MySQL isn't very good for large data sets and something like Oracle or MSSQL however I've been told to rebuild it on MySQL rather than other database solutions at this moment.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

羁客 2024-09-19 18:18:26

由于多种原因，LIKE 非常慢：

除非您的 LIKE 表达式以常量开头，不会使用索引。< /p>
例如，LIKE ('james%smith%') 很好，LIKE ('%james%smith%') 不利于索引。您的示例不会在“user_name”字段上使用任何索引。
与常规运算符相比，字符串匹配是一项复杂（算法上）的业务。

解决方法：

如果您可以使用该字段上的索引，请确保您的 LIKE 表达式以常量而不是通配符开头。
如果您搜索整个单词，请考虑创建一个索引表（在“索引”一词的文献/图书馆上下文中，而不是数据库索引上下文中）。或者如果搜索随机经常重复的子字符串，则使用子字符串查找表。
例如，如果所有用户名的形式为“FN LN”或“LN，FN” - 将它们分开并将名字和/或姓氏存储在字典表中，加入到该表（并执行直接相等）在您的查询中。

回复收藏 0 原文

枕梦 2024-09-19 18:18:26

LIKE ('%james%smith%')

避免这些事情就像避免瘟疫一样。一般的 DBMS 不可能对其进行优化。

正确的方法是在插入或更新数据时计算类似的内容（名字和姓氏），以便在所有读取中分摊成本。这可以通过添加两个新列（索引）并使用插入/更新触发器来完成。

或者，如果您想要列中的所有单词，请让触发器将数据分解为单词，然后使用应用程序级索引表来查找相关记录，例如：

main_table:
    id integer primary key
    blah blah blah
    text varchar(60)
appl_index:
    id index
    word varchar(20)
    primary key (id,word)
    index (word)

然后您可以查询 appl_index 找到那些同时包含 james 和 smith 的 id，比像 '% 这样可恶的 快得多...'。您还可以将实际单词分解到一个单独的表中并使用单词 ID，但这只是一个品味问题 - 它对性能的影响值得怀疑。

您很可能在使用 f.friend_from||f.friend_to 时遇到类似的问题，但我以前没有见过该语法（如果上下文是 u.user_id 可以是其中之一）。

基本上，如果您希望数据库能够扩展，请不要在您的选择中执行任何看起来像每行函数的操作。这是从使用大型机数据库的人那里得到的，其中 8400 万行大约是我们的配置表的大小:-)

并且，与所有优化问题一样，衡量，不要猜测！

LIKE ('%james%smith%')

Avoid these things like the plague. They are impossible for a general DBMS to optimise.

The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.

Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:

main_table:
    id integer primary key
    blah blah blah
    text varchar(60)
appl_index:
    id index
    word varchar(20)
    primary key (id,word)
    index (word)

Then you can query appl_index to find those ids that have both james and smith in them, far faster than the abominable like '%...'. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.

You may well have a similar problems with f.friend_from||f.friend_to but I've not seen that syntax before (if, as it seems to be, the context is u.user_id can be one or the other).

Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)

And, as with all optimisation questions, measure, don't guess!

回复收藏 0 原文

~没有更多了~