使用 MySQL 搜索大型数据集?
现在我是一名非常高级的 PHP 开发人员,并且对小型 MySQL 集非常了解,但是我现在正在为我最近加入的一家初创公司构建大型基础设施,他们的服务器每天使用它们的服务器推送大约 100 万行数据巨大的服务器能力和以前的架构。
我需要知道搜索数据库大小为 394.4 GB 的大型数据集(目前有 8490 万行)的最佳方法是什么。它使用 Amazon RDS 托管,因此没有任何停机时间或速度缓慢,只是我想知道内部访问大型数据集的最佳方式是什么。
例如,如果我想搜索包含 8400 万行的数据库,则需要 6 分钟。现在,如果我直接请求特定的 ID 或标题,它会立即提供服务。那么我将如何搜索大型数据集。
只是提醒您,通过传入一个变量来通过数据库查找信息很快,但在搜索时执行速度非常慢。
MySQL 查询示例:
SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100
84m 行以下的查询速度明显慢。具体来说,独立执行此查询需要 47.41 秒,大家有什么想法吗?
我所需要的只是解决这个挑战,然后我就能掌握方向。另外,我知道 MySQL 不太适合大型数据集以及 Oracle 或 MSSQL 之类的数据集,但我被告知目前要在 MySQL 而不是其他数据库解决方案上重建它。
Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.
I need to know what is the best way to search through large data sets (it currently resides at 84.9 million) rows with a database size of 394.4 gigabytes. It is hosted using Amazon RDS so it does not have any downtime or slowness, it's just that I want to know what's the best way to access large data sets internally.
For example, if I wanted to search through a database of 84 million rows it takes me 6 minutes. Now, if I made a direct request to a specific id or title it would serve it instantly. So how would I search through a large data set.
Just to remind you, it's fast to find information through database by passing in one variable but when searching it performs VERY slow.
MySQL query example:
SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100
That query under 84m rows is sigificantly slow. Specifically 47.41 seconds to perform this query standalone, any ideas guys?
All I need is that challenge sorted and I'll be able to get the drift. Also, I know MySQL isn't very good for large data sets and something like Oracle or MSSQL however I've been told to rebuild it on MySQL rather than other database solutions at this moment.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
由于多种原因,
LIKE
非常慢:除非您的
LIKE
表达式以常量开头,不会使用索引。< /p>例如,
LIKE ('james%smith%')
很好,LIKE ('%james%smith%')
不利于索引。您的示例不会在“user_name
”字段上使用任何索引。与常规运算符相比,字符串匹配是一项复杂(算法上)的业务。
解决方法:
如果您可以使用该字段上的索引,请确保您的
LIKE
表达式以常量而不是通配符开头。如果您搜索整个单词,请考虑创建一个索引表(在“索引”一词的文献/图书馆上下文中,而不是数据库索引上下文中)。或者如果搜索随机经常重复的子字符串,则使用子字符串查找表。
例如,如果所有用户名的形式为“FN LN”或“LN,FN” - 将它们分开并将名字和/或姓氏存储在字典表中,加入到该表(并执行直接相等)在您的查询中。
LIKE
is VERY slow for a variety of reasons:Unless your
LIKE
expression starts with a constant, no index will be used.E.g.
LIKE ('james%smith%')
is good,LIKE ('%james%smith%')
is bad for indexing. Your example will NOT use any indexes on "user_name
" field.String matching is complex (algorythmically) business compared to regular operators.
To resolve:
Make sure your
LIKE
expression starts with a constant, not a wildcard, if you have an index on that field you might be able to use.Consider making an index table (in the literature/library context of the word "index", not a database index context) if you search for whole words. Or a substring lookup table if searching for random often repeating substrings.
E.g. if all user names are of the form "FN LN" or "LN, FN" - split them up and store first names and/or last names in a dictionary table, joining to that table (and doing straight equality) in your query.
避免这些事情就像避免瘟疫一样。一般的 DBMS 不可能对其进行优化。
正确的方法是在插入或更新数据时计算类似的内容(名字和姓氏),以便在所有读取中分摊成本。这可以通过添加两个新列(索引)并使用插入/更新触发器来完成。
或者,如果您想要列中的所有单词,请让触发器将数据分解为单词,然后使用应用程序级索引表来查找相关记录,例如:
然后您可以查询 appl_index 找到那些同时包含
james
和smith
的id
,比像 '% 这样可恶的快得多...'
。您还可以将实际单词分解到一个单独的表中并使用单词 ID,但这只是一个品味问题 - 它对性能的影响值得怀疑。您很可能在使用
f.friend_from||f.friend_to
时遇到类似的问题,但我以前没有见过该语法(如果上下文是u.user_id
可以是其中之一)。基本上,如果您希望数据库能够扩展,请不要在您的选择中执行任何看起来像每行函数的操作。这是从使用大型机数据库的人那里得到的,其中 8400 万行大约是我们的配置表的大小:-)
并且,与所有优化问题一样,衡量,不要猜测!
Avoid these things like the plague. They are impossible for a general DBMS to optimise.
The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.
Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:
Then you can query
appl_index
to find thoseid
s that have bothjames
andsmith
in them, far faster than the abominablelike '%...'
. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.You may well have a similar problems with
f.friend_from||f.friend_to
but I've not seen that syntax before (if, as it seems to be, the context isu.user_id
can be one or the other).Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)
And, as with all optimisation questions, measure, don't guess!