快速处理大量数据

发布于 2024-11-18 23:33:20 字数 544 浏览 2 评论 0原文

我正在开发一个 Web 应用程序,其中用户提供参数,这些参数用于从最多 2000 万行的数据库中生成前 1000 个项目的列表。我需要同时获得所有前 1000 个项目,并且从用户的角度来看,我需要这个排名或多或少是即时发生的。

目前,我使用带有用户定义函数的 MySQL 对数据进行评分和排名,然后 PHP 从那里获取数据。在 1M 行的数据库上进行测试,这大约需要 8 秒,但我需要大约 2 秒的性能,即使对于多达 20M 行的数据库也是如此。优选地,这个数字应该更低,以便保证最多 50 个同时用户的良好吞吐量。

我对任何能够尽可能高效地处理这些数据的软件的流程持开放态度,无论是否是 MySQL。以下是该过程的特点和限制:

  • 与评分过程相关的每行数据约为每项 50 字节。
  • 对数据库的插入和更新可以忽略不计。
  • 每个分数都独立于其他分数,因此可以并行计算分数。
  • 由于参数和参数值较多,无法预先计算分数。
  • 该方法应该能够很好地适应多个同时用户
  • ,就服务器数量而言,所需的计算资源越少越好。

谢谢

I'm working on a web application where the user provides parameters, and these are used to produce a list of the top 1000 items from a database of up to 20 million rows. I need all top 1000 items at once, and I need this ranking to happen more or less instantaneously from the perspective of the user.

Currently, I'm using a MySQL with a user-defined function to score and rank the data, then PHP takes it from there. Tested on a database of 1M rows, this takes about 8 seconds, but I need performance around 2 seconds, even for a database of up to 20M rows. Preferably, this number should be lower still, so that decent throughput is guaranteed for up to 50 simultaneous users.

I am open to any process with any software that can process this data as efficiently as possible, whether it is MySQL or not. Here are the features and constraints of the process:

  • The data for each row that is relevant to the scoring process is about 50 bytes per item.
  • Inserts and updates to the DB are negligible.
  • Each score is independent of the others, so scores can be computed in parallel.
  • Due to the large number of parameters and parameter values, the scores cannot be pre-computed.
  • The method should scale well for multiple simultaneous users
  • The fewer computing resources this requires, in terms of number of servers, the better.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

倾城花音 2024-11-25 23:33:20

一种可行的方法似乎是将所有数据加载(并稍后更新)到大约 1GB RAM 中,并在 MySQL 之外使用 C++ 等语言执行评分和排名。那应该比 MySQL 更快。

对于这种方法,评分必须相对简单,因为您的要求只为每行留下十分之一微秒用于评分和排名,而无需并行化或优化。

A feasible approach seems to be to load (and later update) all data into about 1GB RAM and perform the scoring and ranking outside MySQL in a language like C++. That should be faster than MySQL.

The scoring must be relatively simple for this approache because your requirements only leave a tenth of a microsecond per row for scoring and ranking without parallelization or optimization.

墨落画卷 2024-11-25 23:33:20

如果您可以发布您遇到问题的查询可以提供帮助。

虽然这里有一些事情。
确保您在数据库上创建了索引。
确保使用优化查询并使用联接而不是内部查询。

If you could post query you are having issue with can help.

Although here are some things.
Make sure you have indexes created on database.
Make sure to use optimized queries and using joins instead of inner queries.

桃扇骨 2024-11-25 23:33:20

根据您的条件,提高性能的可能性将取决于您是否可以使用输入条件来预先筛选需要计算分数的行数。即,如果用户提供的参数之一自动取消大部分行的资格,那么首先应用该过滤将提高性能。如果没有一个参数具有该特征,那么您可能需要更多的硬件或具有更高性能的数据库。

Based on your criteria, the possibility of improving performance would depend on whether or not you can use the input criteria to pre-filter the number of rows for which you need to calculate scores. I.e. if one of the user-provided parameters automatically disqualifies a large fraction of the rows, then applying that filtering first would improve performance. If none of the parameters have that characteristic, then you may need either much more hardware or a database with higher performance.

离笑几人歌 2024-11-25 23:33:20

我想说,对于这类问题,如果您已经完成了所有明显的软件优化(我们无法知道这一点,因为您没有提到任何有关软件方法的信息),您应该尝试进行一些认真的硬件优化。最大化 SQL 服务器上的内存,并尽可能将表放入内存中。使用 SSD 作为表/索引存储,以实现快速反序列化。如果您处于集群状态,请将网络速度提高到可行的最高网络速度。

I'd say for this sort of problem, if you've done all the obvious software optimizations (and we can't know that, since you haven't mentioned anything about your software approaches), you should try for some serious hardware optimization. Max out the memory on your SQL servers, and try to fit your tables into memory where possible. Use an SSD for your table / index storage, for speedy deserialization. If you're clustered, crank up the networking to the highest feasible network speeds.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文