需要灵感:选择大量数据以获得高分
我需要一些解决方案的灵感...
我们正在运行一款拥有大约 80,000 名活跃用户的在线游戏 - 我们希望扩大这一规模,因此设定了实现多达 1-500,000 名用户的目标。
该游戏包括所有用户的高分,这是基于大量数据的。这些数据需要在代码中进行处理以计算每个用户的值。
计算出值后,我们需要对用户进行排名,并将数据写入高分表。
我的问题是,为了为 500.000 个用户生成高分,我们需要以 25-30.000.000 行的顺序从数据库加载数据,总计约 1.5-2GB 的原始数据。此外,为了对值进行排名,我们需要拥有总的值集。
此外,我们需要尽可能频繁地生成高分 - 最好每 30 分钟一次。
现在我们可以使用暴力 - 每 30 分钟加载 30 个 mio 记录,计算值并对它们进行排名,然后将它们写入数据库,但我担心这会对数据库、应用程序服务器造成压力和网络 - 如果可能的话。
我认为解决这个问题的办法可能是如何解决这个问题,但我不知道如何解决。因此,我正在根据以下信息寻找可能的替代解决方案的一些灵感:
- 我们需要所有约 500.000 个团队的完整高分 - 我们不能(除非绝对必要,否则不会)将其分片。
- 我假设如果没有所有用户值的列表,就无法对用户进行排名。
- 计算每个团队的价值必须在代码中完成 - 我们不能单独使用 SQL 来完成。
- 我们当前的方法单独加载每个用户的数据(3 次调用数据库)来计算值 - 加载数据并生成 25.000 个用户的高分需要大约 20 分钟,如果扩展到 500.000,则速度太慢。
- 我假设硬件大小不会成为问题(在合理的范围内)
- 我们已经在使用 memcached 来存储和检索缓存数据
欢迎任何建议、有关类似问题的好文章的链接。
I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
- We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
- I'm assuming that there is no way to rank users without having a list of all users values.
- Calculating the value for each team has to be done in code - we can't do it in SQL alone.
- Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
- I'm assuming that hardware size will not an issue (within reasonable limits)
- We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
有趣的问题。根据我的经验,批处理只能作为最后的手段。通常最好让软件在使用新数据插入/更新数据库时计算值。对于您的场景,这意味着每次插入或更新用于计算团队得分的任何数据时都应该运行得分计算代码。将计算值与团队记录一起存储在数据库中。在计算值字段上放置索引。然后你可以要求数据库对该字段进行排序,速度会相对较快。即使有数百万条记录,它也应该能够在 O(n) 时间或更短的时间内返回前 n 条记录。我认为您根本不需要高分表,因为查询速度足够快(除非您对高分表除了作为缓存之外还有其他需要)。该解决方案还为您提供实时结果。
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
假设 2GB 数据中的大部分变化不那么频繁,您可以计算并缓存(在数据库或其他地方)每天的总数,然后根据上次计算以来提供的新记录添加差异。
在 postgresql 中,您可以将表聚集在表示插入记录时间的列上,并在该列上创建索引。然后,您可以对最近的数据进行计算,而无需扫描整个表。
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
首先也是最重要的一点:
一种可能的解决方案是:
结果仍需要一段时间,但至少性能不会受到太大影响。
First and formost:
One possible solution is:
Results are still going to take a while, but at least performance won't be impacted as much.
如何将这些分数保存在数据库中,然后简单地查询数据库中的最高分数(以便计算在服务器端完成,而不是在客户端完成......因此不需要移动数百万条记录)。
听起来很简单……除非我没明白你的意思……请告诉我。
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
滚动计算并存储每个活跃团队的得分。存储分数后,您应该能够在 SQL 中进行排序/排序/检索。为什么这不是一个选择?
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
它可能被证明是徒劳的,但我至少会看看排序的完成方式 在较低的层次上,看看你是否能从中获得一些灵感。您也许能够一次获取更多可管理的数据量进行处理。
您是否运行过测试来查看您对数据大小的担忧是否有效?在中端服务器上,如果软件针对它进行了优化,那么投入 2GB 左右的内存并不是太困难。
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
在我看来,这显然是一项 chacheing 的工作,因为你应该能够将 50 万个分数记录保存在半本地,即使不是在 RAM 中。每次更新大DB中的数据时,都会对本地的分数记录进行相应的调整。
对本地分数记录进行排序应该很简单。 (它们几乎是按顺序排列的。)
如果您只需要知道前 100 名左右的分数,那么排序就更容易了。您所要做的就是扫描列表并将每个元素插入排序到包含 100 个元素的列表中。如果该元素低于第一个元素(99.98% 的情况都是如此),则无需执行任何操作。
然后每天左右对整个数据库运行一次大更新,只是为了消除任何蔓延的不一致。
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.