数百个表与单个大型表
我正在尝试解决一个问题,我们正在分析表中的大量数据。我们需要提取这些数据的某些子集并进行分析。事实上,我认为最好对其进行多线程处理,并在最初引入尽可能多的数据,并对每个区域执行各种计算。假设要分析的每个数据子集都表示为 S1、S2...,因此每个子集都有一个线程。执行计算后,还可能会创建一些可视化结果,并且需要将结果存储回数据库中,因为分析结果中可能存在许多千兆字节的数据。假设结果用 R1、R2 等表示,
虽然这有点模糊,但我想知道我们是否应该为每个 R1、R2 等创建一个表,或者将所有结果存储在一个表中?我们可能希望多个线程同时存储结果(回想一下 S1、S2 的线程),因此如果只有一个表,我需要确保多个线程可以同时访问它。如果有帮助的话,当再次需要 R1、R2 等的数据时,所有这些数据都将按一定的顺序被拉出,如果 R1、R2 等每个都有一个表,那么这将很容易维护。 ,我在想,如果我们走这条路,我们可以为每个表设置一个对象来管理对该特定结果表的请求。本质上,我希望该对象就像一个 bean,只在必要时从该数据库加载数据(太多而无法一次保存在内存中)。另一点是,我们使用 InnoDB 作为存储引擎,以防多个线程是否可以访问特定表产生任何影响。
那么,有了这些信息,最好是为结果创建一组表,还是为每个结果区域(可能是 100 个)创建一个表?
谢谢
I’m attempting to solve a problem where we are analyzing a substantial amount of data from a table. We need to pull in certain subsets of this data and analyze it. As is, I believe that it would be best to multithread it and bring in as much data as possible initially and perform various computations on each region. Let’s assume that each subset of data to analyze is denoted as S1, S2, … So There will be a thread for each. After performing the calculations, some visualization may be created as well and the results will need to be stored back into the database as there may potentially be many gigabytes worth of data in the analysis results. Let’s assume that the results are denoted by R1, R2, …
Although this is a little vague, I am wondering whether we should create a table for each of R1, R2, etc or store all of the results in a single table? It will likely be the case that we will want multiple threads storing results at the same time (recall threads for S1, S2) so if there is a single table, I need to ensure that multiple threads can access it at the same time. If it helps, when the data for R1, R2, etc is needed again, all of it will be pulled out and in a certain order that would be easy to maintain if there were a table for each of R1, R2, etc. Also, I was thinking that we could have a single object for each table that manages requests to that particular results table if we go that route. Essentially, I would like the object to be like a bean that only loads in data from that database as necessary (too much to keep in memory at once). Another point is that we are using InnoDB as our storage engine in case that makes any difference as to whether multiple threads can access a particular table.
So, with this bit of information, would it be best to create a set of tables for the results or one for each region of results (possibly 100s)?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你可以,但是你必须管理 100 张桌子。获得整个集合的统计数据将会困难得多。
如果数据可以轻松分区为不相交的不同子集,则数据库不应锁定行,特别是如果您只是在应用程序中进行读取和处理。在这种情况下,您不需要将表分区为数百个表,并且应用程序中的每个线程都可以独立使用。
You could, but then you have to manage 100 tables. And getting statistics for the whole set will be that much more difficult.
If the data can be easily partitioned to different subsets that do not intersect, the database should not be locking rows, especially if you are just doing reads and processing in your application. In such a case you don't need to partition the table into hundreds of tables and each thread in your application can be used independently.
这听起来像是一个很好的地图缩减候选者。这是假设您将对整个集合执行相同的计算,并且只是想加快该过程。
您是否考虑过使用 MongoDB 之类的东西?您可以在其中编写自己的地图减少聚合。
地图缩减: http://en.wikipedia.org/wiki/MapReduce
mongo : http://www.mongodb.org/display/DOCS/MapReduce
Mongo确实支持就地更新这是一个无锁的最终一致的存储。
this sounds like a good map reduce candidate. That's assuming that you are going to perform the same calculation on the whole set and just want to speed up the process.
Have you considered using something like MongoDB? you can write your own map reduce aggregations in it.
Map reduce: http://en.wikipedia.org/wiki/MapReduce
mongo : http://www.mongodb.org/display/DOCS/MapReduce
Mongo does support update in place and it's a lockless eventually consistent store.