并行性或 PLINQ 与 Postgres 和 Npgsql
我有一个包含超过 12,000,000 行的 Posrgres 9.04 数据库表。
我需要一个程序来读取每一行,进行一些计算和查找(针对第二个表),然后使用这些计算的结果在第三个表中写入新行。完成后,第三个表将具有与第一个表相同的行数。
在 Core i7 720QM 处理器上串行执行需要超过 24 小时。它只对我的 8 个核心之一(4 个物理核心,但 8 个通过 HTT 对 Windows 7 可见)进行征税。
我想通过并行性来加快速度。我以为我可以使用 PLINQ 和 Npgsql:
NpgsqlDataReader records = new NpgsqlCommand("SELECT * FROM table", conn).ExecuteReader();
var single_record = from row in records.AsParallel()
select row;
但是,我收到 records.AsParallel()
错误:找不到源类型“System.Linq.ParallelQuery”的查询模式的实现'。未找到“选择”。考虑显式指定范围变量“行”的类型。
我已经进行了大量的 Google 搜索,但我只是变得更加困惑。 NpgsqlDataReader 继承自 System.Data.Common.DbDataReader,后者又实现 IEnumerable,后者具有 AsParallel 扩展,看来正确的东西已经到位了?
我不清楚我什至可以做什么来显式指定范围变量的类型。看来最佳实践是不指定这一点。
我愿意切换到 DataSet,假设它与 PLINQ 兼容,但由于有 12,000,000 行,所以宁愿避免。
这甚至可以通过 Npgsql 实现吗?我需要使用 Devart 的 dotConnect for PostgreSQL 吗?
更新:刚刚发现 http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae,这让我尝试这个:
foreach(IDataRecord arrest in
from row in arrests.AsParallel().Cast <IDataRecord>()
select row)
所以IDE 中目前还没有错误,但是这是构建此代码的正确方法吗?
I have a Posrgres 9.04 database table with over 12,000,000 rows.
I need a program to read each row, do some calculations and lookups (against a 2nd table), then write a new row in a 3rd table with the results of these calculations. When done, the 3rd table will have the same number of rows as the 1st table.
Executing serially on a Core i7 720QM processor takes more than 24 hours. It only taxes one of my 8 cores (4 physical cores, but 8 visible to Windows 7 via HTT).
I want to speed this up with parallelism. I thought I could use PLINQ and Npgsql:
NpgsqlDataReader records = new NpgsqlCommand("SELECT * FROM table", conn).ExecuteReader();
var single_record = from row in records.AsParallel()
select row;
However, I get an error for records.AsParallel()
: Could not find an implementation of the query pattern for source type 'System.Linq.ParallelQuery'. 'Select' not found. Consider explicitly specifying the type of the range variable 'row'.
I've done a lot of Google searches, and I'm just coming up more confused. NpgsqlDataReader inherits from System.Data.Common.DbDataReader, which in turn implements IEnumerable, which has the AsParallel extension, so seems like the right stuff is in place to get this working?
It's not clear to me what I could even do to explicitly specify the type of the range variable. It appears that best practice is not to specify this.
I am open to switching to a DataSet, presuming that's PLINQ compatible, but would rather avoid if possible because of the 12,000,000 rows.
Is this even something achievable with Npgsql? Do I need to use Devart's dotConnect for PostgreSQL instead?
UPDATE: Just found http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae, which led me to try this:
foreach(IDataRecord arrest in
from row in arrests.AsParallel().Cast <IDataRecord>()
select row)
So far no errors in the IDE, but is this a proper way of constructing this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这确实是解决方案:
该解决方案的灵感来自于我在 http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae#1956768e-9403-4671-a196-8dfb3d7070e3。我不清楚为什么需要强制转换和类型规范,但它有效。
编辑:虽然这不会导致语法或运行时错误,但实际上它不会使事情并行运行。一切仍然是序列化的。请参阅ConcurrentQueue 上的 PLINQ 不是多线程以获得卓越的解决方案。
This is indeed the solution:
This solution was inspired by what I found at http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae#1956768e-9403-4671-a196-8dfb3d7070e3. It's not clear to me why the cast and type specification is needed, but it works.
EDIT: While this doesn't cause syntax or runtime errors, it in fact does not make things run in parallel. Everything is still serialized. See PLINQ on ConcurrentQueue isn't multithreading for a superior solution.
您应该考虑使用 Greenplum。在 Greenplum 数据库中实现这一点很简单。免费版本没有任何限制,它的核心是 postgresql。
You should consider using Greenplum. It's trivial to accomplish this in a Greenplum database. The free version isn't gimped in any way and it's postgresql at its core.