MEMSQL:带有“跳过重复密钥”的数据加载数据。选项非常慢
我正在评估单杆7.6.10的负载性能。
我测试了两种加载方法对现实世界的实践很重要:
加载以跳过重复的主要键
终止的行加载数据local infile'/opt/orders.tbl'跳过键错误到'|'终止的表订单字段由'| \ n'max_errors 0;
加载以用最新记录替换重复的主键
终止加载数据本地Infile'/opt/orders.tbl'替换为表orders_sf1_col列,由'|';';
在运行测试之前终止,我猜想这两种方法在加载时间方面都应具有相似的性能需要扫描查找重复数据的主要键。如果有任何差异,则可能替换方法应该花费更多时间,因为它需要删除当前记录并插入最新的记录以替换。
但令我惊讶的是,使用跳过运行速度非常慢,并且完成了几乎 8分钟的加载163MB数据文件。但是替换用同一文件加载到同一表可以在 15秒内完成。
这两个测试均在具有相同数据文件的同一测试环境(3 VM)上进行,并加载到同一目标表中。为了模拟重复的冲突,我将两个连续的负载运到一个空表上,只测量最后一个。
问题是为什么使用 跳过重复的密钥错误
表现如此慢,并且是否有更好的方法可以实现相同的效果?
DDL在这里:
CREATE TABLE `orders_sf1_col` (
`O_ORDERKEY` int(11) NOT NULL,
`O_CUSTKEY` int(11) NOT NULL,
`O_ORDERSTATUS` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_TOTALPRICE` decimal(15,2) NOT NULL,
`O_ORDERDATE` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
`O_ORDERPRIORITY` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_CLERK` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_SHIPPRIORITY` int(11) NOT NULL,
`O_COMMENT` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_NOP` varchar(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY `PRIMARY` (`O_ORDERKEY`) USING HASH,
KEY `ORDERS_FK1` (`O_CUSTKEY`) USING HASH,
KEY `ORDERS_DT_IDX` (`O_ORDERDATE`) USING HASH,
SHARD KEY `__SHARDKEY` (`O_ORDERKEY`) USING CLUSTERED COLUMNSTORE
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='STRICT_ALL_TABLES'
谢谢
I am evaluating the loading performance of Singlestore 7.6.10.
I tested two ways of loading both are important to real world practice:
loading to skip duplicated primary keys
load data local infile '/opt/orders.tbl' skip duplicate key errors into table ORDERS fields terminated by '|' lines terminated by '|\n' max_errors 0;
loading to replace duplicated primary keys with latest records
load data local infile '/opt/orders.tbl' replace into table orders_sf1_col columns terminated by '|';
Before running the tests, I guessed both methods should have similar performance in terms of load time because both ways need to scan the primary key to lookup duplicated data. If there is any difference, probably the REPLACE method should take more time because it needs to delete the current record and insert the latest one for replacement.
But to my surprise, loading with SKIP runs extremely slow and finished to load 163MB data file in almost 8 minutes. But the REPLACE loading with same file to same table can be finished in less than 15 seconds.
Both tests are run on same test environment (3 VMs) with same data file and load into the same target table. To simulate the duplicated conflicts, I ran two consecutive loads to an empty table and only measure the last one.
Question is why using skip duplicate key errors
performs so slow and if there is a better way to achieve the same effect?
The DDL is here:
CREATE TABLE `orders_sf1_col` (
`O_ORDERKEY` int(11) NOT NULL,
`O_CUSTKEY` int(11) NOT NULL,
`O_ORDERSTATUS` char(1) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_TOTALPRICE` decimal(15,2) NOT NULL,
`O_ORDERDATE` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
`O_ORDERPRIORITY` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_CLERK` varchar(15) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_SHIPPRIORITY` int(11) NOT NULL,
`O_COMMENT` varchar(79) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
`O_NOP` varchar(79) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
UNIQUE KEY `PRIMARY` (`O_ORDERKEY`) USING HASH,
KEY `ORDERS_FK1` (`O_CUSTKEY`) USING HASH,
KEY `ORDERS_DT_IDX` (`O_ORDERDATE`) USING HASH,
SHARD KEY `__SHARDKEY` (`O_ORDERKEY`) USING CLUSTERED COLUMNSTORE
) AUTOSTATS_CARDINALITY_MODE=INCREMENTAL AUTOSTATS_HISTOGRAM_MODE=CREATE AUTOSTATS_SAMPLING=ON SQL_MODE='STRICT_ALL_TABLES'
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Skip 是更多资源密集型功能,因为它使用
群集索引扫描
这就是为什么它花费更多时间的原因。另一方面,
替换使用服务器的资源较少,因为它使用
群集索引seek
这会以明显的差异减少执行时间。
但是Singlestore最新版本(7.8)的结果更好,请通过官方文档。
Skip is more resource intensive function because it utilizes
clustered index scan
that's why it was taking more time.On the other hand,
Replace utilizes less resources of the server because it uses
clustered index seek
Which reduces the execution time with a noticeable difference.
But Singlestore latest version (7.8) has better results please go thru the official documentation.