大规模 ETL 字符串查找性能问题
我遇到 ETL 流程性能问题。我有一个包含 4+ 十亿行的表。结构为:
id
bigint Identity(1,1)raw_url
varchar(2000) not nullmd5hash
char(32) not nulljob_control_number
int not null
id 上的聚集唯一索引和 md5hash 上的非聚集唯一索引
SQL Server 2008 Enterprise 页面级压缩已打开
我们必须将 Web 服务器日志中的原始 URL 存储为维度。由于原始字符串> 900 个字符,我们无法在该列上放置唯一索引。我们使用 md5 哈希函数来创建唯一的 32 个字符的字符串以用于索引目的。我们不能允许表中出现重复的 raw_url 字符串。
问题是性能不佳。 md5hash 本质上是随机的,因此索引碎片达到 50%,这会导致 IO 效率低下。
寻求有关如何构建此结构的建议,以实现更好的插入和查找性能以及更少的索引碎片。
I have an ETL process performance problem. I have a table with 4+ billion rows in it. Structure is:
id
bigint identity(1,1)raw_url
varchar(2000) not nullmd5hash
char(32) not nulljob_control_number
int not null
Clustered unique index on the id and non clustered unique index on md5hash
SQL Server 2008 Enterprise
Page level compression is turned on
We have to store the raw urls from our web-server logs as a dimension. Since the raw string > 900 characters we cannot put a unique index on that column. We use an md5 hash function to create the unique 32 character string for indexing purposes. We cannot allow duplicate raw_url strings in the table.
The problem is poor performance. The md5hash is of course random by nature so the index fragmentation drives to 50% which leads to inefficient IO.
Looking for advice on how to structure this to allow better insertion and lookup performance as well as less index fragmentation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我会将表分解为物理文件,并将较旧的未更改数据放在只读文件组中。确保非聚集索引也在文件组中。
编辑(来自评论):当我考虑这个问题时,如果您关闭页面级压缩,那也会改善 I/O。
I would break up the table into physical files, with the older non-changing data in a read-only file group. Make sure the non-clustered index is also in the filegroup.
Edit (from comment): And while I'm thinking about it, if you turn off page level compression, that'll improve I/O as well.
我认为它应该是事实表中的退化维度。
并找出某种方法对数据进行分区。也许将前 xxx 个字符存储为单独的字段,然后按其进行分区。
然后,当您进行查找时,您将传递短列和长列,因此它首先在分区中查找。
I would argue that it should be a degenerate dimension in the fact table.
And figure some way to do partitioning on the data. Maybe take the first xxx characters and store them as a separate field, and partition by that.
Then when you're doing lookups, you're passing the short and long columns, so it's looking in a partition first.