.NET 中 ETL 期间避免 OutOfMemoryException 的策略
我写了一个执行ETL过程的ETL过程。 ETL 流程总共需要处理超过 100 多百万行或相当于 2 年的记录。为了避免内存不足问题,我们将数据加载频率降低到每 7 天一次。对于每个chunk进程,它加载所有需要的参考数据,然后该进程打开一个sql连接并逐一加载源数据,对其进行转换,并将其写入数据仓库。
按块处理数据的缺点是速度慢。
这个过程对于大多数表来说都运行良好,但有一个表我仍然遇到内存不足的情况。该进程加载了太多参考数据。我想避免将数据分块到 3 天,以便它具有良好的性能。
我可以使用其他策略来避免 OutOfMemoryException 吗?
例如,本地数据库、将引用数据写入文件、在 Windows 中生成另一个 .NET 进程以容纳更多内存、使用 CLR 存储过程进行 ETL...
环境:Windows 7 32 位操作系统。 4 GB 内存。 SQL Server 标准版。
唯一的解决方案是使用存储过程并让 SQL Server 处理 ETL。然而,我试图避免它,因为该程序也需要支持 Oracle。 我尝试的其他性能改进是添加索引以改进加载查询。创建自定义数据访问类以仅加载必要的列,而不是将整行加载到内存中。
谢谢
I have wrote a ETL process that perform ETL process. The ETL process needs to process more than 100+ million or rows overall for 2 years worth of records. To avoid out of memory issue, we chunk the data loading down to every 7 days. For each chunk process, it loads up all the required reference data, then the process open a sql connection and load the source data one by one, transform it, and write it to the data warehouse.
The drawback of processing the data by chunk is it is slow.
This process has been working fine for most of the tables, but there is one table I still run into out of memory. The process has loaded too many reference data. I would like to avoid chunk the data down to 3 days so that it has a decent performance.
Is there any other strategies that I can use to avoid OutOfMemoryException?
For example, local database, write the reference data to files, spawn another .NET process to hold more memory in Windows, use CLR stored procedure to do ETL...
Environment: Windows 7 32 bit OS. 4 GB of RAM. SQL Server Standard Edition.
The only one solution is to use a store procedure and let SQL Server handle the ETL. However, I am trying to avoid it because the program needs to support Oracle as well.
Other performance improvement I tried are added indexes to improve the loading queries. Create custom data access class to only load the necessary columns, instead of loading the entire row into memory.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在不知道如何准确处理数据的情况下,很难说,但在任何情况下都可以实现的简单解决方案是使用 64 位操作系统并将应用程序编译为 64 位。在 32 位模式下,.NET 堆只会增长到大约 1.5GB,这可能会限制您。
Without knowing how you exactly process the data it is hard to say, but a naive solution that can be implemented in any case is to use a 64-bit OS and compile your application as 64-bit. In 32-bit mode .NET heap will only grow to about 1.5GB which might be limiting you.
我知道它的旧帖子,但对于那些寻找更好的点来用编程语言编写数据操作的人来说。
我不确定您是否考虑过研究 ETL 工具如何执行数据加载操作并在代码中复制类似的策略。
其中一项建议是并行数据管道。这里,每个管道将根据源数据的分区对单个块执行 ETL。例如,您可以考虑并行生成不同周数据的过程。这仍然无法解决单个进程中的内存问题。不过,如果您在单个进程内堆内的内存分配达到限制,则可以使用。这对于与随机访问并行读取数据也很有用。但需要一个主流程来协调并完成该流程作为单个 ETL 操作。
我假设您在最终将数据写入数据库之前在转换中执行了大量查找操作。假设主事务表很大,参考数据很小。你需要重点关注数据结构运算和算法。下面有一些相同的提示。在编写算法时,请先参考数据的特征,然后选择最适合的算法。
通常,Lookup数据(参考数据)存储在缓存中。选择一个对读取和搜索操作有效的简单数据结构(例如数组列表)。如果可能的话,按您要加入的键对此数组进行排序,以提高搜索算法的效率。
转换任务中的查找操作有不同的策略。在数据库世界中,您可以将其称为连接操作。
合并连接算法:
当源已按连接属性键排序时,这是理想的选择。排序合并算法的关键思想是首先通过连接属性对关系进行排序,以便交错线性扫描会同时遇到这些集合。有关示例代码,https://en.wikipedia.org/wiki/Sort-merge_join
嵌套连接:
工作原理类似于嵌套循环,其中外循环索引的每个值都被视为内循环索引的限制(或起点或任何适用的值),并且对以下语句执行相应的操作内循环。因此,基本上,如果外部循环执行 R 次,并且对于每次执行,内部循环执行 S 次,则嵌套循环的总成本或时间复杂度为 O(RS)。
当表在连接列上建立索引时,嵌套循环连接可提供高效的访问。此外,在许多小型事务中,例如仅影响一小部分行的事务,索引嵌套循环联接远远优于排序合并联接和散列联接,
我仅描述在查找操作中可以考虑的两种方法。 ETL 中要记住的主要思想是查找和检索元组(按设置)以进行进一步操作。搜索将基于密钥,生成的交易密钥将提取所有记录(投影)。获取此文件并在一次读取操作中加载文件中的行。如果您不需要转换操作的所有记录,这更多是建议。
另一个非常昂贵的操作是写回数据库。可能倾向于一次处理一行的提取、转换和加载。考虑一下可以向量化的操作,您可以将其与批量数据结构操作一起执行。例如,对多维向量进行 lambda 运算,而不是一次循环每一行并针对给定行的所有列执行转换和运算。然后我们可以将该向量写入文件或数据库。这将避免内存压力。
I know its old post but for people searching for better points to write data operations with programming languages.
I am not sure if you have considered to study how ETL tools perform their data loading operations and replicate similar strategy in your code.
One such suggestion, parallel data pipes. Here each pipe will perform the ETL on a single chunks based on partitioning of the data from the source. For example, you could consider spawning processes for different weeks data in parallel. This still will not solve your memory issues within a single process. Though can be used in case you reach a limit with memory allocation within heap within single process. This is also useful to read the data in parallel with random access. Though will require a master process to coordinate and complete the process as a single ETL operation.
I assume you perform in your transformation a lot of lookup operation before finally writing your data to database. Assuming the master transaction table is huge and reference data is small. You need to focus on data structure operation and alogirthm. There are few tips below for the same. Refer to the characteristics of your data before choosing what suites best when writing the algorithm.
Generally, Lookup data (reference data) is stored in cache. Choose a simple data structure that is efficient for read and search operation (say Array list). If possible sort this array by the key you will join to be efficient in your search algorithm.
There is different strategy for lookup operations in your transformation tasks. In database world you can call it as join operation.
Merge Join algorithm :
Ideal when the source is already sorted on join attribute key. The key idea of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time. For sample code, https://en.wikipedia.org/wiki/Sort-merge_join
Nested Join:
works like a nested loop, where each value of the index of the outer loop is taken as a limit (or starting point or whatever applicable) for the index of the inner loop, and corresponding actions are performed on the statement(s) following the inner loop. So basically, if the outer loop executes R times and for each such execution the inner loop executes S times, then the total cost or time complexity of the nested loop is O(RS).
Nested-loop joins provide efficient access when tables are indexed on join columns. Furthermore,in many small transactions, such as those affecting only a small set of rows, index nested loopsjoins are far superior to both sort -merge joins and hash joins
I am only describing two methods that can be thought in your lookup operation. The main idea to remember in ETL is all about lookup and retrieve the tuples (as set) for further operation. Search will be based on key and resultant transaction keys will extract all the records (projection). Take this and load the rows from the file in one reading operation. This is more of suggestion in case you don't need all the records for transformation operations.
Another very costly operation is writing back to the database. There might be tendency to process the extraction, transformation and loading one row at a time. Think of operations that can be vectorized where in you can perform it together with a data structure operation in bulk. For example, lambada operation on a multi dimensional vector rather than looping every row one at a time and performing transformation and operations across all columns for a given row. We then can write this vector into file or database. This will avoid memory pressure.
这是一个非常古老的问题,它更多的是一个设计问题,我确信有很多解决方案,除非我了解更具体的细节。
最终,我使用 Merge 编写了 SQL 存储过程来处理 C# 应用程序处理时间过长的数据类型的 ETL 过程。另外,业务需求发生变化,我们放弃了Oracle支持,只支持64位服务器,这降低了维护成本,避免了ETL内存不足的问题。
此外,每当我们看到有机会提高查询性能时,我们都会添加许多索引。
ETL 过程不是按日期范围进行分块,而是按计数 (5000) 对数据进行分块并提交每个事务,这减少了事务日志文件的大小,如果 ETL 失败,该过程只需要回滚数据的子集。
最后,我们实现了缓存(key,value),以便将ETL日期范围内频繁引用的数据加载到内存中,以减少数据库查询。
This was a very old question, and it is more a design question and I am sure there are many solutions to it, unless I get into more specific details.
Ultimately, I wrote SQL Stored Procedure using Merge to handle the ETL process for the data type that took too long to process thought C# application. In addition, the business requirement was changed such that we dropped Oracle support, and only support 64-bit server, which reduced maintenance cost and avoid ETL out of memory issue.
In addition, we added many indexes whenever we see an opportunity to improve the querying performance.
Instead of chunking by a day range, the ETL process also chunks the data by count (5000) and commit on every transaction, this reduced the transaction log file size and if the ETL fails , the process only needs to rollback a subset of the data.
Lastly, we implemented caches (key,value) so that frequently referenced data within the ETL date range are loaded in memory to reduce database querying.