PostgreSQL 到数据仓库：近实时 ETL/数据提取的最佳方法

发布于 2024-08-26 19:09:30 字数 1262 浏览 15 评论 0原文

背景：

我有一个针对 OLTP 进行了深度优化的 PostgreSQL (v8.3) 数据库。

我需要在半实时的基础上从中提取数据（有人肯定会问半实时意味着什么，答案是尽可能频繁地回答，但我会务实，作为一个基准，可以说我们希望每 15 分钟一次）并将其输入数据仓库。

有多少数据？在高峰时段，我们所说的 OLTP 端每分钟大约有 80-100k 行，非高峰时段这将大幅下降至 15-20k。最频繁更新的行每行约为 64 字节，但有各种表等，因此数据非常多样化，每行最多可达 4000 字节。 OLTP 24x5.5 处于活动状态。

最佳解决方案？

据我所知，最实用的解决方案如下：

创建一个触发器，将所有 DML 活动写入循环的 CSV 日志文件
执行所需的任何转换
使用本机 DW 数据泵工具将转换后的 CSV 有效地泵入 DW

为什么采用这种方法？

TRIGGERS 允许将选择性表定为目标，而不是系统范围 + 输出可配置（即转换为 CSV）并且相对容易编写和部署。 SLONY 使用类似的方法，开销是可以接受的
CSV 可以轻松快速地转换
易于将 CSV 注入 DW

考虑的替代方案...

使用本机日志记录 (http://www.postgresql.org/docs/8.3/static/runtime-config-日志记录.html）。问题在于，相对于我所需要的内容，它看起来非常冗长，并且解析和转换有点棘手。然而，它可能会更快，因为我认为与触发器相比，开销更少。当然，这会让管理变得更容易，因为它是系统范围的，但同样，我不需要一些表（有些表用于持久存储我不想记录的 JMS 消息）
直接通过 ETL 工具查询数据例如 Talend 并将其注入 DW ... 问题是 OLTP 模式需要调整来支持这一点，这会产生许多负面影响
使用调整/破解的 SLONY - SLONY 在记录和迁移更改方面做得很好从属所以概念框架已经存在，但建议的解决方案似乎更简单、更清晰
使用 WAL

以前有人这样做过吗？想分享您的想法吗？

原文

Background:

I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.

I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.

How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.

Best Solution?

From what I can piece together the most practical solution is as follows:

Create a TRIGGER to write all DML activity to a rotating CSV log file
Perform whatever transformations are required
Use the native DW data pump tool to efficiently pump the transformed CSV into the DW

Why this approach?

TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable
CSV easy and fast to transform
Easy to pump CSV into the DW

Alternatives considered ....

Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
Using the WAL

Has anyone done this before? Want to share your thoughts?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

晚雾 2024-09-02 19:09:30

假设您感兴趣的表具有（或可以增强）唯一的、索引的、顺序键，那么只需发出 SELECT ... FROM table ... WHERE key > 即可获得更好的价值。 :last_max_key 输出到文件，其中 last_max_key 是上次提取的最后一个键值（如果是第一次提取，则为 0。）这种增量、解耦方法避免了在插入数据路径中引入触发延迟（无论是自定义触发器还是修改后的 Slony），并且根据您的设置可以更好地随着 CPU 数量等进行扩展。（但是，如果您还必须跟踪UPDATEs，并且顺序键是您添加的，那么您的 UPDATE 语句应该将键列SET设置为NULL 因此它会获得一个新值，并在下次提取时被选中。如果没有触发器，您将无法跟踪DELETE。）这就是什么。当您提到 Talend 时，您想到的是？

我不会使用日志记录功能，除非您无法实现上述解决方案；日志记录很可能涉及锁定开销，以确保日志行按顺序写入，并且当多个后端写入日志时不会相互重叠/覆盖（检查 Postgres 源。）锁定开销可能不是灾难性的，但如果您可以使用增量 SELECT 替代方案，则可以不使用它。此外，语句日志记录会淹没任何有用的警告或错误消息，并且解析本身不会是瞬时的。

除非您愿意解析 WAL（包括事务状态跟踪，并准备好在每次升级 Postgres 时重写代码），否则我也不一定会使用 WAL——也就是说，除非您有额外的可用硬件，在这种情况下，您可以将 WAL 发送到另一台机器进行提取（在第二台机器上，您可以无耻地使用触发器——甚至语句日志记录——因为无论发生什么不会影响主机上的 INSERT/UPDATE/DELETE 性能。）请注意性能方面（在主机上），除非您可以将日志写入 SAN，通过将 WAL 传送到不同的机器，您将获得与运行增量 SELECT 相当的性能损失（主要是在抖动文件系统缓存方面）。

Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?

I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.

Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.

回复收藏 0 原文

猫九 2024-09-02 19:09:30

如果您能想到一个仅包含 id 和“校验和”的“校验和表”，您不仅可以快速选择新记录，还可以快速选择已更改和已删除的记录。

校验和可以是您喜欢的 crc32 校验和函数。

回复收藏 0 原文

桃气十足 2024-09-02 19:09:30

PostgreSQL 中新的 ON CONFLICT 子句改变了我进行许多更新的方式。我将新数据（基于 row_update_timestamp）拉入临时表，然后在一个 SQL 语句中使用 ON CONFLICT UPDATE INSERT 到目标表中。如果您的目标表已分区，那么您需要跳过几个步骤（即直接点击分区表）。当您加载临时表（最有可能）或在 ON CONFLICT SQL（如果微不足道）中时，可能会发生 ETL。与其他“UPSERT”系统（更新、零行插入等）相比，这显示出巨大的速度改进。在我们特定的 DW 环境中，我们不需要/不想容纳 DELETE。查看 ON CONFLICT 文档 - 它与 Oracle 的 MERGE 相媲美！

回复收藏 0 原文

○愚か者の日 2024-09-02 19:09:30

我对 2023 年这个主题的看法...

选项 1（批量方法）：

使用增量提取进行暂存，通过使用整数或时间戳来保持每次迭代中每个表传输的最大行数。我们可以使用always ONCONFLICT 语句来避免由于意外的迭代崩溃而导致的任何意外的键冲突。这种方式无法跟踪行删除，但我们可以使用删除标志作为列来过滤到数据仓库中。
数据仓库通过创建存储过程来使用计算表来进行复杂的联接/计算并将结果插入到新的预先计算表中。

-选项2（管道方法）