Python:与复杂数据仓库交互
我们已经努力为我们的问题建立一个全维数据库模型,现在是时候开始编码了。我们之前的项目使用了通过字符串操作构建的手工查询。
python 和复杂的数据库布局之间的接口是否有最佳/标准实践?
我已经简要评估了 SQLAlchemy、SQLObject 和 Django-ORM,但是(我可能很容易遗漏一些东西)它们似乎针对小型 Web 类型 (OLTP) 事务进行了调整,而我正在其中进行大容量分析 (OLAP) 事务。
我的一些要求可能与平常有所不同:
- 相对快速地加载大量数据
- 快速轻松地更新/插入少量数据轻松
- 处理大量行(5 年内每分钟 300 个条目)
- 允许修改架构,满足未来的需求
编写这些查询很容易,但是编写代码来排列所有数据却很乏味,尤其是随着架构的发展。这似乎是计算机擅长的事情?
We've worked hard to work up a full dimensional database model of our problem, and now it's time to start coding. Our previous projects have used hand-crafted queries constructed by string manipulation.
Is there any best/standard practice for interfacing between python and a complex database layout?
I've briefly evaluated SQLAlchemy, SQLObject, and Django-ORM, but (I may easily be missing something) they seem tuned for tiny web-type (OLTP) transactions, where I'm doing high-volume analytical (OLAP) transactions.
Some of my requirements, that may be somewhat different than usual:
- load large amounts of data relatively quickly
- update/insert small amounts of data quickly and easily
- handle large numbers of rows easily (300 entries per minute over 5 years)
- allow for modifications in the schema, for future requirements
Writing these queries is easy, but writing the code to get the data all lined up is tedious, especially as the schema evolves. This seems like something that a computer might be good at?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不要对您的要求感到困惑。一种尺寸并不适合所有情况。
为什么不使用数据库的本机加载器来实现此目的?使用Python准备文件,但使用数据库工具加载。你会发现这速度快得惊人。
这开始改变数据仓库的规则。除非您正在谈论主数据管理来更新维度的报告属性。
这就是 ORM 和 Web 框架的用途。
同样,这就是您使用 Python 前端处理管道的原因,但实际的 INSERT 是由数据库工具完成的。不是Python。
您几乎没有必要将其自动化。这当然是“编程”的最低优先级任务。为了正确保存数据,您通常会手动执行此操作。
顺便说一句,“通过字符串操作构建的手工查询”可能是有史以来最大的错误。这些对于 RDBMS 解析器来说很难处理——它们比使用插入了绑定变量的查询要慢。
Don't get confused by your requirements. One size does not fit all.
Why not use the databases's native loaders for this? Use Python to prepare files, but use database tools to load. You'll find that this is amazingly fast.
That starts to bend the rules of a data warehouse. Unless you're talking about Master Data Management to update reporting attributes of a dimension.
That's what ORM's and web frameworks are for.
Again, that's why you use a pipeline of Python front-end processing, but the actual INSERT's are done by database tools. Not Python.
You have almost no use for automating this. It's certainly your lowest priority task for "programming". You'll often do this manually in order to preserve data properly.
BTW, "hand-crafted queries constructed by string manipulation" is probably the biggest mistake ever. These are hard for the RDBMS parser to handle -- they're slower than using queries that have bind variables inserted.
我将 SQLAlchemy 与一个相当大的数据仓库一起使用,并且成功地将它用于完整的 ETL 过程。特别是在某些具有复杂转换规则或异构源(例如 Web 服务)的源中。我没有使用 Sqlalchemy ORM,而是使用它的 SQL 表达式语言,因为我实际上不需要在 ETL 过程中将任何内容与对象进行映射。值得注意的是,当我带来一些源的逐字副本时,我宁愿使用数据库工具(例如 PostgreSQL 转储实用程序)。你无法击败它。
SQL 表达式语言是您使用 SQLAlchemy(或任何 ORM)获得的最接近手写 SQL 的语言,但由于您可以通过 Python 以编程方式生成 SQL,因此您将节省时间,特别是如果您需要遵循一些非常复杂的转换规则。
但有一件事,我宁愿手动修改我的架构。我不相信任何工具可以完成这项工作。
I'm using SQLAlchemy with a pretty big datawarehouse and I'm using it for the full ETL process with success. Specially in certain sources where I have some complex transformation rules or with some heterogeneous sources (such as web services). I'm not using the Sqlalchemy ORM but rather using its SQL Expression Language because I don't really need to map anything with objects in the ETL process. Worth noticing that when I'm bringing a verbatim copy of some of the sources I rather use the db tools for that -such as PostgreSQL dump utility-. You can't beat that.
SQL Expression Language is the closest you will get with SQLAlchemy (or any ORM for the matter) to handwriting SQL but since you can programatically generate the SQL from python you will save time, specially if you have some really complex transformation rules to follow.
One thing though, I rather modify my schema by hand. I don't trust any tool for that job.
当然是 SQLAlchemy。与 SQLAlchemy 相比,所有其他 ORM 看起来都像小孩子的玩具。特别是 Django-ORM。 SQLAlchemy 之于 Python,就像 Hibernate 之于 Java 一样。
SQLAlchemy definitely. Compared to SQLAlchemy, all other ORMs look like child's toy. Especially the Django-ORM. What's Hibernate to Java, SQLAlchemy is to Python.