如何搭建大型数据库本地环境

发布于 2024-12-08 11:00:05 字数 312 浏览 1 评论 0原文

我有两个存储(PostgreSQL、MongoDB),并且由于我需要在计算机上本地开发应用程序(最好是离线),因此我需要将这些存储中的数据复制到我的 HDD。

无论如何,这些都是包含大约数百GB数据的大型数据库。

我不需要存储在那里的所有数据,只需要其中的样本就能够在该数据上本地启动我的应用程序。两种存储都有一些用于数据导出的强大工具(pg_dump、mongodump、mongoexport 等)。

但我不知道如何轻松有效地进行小样本数据的导出。即使我会获取所有表/集合的列表并构建一些白名单,这将定义应该限制行数的表,但触发器、函数、索引等也会出现问题。

I have two storages (PostgreSQL, MongoDB) and as I need to develope application locally on my computer (ideally offline), i need data from those storages to be copied to my HDD.

Anyway those are massive databases with around hundreds of gigabytes of data.

I don't need all data stored there, just sample of them to be able to launch my app locally on that data. Both storages have some capable tools for data export (pg_dump, mongodump, mongoexport etc.).

But I don't know how to easily and effectively do the export of small sample of data. Even if I would take the list of all tables/collections and build some whitelist, which would define tables, which should be limited on number of rows, there comes troubles with triggeres, functions, indexes etc.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

厌味 2024-12-15 11:00:05

我不知道 MongoDB 的测试,但对于 PostgreSQL,这就是我所做的。

我在针对数据库进行开发时遵循一种模式,将数据库端与应用程序端分开。为了测试数据库端,我有一个测试模式,其中包括一个重置真实模式中所有数据的存储过程。此重置是按照 MERGE 模式完成的(删除具有无法识别的键的任何记录,更新具有匹配键但已更改的记录,并插入丢失的记录)。在运行每个单元测试之前调用此重置。这为我提供了存储函数的简单、清晰的测试覆盖范围。

为了测试调用数据库的代码,数据库层始终被模拟,因此永远不会有任何实际调用数据库。

您所描述的内容向我表明您正在尝试将单元测试与集成测试混合在一起,我强烈建议您不要这样做。当您已经证明了基本功能并且想要证明组件之间的集成以及可能还包括性能时,就会发生集成测试。对于 IT 来说,您确实需要代表性硬件上的代表性数据集。通常这意味着一台专用机器,并使用 hudson 进行 CI。

您似乎要进入的方向将会很困难,因为正如您已经注意到的那样,很难处理如此大量的数据,并且很难生成有代表性的数据集(大多数 CI 系统实际上使用的是经过“清理”的生产数据” 的敏感信息)

这就是为什么我工作过的大多数地方都没有这样做。

I don't know about testing for MongoDB, but for PostgreSQL here's what I do.

I follow a pattern while developing against databases that separates the DB side from the app side. For testing the DB side, I have a test schema which includes a single stored procedure that resets all the data in the real schema. This reset is done following the MERGE pattern (delete any records with an unrecognized key, update records that have matching keys but which are changed, and insert missing records). This reset is called before running every unit test. This gives me simple, clear test coverage for stored functions.

For testing code that calls into the database, the database layer is always mocked, so there are never any calls that actually go to the database.

What you are describing suggests to me that you are attempting to mix unit testing with integration testing, and I rather strongly suggest that you don't do that. Integration testing is what happens when you've already proved base functionality and want to prove integration between components and probably also performance, too. For IT, you really need a representative data set on representative hardware. Usually this means a dedicated machine, and using hudson for CI.

The direction you seem to be going in is going to be difficult because, as you've already noticed, it's difficult to handle that volume of data and it's difficult to generate representative data sets (most CI systems actually use production data that's been "cleaned" of sensitive information)

Which is why most of the places I've worked have not gone that way.

悟红尘 2024-12-15 11:00:05

只需复制全部即可。按照今天的标准,几百 GB 并不算多——你可以花 80 美元购买 2000 GB 的磁盘。

如果您在小样本数据上测试代码,那么您如何知道您的编码对于完整数据库是否足够有效?

如果它走出您的公司大楼,请记住使用强密码对其进行加密。

Just copy it all. Several hundreds gigabytes is not very much by today's standards — you can buy 2000GB disk for $80.

If you test your code on small sample data then how do you know if your coding will be efficient enough for full database?

Just remember to encrypt it with strong password if it goes out of your company building.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文