数据管道提案

发布于 2025-01-23 02:31:42 字数 832 浏览 2 评论 0原文

在过去的几年中，我们的产品一直在稳步增长，现在我们的某些桌子的数据大小就处于一个转折点，我们希望在此桌子的增长可能会在未来几个月内翻倍或三倍，甚至在接下来的几年中。我们现在的谈话范围为140万，因此到夏季结束时，超过3M，并且（因为我们预计增长将是指数级），我们在年底时假设大约1000万。（M是百万，而不是Mega/1000）。

我们正在谈论的表格有点像伐木表。该应用程序每天接收数据文件（CSV/XLS），并且数据将转移到该表中。然后将其用于应用程序的特定时间（几周/几个月），之后变得相当多余。也就是说：如果一切顺利。如果道路上存在一些问题，则行中的数据可用于检查解决问题。

我们想做的是定期清理桌子，根据某些要求删除任何数量的行，而是实际删除行将它们移至“其他地方”。我们目前使用mySQL作为数据库，“其他地方”可能是相同的，但可以是任何东西。对于其他项目，我们有一个主/从设置，其中涉及整个数据库，但这不是我们想要或需要的。这只是一些桌子，主表需要变短，而奴隶只会更大，而不是一对一的同步。

辅助商店的主要要求是，在需要时，通过SQL或其他DSL或仅视觉工具，数据应易于检查/查询。因此，我们对将数据备份到一个或多个CSV文件或另一种纯文本格式不感兴趣，因为这并不容易检查。然后，日志将在S3上的某个位置，因此我们需要下载它，上面grep/sed/awk ...我们宁愿有类似的数据库可以咨询。

我希望问题很清楚吗？

为了记录：虽然解决方案可以是我们希望拥有最简单的解决方案的任何东西。并不是说我们不想要Apache Kafka（示例），而是我们必须学习，安装，维护它。每件新技术都会增加我们的堆栈，它的越来越轻，我们就越喜欢它；）。

谢谢！

PS：我们不仅在这里很懒惰，还做了一些研究，而且我们只是认为对问题有更多见解是一个好主意。

原文

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).

The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.

What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.

The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.

I hope the problem is clear?

For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).

Thanks!

PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

分享到QQ

分享到微博