当前位置：文江博客话题详情

数据库设计/延迟/并发，令人头疼

发布于 2024-08-06 03:15:23 字数 2108 浏览 11 评论 0原文

我有一个客户端服务器应用程序，它从几个表中获取所有数据，重新计算并存储它。

示例：

每个项目都有一个“物料清单”= 构成该项目的其他项目的列表和数量。因此，物品的成本是其 BOM 中物品的成本 * 数量之和。最终，一些“基本”项目没有物料清单，只是独立设置成本。（即：原材料）

即：A 的 BOM 表示其由 2xB 和 3xC 制成。

我现在所做的，我不记得为什么这样做，是我从数据库中获取所有项目和所有 BOM，并一次针对每个项目递归计算其成本。一旦我计算出一项，我就会对其进行标记，这样我就不会再次重做成本。（还防止无限递归）

事实是，这有点愚蠢：首先，它很慢并且会重新计算未更改的内容，更糟糕的是，给它一个足够大的数据库，它将耗尽内存。

相反，我可以根据需要重新计算项目：当项目的 BOM 发生更改时，我重新计算该 BOM，然后选择包含此更新项目的所有 BOM，并重新计算它们；递归地冲洗并重复，直到到达顶部，其中数据库中的 BOM 不依赖于任何更改的项目。

这在实践中意味着什么：假设某些物品是原材料，其成本可能会经常更新，而某些物品是“最终用户”物品，其 BOM 很少会发生变化。当用户更改其中一种材料的成本时，可能意味着要遍历数千个项目并重新计算它们。假设 1 个项目/BOM 的 SELECT 需要 15 毫秒（我使用的是 Postgresql），那么仅选择 1000 个项目/BOM 将花费 15 秒，然后您必须将重新计算的成本更新回数据库中的项目...哦亲爱的，延迟现在可以变成几分钟。

我工作的公司使用的ERP软件采用的是第一种方法：一次性批量重新计算整个数据库。这实际上需要几个小时，而且在十多年的使用过程中，这种方法的问题似乎已经越来越多。批量重新计算每周进行一次。

现在我实际上已经“大声地写下了这个”，我认为花几分钟并不重要。问题是我不太了解数据库，而且我担心并发性：因为更新 Item A 需要很长时间，所以很可能有人会在 Item A 更新期间更新第二个 Item B已更新。

假设项目 D 是由上面的 A 和 B 制成的。用户 1 更新 A，因此服务器软件开始使用数据库自慰几分钟，最终更新 D。但与此同时，用户 2 更新 B，因此服务器最终将再次更新 D。

使用Postgresql的事务可以解决问题吗？事务从数据库当时的状态开始，因此事务 1 看到 D 由 A1 和 B1 组成，并将 A 从 A1 更新到 A2，但在完成和提交之前，事务 2 将开始，同时也会看到 A1和B1。 T1重新计算并提交，D = A2 + B1。但T2已经开始了，并没有看到新的A，A2。因此它最终向数据库提交 D = A1 + B2，这是不正确的。应该是D=A2+B2。

此外，某些处理会重叠，浪费服务器时间。

如果我按顺序而不是并行执行 T1 和 T2，那么万岁，答案是正确的，但用户 2 将不得不等待更长时间。另外，如果一组事务彼此没有关系（完全独立......依赖树；即：A=X+Y 和 B=N+M），那么并行计算将给出正确的答案并且速度会更快用户。

重要提示：即使按顺序处理，我仍然会使用事务，因此软件的其余部分仍然可以并行处理该数据，除了重新计算成本的功能之外。

现在，如果数据库延迟不会那么“糟糕”，那么整个“按顺序处理”的事情就不会那么糟糕。比如说，如果整个数据都保存在 RAM 中，那么处理 1000 个对象将是轻而易举的事。啊，但即使我构建一个系统来快速将数据块移入磁盘/RAM，并进行一些缓存（以替换数据库），那也是行不通的，因为我仍然需要事务，以便服务器的其余功能可以并行工作。（上面的“重要提示”）所以我最终会构建另一个数据库。可能会快一点，但这是愚蠢/浪费时间。

我“缓存”每个项目的成本的全部原因是这样我就不会在每次使用它时重新计算它，因为它不仅浪费有限的资源，而且数据库延迟太大，并发问题规模甚至更糟。

现在我不需要奇怪为什么“他们”要大批量地这样做……这让我很头疼。

Q1：你们如何以“最佳”方式解决这个问题？

根据我目前的理解（即，在面对之前我默默忽略的并发问题之后），我会让该函数使用事务顺序，应用程序的其余部分仍然能够并行使用数据，我认为这对用户来说是最好的。这就是目标：最适合用户，但保证系统的正确性。

也许稍后我可以向它扔硬件并使用软件黑魔法来减少延迟，但我现在开始对自己撒谎。

另外，在过去的几个月里，我对一些显而易见的事情（有些与编程无关）完全视而不见，所以我期待有人会指出一些我认为很明显的可耻的事情设法错过...:|

原文

I have a client-server application which gets all the data out of a couple of tables, recalculates something and stores it.

Example:

Each Item has a 'Bill of Materials' = the list and quantities of which other items it is made out of. Therefore the cost of an item is the sum of the cost of the items in its BOM * their quantities. Ultimately, some "base" items have no BOM and just have the cost set independently. (ie: raw materials)

ie: A's BOM says its made out of 2xB and 3xC.

What I do now, and I don't remember why I do it like this, is I get all the items and all the BOMs out of the DB, and go for each item at a time calculating its cost recursively. Once I calculate one item, I flag it so I don't redo the cost again. (also guards against infinite recursion)

Thing is, this is kinda stupid: first, its slooow and gonna recalculate stuff that hasn't changed, and worse, give it a DB big enough, and it will run out of memory.

Instead, I could recalculate items on demand: when an Item's BOM changes, I recalculate that BOM, then SELECT all the BOMs which contain this updated Item, and recalculate them as well; rinse and repeat recursively 'till you reach the top, where no BOM in the DB depends on any of the changed items.

What this means in practice: say some of the Items are raw materials, whose cost might be updated often, and some Items are "end-user" stuff, whose BOM will rarely if ever change. When the user changes the cost of one of those materials, then it might mean going trough thousands of Items, recalculating them. Say a SELECT of 1 Item/BOM takes 15ms (I'm on Postgresql), then merely SELECTing 1000 Items/BOMs will take 15 seconds, and then you have to UPDATE the recalculated cost back into the Item in the DB... oh dear, latency can turn into minutes now.

The ERP software the company I work for uses is taking the 1st approach: batch-recalculate the entire DB at once. This literally takes hours, and it seems the problems have been building up with this approach, over the 10+ years of usage. The batch-recalculation is done weekly.

Now that I've actually "written this out loud", I don't think that it takes a few minutes matters too much. The problem is that I don't understand databases well, and I'm worrying about concurrency: since it will take a long time to update on Item A, it is likely someone will update a second Item B during the time Item A is being updated.

Say Item D is made out of the A and B above. User 1 updates A, so the server software begins masturbating with the DB for a couple of minutes, eventually updating D. But in the meantime, User 2 updates B, so the server will eventually update D again.

Will using Postgresql's transactions solve the problem? A transaction begins with the then-current state of the DB, so Transaction 1 sees D being made out of A1 and B1, and its updating A from A1 to A2, but before it finishes and commits, Transaction 2 will begin, also seeing A1 and B1. T1 recalculates and commits, D = A2 + B1. But T2 has already began, and doesn't see the new A, A2. So it then finally commits to the DB that D = A1 + B2, which is incorrect. It should be D = A2 + B2.

Also, some processing will overlap, wasting server time.

If I do T1 and T2 in sequence instead of in parallel, then hooray, the answer is correct, but User 2 will have to wait longer. Also, if a group of transactions have no relation to each other (completely independent... dependency trees; ie: A=X+Y and B=N+M), then parallel computation will give the correct answer AND will be faster for the user.

Important note: even when processing in sequence, I'd still use transactions, so the rest of the software can still work with that data in parallel, except for the function that recalculates cost.

Now, this whole "process-in-sequence" thing would not be so bad if.... the DB latency wouldn't be so "awful". If, say, the entire data would be held in RAM, then going trough 1000 objects would be a breeze. Ah, but even if I build a system to quickly move chunks of data to/from disk/RAM and do some caching -to replace the DB-, that won't do because I still need transactions so that the rest of the server functionality can work in parallel. ('important note' above) So I'd end up building another DB. Might be a bit faster, but its stupid/waste of time.

The whole reason I "cache" the cost of each Item is so that I don't recalculate it every time I use it, because not only does it waste limited resources, the DB latency is just too big and concurrency issues scale even worse.

Now I need no wonder why "they" did it in big batches... this is making my head hurt.

Q1: How do you guys solve this in an "optimum" way?

From my current understanding (that is, after facing the concurrency problem which before I silently ignored), I would make that function use transactions in sequence, and the rest of the app will still be able to use the data in parallel, which I believe is best for the user. That's the goal: best for the user, but guaranteed correctness for the system.

Maybe later on I could throw hardware at it and use software black magic to reduce that latency, but I'm beginning to lie to myself now.

Also, in the past couple of months I've been completely blind to several dead-obvious things (some not related to programming), so I'm expecting that someone will point out something shamefully obvious that I managed to miss... :|

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

还给你自由 2024-08-13 03:15:23

我不记得为什么要这样做......

这让我突然意识到，这是你需要解决的第一件事！

您没有任何理由需要将数据取回应用程序只是为了计算每个 BOM 的总成本。有多种技术可用于处理 SQL 中的“部件爆炸”或分层数据集。

我在演示文稿“SQL Antipatterns Strike Back”中介绍了几种解决方案，或者您可以阅读《Joe Celko 的 SQL 中的树和层次结构之类的书。 ”

有些解决方案是特定于供应商的，有些可以使用任何普通 SQL DBMS 来完成。我没有注意到什么品牌的数据库，但 Jonathan 正确地让我意识到您正在使用 PostgreSQL。

在这种情况下，您应该阅读“WITH”查询，这是 PostgreSQL 8.4 中的新增功能，它允许您执行一些复杂的递归查询效果。

http://www.postgresql.org/docs/current/static/ querys-with.html

我已经实现了一个系统，其中 BOM 由各个资源的层次结构组成，并且我不必执行您所描述的任何批处理（诚然，只有少数当我处理它时，数据库中有数千个资源）。

您应该学习如何在 SQL 中使用聚合函数，例如 SUM() 和 GROUP BY（任何有关 SQL 的书籍都应该包含此内容），以及存储实体的层次关系的技术。

既然你说你不太了解数据库，我建议你在对真实系统进行任何更改之前尝试实现一个“玩具”系统。我只是从个人经验来说，但我发现我无法在学习新技术的同时尝试在实际项目中运用该技能。

回复收藏 0 原文

み青杉依旧 2024-08-13 03:15:23

在我看来，这听起来像是一种将从数据库中的存储过程中受益的计算，或多或少无论您使用哪种实现方法。这减少了客户端和服务器之间的流量，这几乎总是会提高像这样的一组复杂计算的性能。

你说：

我现在所做的，我不记得为什么这样做，是我从数据库中获取所有项目和所有 BOM，并一次针对每个项目递归计算其成本。一旦我计算出一项，我就会对其进行标记，这样我就不会再次重做成本。（还可以防止无限递归）。

我对这个解释中的“标记它”部分感到困惑 - 不知道为什么你这样做是个坏消息。您确实需要了解自己在做什么。

进行 BOM 处理的方法有很多 - Bill Karwin 向您指出了一些有趣的信息（SQL Antipatterns 链接大约有 250 张幻灯片！）。 SQL 反模式部分讨论“朴素树”（如下所述）。然而，这些解决方案不涵盖下面概述的情况，即同一子树可以由多个父级使用（因为一个子组件可以是多个产品的组件）。

路径枚举不起作用：您无法使用相同的子装配信息，因为您将包含产品信息构建到路径中。
当子装配体用于一个产品时，嵌套集可以正常工作；当子组件用于许多产品时则不然。
“封闭表”解决方案可以适应这一点 - 它或多或少是下面的第二种选择。

您需要考虑对受影响的部分进行自下而上的扫描是否有意义，或者进行某种广度优先或深度优先扫描是否会更好。 BOM 数据的性质是这一决策的驱动因素之一。如果您的结构中某些子装配体用作多个产品的组件，则您是否为每个产品单独记录子装配体中使用的零件，还是记录产品使用该子装配体？

澄清一下：

子组件 A (P001) 包含 24 x 8mm 螺母 (P002)、24 x 8mm x 50 mm 螺栓 (P003)、1 x 底板 (P004)、1 x 盖板 (P005)。
产品 B (P006) 包含 1 x 子组件 A 和许多其他零件。
产品 B (P007) 包含 1 x 子组件 B 和许多其他零件。

您的 BOM 记录可能看起来像这样（朴素树）：

Part      Component     Quantity
P001      P002          24
P001      P003          24
P001      P004          1
P001      P005          1
P006      P001          1
P007      P001          1

或者它们可能看起来像这样（闭包表）：

Part      Component     Quantity
P001      P002          24
P001      P003          24
P001      P004          1
P001      P005          1
P006      P002          24
P006      P003          24
P006      P004          1
P006      P005          1
P007      P002          24
P007      P003          24
P007      P004          1
P007      P005          1

第二种情况不太理想 - 获得正确的值要困难得多，如果是这样，则更加困难像螺母或螺栓这样的零件，多个子组件可以使用相同的零件，因此在主要可交付产品（P006，P007）中正确计数将非常困难。然而，在第二种情况下，重新计算任何零件的成本要简单得多 - 您只需计算组成零件的每个组件的“成本乘以数量”的总和即可。如果您保留朴素树来记录零件结构分解，并在某些产品或子组件的结构（而不是价格）发生变化时（重新）计算闭合表，那么您可能已经接近涅槃了要得到。

在某个地方（但在另一台计算机上而不是这台计算机上）我有一些旧代码可以使用虚构的程序集来处理这些东西。编码已经完成了……咕哝，咕哝……很久以前，并为特定的 DBMS 使用临时表（并且没有提到嵌套集或路径枚举；它确实计算闭包表） - 它必须是适应其他DBMS。问问吧，我会把它挖出来。

This sounds to me like a calculation that would benefit from being a stored procedure in the database, more or less regardless of which implementation method you use. That cuts down on the traffic between client and server, which almost invariably improves the performance of a complex set of calculations like this.

You say:

What I do now, and I don't remember why I do it like this, is I get all the items and all the BOMs out of the DB, and go for each item at a time calculating its cost recursively. Once I calculate one item, I flag it so I don't redo the cost again. (also guards against infinite recursion).

I'm puzzled about the 'flag it' part of this explanation - and not knowing why you do something the way you do it is bad news. You really need to understand what you are doing.

There are many ways to do BOM processing - and Bill Karwin has pointed you at some interesting info (the SQL Antipatterns link is about 250 slides!). The SQL Antipatterns section discusses 'naïve trees' (such as those outlined below). However, the solutions do not cover the case outlined below, where the same sub-tree can be used by multiple parents (because one sub-assembly can be a component of multiple products).

Path enumeration doesn't work: you can't use the same sub-assembly information because you build the containing product information into the path.
Nested sets work fine when the sub-assembly is used in one product; not when the sub-assembly is used in many products.
The 'closure table' solution can be adapted to cover this - it is more or less the second alternative below.

You need to consider whether it makes sense to be doing a bottom-up scan of the affected parts or whether you will be better off doing some sort of breadth-first or depth-first scan. One driver on this decision making will be the nature of the BOM data. If you have a structure where some sub-assembly is used as a component of multiple products, do you record the parts use in the sub-assembly separately for each product, or do you record that the products use the sub-assembly?

To clarify:

Sub-assembly A (P001) contains 24 x 8mm nuts (P002), 24 x 8mm x 50 mm bolts (P003), 1 x baseplate (P004), 1 x coverplate (P005).
Product B (P006) contains 1 x Sub-assembly A and a number of other parts.
Product B (P007) contains 1 x Sub-assembly B and a number of other parts.

Your BOM records could look like this (naïve tree):

Part      Component     Quantity
P001      P002          24
P001      P003          24
P001      P004          1
P001      P005          1
P006      P001          1
P007      P001          1

Or they could look like this (closure table):

Part      Component     Quantity
P001      P002          24
P001      P003          24
P001      P004          1
P001      P005          1
P006      P002          24
P006      P003          24
P006      P004          1
P006      P005          1
P007      P002          24
P007      P003          24
P007      P004          1
P007      P005          1

This second case is much less desirable - it is much harder to get the values right, doubly so if, as in the case of parts like nuts or bolts, multiple sub-assemblies could use the same part, so getting the counts right in a major deliverable product (P006, P007) would be very hard. However, recalculating the cost of any part is much simpler in the second case - you simply count up the sum of the 'cost times quantity' for each component that makes up a part. If you retain the naïve tree to record the part-structure breakdown and (re)compute the closure table when the structure (not price) of some product or sub-assembly changes, then you are probably as close to nirvana as you're likely to get.

Somewhere (but on another computer than this one) I have some old code to mess around with this stuff, using fictitious assemblies. The coding was done ... mumble, mumble ... a long time ago, and uses temporary tables (and doesn't mention nested sets or path enumeration; it does compute closure tables) for a specific DBMS - it would have to be adapted to other DBMS. Ask, and I'll dig it out.

回复收藏 0 原文

~没有更多了~