如何在整个组织内共享数据
组织在多个部门和应用程序之间共享关键数据有哪些好方法?
举个例子,假设有一个主要应用程序和数据库来管理客户数据。组织中还有十个其他应用程序和数据库读取该数据并将其与自己的数据关联起来。目前,这种数据共享是通过数据库 (DB) 链接、物化视图、触发器、临时表、重新键入信息、Web 服务等的混合来完成的。
还有其他好的方法来共享数据吗?并且,在以下问题方面,您的方法与上述方法相比如何:
请记住,共享客户数据有多种使用方式,从简单的单记录查询到复杂的多谓词、多排序以及与存储在不同数据库中的其他组织数据的联接。
感谢您的建议和建议...
What are some good ways for an organization to share key data across many deparments and applications?
To give an example, let's say there is one primary application and database to manage customer data. There are ten other applications and databases in the organization that read that data and relate it to their own data. Currently this data sharing is done through a mixture of database (DB) links, materialized views, triggers, staging tables, re-keying information, web services, etc.
Are there any other good approaches for sharing data? And, how do your approaches compare to the ones above with respect to concerns like:
Keep in mind that the shared customer data is used in many ways, from simple, single record queries to complex, multi-predicate, multi-sort, joins with other organization data stored in different databases.
Thanks for your suggestions and advice...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我相信你已经看到了这一点,“这取决于情况”。
这取决于一切。与部门 A 共享客户数据的解决方案可能与与部门 B 共享客户数据的解决方案完全不同。
这些年来我最喜欢的概念是“最终一致性”的概念。该术语来自亚马逊谈论分布式系统。
前提是,虽然分布式企业中的数据状态现在可能不完全一致,但“最终”会是完全一致的。
例如,当系统 A 上的客户记录更新时,系统 B 的客户数据现在已过时且不匹配。但是,“最终”,来自 A 的记录将通过某种过程发送到 B。因此,最终,两个实例将匹配。
当您使用单个系统时,您没有“EC”,而是拥有即时更新、单一“事实来源”,并且通常还有一个处理竞争条件和冲突的锁定机制。
您的操作处理“EC”数据的能力越强,分离这些系统就越容易。一个简单的例子是销售使用的数据仓库。他们使用 DW 运行每日报告,但直到凌晨才运行报告,并且总是查看“昨天”(或更早)的数据。因此数据仓库并不需要实时与日常运营系统完全一致。流程在营业时间结束时运行并在大型单一更新操作中集中移动这些天的事务和活动是完全可以接受的。
你可以看到这个需求如何解决很多问题。交易数据不存在争用,不用担心某些报告数据会在累积统计数据的过程中发生变化,因为报告对实时数据库进行了两次单独的查询。白天不需要因为高细节的喋喋不休而占用网络和CPU处理等。
这是 EC 的一个极端的、简化的、非常粗略的例子。
但考虑一下像谷歌这样的大型系统。作为搜索的消费者,我们不知道谷歌收集的搜索结果何时或需要多长时间才能出现在搜索页面上。 1毫秒? 1秒? 10秒? 10小时?很容易想象,如果您访问谷歌的西海岸服务器,您很可能会得到与访问其东海岸服务器不同的搜索结果。这两个实例在任何时候都不完全一致。但从很大程度上来说,它们大多是一致的。对于他们的用例来说,他们的消费者并没有真正受到滞后和延迟的影响。
考虑电子邮件。 A 想要向 B 发送消息,但在此过程中,消息要经过系统 C、D 和 E。每个系统都接受该消息,并对其承担全部责任,然后将其交给另一个系统。发件人看到电子邮件正在发送。接收者并不会真正错过它,因为他们不一定知道它的到来。因此,该消息在系统中传输需要很长的时间,而没有人知道或关心它的速度有多快。
另一方面,A可能正在和B通电话。“我刚刚发了,你收到了吗?现在?现在?现在收到吗?”
因此,存在某种潜在的、隐含的绩效和响应水平。最后,“最终”,A 的发件箱与 B 的收件箱匹配。
这些延迟、对陈旧数据的接受(无论是一天前的数据还是 1-5 秒前的数据)是控制系统最终耦合的因素。此要求越宽松,耦合就越松散,您在设计方面的灵活性就越大。
对于 CPU 的核心来说也是如此。在同一系统上运行的现代多核、多线程应用程序可以对“相同”数据有不同的视图,仅过时微秒。如果您的代码可以正确处理可能彼此不一致的数据,那么快乐的一天,它就会顺利进行。如果不是,您需要特别注意确保数据完全一致,使用易失性内存限定或锁定结构等技术。所有这些都以它们的方式降低了性能。
所以,这是基本的考虑。所有其他决定都从这里开始。回答这个问题可以告诉您如何跨机器对应用程序进行分区、共享哪些资源以及如何共享它们。有哪些协议和技术可用于移动数据,以及执行传输的处理成本是多少。复制、负载均衡、数据共享等等都是基于这个概念。
编辑,回应第一条评论。
正确,完全正确。这里的博弈,例如,如果B不能改变客户数据,那么改变客户数据有什么危害呢?你能“冒”它在短时间内过时的风险吗?也许您的客户数据输入速度足够慢,您可以立即将其从 A 复制到 B。假设零钱被放在一个队列中,由于交易量小,很容易被提取(<1s),但即使如此,它仍然会与原始零钱“脱离交易”,因此有一个小窗口,A 可以在其中进行操作。拥有 B 没有的数据。
现在头脑真的开始旋转了。在那一秒的“滞后”期间会发生什么,最糟糕的情况是什么。你能围绕它进行设计吗?如果您可以设计 1 秒左右的延迟,那么您也许可以设计 5 秒、1 米甚至更长的延迟。您在 B 上实际使用了多少客户数据?也许B是一个旨在促进库存订单拣选的系统。很难想象还有什么比简单的客户 ID 和名称更重要的了。只是在组装订单时粗略地识别订单的对象。
拣货系统不一定需要在拣货过程结束之前打印出所有客户信息,到那时订单可能已经转移到另一个可能更最新的系统,特别是运输信息,因此最终,拣货系统几乎不需要任何客户数据。事实上,您可以在提货订单中嵌入和非规范化客户信息,因此不需要或期望稍后进行同步。只要客户 ID 正确(无论如何都不会改变)和名称(很少改变,不值得讨论),这就是您需要的唯一真实参考,并且您的所有提货单在当时都完全准确创建。
诀窍在于心态,即打破系统并专注于任务所需的基本数据。您不需要的数据不需要复制或同步。人们对非规范化和数据缩减之类的事情感到恼火,尤其是当他们来自关系数据建模领域时。并且有充分的理由,应该谨慎考虑。但是一旦你进行了分布式,你就隐含地进行了非规范化。哎呀,你现在正在批量复制它。所以,你最好在这方面变得更聪明。
所有这些都可以通过可靠的程序和对工作流程的透彻理解来缓解。识别风险并制定处理风险的政策和程序。
但困难的部分是从一开始就打破中央数据库的链条,并告诉人们,当你拥有一个单一的、中央的、完美的信息存储时,他们不能像他们所期望的那样“拥有一切”。
I'm sure you saw this coming, "It Depends".
It depends on everything. And the solution to sharing Customer data for department A may be completely different for sharing Customer data with department B.
My favorite concept that has risen up over the years is the concept of "Eventual Consistency". The term came from Amazon talking about distributed systems.
The premise is that while the state of data across a distributed enterprise may not be perfectly consistent now, it "eventually" will be.
For example, when a customer record gets updated on system A, system B's customer data is now stale and not matching. But, "eventually", the record from A will be sent to B through some process. So, eventually, the two instances will match.
When you work with a single system, you don't have "EC", rather you have instant updates, a single "source of truth", and, typically, a locking mechanism to handle race conditions and conflicts.
The more able your operations are able to work with "EC" data, the easier it is to separate these systems. A simple example is a Data Warehouse used by sales. They use the DW to run their daily reports, but they don't run their reports until the early morning, and they always look at "yesterdays" (or earlier) data. So there's no real time need for the DW to be perfectly consistent with the daily operations system. It's perfectly acceptable for a process to run at, say, close of business and move over the days transactions and activities en masse in a large, single update operation.
You can see how this requirement can solve a lot of issues. There's no contention for the transactional data, no worries that some reports data is going to change in the middle of accumulating the statistic because the report made two separate queries to the live database. No need to for the high detail chatter to suck up network and cpu processing, etc. during the day.
Now, that's an extreme, simplified, and very coarse example of EC.
But consider a large system like Google. As a consumer of Search, we have no idea when or how long it takes for a search result that Google harvests to how up on a search page. 1ms? 1s? 10s? 10hrs? It's easy to imaging how if you're hitting Googles West Coast servers, you may very well get a different search result than if you hit their East Coast servers. At no point are these two instances completely consistent. But by large measure, they are mostly consistent. And for their use case, their consumers aren't really affected by the lag and delay.
Consider email. A wants to send message to B, but in the process the message is routed through system C, D, and E. Each system accepts the message, assume complete responsibility for it, and then hands it off to another. The sender sees the email go on its way. The receiver doesn't really miss it because they don't necessarily know its coming. So, there is a big window of time that it can take for that message to move through the system without anyone concerned knowing or caring about how fast it is.
On the other hand, A could have been on the phone with B. "I just sent it, did you get it yet? Now? Now? Get it now?"
Thus, there is some kind of underlying, implied level of performance and response. In the end, "eventually", A's outbox matches B inbox.
These delays, the acceptance of stale data, whether its a day old or 1-5s old, are what control the ultimate coupling of your systems. The looser this requirement, the looser the coupling, and the more flexibility you have at your disposal in terms of design.
This is true down to the cores in your CPU. Modern, multi core, multi-threaded applications running on the same system, can have different views of the "same" data, only microseconds out of date. If your code can work correctly with data potentially inconsistent with each other, then happy day, it zips along. If not you need to pay special attention to ensure your data is completely consistent, using techniques like volatile memory qualifies, or locking constructs, etc. All of which, in their way, cost performance.
So, this is the base consideration. All of the other decisions start here. Answering this can tell you how to partition applications across machines, what resources are shared, and how they are shared. What protocols and techniques are available to move the data, and how much it will cost in terms of processing to perform the transfer. Replication, load balancing, data shares, etc. etc. All based on this concept.
Edit, in response to first comment.
Correct, exactly. The game here, for example, if B can't change customer data, then what is the harm with changed customer data? Can you "risk" it being out of date for a short time? Perhaps your customer data comes in slowly enough that you can replicate it from A to B immediately. Say the change is put on a queue that, because of low volume, gets picked up readily (< 1s), but even still it would be "out of transaction" with the original change, and so there's a small window where A would have data that B does not.
Now the mind really starts spinning. What happens during that 1s of "lag", whats the worst possible scenario. And can you engineer around it? If you can engineer around a 1s lag, you may be able to engineer around a 5s, 1m, or even longer lag. How much of the customer data do you actually use on B? Maybe B is a system designed to facilitate order picking from inventory. Hard to imagine anything more being necessary than simply a Customer ID and perhaps a name. Just something to grossly identify who the order is for while it's being assembled.
The picking system doesn't necessarily need to print out all of the customer information until the very end of the picking process, and by then the order may have moved on to another system that perhaps is more current with, especially, shipping information, so in the end the picking system doesn't need hardly any customer data at all. In fact, you could EMBED and denormalize the customer information within the picking order, so there's no need or expectation of synchronizing later. As long as the Customer ID is correct (which will never change anyway) and the name (which changes so rarely it's not worth discussing), that's the only real reference you need, and all of your pick slips are perfectly accurate at the time of creation.
The trick is the mindset, of breaking the systems up and focusing on the essential data that's necessary for the task. Data you don't need doesn't need to be replicated or synchronized. Folks chafe at things like denormalization and data reduction, especially when they're from the relational data modeling world. And with good reason, it should be considered with caution. But once you go distributed, you have implicitly denormalized. Heck, you're copying it wholesale now. So, you may as well be smarter about it.
All this can mitigated through solid procedures and thorough understanding of workflow. Identify the risks and work up policy and procedures to handle them.
But the hard part is breaking the chain to the central DB at the beginning, and instructing folks that they can't "have it all" like they may expect when you have a single, central, perfect store of information.
这绝对不是一个全面的答复。抱歉,我的帖子很长,我希望它能增加这里提出的想法。
对于你提到的一些方面,我有一些看法。
根据我的经验,这通常是部门化或专业化的副作用。某个部门率先收集其他专业团体认为有用的某些数据。由于他们没有对该数据的唯一访问权限,因为这些数据与其他数据收集混合在一起,为了利用它,他们也开始收集/存储数据,本质上使其重复。这个问题永远不会消失,就像不断努力重构代码和消除重复一样,需要不断地引入重复数据以进行集中访问、存储和修改。
大多数接口的定义都是出于良好的意图,同时考虑到其他约束。然而,我们只是习惯于摆脱先前定义的接口所施加的限制。这又是一个持续重构的案例。
如果有的话,大多数软件都受到这个问题的困扰。紧耦合通常是考虑到我们面临的时间限制而采取的权宜解决方案的结果。松散耦合会带来一定程度的复杂性,当我们想要完成任务时,我们不喜欢这种复杂性。 Web 服务口号已经流传了很多年,但我还没有看到一个很好的解决方案示例可以完全缓解这一点。
对我来说,这是解决您在问题中提到的所有问题的关键。 SIP 与 H.323 VoIP 的故事浮现在我的脑海中。 SIP 非常简单,易于构建,而 H.323 就像典型的电信标准一样,试图设想地球上有关 VoIP 的每个问题并为其提供解决方案。最终结果是,SIP 的增长速度更快。兼容 H.323 的解决方案是一件痛苦的事情。事实上,H.323 合规性是一个价值巨大的行业。
多年来,我开始喜欢 REST 架构,因为它很简单。它提供了对数据的简单独特的访问,并且可以轻松地围绕它构建应用程序。我发现企业解决方案受到数据重复、隔离和访问的困扰,比性能等任何其他问题都要严重。REST 对我来说提供了解决其中一些弊病的灵丹妙药。
This is definitely not a comprehensive reply. Sorry, for my long post and I hope it adds to thoughts that would be presented here.
I have a few observations on some of the aspect that you mentioned.
It has been my experience that this is usually a side effect of departmentalization or specialization. A department pioneers collection of certain data that is seen as useful by other specialized groups. Since they don't have unique access to this data as it is intermingled with other data collection, in order to utilize it, they too start collecting / storing the data, inherently making it duplicate. This issue never goes away and just like there is a continuos effort in refactoring code and removing duplication, there is a need to continuously bring duplicate data for centralized access, storage and modification.
Most interfaces are defined with good intention keeping other constraints in mind. However, we simply have a habit of growing out of the constraints placed by previously defined interfaces. Again a case for continuos refactoring.
If any thing, most software is plagued by this issue. The tight coupling is usually a result of expedient solution given the constraint of time we face. Loose coupling incurs a certain degree of complexity which we dislike when we want to get things done. The web services mantra has been going rounds for a number of years and I am yet to see a good example of solution that completely alleviates the point
To me this is the key to fighting all the issues you have mentioned in your question. SIP vs H.323 VoIP story comes into my mind. SIP is very simplified, easy to build while H.323 like a typical telecom standard tried to envisage every issue on the planet about VoIP and provide a solution for it. End result, SIP grew much more quickly. It is a pain to be H.323 compliant solution. In fact, H.323 compliance is a mega buck industry.
Over years, I have started to like REST architecture for it's simplicity. It provides a simple unique access to data and easy to build applications around it. I have seen enterprise solution suffer more from duplication, isolation and access of data than any other issue like performance etc. REST to me provides a panacea to some of those ills.
为了解决其中的一些问题,我喜欢中央“数据中心”的概念。数据中心代表特定实体的“单一事实来源”,但仅存储 ID,不存储名称等信息。事实上,它仅存储 ID 映射 - 例如,这些映射将系统 A 中的客户 ID 映射到系统 A 中的客户 ID。来自系统 B 的客户编号,以及系统 C 中的客户编号。系统之间的接口使用集线器来了解如何将一个系统中的信息与另一个系统中的信息相关联。
这就像一个中心翻译;无需编写特定的代码来进行 A->B、A->C 和 B->C 的映射,随着系统数量的增加,其出勤率呈指数级增长,您只需与集线器之间进行转换:A->Hub、B->Hub、C->Hub、D->Hub 等。
To solve a number of those issues, I like the concept of central "Data Hubs". A Data Hub represents a "single source of truth" for a particular entity, but only stores IDs, no information like names etc. In fact, it only stores ID maps - for example, these map the Customer ID in system A, to the Client Number from system B, and to the Customer Number in system C. Interfaces between the systems use the hub to know how to relate information in one system to the other.
It's like a central translation; instead of having to write specific code for mapping from A->B, A->C, and B->C, with its attendance exponential increase as you add more systems, you only need to convert to/from the hub: A->Hub, B->Hub, C->Hub, D->Hub, etc.