共享内存与消息传递如何处理大型数据结构?
在研究 Go 和 Erlang 的并发方法时,我注意到它们都依赖于消息传递。
这种方法显然减少了对复杂锁的需求,因为没有共享状态。
但是,请考虑许多客户端希望对内存中的单个大型数据结构(例如后缀数组)进行并行只读访问的情况。
我的问题:
使用共享状态是否会比消息传递更快并且使用更少的内存,因为锁大多是不必要的,因为数据是只读的,并且只需要存在于单个位置?
在消息传递上下文中如何解决这个问题?是否存在可以访问数据结构的单个进程,并且客户端只需要顺序地从中请求数据?或者,如果可能的话,是否会将数据分块以创建多个保存块的进程?
考虑到现代 CPU 的架构和 ,这两种解决方案之间是否有很大差异——即共享内存可以由多个内核并行读取——这意味着不存在硬件瓶颈,否则这两种实现的性能大致相同?
In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.
This approach obviously alleviates the need for complex locks because there is no shared state.
However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.
My questions:
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
需要意识到的一件事是,Erlang 并发模型并没有真正指定消息中的数据必须在进程之间复制,它指出发送消息是唯一的通信方式,并且没有共享状态。由于所有数据都是不可变的,这是最基本的,因此实现很可能不会复制数据,而只是发送对其的引用。或者可以结合使用这两种方法。与往常一样,不存在最佳解决方案,并且在选择如何实现时需要进行权衡。
BEAM 使用复制,但发送引用的大型二进制文件除外。
One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.
The BEAM uses copying, except for large binaries where it sends a reference.
是的,在这种情况下共享状态可能会更快。但前提是您可以放弃锁,并且只有在绝对只读的情况下才可行。如果它是“大部分只读”,那么您需要一个锁(除非您设法编写无锁结构,请注意它们比锁更棘手),然后您将很难使其执行与良好的消息传递架构一样快。
是的,您可以编写一个“服务器进程”来共享它。对于真正的轻量级进程,它并不比编写一个小型 API 来访问数据更繁重。像“拥有”数据的对象(在 OOP 意义上)一样思考。将数据分割成块以增强并行性(在数据库圈中称为“分片”)在大型情况下(或者数据存储在慢速存储上)很有帮助。
即使 NUMA 逐渐成为主流,每个 NUMA 单元仍然拥有越来越多的核心。一个很大的区别是消息只能在两个内核之间传递,而锁必须从所有内核上的缓存中刷新,从而限制了单元间总线延迟(甚至比 RAM 访问更慢)。如果有什么不同的话,那就是共享状态/锁变得越来越不可行。
简而言之......习惯消息传递和服务器进程,它很流行。
编辑:重新审视这个答案,我想添加在 Go 文档中找到的一个短语:
这个想法是:当线程之间共享一块内存时,避免并发访问的典型方法是使用锁进行仲裁。 Go风格是通过引用传递消息,线程只有在收到消息时才访问内存。它依赖于程序员纪律的某种程度;但会产生看起来非常干净的代码,可以轻松校对,因此调试起来相对容易。
优点是您不必在每条消息上复制大数据块,也不必像某些锁实现那样有效地刷新缓存。现在说这种风格是否会带来更高性能的设计还为时过早。 (特别是因为当前的 Go 运行时在线程调度方面有些幼稚)
Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.
in short.... get used to message passing and server processes, it's all the rage.
Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.
the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)
在 Erlang 中,所有值都是不可变的 - 因此在进程之间发送消息时无需复制消息,因为它无论如何都无法修改。
在 Go 中,消息传递是按照约定的 - 没有什么可以阻止您通过通道向某人发送指针,然后修改指向的数据,只是约定,所以再次不需要复制消息。
In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.
In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.
大多数现代处理器都使用 MESI 协议 的变体。由于共享状态,在不同线程之间传递只读数据非常便宜。不过,修改后的共享数据非常昂贵,因为存储此缓存行的所有其他缓存都必须使其无效。
因此,如果您有只读数据,那么在线程之间共享它而不是通过消息进行复制是非常便宜的。如果您有主要是读取的数据,则在线程之间共享的成本可能会很高,部分原因是需要同步访问,部分原因是写入破坏了共享数据的缓存友好行为。
不可变数据结构在这里可能是有益的。您无需更改实际的数据结构,只需创建一个共享大部分旧数据的新数据结构,但更改了您需要更改的内容。共享它的单个版本很便宜,因为所有数据都是不可变的,但您仍然可以有效地更新到新版本。
Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.
So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.
Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.
请注意,您的问题在技术上是没有意义的,因为消息传递可以使用共享状态,因此我假设您的意思是通过深度复制来传递消息以避免共享状态(就像 Erlang 目前所做的那样)。
使用共享状态会快很多。
可以使用任何一种方法。
复制对缓存不友好,因此会破坏多核上的可扩展性,因为它会加剧对主内存等共享资源的争用。
最终,Erlang 风格的消息传递是为并发编程而设计的,而您关于吞吐量性能的问题实际上是针对并行编程的。这是两个完全不同的主题,在实践中它们之间的重叠很小。具体来说,在并发编程环境中,延迟通常与吞吐量一样重要,而 Erlang 风格的消息传递是实现所需延迟配置文件(即始终保持低延迟)的好方法。那么,共享内存的问题不在于读取器和写入器之间的同步,而在于低延迟内存管理。
Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).
Using shared state will be a lot faster.
Either approach can be used.
Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.
Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.
什么是大型数据结构?
一个人大另一个人小。
上周我和两个人交谈 - 一个人正在制作嵌入式设备,他用了这个词
“大” - 我问他这是什么意思 - 他说超过 256 KB - 在同一周晚些时候
那家伙正在谈论媒体分发 - 他使用了“大”这个词,我问他什么
意思是 - 他想了一下,说“不适合一台机器”,比如 20-100 TBytes
在 Erlang 术语中,“大”可能意味着“不适合 RAM” - 所以使用 4 GB 的 RAM
数据结构>> 100 MB 可能被认为很大 - 复制 500 MB 的数据结构
可能是个问题。在 Erlang 中复制小型数据结构(例如小于 10 MB)从来都不是问题。
真正大型的数据结构(即无法容纳在一台机器上的数据结构)必须是
在多台机器上复制并“条带化”。
所以我猜你有以下想法:
小型数据结构没有问题 - 因为它们很小,数据处理时间很短
快,复制快等等(只是因为它们很小)
大数据结构是一个问题 - 因为它们不适合一台机器 - 所以复制是必不可少的。
What is a large data structure?
One persons large is another persons small.
Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes
In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.
Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.
So I guess you have the following:
Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)
Big data structures are a problem - because they don't fit on one machine - so copying is essential.
这里没有介绍的一种解决方案是主从复制。如果您有一个大型数据结构,您可以将对其所做的更改复制到对其副本执行更新的所有从属服务器。
如果想要扩展到多台机器,而这些机器甚至无法在没有非常人为设置的情况下共享内存(从远程计算机内存读取/写入的块设备的 mmap?),那么这是特别有趣的。
它的一种变体是有一个事务管理器,可以很好地要求更新复制的数据结构,并且它将确保它同时服务于一个且唯一的更新请求。这更多的是用于 mnesia 表数据的主主复制的 mnesia 模型,它符合“大型数据结构”的资格。
One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.
This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)
A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".
目前的问题确实是锁定和高速缓存行一致性可能与复制更简单的数据结构(例如几百字节)一样昂贵。
大多数时候,试图消除大部分锁定的巧妙编写的新多线程算法总是会更快——并且对于现代无锁数据结构来说更快。特别是当您拥有设计良好的缓存系统(例如 Sun 的 Niagara 芯片级多线程)时。
如果您的系统/问题不容易分解为几个简单的数据访问,那么您就有问题了。而且并不是所有的问题都可以通过消息传递来解决。这就是为什么仍然有一些基于 Itanium 的超级计算机在售,因为它们拥有 TB 的共享 RAM,并且在同一共享内存上运行多达 128 个 CPU。它们比具有相同 CPU 能力的主流 x86 集群贵一个数量级,但您不需要分解数据。
到目前为止没有提到的另一个原因是,当使用多线程时,程序可以变得更容易编写和维护。消息传递和无共享方法使其更加易于维护。
举个例子,Erlang 的设计从来就不是为了让事情变得更快,而是使用大量线程来构建复杂的数据和事件流。
我想这是设计的要点之一。在谷歌的网络世界中,你通常不关心性能——只要它可以在云端并行运行即可。理想情况下,通过消息传递,您可以添加更多计算机,而无需更改源代码。
The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).
Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.
If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.
Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.
As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.
I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.
通常消息传递语言(这在 erlang 中特别容易,因为它具有不可变变量)会优化进程之间的实际数据复制(当然仅限本地进程:您需要明智地考虑您的网络分布模式),所以这不是没什么大问题。
Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.
另一个并发范例是 STM,软件事务内存。 Clojure 的参考文献受到了很多关注。 Tim Bray 有一个很好的系列探索 erlang 和 clojure 的并发机制
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses
The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses