分布式计算与线程
分布式计算和线程有多相似? 我发现两篇论文得出了完全相反的结论:
“多线程比网络更容易。线程如何简单且与网络代码相似”
http://software.intel.com/file/14723
(这给我的印象是它们非常相似,封装后这两种方法可以使用相同的代码完成 -但也许我错了)
“关于分布式计算的说明”
http:// Research.sun.com/techrep/1994/abstract-29.html
(这形成了强烈的区别)
我确信事实介于两者之间。 什么是黄金分割? 是否有任何技术可以统一这两种范式? 或者由于网络和并发之间的根本差异,此类尝试失败了?
How similar is distributed computing and threading? I've found two papers coming to quite opposite conclusions:
"Multi-Threading is Easier Than Networking. How threading is easy and similar to network code"
http://software.intel.com/file/14723
(this gives me an impression that they're so similar that after encapsulation these two approaches could be done with the same code - but maybe I'm wrong)
"A note on distributed computing"
http://research.sun.com/techrep/1994/abstract-29.html
(and this puts a strong distinction)
I'm sure the truth is somewhere in between. What's the golden mean? Are there any technologies that unify those two paradigms? Or have such attempts failed because of fundamental differences between networking and concurrency?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我从来没有发现它们非常相似。 为了这篇文章的目的,我将“节点”定义为在一台机器上运行的一个硬件线程。 因此,四核机器有四个节点,四个单处理器盒的集群也是如此。
每个节点通常都会运行一些处理,并且需要某种类型的跨节点通信。 通常这种通信的第一个实例是告诉节点要做什么。 对于这种通信,我可以使用共享内存、信号量、共享文件、命名管道、套接字、远程过程调用、分布式 COM 等。但最容易使用的共享内存和信号量通常无法通过网络使用。 共享文件可能可用,但性能通常很差。 套接字往往是网络上最常见和最灵活的选择,而不是更复杂的机制。 此时,您必须处理网络架构的细节,包括延迟、带宽、数据包丢失、网络拓扑等。
如果您从工作队列开始,同一台机器上的节点可以使用简单的共享内存来完成任务。 您甚至可以无锁地编写它,并且它可以无缝工作。 对于网络上的节点,您将队列放在哪里? 如果将其集中化,该机器可能会遭受非常高的带宽成本。 尝试分发它,事情很快就会变得非常复杂。
我发现,一般来说,处理这种类型的并行架构的人们倾向于选择令人尴尬的并行问题来解决。 我想到了光线追踪。 除了工作分配之外,不需要太多的跨节点通信。 当然,这样的问题还有很多,但我发现认为分布式计算本质上与线程相同有点不诚实。
现在,如果您要编写与分布式系统行为相同的线程,使用纯消息传递并且不假设任何线程是“主”线程等,那么是的,它们将非常相似。 但你所做的就是假装你有一个分布式架构并在线程中实现它。 事实是,线程是比真正的分布式计算简单得多的并行情况。 您可以将这两个问题抽象为一个问题,但要选择更难的版本并严格遵守它。 当所有节点都位于机器本地时,结果不会那么好。 您没有利用特殊情况。
I've never found them to be very similar. Let me define for the purposes of this post a "node" to be one hardware thread running on one machine. So a quad core machine is four nodes, as is a cluster of four single processor boxes.
Each node will typically be running some processing, and there will need to be some type of cross-node communication. Usually the first instance of this communication is telling the node what to do. For this communication, I can use shared memory, semaphores, shared files, named pipes, sockets, remote procedure calls, distributed COM, etc. But the easiest ones to use, shared memory and semaphores, are not typically available across a network. Shared files may be available, but performance is typically poor. Sockets tend to be the most common and most flexible choice over a network, rather than the more sophisticated mechanisms. At that point you have to deal with the details of network architecture, including latency, bandwidth, packet loss, network topology, and more.
If you start with a queue of work, nodes on the same machine can use simple shared memory to get things to do. You can even write it up lockless and it will work seamlessly. With nodes over a network, where do you put the queue? If you centralize it, that machine may suffer very high bandwidth costs. Try to distribute it and things get very complicated very quickly.
What I've found, in general, is the people tackling this type of parallel architecture tend to choose embarrassingly parallel problems to solve. Raytracing comes to mind. There's not much cross-node communication required, apart from job distribution. There are many problems like this, to be sure, but I find it a bit disingenuous to suggest that distributed computing is essentially the same as threading.
Now if you're going to go write threading that behaves identically to a distributed system, using pure message passing and not assuming any thread to be the "main" one and such, then yes, they're going to be very similar. But what you've done is pretended you have a distributed architecture and implemented it in threads. The thing is that threading is a much simpler case of parallelism than true distributed computing is. You can abstract the two into a single problem, but by choosing the harder version and sticking strictly to it. And the results won't be as good as they could be when all of the nodes are local to a machine. You're not taking advantage of the special case.
分布式计算是在多个不同的独立机器上完成的,通常有时带有专门的操作系统。 这更困难,因为机器的互连性要低得多,因此需要对整个数据集进行大量快速、随机访问的问题非常难以解决。
一般来说,您需要专门的库来解决分布式计算问题,以找出如何为问题分配节点并处理数据。
我真的想知道他们是否会得出不同的结论,因为他们试图在每个平台上解决错误的问题。 有些问题非常适合高度互连的机器,并且可以从真正强大的超级计算机中受益。 其他问题可以在简单的分布式模型上处理。 一般来说,超级计算机可以解决更广泛的问题,但更加专业且昂贵。
Distributing computing is done over multiple different independent machines, generally with sometimes specialized OS's. It's harder because the interconnectedness of the machines is much lower, and therefore problems which require a lot of quick, random access to the entire dataset are very difficult to solve.
Generally speaking, you need specialized libraries to do distributed computing problems that figure out how to assign nodes to problems and cart around the data.
I really wonder if they are coming to different conclusions because they are trying to solve the wrong problems on each platform. Some problems adhere very nicely to highly interconnected machines, and can benefit from really power supercomputers. Other problems can be dealt with on simply distributed models. In general, supercomputers can solve a wider range of problems, but are much, much more specialized and expensive.
差异似乎又回到了线程共享状态,进程传递消息。
在选择一种之前,您需要决定如何在应用程序中维护状态。
共享状态很容易上手,所有数据和变量都在那里。 但是一旦出现死锁/竞争条件,就很难修改/扩展。
消息传递(例如Erlang)需要不同的设计方法,你必须从一开始就考虑并发的机会,但每个分布式进程的状态是隔离的,使得锁定/竞争问题更容易处理。
The difference seems to come back to Threads share state, Processes pass messages.
You need to decide how you want to maintain state in your app before choosing one.
Share state is easy to get started with, all the data and variables are just there. But once deadlocks/race conditions enter, its hard to modify/scale.
Message passing (eg Erlang) requires a different approach to design, you have to think about opportunities for concurrency from the beginning, but state of each distributed process is isolated, making locking/race problems easier to deal with.
我认为将进程与分布式计算方法进行比较比将线程与其进行比较有用得多。 线程存在于单个进程内并共享相同的数据和相同的内存。 这在多台机器上是不可能的。 另一方面,进程有自己的内存,尽管在某些情况下它包含与另一个进程完全相同的数据(例如,在 fork() 之后)。 这可以通过网络来实现。
为这个类比增添额外分量的是,许多用于进程间通信的工具都是网络透明的。 一个很好的例子是 unix 套接字,它使用与网络套接字相同的接口(连接代码除外)。
I think it's a lot more useful to compare processes with distributed computing approaches than it is to compare threads with it. Threads exists inside a single process and shares the same data and the same memory. This isn't possible over several machines. Processes on the other hand has a their own memory, although it in some cases contains exactly the same data as another process (after a fork(), for example). This could be achieved over a network.
Something that adds extra weight to this analogy is the fact that many tools used for inter process communication is network transparent. A good example would be unix sockets, which uses the same interface as network sockets (except for the connection code).
是的,在开发时,方法非常相似,但每种方法的使用却非常不同。 我不太清楚你的想法,如果我错了,请告诉我:当谈论分布式计算时,我们假设不止一台计算机或服务器在同一应用程序中处理代码,但是当我们谈论多线程时,我们正在谈论在同一台计算机上同时处理应用程序的不同线程。
您可以将其视为分布式计算的一个示例,即一个应用程序访问位于 Internet 中的 Web 服务。 有两台不同的计算机在同一个应用程序中工作。
如果您想要一个多线程的示例,只需考虑一个尝试查找一个大素数的应用程序即可。 如果您不使用多线程,则在计算下一个素数(可以是一生或更长时间)时,您将无法在应用程序中看到或执行任何其他操作,因为应用程序没有响应while 正在计算中。
您也可以混合使用它们:作为一个更复杂的示例,您始终可以使用多线程通过同一应用程序同时访问不同的 Web 服务,这是为了使您的应用程序能够响应,即使在以下任一情况下未连接服务器。
Yes at developing time the approach is very similar but the use of each is very different. I don't get your idea very clear, let me know if I'm wrong: When talking about distributed computing we are assuming more than one computer or server processing code in the same application, but when we are talking about Multi-Threading we are talking about processing different threadings of the application at the same time in the same computer.
You can think as an example of distributed computing, in one application accessing a web service located in the Internet. There are two different computers working in the same app.
If you want an example of multi-threading, just think of an application trying to find one big prime number. If you don´t use multi-threading in it you won't be able to see or do anything else in the application at the time it's calculating the next prime number (can be a life time or more) because the application is not responsive while is working in the calculation.
You can mix them too: As a more complex example, you can always use multi-threading to access different web services at the same time by the same application, this is in order to make your application responsive even if is not connecting when one of the servers.
我认为这两份文件不能轻易进行比较。 英特尔的文档是对线程的介绍,他们试图通过寻找网络计算的类比来解释它,这对我来说有点奇怪和误导。 我不确定他们为什么选择这样一种呈现线程的方式,也许他们针对的是熟悉网络的人,网络可能比线程更广为人知或至少被认可。
另一方面,Sun 的文档是一篇严肃的文章,描述了与分布式编程相关的所有困难。 我所能做的就是简单地确认他们所说的内容。
在我看来,试图隐藏对象远程这一事实的抽象是有害的,因为它通常会导致非常糟糕的性能。 程序员必须意识到对象的远程性才能以有效的方式调用它。
I think those two documents cannot be easily compared. Intel's document is a sort of introduction to threading, and they try to explain it by finding analogies to network computing, which is a bit strange and misleading to me. I'm not sure why they chose such a way of presenting threading, maybe they aimed at people familiar with networking, which is probably more known or at least recognized than threading.
Sun's document, on the other hand, is a serious article, depicting all the difficulties related to distributed programming. All I can do is to simply confirm what they say therein.
In my opinion, an abstraction that attempts to hide the fact of an object being remote is harmful as it usually leads to a very bad performance. The programmer must be aware of the remoteness of an object to be able to invoke it in an efficient way.