Good question. Microsoft Azure is attempting to address this by allowing you to put applications "in the cloud" and not have to be as concerned with scalability up/down, redundancy, data storage, etc. But this is not accomplished at the hypervisor level.
Hardware-wise, there are some downsides to having everything be one big VM rather than many smaller ones. For one thing, software doesn't always understand how to handle all the resources. For example, some applications still can't handle multiple processor cores. I've seen informal benchmarks showing that IIS performs better spreading the same resources over multiple instances rather than one giant instance.
From a management perspective, it is probably better to have multiple VMs in certain cases. Imagine that a bad deployment corrupts a node. If that were your one and only (albeit giant) node, now your whole application is down.
There used to be a Linux implementation, openMosix that since closed down. I don't know of any replacements. openMosix made it very easy to create and use SSI on a standard Linux kernel; too bad it got overtaken by events.
I do not know enough about Xen to know if it is possible but with VMware you can create pools of resources which come from many physical hosts. Then you can assign the resources to your VMs. That could be many VMs or just one VM. Aggregation: Transform Isolated Resources into Shared Pools
Simulating a single core over multiple physical cores is very inefficient. You can do it, but it'll be slower than a cluster. Two physical cores can talk to each other in near-real-time, if they're on separate machines then you're doing something like say clocking down your motherboard speed by factors of 10 or more if these two physical cores (and RAM) are communicating even over a fibre optic network.
Dual cores can communicate faster than two distinct CPUs on the same motherboard, if they are on separate machines, thats slower again, if there are multiple machines, slower even again.
Basically you can, but there is net performance loss compared to the net performance gain you would be hoping to achieve.
Real life example, I had a bunch of VMs on a dual quad core server (~2.5Ghz/core) performing way, way below what they should have been. On closer inspection, it turned out that the hypervisor was emulating a single 3.5-4Ghz core when the load on an individual VM was more than 2.5Ghz -- after limiting each VM to 2.5Ghz performance went back to what was expected.
基本上,没有这样做的原因是因为性能通常是绝对不可预测或糟糕的。计算机系统中有一个概念,称为[NUMA][3],它基本上意味着访问不同内存块的成本不统一。这适用于大型系统,其中 CPU 可能将一些内存访问路由到不同的芯片,或者通过网络远程访问内存的情况(例如在 SSI 中)。通常,操作系统会尝试通过在内存中布置程序和数据来补偿这一点,以使程序可以尽快运行。即,代码和数据将全部放置在同一个NUMA“区域”中,并被调度到最接近的CPU上。
然而,如果您正在运行大型应用程序(尝试使用 SSI 中的所有内存),操作系统几乎无法减少远程内存获取的影响。 MySQL 不知道访问页 0x1f3c 将花费 8 纳秒,而访问页 0x7f46 将使其停滞数百微秒,在通过网络获取内存时可能会延迟数百毫秒。这意味着非 NUMA 感知的应用程序在这种环境中运行起来会很糟糕(说真的,非常糟糕)。据我所知,大多数当代 SSI 产品都依赖于机器之间最快的互连(例如 Infiniband)来实现甚至还过得去的性能。
I agree with saidimu, you are talking about the Single System Image concept. In addition to the OpenMosix project, there have been several commercial implementations of the same idea (one contemporary example is ScaleMP). It's not a new idea.
I just wanted to elaborate on some of the technical points of SSI.
Basically, the reason it's not done is because the performance is generally absolutely unpredictable or terrible. There is a concept in computer systems known as [NUMA][3], which basically means that the cost of accessing different pieces of memory is not uniform. This can apply to huge systems where CPUs may have some memory accesses routed around to different chips, or in cases where memory is accessed remotely over a network (such as in SSI). Typically, the operating system will attempt to compensate for this by laying out programs and data in memory in such a way that a program can run as quickly as possible. I.e., the code and data will all be placed in the same NUMA "region", and be scheduled on the closest possible CPU.
However, in cases where you are running big applications (attempting to use all the memory in your SSI), there is little the operating system can do to reduce the impact of remote memory fetches. MySQL is not aware that accessing page 0x1f3c will cost 8 nanoseconds, while accessing page 0x7f46 will stall it for hundreds of microseconds, possibly milliseconds while the memory is fetched over the network. This means that non-NUMA aware applications will run like crap (seriously, very bad) in this kind of environment. As far as I know, most comtemporary SSI products rely on the fastest possible interconnects (such as Infiniband) between machines to achieve even a passable performance.
This is also why frameworks that expose the true cost of accessing data to the programmer (such as MPI: message passing interface) have achieved more traction than SSI or DSM (distributed shared memory) approaches. In fact, there is basically no way for a programmer to optimize an application to run in an SSI environment, which just sucks.
发布评论
评论(5)
好问题。 Microsoft Azure 试图通过允许您将应用程序放在“云中”来解决这个问题,而不必担心可扩展性的上/下、冗余、数据存储等。但这并不是在虚拟机管理程序级别实现的。
http://www.microsoft.com/windowsazure/
硬件方面,有< /strong> 将所有内容都设置为一个大型虚拟机而不是许多较小的虚拟机有一些缺点。一方面,软件并不总是了解如何处理所有资源。例如,某些应用程序仍然无法处理多个处理器核心。我见过非正式的基准测试,显示 IIS 将相同的资源分布在多个实例上比在一个巨型实例上表现得更好。
从管理角度来看,在某些情况下拥有多个虚拟机可能更好。想象一下,错误的部署会损坏节点。如果那是您唯一的(尽管是巨大的)节点,那么现在您的整个应用程序都会崩溃。
Good question. Microsoft Azure is attempting to address this by allowing you to put applications "in the cloud" and not have to be as concerned with scalability up/down, redundancy, data storage, etc. But this is not accomplished at the hypervisor level.
http://www.microsoft.com/windowsazure/
Hardware-wise, there are some downsides to having everything be one big VM rather than many smaller ones. For one thing, software doesn't always understand how to handle all the resources. For example, some applications still can't handle multiple processor cores. I've seen informal benchmarks showing that IIS performs better spreading the same resources over multiple instances rather than one giant instance.
From a management perspective, it is probably better to have multiple VMs in certain cases. Imagine that a bad deployment corrupts a node. If that were your one and only (albeit giant) node, now your whole application is down.
您可能正在谈论单一系统映像这个概念。
曾经有一个 Linux 实现,openMosix,后来被关闭了。我不知道有什么替代品。 openMosix 使得在标准 Linux 内核上创建和使用 SSI 变得非常容易;太糟糕了,它被事件所取代。
You're probably talking about the concept Single System Image.
There used to be a Linux implementation, openMosix that since closed down. I don't know of any replacements. openMosix made it very easy to create and use SSI on a standard Linux kernel; too bad it got overtaken by events.
我对 Xen 的了解不够,不知道这是否可行,但使用 VMware,您可以创建来自许多物理主机的资源池。然后您可以将资源分配给您的虚拟机。这可能是许多虚拟机,也可能只是一台虚拟机。
聚合:将隔离资源转变为共享池
I do not know enough about Xen to know if it is possible but with VMware you can create pools of resources which come from many physical hosts. Then you can assign the resources to your VMs. That could be many VMs or just one VM.
Aggregation: Transform Isolated Resources into Shared Pools
在多个物理核心上模拟单个核心效率非常低。你可以做到,但它会比集群慢。两个物理核心可以近乎实时地相互通信,如果它们位于不同的机器上,那么您正在做一些事情,比如如果这两个物理核心(和 RAM)将主板速度降低 10 倍或更多甚至可以通过光纤网络进行通信。
双核的通信速度比同一主板上的两个不同的 CPU 更快,如果它们位于不同的机器上,那就又慢了,如果有多台机器,那就更慢了。
基本上可以,但是与您希望实现的净性能增益相比,存在净性能损失。
现实生活中的例子,我在双四核服务器(~2.5Ghz/核心)上有一堆虚拟机,其性能远远低于应有的水平。经过仔细检查,结果发现,当单个虚拟机上的负载超过 2.5Ghz 时,虚拟机管理程序正在模拟单个 3.5-4Ghz 核心 - 在将每个虚拟机限制为 2.5Ghz 后,性能又回到了预期水平。
Simulating a single core over multiple physical cores is very inefficient. You can do it, but it'll be slower than a cluster. Two physical cores can talk to each other in near-real-time, if they're on separate machines then you're doing something like say clocking down your motherboard speed by factors of 10 or more if these two physical cores (and RAM) are communicating even over a fibre optic network.
Dual cores can communicate faster than two distinct CPUs on the same motherboard, if they are on separate machines, thats slower again, if there are multiple machines, slower even again.
Basically you can, but there is net performance loss compared to the net performance gain you would be hoping to achieve.
Real life example, I had a bunch of VMs on a dual quad core server (~2.5Ghz/core) performing way, way below what they should have been. On closer inspection, it turned out that the hypervisor was emulating a single 3.5-4Ghz core when the load on an individual VM was more than 2.5Ghz -- after limiting each VM to 2.5Ghz performance went back to what was expected.
我同意saidimu的观点,你正在谈论单一系统映像的概念。除了 OpenMosix 项目之外,同一想法还有多个商业实现(当代的一个例子是 ScaleMP)。这不是什么新想法。
我只是想详细阐述一下SSI的一些技术点。
基本上,没有这样做的原因是因为性能通常是绝对不可预测或糟糕的。计算机系统中有一个概念,称为[NUMA][3],它基本上意味着访问不同内存块的成本不统一。这适用于大型系统,其中 CPU 可能将一些内存访问路由到不同的芯片,或者通过网络远程访问内存的情况(例如在 SSI 中)。通常,操作系统会尝试通过在内存中布置程序和数据来补偿这一点,以使程序可以尽快运行。即,代码和数据将全部放置在同一个NUMA“区域”中,并被调度到最接近的CPU上。
然而,如果您正在运行大型应用程序(尝试使用 SSI 中的所有内存),操作系统几乎无法减少远程内存获取的影响。 MySQL 不知道访问页 0x1f3c 将花费 8 纳秒,而访问页 0x7f46 将使其停滞数百微秒,在通过网络获取内存时可能会延迟数百毫秒。这意味着非 NUMA 感知的应用程序在这种环境中运行起来会很糟糕(说真的,非常糟糕)。据我所知,大多数当代 SSI 产品都依赖于机器之间最快的互连(例如 Infiniband)来实现甚至还过得去的性能。
这也是为什么向程序员公开访问数据的真实成本的框架(例如 MPI:消息传递接口)比 SSI 或 DSM(分布式共享内存)方法更受欢迎的原因。事实上,程序员基本上没有办法优化应用程序在 SSI 环境中运行,这很糟糕。
I agree with saidimu, you are talking about the Single System Image concept. In addition to the OpenMosix project, there have been several commercial implementations of the same idea (one contemporary example is ScaleMP). It's not a new idea.
I just wanted to elaborate on some of the technical points of SSI.
Basically, the reason it's not done is because the performance is generally absolutely unpredictable or terrible. There is a concept in computer systems known as [NUMA][3], which basically means that the cost of accessing different pieces of memory is not uniform. This can apply to huge systems where CPUs may have some memory accesses routed around to different chips, or in cases where memory is accessed remotely over a network (such as in SSI). Typically, the operating system will attempt to compensate for this by laying out programs and data in memory in such a way that a program can run as quickly as possible. I.e., the code and data will all be placed in the same NUMA "region", and be scheduled on the closest possible CPU.
However, in cases where you are running big applications (attempting to use all the memory in your SSI), there is little the operating system can do to reduce the impact of remote memory fetches. MySQL is not aware that accessing page 0x1f3c will cost 8 nanoseconds, while accessing page 0x7f46 will stall it for hundreds of microseconds, possibly milliseconds while the memory is fetched over the network. This means that non-NUMA aware applications will run like crap (seriously, very bad) in this kind of environment. As far as I know, most comtemporary SSI products rely on the fastest possible interconnects (such as Infiniband) between machines to achieve even a passable performance.
This is also why frameworks that expose the true cost of accessing data to the programmer (such as MPI: message passing interface) have achieved more traction than SSI or DSM (distributed shared memory) approaches. In fact, there is basically no way for a programmer to optimize an application to run in an SSI environment, which just sucks.