当我启用“TiedToExecution”时,为什么 VirtualBox 中的 TSC 会更大?

发布于 2024-12-05 12:07:31 字数 2745 浏览 2 评论 0原文

背景(我对 rdtsc 如何虚拟化的理解):我正在 VirtualBox 中试验 TSC 值。我目前对 VirtualBox 如何模拟 rdtsc 的理解是,在虚拟模式下,对 rdtsc 的任何调用都将被预定结果偏移,该结果是在另一个寄存器中设置的值。当虚拟机启动时,该值在主机上为 rdtsc。

此策略的优点是 rdtsc 将以预期方式随挂钟时间提前,但缺点是进程可能会认为 rdtsc 花费的时间比预期更长。例如,在这样的简单代码中:

x = rdtsc();
y = rdtsc();
z = y - x;
print z

在来宾上执行时,由于与捕获 rdtsc 相关的挂钟时间成本,z 可能会比预期大。如果主机操作系统在这两个调用之间交换 VirtualBox 进程,情况会更糟。

通过阅读 VirtualBox 手册(更改 TSC 模式),我了解到有一种替代虚拟化技术可以直接模拟 TSC。据我了解,偏移值只会考虑客户操作系统实际使用 CPU 的时间。优点是,就可用周期而言,TSC 的行为与在主机上完全相同。缺点是 TSC 会偏离挂钟时间,因为存在客户操作系统不知道的“丢失周期”。

我的目标:我正在尝试设置 VirtualBox 来执行第二个选项。我想模拟 rdtsc 的短期行为,就好像它尽可能精确地在硬件中运行一样,并且我不在乎它是否与挂钟时间不匹配。我完全意识到这对 SMP 来说并不“可靠”;它是为了实验而不是企业软件。

我做了什么:首先,我编写了一个简单的测试程序,它重复调用rdtsc,然后打印结果:

__inline__ uint64_t rdtsc()
{
    uint32_t lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return (uint64_t)hi << 32 | lo;
}

int main()
{
    int i;
    uint64_t val[8];

    val[0] = rdtsc();    
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();

    for (i = 0; i < 8; i++) {
        printf("rdtsc (%2d): %llX", i, val[i]);
        if (i > 0) {
            printf("\t\t (+%llX)", (val[i] - val[i - 1]));
        }
        printf("\n");
    }
    return 0;
}

我在我的主机上尝试了这个程序。然后,我在我的 VirtualBox 机器上运行它。 rdtsc 之间的增量本质上是相同的——唯一的区别是我的主机上的值本身大约多了 30T。示例输出:

rdtsc ( 0): 334F2252A1824
rdtsc ( 1): 334F2252A1836    (+12)
rdtsc ( 2): 334F2252A1853    (+1D)
rdtsc ( 3): 334F2252A1865    (+12)
rdtsc ( 4): 334F2252A1877    (+12)
rdtsc ( 5): 334F2252A1889    (+12)
rdtsc ( 6): 334F2252A18A6    (+1D)
rdtsc ( 7): 334F2252A18B8    (+12)

然后,我更改了 VirtualBox 中的 TSCTiedToExecution 标志,我认为该标志应该忽略挂钟时间,以支持更精确的虚拟周期计数。我从上面提到的手册页中得到了这个:

./VBoxManage setextradata "HelloWorld" "VBoxInternal/TM/TSCTiedToExecution" 1

然而,这给了我意想不到的结果。虚拟程序现在返回:

rdtsc ( 0): F2252A1824
rdtsc ( 1): F2252A1836   (+B12)
rdtsc ( 2): F2252A1853   (+B1D)
rdtsc ( 3): F2252A1865   (+AFF)
rdtsc ( 4): F2252A1877   (+B13)
rdtsc ( 5): F2252A1889   (+AF2)
rdtsc ( 6): F2252A18A6   (+B1D)
rdtsc ( 7): F2252A18B8   (+B0C)

随着 TSCTiedToExecution 打开,rdtsc 似乎需要大约 1100 个周期来执行......

问题: 首先,我想知道为什么我会得到这个行为?这似乎与我的预期几乎相反,而且它肯定与我对其实现方式的理解不符。

其次,我想知道如何才能实现我最初的目标,即让 TSC 在每个虚拟周期中前进,就像在硬件上一样?

我的设置:我正在 8x Intel(R) Xeon(R) CPU X5550 @ 2.67GHz 上运行。 VirtualBox 启用了 VMX 和嵌套分页。我从源代码编译它,版本:4.1.2_OSE r38459。

提前致谢。

PS我对此开始了赏金,但仍然没有答案......

Background (my understanding of how rdtsc is virtualized): I am experimenting with TSC values in VirtualBox. My current understanding of how VirtualBox emulates rdtsc is that in virtual mode, any call to rdtsc will be offset by a predetermined result, which is a value set in another register. This value would be rdtsc on the host when the virtual machine started.

An advantage to this strategy is that rdtsc will advance with wall clock time in an expected manner, but the disadvantage is that a process may perceive rdtsc to take longer than expected. For instance, in simple code like this:

x = rdtsc();
y = rdtsc();
z = y - x;
print z

executed on the guest, z may be larger than expected because of the wall-clock-time cost associated with trapping rdtsc. It would be even worse if the host OS swapped off the VirtualBox process in between these two calls.

From reading the VirtualBox manual (Change TSC Mode), I read there is an alternative virtualization technique which is supposed to directly simulate TSC. As I understand it, the offset value will only take into account time that the guest OS actually uses the CPU. The advantage is that with respect to cycles available, TSC will behave exactly as if it was on a host machine. The downside is that TSC will drift away from wall-clock-time as there are "missing cycles" that the guest OS is not aware of.

My goal: I am trying to set VirtualBox to do the 2nd option. I want to emulate the short-term behavior of rdtsc as if it were running in hardware as precisely as possible, and I don't care if it doesn't match wall-clock-time. I am fully aware that this is not "reliable" on SMP; it's for experimenting not for enterprise software.

What I did: First I wrote a simple test program that calls rdtsc repeatedly, then prints the results:

__inline__ uint64_t rdtsc()
{
    uint32_t lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return (uint64_t)hi << 32 | lo;
}

int main()
{
    int i;
    uint64_t val[8];

    val[0] = rdtsc();    
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();
    val[0] = rdtsc();

    for (i = 0; i < 8; i++) {
        printf("rdtsc (%2d): %llX", i, val[i]);
        if (i > 0) {
            printf("\t\t (+%llX)", (val[i] - val[i - 1]));
        }
        printf("\n");
    }
    return 0;
}

I tried this program on my host machine. Then, I ran it in my VirtualBox machine. The deltas between rdtsc were essentially identical -- the only difference was the value itself on my host was about 30T more. Example output:

rdtsc ( 0): 334F2252A1824
rdtsc ( 1): 334F2252A1836    (+12)
rdtsc ( 2): 334F2252A1853    (+1D)
rdtsc ( 3): 334F2252A1865    (+12)
rdtsc ( 4): 334F2252A1877    (+12)
rdtsc ( 5): 334F2252A1889    (+12)
rdtsc ( 6): 334F2252A18A6    (+1D)
rdtsc ( 7): 334F2252A18B8    (+12)

Then, I changed the TSCTiedToExecution flag in VirtualBox, which I thought was supposed to ignore wall-clock-time in favor of more precise virtual cycle counting. I got this from the manual page I mentioned above:

./VBoxManage setextradata "HelloWorld" "VBoxInternal/TM/TSCTiedToExecution" 1

However this gave me unexpected results. The virtual program now returned:

rdtsc ( 0): F2252A1824
rdtsc ( 1): F2252A1836   (+B12)
rdtsc ( 2): F2252A1853   (+B1D)
rdtsc ( 3): F2252A1865   (+AFF)
rdtsc ( 4): F2252A1877   (+B13)
rdtsc ( 5): F2252A1889   (+AF2)
rdtsc ( 6): F2252A18A6   (+B1D)
rdtsc ( 7): F2252A18B8   (+B0C)

With TSCTiedToExecution on, rdtsc seems to be taking about 1100 cycles to execute....

Question: First, I am wondering why did I get this behavior? It seems like almost the opposite of what I would expect, and it certainly does not match with my understanding of how this is implemented.

Second, I am wondering how can I accomplish my original goal of having TSC advance for each virtual cycle as if it was on hardware?

My Setup: I am running on a 8x Intel(R) Xeon(R) CPU X5550 @ 2.67GHz. VirtualBox has VMX and nested paging enabled. I compiled it from source, version: 4.1.2_OSE r38459.

Thanks in advance.

P.S. I started a bounty on this, but still no answers...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夜唯美灬不弃 2024-12-12 12:07:31

为了让自己哭泣,请尝试禁用“VBoxInternal/TM/TSCTiedToExecution”并再次运行您的测试程序。 下一个代码

ULONGLONG x1 = Cpu::Rdtsc();
ULONGLONG x2 = Cpu::Rdtsc();

DbgPrintUlong('D', x2 - x1, 30, 23);

在禁用“VBoxInternal/TM/TSCTiedToExecution”的 VirtualBox 上运行的 显示 x2 - x1 花费了大约 200 000 个周期。相比之下,在启用“VBoxInternal/TM/TSCTiedToExecution”的机器上,只需要 3000 个 jf 周期。我认为,这种减少是 VirtualBox 手册中的下一段话所指的:“在特殊情况下,让来宾中的 TSC(时间戳计数器)反映实际执行来宾所花费的时间可能很有用。”

所以,我认为在很长一段时间内我们不会在 VirtualBox 中拥有更好的 TSC 仿真。

我唯一可以建议的是转向 VmWare Workstation。它对 TSC 有更好的仿真。

To make self crying try to disable "VBoxInternal/TM/TSCTiedToExecution" and run your test program again. The next code

ULONGLONG x1 = Cpu::Rdtsc();
ULONGLONG x2 = Cpu::Rdtsc();

DbgPrintUlong('D', x2 - x1, 30, 23);

running on VirtualBox with "VBoxInternal/TM/TSCTiedToExecution" disabled display that x2 - x1 took about 200 000 of cycles. In contrast, on machine with "VBoxInternal/TM/TSCTiedToExecution" enabled it took only 3 000 jf cycles. I think, this reduction is meant by next passage from the VirtualBox manual "In special circumstances it may be useful however to make the TSC (time stamp counter) in the guest reflect the time actually spent executing the guest."

So, I think we won't have better TSC emulation in VirtualBox for a long time.

The only thing that I can advise is to move on VmWare Workstation. It have much better emulation of TSC.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文