为什么 C/C++/rtl 中没有 Z80 之类的 LDIR 功能？

发布于 2024-07-10 21:27:42 字数 696 浏览 9 评论 0原文

在 Z80 机器代码中，这是一种将缓冲区初始化为固定值（例如全部为空白）的廉价技术。所以一段代码可能看起来像这样。

LD HL, DESTINATION             ; point to the source
LD DE, DESTINATION + 1         ; point to the destination
LD BC, DESTINATION_SIZE - 1    ; copying this many bytes
LD (HL), 0X20                  ; put a seed space in the first position
LDIR                           ; move 1 to 2, 2 to 3...

结果是 DESTINATION 处的内存块完全被填满。我已经尝试过 memmove 和 memcpy，但无法复制这种行为。我希望 memmove 能够正确地做到这一点。

为什么 memmove 和 memcpy 会这样？

有没有合理的方法来进行这种数组初始化？

我已经知道 char array[size] = {0}对于数组初始化，

我已经知道 memset 将为单个字符完成这项工作。

解决这个问题还有哪些其他方法？

原文

In Z80 machine code, a cheap technique to initialize a buffer to a fixed value, say all blanks. So a chunk of code might look something like this.

LD HL, DESTINATION             ; point to the source
LD DE, DESTINATION + 1         ; point to the destination
LD BC, DESTINATION_SIZE - 1    ; copying this many bytes
LD (HL), 0X20                  ; put a seed space in the first position
LDIR                           ; move 1 to 2, 2 to 3...

The result being that the chunk of memory at DESTINATION is completely blank filled.
I have experimented with memmove, and memcpy, and can't replicate this behavior. I expected memmove to be able to do it correctly.

Why do memmove and memcpy behave this way?

Is there any reasonable way to do this sort of array initialization?

I am already aware of char array[size] = {0} for array initialization

I am already aware that memset will do the job for single characters.

What other approaches are there to this issue?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如梦初醒的夏天 2024-07-17 21:27:42

有一种更快的方法可以使用堆栈清空内存区域。虽然 LDI 和 LDIR 的使用非常普遍，但 David Webb（他以各种方式推动 ZX Spectrum，例如全屏数字倒计时，包括边框）提出了这种速度快 4 倍的技术：

保存堆栈指针，然后
将其移至屏幕末尾。
加载 HL 寄存器对
零，
进入一个大循环
将 HL 压入堆栈。
堆栈在屏幕上向上和向下移动
通过记忆并在这个过程中，
清除屏幕。

以上解释摘自David Webbs 游戏 Starion 的评论。

Z80 例程可能看起来有点像这样：

  DI              ; disable interrupts which would write to the stack.
  LD HL, 0
  ADD HL, SP      ; save stack pointer
  EX DE, HL       ; in DE register
  LD HL, 0
  LD C, 0x18      ; Screen size in pages
  LD SP, 0x4000   ; End of screen
PAGE_LOOP:
  LD B, 128       ; inner loop iterates 128 times
LOOP:
  PUSH HL         ; effectively *--SP = 0; *--SP = 0;
  DJNZ LOOP       ; loop for 256 bytes
  DEC C
  JP NZ,PAGE_LOOP
  EX DE, HL
  LD SP, HL       ; restore stack pointer
  EI              ; re-enable interrupts

但是，该例程的速度略低于两倍。 LDIR 每 21 个周期复制一个字节。内部循环每 24 个周期复制两个字节——PUSH HL 复制 11 个周期，DJNZ LOOP 复制 13 个周期。要获得近 4 倍的速度，只需展开内部循环即可：

LOOP:
   PUSH HL
   PUSH HL
   ...
   PUSH HL         ; repeat 128 times
   DEC C
   JP NZ,LOOP

即每两个字节几乎有 11 个周期，比 LDIR 每字节 21 个周期快约 3.8 倍。

毫无疑问，这项技术已经被重新发明了很多次。例如，它较早出现在 sub-Logic 的 TRS-80 飞行模拟器 1，1980 年。

There was a quicker way of blanking an area of memory using the stack. Although the use of LDI and LDIR was very common, David Webb (who pushed the ZX Spectrum in all sorts of ways like full screen number countdowns including the border) came up with this technique which is 4 times faster:

saves the Stack Pointer and then
moves it to the end of the screen.
LOADs the HL register pair with
zero,
goes into a massive loop
PUSHing HL onto the Stack.
The Stack moves up the screen and down
through memory and in the process,
clears the screen.

The explanation above was taken from the review of David Webbs game Starion.

The Z80 routine might look a little like this:

  DI              ; disable interrupts which would write to the stack.
  LD HL, 0
  ADD HL, SP      ; save stack pointer
  EX DE, HL       ; in DE register
  LD HL, 0
  LD C, 0x18      ; Screen size in pages
  LD SP, 0x4000   ; End of screen
PAGE_LOOP:
  LD B, 128       ; inner loop iterates 128 times
LOOP:
  PUSH HL         ; effectively *--SP = 0; *--SP = 0;
  DJNZ LOOP       ; loop for 256 bytes
  DEC C
  JP NZ,PAGE_LOOP
  EX DE, HL
  LD SP, HL       ; restore stack pointer
  EI              ; re-enable interrupts

However, that routine is a little under twice as fast. LDIR copies one byte every 21 cycles. The inner loop copies two bytes every 24 cycles -- 11 cycles for PUSH HL and 13 for DJNZ LOOP. To get nearly 4 times as fast simply unroll the inner loop:

LOOP:
   PUSH HL
   PUSH HL
   ...
   PUSH HL         ; repeat 128 times
   DEC C
   JP NZ,LOOP

That is very nearly 11 cycles every two bytes which is about 3.8 times faster than the 21 cycles per byte of LDIR.

Undoubtedly the technique has been reinvented many times. For example, it appeared earlier in sub-Logic's Flight Simulator 1 for the TRS-80 in 1980.

回复收藏 0 原文

神也荒唐 2024-07-17 21:27:42

memmove 和 memcpy 不会以这种方式工作，因为它对于移动或复制内存来说不是一个有用的语义。在 Z80 中填充内存很方便，但为什么您会期望名为“memmove”的函数用单个字节填充内存呢？它用于移动内存块。它的实现是为了获得正确的答案（源字节移动到目标），无论块如何重叠。它对于获得移动内存块的正确答案很有用。

如果您想填充内存，请使用 memset，它旨在满足您的需求。

回复收藏 0 原文

怎会甘心 2024-07-17 21:27:42

我相信这符合 C 和 C++ 的设计理念。作为Bjarne Stroustrup一次说，C++设计的主要指导原则之一是“你不使用的东西，你就不用付费”。虽然丹尼斯·里奇可能没有用完全相同的话来说，但我相信这是一个指导原则也影响了他对 C 的设计（以及后来的人对 C 的设计）。现在您可能会认为，如果您分配内存，它应该自动初始化为 NULL，我倾向于同意您的观点。但这需要机器周期，如果您在每个周期都至关重要的情况下进行编码，那么这可能不是一个可接受的权衡。基本上，C 和 C++ 尽量不妨碍你——因此，如果你想要初始化一些东西，你必须自己做。

回复收藏 0 原文

高冷爸爸 2024-07-17 21:27:42

您展示的 Z80 序列是实现这一目标的最快方法 - 在 1978 年。那是 30 年前的事了。从那时起，处理器已经取得了很大的进步，而今天这几乎是最慢的方法。

Memmove 设计为在源范围和目标范围重叠时工作，因此您可以将一块内存向上移动一个字节。这是 C 和 C++ 标准指定的行为的一部分。 Memcpy 未指定；它的工作方式可能与 memmove 相同，也可能有所不同，具体取决于编译器决定如何实现它。编译器可以自由选择比memmove更高效的方法。

回复收藏 0 原文

北陌 2024-07-17 21:27:42

这可以在 x86 汇编中同样轻松地完成。事实上，它归结为与您的示例几乎相同的代码。

mov esi, source    ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0  ; initialize the first byte with the "seed"
mov ecx, 100h      ; set ecx to the size of the buffer
rep movsb          ; do the fill

然而，如果可以的话，一次设置多个字节会更有效。

最后，memcpy/memmove 不是您正在寻找的，它们用于将内存块从一个区域复制到另一个区域（memmove 允许源和目标是同一缓冲区的一部分）。 memset 用您选择的字节填充一个块。

This be accomplished in x86 assembly just as easily. In fact, it boils down to nearly identical code to your example.

mov esi, source    ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0  ; initialize the first byte with the "seed"
mov ecx, 100h      ; set ecx to the size of the buffer
rep movsb          ; do the fill

However, it is simply more efficient to set more than one byte at a time if you can.

Finally, memcpy/memmove aren't what you are looking for, those are for making copies of blocks of memory from from area to another (memmove allows source and dest to be part of the same buffer). memset fills a block with a byte of your choosing.

回复收藏 0 原文

十年不长 2024-07-17 21:27:42

为什么 memmove 和 memcpy 会有这样的行为？

可能是因为没有针对 Z80 硬件的特定现代 C++ 编译器？写一个。 ;-)

这些语言不指定给定硬件如何实现任何内容。这完全取决于编译器和库的程序员。当然，为每种可以想象的硬件配置编写一个自己的、高度指定的版本是一项艰巨的工作。这就是原因。

是否有任何合理的方法来进行这种数组初始化？是否有任何合理的方法来进行这种数组初始化？

好吧，如果其他方法都失败了，你总是可以使用内联汇编。除此之外，我希望 std::fill 在良好的 STL 实现中表现最佳。是的，我完全意识到我的期望太高，并且 std::memset 在实践中通常表现更好。

回复收藏 0 原文

冷默言语 2024-07-17 21:27:42

如果您在硬件级别上摆弄，那么某些 CPU 具有 DMA 控制器，可以非常快地填充内存块（比 CPU 快得多）。我已经在 Freescale i.MX21 CPU 上完成了此操作。

回复收藏 0 原文

人间☆小暴躁 2024-07-17 21:27:42

如果这是在 Z80 上将内存块设置为给定值的最有效方法，那么 memset() 很可能按照您在面向 Z80 的编译器上描述的方式实现。

memcpy() 也可能在该编译器上使用类似的序列。

但是，为什么针对具有与 Z80 完全不同指令集的 CPU 的编译器会使用 Z80 习惯用法来处理这些类型的事情呢？

请记住，x86 架构有一组类似的指令，可以使用 REP 操作码作为前缀，让它们重复执行以执行复制、填充或比较内存块等操作。然而，当 Intel 推出 386（或者可能是 486）时，CPU 实际上在循环中运行这些指令的速度比更简单的指令要慢。因此编译器经常停止使用面向 REP 的指令。

回复收藏 0 原文

妳是的陽光 2024-07-17 21:27:42

还有 calloc 在返回之前分配内存并将其初始化为 0指针。当然，calloc只初始化为0，而不是用户指定的值。

回复收藏 0 原文

一个人的旅程 2024-07-17 21:27:42

说真的，如果您正在编写 C/C++，只需编写一个简单的 for 循环，然后让编译器为您操心。作为一个例子，下面是 VS2005 针对这种情况生成的一些代码（使用模板化大小）：

template <int S>
class A
{
  char s_[S];
public:
  A()
  {
    for(int i = 0; i < S; ++i)
    {
      s_[i] = 'A';
    }
  }
  int MaxLength() const
  {
    return S;
  }
};

extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all

void test()
{
  A<5> a5;
  useA(a5, a5.MaxLength());
}

汇编器输出如下：

test PROC

[snip]

; 25   :    A<5> a5;

mov eax, 41414141H              ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al

; 26   :    useA(a5, a5.MaxLength());

lea eax, DWORD PTR a5[esp+40]
push    5               ; MaxLength()
push    eax
call    useA

它没有比这更有效。不要担心并相信您的编译器，或者至少在尝试找到优化方法之前看看您的编译器会生成什么。为了进行比较，我还使用 std::fill(s_, s_ + S, 'A') 和 std::memset(s_, 'A', S) 编译了代码> 而不是 for 循环，编译器产生相同的输出。

Seriously, if you're writing C/C++, just write a simple for-loop and let the compiler bother for you. As an example, here's some code VS2005 generated for this exact case (using templated size):

template <int S>
class A
{
  char s_[S];
public:
  A()
  {
    for(int i = 0; i < S; ++i)
    {
      s_[i] = 'A';
    }
  }
  int MaxLength() const
  {
    return S;
  }
};

extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all

void test()
{
  A<5> a5;
  useA(a5, a5.MaxLength());
}

The assembler output is the following:

test PROC

[snip]

; 25   :    A<5> a5;

mov eax, 41414141H              ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al

; 26   :    useA(a5, a5.MaxLength());

lea eax, DWORD PTR a5[esp+40]
push    5               ; MaxLength()
push    eax
call    useA

It does not get any more efficient than that. Stop worrying and trust your compiler or at least have a look at what your compiler produces before trying to find ways to optimize. For comparison I also compiled the code using std::fill(s_, s_ + S, 'A') and std::memset(s_, 'A', S) instead of the for-loop and the compiler produced the identical output.

回复收藏 0 原文

迷迭香的记忆 2024-07-17 21:27:42

如果您使用的是 PowerPC，请使用 _dcbz()。

回复收藏 0 原文

左秋 2024-07-17 21:27:42

在许多情况下，使用“memspread”函数会很有用，该函数的定义行为是在整个内存范围中复制内存范围的起始部分。尽管如果目标是传播单个字节值，memset() 就可以很好地工作，但有时，例如，人们可能希望用相同的值填充整数数组。在许多处理器实现中，每次从源复制一个字节到目标将是一种非常糟糕的实现方式，但设计良好的函数可以产生良好的结果。例如，首先查看数据量是否小于32字节左右；如果是这样，只需按字节复制；否则检查源和目标对齐；如果它们对齐，则将大小舍入到最接近的单词（如果需要），然后复制第一个单词的所有位置，复制下一个单词的所有位置，等等。

我有时也希望有一个指定为的函数作为自下而上的 memcpy 工作，旨在用于重叠范围。至于为什么没有一个标准，我想没有人认为它很重要。

回复收藏 0 原文

日裸衫吸 2024-07-17 21:27:42

memcpy() 应该具有这种行为。 memmove() 并非有意设计，如果内存块重叠，它会从缓冲区末尾开始复制内容以避免这种行为。但是要使用特定值填充缓冲区，您应该在 C 中使用 memset() 或在 C++ 中使用 std::fill()，大多数现代编译器都会对其进行优化适当的块填充指令（例如 x86 架构上的 REP STOSB）。

回复收藏 0 原文

半衾梦 2024-07-17 21:27:42

如前所述，memset() 提供了所需的功能。

memcpy() 用于在源缓冲区和目标缓冲区不重叠或 dest < 的所有情况下移动内存块。来源。

memmove() 解决了缓冲区重叠和 dest > 的情况来源。

在 x86 架构上，优秀的编译器直接用内联汇编指令替换 memset 调用，非常有效地设置目标缓冲区的内存，甚至应用进一步的优化，例如使用 4 字节值来填充尽可能长的长度（如果以下代码在语法上不完全正确）这是因为我很长一段时间没有使用 X86 汇编代码）：

lea edi,dest ;将填充字节复制到eax的所有4个字节移动，填充莫夫啊，阿尔 mov dx,ax shl eax,16 移动斧头，dx mov ecx,计数 mov edx,ecx shr ecx,2 CLD 代表斯托德测试 edx,2 jz moveByte 斯托夫移动字节：测试 edx,1 jz fillDone 斯托斯布填写完成：