为什么 C/C++/rtl 中没有 Z80 之类的 LDIR 功能?

发布于 2024-07-10 21:27:42 字数 696 浏览 8 评论 0原文

在 Z80 机器代码中,这是一种将缓冲区初始化为固定值(例如全部为空白)的廉价技术。 所以一段代码可能看起来像这样。

LD HL, DESTINATION             ; point to the source
LD DE, DESTINATION + 1         ; point to the destination
LD BC, DESTINATION_SIZE - 1    ; copying this many bytes
LD (HL), 0X20                  ; put a seed space in the first position
LDIR                           ; move 1 to 2, 2 to 3...

结果是 DESTINATION 处的内存块完全被填满。 我已经尝试过 memmove 和 memcpy,但无法复制这种行为。 我希望 memmove 能够正确地做到这一点。

为什么 memmove 和 memcpy 会这样?

有没有合理的方法来进行这种数组初始化?

我已经知道 char array[size] = {0}对于数组初始化,

我已经知道 memset 将为单个字符完成这项工作。

解决这个问题还有哪些其他方法?

In Z80 machine code, a cheap technique to initialize a buffer to a fixed value, say all blanks. So a chunk of code might look something like this.

LD HL, DESTINATION             ; point to the source
LD DE, DESTINATION + 1         ; point to the destination
LD BC, DESTINATION_SIZE - 1    ; copying this many bytes
LD (HL), 0X20                  ; put a seed space in the first position
LDIR                           ; move 1 to 2, 2 to 3...

The result being that the chunk of memory at DESTINATION is completely blank filled.
I have experimented with memmove, and memcpy, and can't replicate this behavior. I expected memmove to be able to do it correctly.

Why do memmove and memcpy behave this way?

Is there any reasonable way to do this sort of array initialization?

I am already aware of char array[size] = {0} for array initialization

I am already aware that memset will do the job for single characters.

What other approaches are there to this issue?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

如梦初醒的夏天 2024-07-17 21:27:42

有一种更快的方法可以使用堆栈清空内存区域。 虽然 LDI 和 LDIR 的使用非常普遍,但 David Webb(他以各种方式推动 ZX Spectrum,例如全屏数字倒计时,包括边框)提出了这种速度快 4 倍的技术:

  • 保存堆栈指针,然后
    将其移至屏幕末尾。
  • 加载 HL 寄存器对
    零,
  • 进入一个大循环
    将 HL 压入堆栈。
  • 堆栈在屏幕上向上和向下移动
    通过记忆并在这个过程中,
    清除屏幕。

以上解释摘自David Webbs 游戏 Starion 的评论。

Z80 例程可能看起来有点像这样:

  DI              ; disable interrupts which would write to the stack.
  LD HL, 0
  ADD HL, SP      ; save stack pointer
  EX DE, HL       ; in DE register
  LD HL, 0
  LD C, 0x18      ; Screen size in pages
  LD SP, 0x4000   ; End of screen
PAGE_LOOP:
  LD B, 128       ; inner loop iterates 128 times
LOOP:
  PUSH HL         ; effectively *--SP = 0; *--SP = 0;
  DJNZ LOOP       ; loop for 256 bytes
  DEC C
  JP NZ,PAGE_LOOP
  EX DE, HL
  LD SP, HL       ; restore stack pointer
  EI              ; re-enable interrupts

但是,该例程的速度略低于两倍。 LDIR 每 21 个周期复制一个字节。 内部循环每 24 个周期复制两个字节——PUSH HL 复制 11 个周期,DJNZ LOOP 复制 13 个周期。 要获得近 4 倍的速度,只需展开内部循环即可:

LOOP:
   PUSH HL
   PUSH HL
   ...
   PUSH HL         ; repeat 128 times
   DEC C
   JP NZ,LOOP

即每两个字节几乎有 11 个周期,比 LDIR 每字节 21 个周期快约 3.8 倍。

毫无疑问,这项技术已经被重新发明了很多次。 例如,它较早出现在 sub-Logic 的 TRS-80 飞行模拟器 1,1980 年。

There was a quicker way of blanking an area of memory using the stack. Although the use of LDI and LDIR was very common, David Webb (who pushed the ZX Spectrum in all sorts of ways like full screen number countdowns including the border) came up with this technique which is 4 times faster:

  • saves the Stack Pointer and then
    moves it to the end of the screen.
  • LOADs the HL register pair with
    zero,
  • goes into a massive loop
    PUSHing HL onto the Stack.
  • The Stack moves up the screen and down
    through memory and in the process,
    clears the screen.

The explanation above was taken from the review of David Webbs game Starion.

The Z80 routine might look a little like this:

  DI              ; disable interrupts which would write to the stack.
  LD HL, 0
  ADD HL, SP      ; save stack pointer
  EX DE, HL       ; in DE register
  LD HL, 0
  LD C, 0x18      ; Screen size in pages
  LD SP, 0x4000   ; End of screen
PAGE_LOOP:
  LD B, 128       ; inner loop iterates 128 times
LOOP:
  PUSH HL         ; effectively *--SP = 0; *--SP = 0;
  DJNZ LOOP       ; loop for 256 bytes
  DEC C
  JP NZ,PAGE_LOOP
  EX DE, HL
  LD SP, HL       ; restore stack pointer
  EI              ; re-enable interrupts

However, that routine is a little under twice as fast. LDIR copies one byte every 21 cycles. The inner loop copies two bytes every 24 cycles -- 11 cycles for PUSH HL and 13 for DJNZ LOOP. To get nearly 4 times as fast simply unroll the inner loop:

LOOP:
   PUSH HL
   PUSH HL
   ...
   PUSH HL         ; repeat 128 times
   DEC C
   JP NZ,LOOP

That is very nearly 11 cycles every two bytes which is about 3.8 times faster than the 21 cycles per byte of LDIR.

Undoubtedly the technique has been reinvented many times. For example, it appeared earlier in sub-Logic's Flight Simulator 1 for the TRS-80 in 1980.

神也荒唐 2024-07-17 21:27:42

memmovememcpy 不会以这种方式工作,因为它对于移动或复制内存来说不是一个有用的语义。 在 Z80 中填充内存很方便,但为什么您会期望名为“memmove”的函数用单个字节填充内存呢? 它用于移动内存块。 它的实现是为了获得正确的答案(源字节移动到目标),无论块如何重叠。 它对于获得移动内存块的正确答案很有用。

如果您想填充内存,请使用 memset,它旨在满足您的需求。

memmove and memcpy don't work that way because it's not a useful semantic for moving or copying memory. It's handy in the Z80 to do be able to fill memory, but why would you expect a function named "memmove" to fill memory with a single byte? It's for moving blocks of memory around. It's implemented to get the right answer (the source bytes are moved to the destination) regardless of how the blocks overlap. It's useful for it to get the right answer for moving memory blocks.

If you want to fill memory, use memset, which is designed to do just what you want.

怎会甘心 2024-07-17 21:27:42

我相信这符合 C 和 C++ 的设计理念。 作为Bjarne Stroustrup一次,C++设计的主要指导原则之一是“你不使用的东西,你就不用付费”。 虽然丹尼斯·里奇可能没有用完全相同的话来说,但我相信这是一个指导原则也影响了他对 C 的设计(以及后来的人对 C 的设计)。 现在您可能会认为,如果您分配内存,它应该自动初始化为 NULL,我倾向于同意您的观点。 但这需要机器周期,如果您在每个周期都至关重要的情况下进行编码,那么这可能不是一个可接受的权衡。 基本上,C 和 C++ 尽量不妨碍你——因此,如果你想要初始化一些东西,你必须自己做。

I believe this goes to the design philosophy of C and C++. As Bjarne Stroustrup once said, one of the major guiding principles of the design of C++ is "What you don’t use, you don’t pay for". And while Dennis Ritchie may not have said it in exactly those same words, I believe that was a guiding principle informing his design of C (and the design of C by subsequent people) as well. Now you may think that if you allocate memory it should automatically be initialized to NULL's and I'd tend to agree with you. But that takes machine cycles and if you're coding in a situation where every cycle is critical, that may not be an acceptable trade-off. Basically C and C++ try to stay out of your way--hence if you want something initialized you have to do it yourself.

高冷爸爸 2024-07-17 21:27:42

您展示的 Z80 序列是实现这一目标的最快方法 - 在 1978 年。那是 30 年前的事了。 从那时起,处理器已经取得了很大的进步,而今天这几乎是最慢的方法。

Memmove 设计为在源范围和目标范围重叠时工作,因此您可以将一块内存向上移动一个字节。 这是 C 和 C++ 标准指定的行为的一部分。 Memcpy 未指定; 它的工作方式可能与 memmove 相同,也可能有所不同,具体取决于编译器决定如何实现它。 编译器可以自由选择比memmove更高效的方法。

The Z80 sequence you show was the fastest way to do that - in 1978. That was 30 years ago. Processors have progressed a lot since then, and today that's just about the slowest way to do it.

Memmove is designed to work when the source and destination ranges overlap, so you can move a chunk of memory up by one byte. That's part of its specified behavior by the C and C++ standards. Memcpy is unspecified; it might work identically to memmove, or it might be different, depending on how your compiler decides to implement it. The compiler is free to choose a method that is more efficient than memmove.

北陌 2024-07-17 21:27:42

这可以在 x86 汇编中同样轻松地完成。 事实上,它归结为与您的示例几乎相同的代码。

mov esi, source    ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0  ; initialize the first byte with the "seed"
mov ecx, 100h      ; set ecx to the size of the buffer
rep movsb          ; do the fill

然而,如果可以的话,一次设置多个字节会更有效。

最后,memcpy/memmove 不是您正在寻找的,它们用于将内存块从一个区域复制到另一个区域(memmove 允许源和目标是同一缓冲区的一部分)。 memset 用您选择的字节填充一个块。

This be accomplished in x86 assembly just as easily. In fact, it boils down to nearly identical code to your example.

mov esi, source    ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0  ; initialize the first byte with the "seed"
mov ecx, 100h      ; set ecx to the size of the buffer
rep movsb          ; do the fill

However, it is simply more efficient to set more than one byte at a time if you can.

Finally, memcpy/memmove aren't what you are looking for, those are for making copies of blocks of memory from from area to another (memmove allows source and dest to be part of the same buffer). memset fills a block with a byte of your choosing.

十年不长 2024-07-17 21:27:42

为什么 memmove 和 memcpy 会有这样的行为?

可能是因为没有针对 Z80 硬件的特定现代 C++ 编译器? 写一个。 ;-)

这些语言不指定给定硬件如何实现任何内容。 这完全取决于编译器和库的程序员。 当然,为每种可以想象的硬件配置编写一个自己的、高度指定的版本是一项艰巨的工作。 这就是原因。

是否有任何合理的方法来进行这种数组初始化?是否有任何合理的方法来进行这种数组初始化?

好吧,如果其他方法都失败了,你总是可以使用内联汇编。 除此之外,我希望 std::fill 在良好的 STL 实现中表现最佳。 是的,我完全意识到我的期望太高,并且 std::memset 在实践中通常表现更好。

Why do memmove and memcpy behave this way?

Probably because there’s no specific, modern C++ compiler that targets the Z80 hardware? Write one. ;-)

The languages don't specify how a given hardware implements anything. This is entirely up to the programmers of the compiler and libraries. Of course, writing an own, highly specified version for every imaginable hardware configuration is a lot of work. That’ll be the reason.

Is there any reasonable way to do this sort of array initialization?Is there any reasonable way to do this sort of array initialization?

Well, if all else fails you could always use inline assembly. Other than that, I expect std::fill to perform best in a good STL implementation. And yes, I’m fully aware that my expectations are too high and that std::memset often performs better in practice.

冷默言语 2024-07-17 21:27:42

如果您在硬件级别上摆弄,那么某些 CPU 具有 DMA 控制器,可以非常快地填充内存块(比 CPU 快得多)。 我已经在 Freescale i.MX21 CPU 上完成了此操作。

If you're fiddling at the hardware level, then some CPUs have DMA controllers that can fill blocks of memory exceedingly quickly (much faster than the CPU could ever do). I've done this on a Freescale i.MX21 CPU.

人间☆小暴躁 2024-07-17 21:27:42

如果这是在 Z80 上将内存块设置为给定值的最有效方法,那么 memset() 很可能按照您在面向 Z80 的编译器上描述的方式实现。

memcpy() 也可能在该编译器上使用类似的序列。

但是,为什么针对具有与 Z80 完全不同指令集的 CPU 的编译器会使用 Z80 习惯用法来处理这些类型的事情呢?

请记住,x86 架构有一组类似的指令,可以使用 REP 操作码作为前缀,让它们重复执行以执行复制、填充或比较内存块等操作。 然而,当 Intel 推出 386(或者可能是 486)时,CPU 实际上在循环中运行这些指令的速度比更简单的指令要慢。 因此编译器经常停止使用面向 REP 的指令。

If this is the most efficient way to set a block of memory to a given value on the Z80, then it's quite possible that memset() might be implemented as you describe on a compiler that targets Z80s.

It might be that memcpy() might also use a similar sequence on that compiler.

But why would compilers targeting CPUs with completely different instruction sets from the Z80 be expected to use a Z80 idiom for these types of things?

Remember that the x86 architecture has a similar set of instructions that could be prefixed with a REP opcode to have them execute repeatedly to do things like copy, fill or compare blocks of memory. However, by the time Intel came out with the 386 (or maybe it was the 486) the CPU would actually run those instructions slower than simpler instructions in a loop. So compilers often stopped using the REP-oriented instructions.

妳是的陽光 2024-07-17 21:27:42

还有 calloc 在返回之前分配内存并将其初始化为 0指针。 当然,calloc只初始化为0,而不是用户指定的值。

There's also calloc that allocates and initializes the memory to 0 before returning the pointer. Of course, calloc only initializes to 0, not something the user specifies.

一个人的旅程 2024-07-17 21:27:42

说真的,如果您正在编写 C/C++,只需编写一个简单的 for 循环,然后让编译器为您操心。 作为一个例子,下面是 VS2005 针对这种情况生成的一些代码(使用模板化大小):

template <int S>
class A
{
  char s_[S];
public:
  A()
  {
    for(int i = 0; i < S; ++i)
    {
      s_[i] = 'A';
    }
  }
  int MaxLength() const
  {
    return S;
  }
};

extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all

void test()
{
  A<5> a5;
  useA(a5, a5.MaxLength());
}

汇编器输出如下:

test PROC

[snip]

; 25   :    A<5> a5;

mov eax, 41414141H              ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al

; 26   :    useA(a5, a5.MaxLength());

lea eax, DWORD PTR a5[esp+40]
push    5               ; MaxLength()
push    eax
call    useA

没有比这更有效。 不要担心并相信您的编译器,或者至少在尝试找到优化方法之前看看您的编译器会生成什么。 为了进行比较,我还使用 std::fill(s_, s_ + S, 'A')std::memset(s_, 'A', S) 编译了代码> 而不是 for 循环,编译器产生相同的输出。

Seriously, if you're writing C/C++, just write a simple for-loop and let the compiler bother for you. As an example, here's some code VS2005 generated for this exact case (using templated size):

template <int S>
class A
{
  char s_[S];
public:
  A()
  {
    for(int i = 0; i < S; ++i)
    {
      s_[i] = 'A';
    }
  }
  int MaxLength() const
  {
    return S;
  }
};

extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all

void test()
{
  A<5> a5;
  useA(a5, a5.MaxLength());
}

The assembler output is the following:

test PROC

[snip]

; 25   :    A<5> a5;

mov eax, 41414141H              ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al

; 26   :    useA(a5, a5.MaxLength());

lea eax, DWORD PTR a5[esp+40]
push    5               ; MaxLength()
push    eax
call    useA

It does not get any more efficient than that. Stop worrying and trust your compiler or at least have a look at what your compiler produces before trying to find ways to optimize. For comparison I also compiled the code using std::fill(s_, s_ + S, 'A') and std::memset(s_, 'A', S) instead of the for-loop and the compiler produced the identical output.

迷迭香的记忆 2024-07-17 21:27:42

如果您使用的是 PowerPC,请使用 _dcbz()。

If you're on the PowerPC, _dcbz().

左秋 2024-07-17 21:27:42

在许多情况下,使用“memspread”函数会很有用,该函数的定义行为是在整个内存范围中复制内存范围的起始部分。 尽管如果目标是传播单个字节值,memset() 就可以很好地工作,但有时,例如,人们可能希望用相同的值填充整数数组。 在许多处理器实现中,每次从源复制一个字节到目标将是一种非常糟糕的实现方式,但设计良好的函数可以产生良好的结果。 例如,首先查看数据量是否小于32字节左右; 如果是这样,只需按字节复制; 否则检查源和目标对齐; 如果它们对齐,则将大小舍入到最接近的单词(如果需要),然后复制第一个单词的所有位置,复制下一个单词的所有位置,等等。

我有时也希望有一个指定为的函数作为自下而上的 memcpy 工作,旨在用于重叠范围。 至于为什么没有一个标准,我想没有人认为它很重要。

There are a number of situations where it would be useful to have a "memspread" function whose defined behavior was to copy the starting portion of a memory range throughout the whole thing. Although memset() does just fine if the goal is to spread a single byte value, there are times when e.g. one may want to fill an array of integers with the same value. On many processor implementations, copying a byte at a time from the source to the destination would be a pretty crummy way to implement it, but a well-designed function could yield good results. For example, start by seeing if the amount of data is less than 32 bytes or so; if so, just do a bytewise copy; otherwise check the source and destination alignment; if they are aligned, round the size down to the nearest word (if necessary), then copy the first word everywhere it goes, copy the next word everywhere it goes, etc.

I too have at times wished for a function that was specified to work as a bottom-up memcpy, intended for use with overlapping ranges. As to why there isn't a standard one, I guess nobody thought it important.

日裸衫吸 2024-07-17 21:27:42

memcpy() 应该具有这种行为。 memmove() 并非有意设计,如果内存块重叠,它会从缓冲区末尾开始复制内容以避免这种行为。 但是要使用特定值填充缓冲区,您应该在 C 中使用 memset() 或在 C++ 中使用 std::fill(),大多数现代编译器都会对其进行优化适当的块填充指令(例如 x86 架构上的 REP STOSB)。

memcpy() should have that behavior. memmove() doesn't by design, if the blocks of memory overlap, it copies the contents starting at the ends of the buffers to avoid that sort of behavior. But to fill a buffer with a specific value you should be using memset() in C or std::fill() in C++, which most modern compilers will optimize to the appropriate block fill instruction (such as REP STOSB on x86 architectures).

半衾梦 2024-07-17 21:27:42

如前所述,memset() 提供了所需的功能。

memcpy() 用于在源缓冲区和目标缓冲区不重叠或 dest < 的所有情况下移动内存块。 来源。

memmove() 解决了缓冲区重叠和 dest > 的情况 来源。

在 x86 架构上,优秀的编译器直接用内联汇编指令替换 memset 调用,非常有效地设置目标缓冲区的内存,甚至应用进一步的优化,例如使用 4 字节值来填充尽可能长的长度(如果以下代码在语法上不完全正确)这是因为我很长一段时间没有使用 X86 汇编代码):

lea edi,dest
;将填充字节复制到eax的所有4个字节
移动,填充
莫夫啊,阿尔
mov dx,ax
shl eax,16
移动斧头,dx
mov ecx,计数
mov edx,ecx
shr ecx,2
CLD
代表斯托德
测试 edx,2
jz moveByte
斯托夫
移动字节:
测试 edx,1
jz fillDone
斯托斯布
填写完成:

实际上这段代码比你的 Z80 版本要高效得多,因为它不进行内存到内存的操作,而只进行寄存器到内存的移动。 您的 Z80 代码实际上是一个 hack,因为它依赖于每个复制操作都填充了后续副本的源。

如果编译器还不错,它可能能够检测到更复杂的 C++ 代码,这些代码可以分解为 memset(请参阅下面的帖子),但我怀疑这实际上发生在嵌套循环中,甚至可能调用初始化函数。

As said before, memset() offers the desired functionality.

memcpy() is for moving around blocks of memory in all cases where the source and destination buffers do not overlap, or where dest < source.

memmove() solves the case of buffers overlapping and dest > source.

On x86 architectures, good compilers directly replace memset calls with inline assembly instructions very effectively setting the destination buffer's memory, even applying further optimizations like using 4-byte values to fill as long as possible (if the following code isn't totally syntactically correct blame it on my not using X86 assembly code for a long time):

lea edi,dest
;copy the fill byte to all 4 bytes of eax
mov al,fill
mov ah,al
mov dx,ax
shl eax,16
mov ax,dx
mov ecx,count
mov edx,ecx
shr ecx,2
cld
rep stosd
test edx,2
jz moveByte
stosw
moveByte:
test edx,1
jz fillDone
stosb
fillDone:

Actually this code is far more efficient than your Z80 version, as it doesn't do memory to memory, but only register to memory moves. Your Z80 code is in fact quite a hack as it relies on each copy operation having filled the source of the subsequent copy.

If the compiler is halfway good, it might be able to detect more complicated C++ code that can be broken down to memset (see the post below), but I doubt that this actually happens for nested loops, probably even invoking initialization functions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文