如何改进这些大小写转换功能?

发布于 2024-09-26 16:51:21 字数 817 浏览 10 评论 0原文

作为学习练习,我的三个函数(ToggleCase、LowerCase 和 UpperCase)均期望有一个指向 ASCII 字符字符串的指针,并以空字符结尾;他们按预期工作。是否有更有效或更快的方法来完成这项任务?我是否违反了良好 C 编码的任何潜规则?我使用宏是因为我认为它使代码看起来更好并且比函数调用更有效。这是典型的还是矫枉过正的?

请随意挑剔和批评代码(但一定要友善)。

case_conversion.h

#define CASE_FLAG 32
#define a_z(c) (c >= 'a' && c <= 'z')
#define A_Z(c) (c >= 'A' && c <= 'Z')

void ToggleCase(char* c);
void LowerCase(char* c);
void UpperCase(char* c);

case_conversion.c

#include "case_conversion.h"

void ToggleCase(char* c)
{
 while (*c)
 {
  *c ^= a_z(*c) || A_Z(*c) ? CASE_FLAG : 0;
  c++;
 }
}
void LowerCase(char* c)
{
 while (*c)
 {
  *c ^= A_Z(*c) ? CASE_FLAG : 0;
  c++;
 }
}
void UpperCase(char* c)
{
 while (*c)
 {
  *c ^= a_z(*c) ? CASE_FLAG : 0;
  c++;
 }
}

As a learning exercise, my three functions—ToggleCase, LowerCase and UpperCase—each expect a pointer to an ASCII char string, terminated by the null character; they work as expected. Are there more efficient or faster methods of accomplishing this task? Am I breaking any unspoken rules of good C coding? I've made use of macros because, I think, it makes the code look better and it is more efficient than function calls. Is this typical or overkill?

Please feel free to nit-pick and critique the code (but do be nice).

case_conversion.h

#define CASE_FLAG 32
#define a_z(c) (c >= 'a' && c <= 'z')
#define A_Z(c) (c >= 'A' && c <= 'Z')

void ToggleCase(char* c);
void LowerCase(char* c);
void UpperCase(char* c);

case_conversion.c

#include "case_conversion.h"

void ToggleCase(char* c)
{
 while (*c)
 {
  *c ^= a_z(*c) || A_Z(*c) ? CASE_FLAG : 0;
  c++;
 }
}
void LowerCase(char* c)
{
 while (*c)
 {
  *c ^= A_Z(*c) ? CASE_FLAG : 0;
  c++;
 }
}
void UpperCase(char* c)
{
 while (*c)
 {
  *c ^= a_z(*c) ? CASE_FLAG : 0;
  c++;
 }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

前事休说 2024-10-03 16:51:21

我的最爱:

*c += (*c-'A'<26U)<<5; /* lowercase */
*c -= (*c-'a'<26U)<<5; /* uppercase */
*c ^= ((*c|32U)-'a'<26)<<5; /* toggle case */

由于您的目标将是嵌入式系统,因此您应该学习如何消除不必要的代码膨胀、分支等。确定 ascii 字符是否按字母顺序排列的条件是 4 个比较/分支操作;我的是 1。我建议查找一些关于算术和位操作技巧的好资源。

注意:在发布我的答案后,我将 *32 操作更改为 <<5,因为许多嵌入式系统编译器太差,无法为您执行此操作。当为好的编译器编写代码时,*32 可能会更好地说明您的意图。

编辑:关于我的代码有太多隐式编译器生成的操作的指控,我认为这是完全错误的。下面是任何半点像样的编译器应该为第一行生成的伪汇编:

  1. 加载 *c 并将其零或符号扩展以填充 int 大小的单词(取决于普通 char 是有符号还是无符号)。
  2. 使用无符号(非溢出捕获)sub 指令减去常量 26。
  3. 如果未设置进位标志,则有条件地跳转到代码的其余部分。
  4. 否则,将 32 添加到 *c 处的值。

步骤 2 和 3 可以在使用比较跳转操作而不是标志的架构上组合。我能看到任何显着的幕后成本突然出现的唯一方法是,如果机器无法直接寻址字符,或者它使用令人讨厌的(符号/幅度或补码)有符号值表示,在这种情况下转换为无符号这将是不平凡的。据我所知,现代嵌入式架构都不存在这些问题;它们大多与传统大型机(以及较小程度上的 DSP)隔离。

如果有人担心糟糕的编译器实际上对 <<5 执行算术,您可以尝试:

if (*c-'A'<26U) *c+=32;

而不是我的代码。无论如何,这可能更干净,但我通常喜欢避免语句,这样我就可以将代码放入循环条件或类似函数的宏中。

编辑2:根据请求,第一行的无分支版本:

*c += (64U-*c & *c-91U)>>( CHAR_BIT*sizeof(无符号)-5);

*c += (64U-*c & *c-91U)>> CHAR_BIT*sizeof(无符号)-1 << 5;

为了使其可靠地工作,c 应具有类型 unsigned char *unsigned int 应严格宽于无符号字符

My favorites:

*c += (*c-'A'<26U)<<5; /* lowercase */
*c -= (*c-'a'<26U)<<5; /* uppercase */
*c ^= ((*c|32U)-'a'<26)<<5; /* toggle case */

Since your target will be embedded systems, you should be learning how to eliminate unnecessary code bloat, branches, etc. Your conditional for determining if an ascii character is alphabetical is 4 comparison/branch operations; mine is 1. I would recommend looking up some good resources on arithmetic and bit manipulation tricks.

Note: I changed the *32 operations to <<5 after posting my answer, because a number of embedded systems compilers are too poor to do this for you. When writing code for a good compiler, *32 would probably illustrate your intent better.

Edit: Regarding the charge that my code has too many implicit compiler-generated operations, I believe that's completely false. Here is the pseudo-asm any half-decent compiler should generate for the first line:

  1. Load *c and zero- or sign-extend it to fill an int-sized word (depending on whether plain char is signed or unsigned).
  2. Subtract the constant 26 using an unsigned (non-overflow-trapping) sub instruction.
  3. Conditional-jump past the rest of the code if the carry flag is not set.
  4. Else, add 32 to the value at *c.

Steps 2 and 3 may be combined on architectures that use a comparing-jump operation instead of flags. The only way I can see any significant behind-the-scenes costs cropping up is if the machine cannot directly address chars, or if it uses a nasty (sign/magnitude or ones complement) signed value representation, in which case the conversion to unsigned would be nontrivial. As far as I know, no modern embedded architecture has these problems; they're mostly isolated to legacy mainframes (and to a lesser extent, DSPs).

If anyone is worried about bad compilers actually performing arithmetic for the <<5, you might try:

if (*c-'A'<26U) *c+=32;

instead of my code. That's probably cleaner anyway, but I generally like avoiding statements so I can shove the code inside a loop condition or function-like macro.

Edit 2: By request, a branch-free version of the first line:

*c += (64U-*c & *c-91U)>>(CHAR_BIT*sizeof(unsigned)-5);

*c += (64U-*c & *c-91U) >> CHAR_BIT*sizeof(unsigned)-1 << 5;

In order for this to work reliably, c should have type unsigned char * and unsigned int should be strictly wider than unsigned char.

桃扇骨 2024-10-03 16:51:21

您的宏至少存在两个主要问题。考虑一下如果我像这样调用其中之一会发生什么

a_z('a' + 1);

由于运算符优先级,调用将不会给出正确的结果。使用括号很容易修复:

#define a_z(c) ((c) >= 'a' && (c) <= 'z')

但也可以像这样调用它们:

a_z(i++);

此调用将递增 i 两次!这在宏中不容易修复(如果有的话)。我建议改用内联函数(如果需要 - 见下文)。

据我所知,在大写/小写之间进行转换的最快方法是使用查找表。当然,这会以内存换取速度 - 选择您的偏好,了解您的特定平台:-)

您需要两个阵列,一个用于任一方向。像这样初始化它们

char toUpper[128]; // we care only about standard ASCII
for (int i = 0; i < 128; i++)
  toUpper[i] = i;
toUpper['a'] = 'A';
...
toUpper['z'] = 'Z';

并且转换很简单:(

char toUpperCase(char c)
{
  return toUpper[c];
}

对于生产代码,应该改进这一点以将数组扩展到给定平台上所有可能的 char 值(或者将其缩小到仅合法值并进行参数检查),但为了说明,这样就可以了。)

There are at least two major problems with your macros. Consider what happens if I call one of them like

a_z('a' + 1);

The call will not give correct results due to operator precedence. This is easy to fix using brackets:

#define a_z(c) ((c) >= 'a' && (c) <= 'z')

But they can also be called like this:

a_z(i++);

This call will increment i twice! And that is not easily fixable (if at all) in a macro. I would recommend using inline functions instead (if needed - see below).

The fastest way to convert between upper/lowercase I know of is using lookup tables. Of course, this trades memory for speed - pick your preference knowing your specific platform :-)

You need two arrays, one for either direction. Initialize them like

char toUpper[128]; // we care only about standard ASCII
for (int i = 0; i < 128; i++)
  toUpper[i] = i;
toUpper['a'] = 'A';
...
toUpper['z'] = 'Z';

And the conversion is trivial:

char toUpperCase(char c)
{
  return toUpper[c];
}

(for production code, this should be improved to extend the array to all possible char values on given platform (or shrink it to only the legal values and do parameter checking), but for illustration, this will do.)

捂风挽笑 2024-10-03 16:51:21

注意:问题标题已编辑 - 原始标题是关于优化“请批评——在 C 中转换字符串大小写的最佳函数”,这解释了为什么我的答案仅涉及优化而不是一般性的“改进”功能。

如果您确实正在寻找绝对最快的方法,那么从长远来看,无分支版本将是最佳选择,因为它可以使用 SIMD。另外,它避免了使用表格(如果内存确实很有限,那么在嵌入式系统上表格可能太大)。

这是一个简单的非 SIMD 无分支示例,ToLower 是对此的一个微不足道的更改。

char BranchFree_AsciiToUpper(char inchar) 
{ 
        // Branch-Free / No-Lookup 
        // toupper() for ASCII-only 
        const int ConvertVal = 'A' - 'a'; 
        // Bits to Shift Arithmetic to Right : 9 == (char-bits + 1) 
        const int AsrBits = 9; 

        int c=(int)inchar; 
        //if( (('a'-1)<c) && (c<('z'+1)) ) { c += 'A'-'a'; } 
        int LowerBound = ('a'-1) - c; 
        int UpperBound = c - ('z' + 1); 
        int BranchFreeMask = (LowerBound & UpperBound)>>AsrBits;
        c = c + (BranchFreeMask & ConvertVal); 
        return((char)c); 
}

为了清楚起见,我的函数进行了扩展并使用非硬编码常量。您可以使用硬编码值在一行中执行相同的操作,但我喜欢可读的代码;然而,这是我的算法的“压缩”版本。它并没有更快,因为它完全相同将同样的事情“压缩”到一行

c+=(((96-(int)c)&((int)c-123))>>9)&(-32);

您可以在此处进行许多优化以使其更快。您可以对 ASCII 进行硬编码,因为该示例不假设除 az 和 AZ 之外的任何编码映射是连续范围。例如,对于 ASCII,如果您没有桶形移位器,则实际上可以将 AsrBits 更改为 4 (9-5),因为 ConvertVal 将为 +/-32,具体取决于 toupper 或 tolower 操作。

一旦您有了可用的branchfree版本,您就可以使用SIMD或位旋转SWAR(寄存器内的SIMD)技术一次转换4-16个字节(甚至可能更多取决于您的寄存器的宽度以及您是否展开以隐藏延迟)。这将比任何查找方法快得多,后者几乎仅限于单字节转换,除非您有非常大的表,并且同时处理的每个字节呈指数增长。

另外,您可以在不使用 int 向上转换的情况下生成branchfree 谓词,但随后您必须执行更多操作(使用向上转换时,每个范围只需减去一次)。您可能需要对 SWAR 进行扩展操作,但大多数 SIMD 实现都有一个比较操作,可以免费为您生成掩码。

SWAR/SIMD 操作还可以受益于更少的内存读/写操作,并且可以对齐确实发生的写操作。这对于具有加载命中存储惩罚的处理器(例如 PS3 Cell 处理器)来说要快得多。将其与展开版本中的简单预取相结合,您几乎可以完全避免内存停顿。

我知道我的示例中似乎有很多代码,但有分支(隐式或显式),因此没有分支错误预测。如果您所在的平台具有严重的分支错误预测惩罚(对于许多流水线嵌入式处理器来说都是如此),那么即使没有 SIMD,您的优化版本构建的上述代码也应该比看起来不那么复杂但创建的代码运行得更快隐式分支。

即使没有 SIMD/SWAR,智能编译器也可以展开并交错上述实现,以隐藏延迟并产生非常快的版本 - 特别是在每个周期可以发出多个非相关指令的现代超标量处理器上。对于任何分支版本来说,这通常是不可能的。

如果您手动展开,我将对加载进行分组并收集存储,以使编译器更容易在其间交错非分支非依赖指令。示例:

// Unrolled inner loop where 'char *c' is the string we're converting
char c0=c[0],c1=c[1],c2=c[2],c3=c[3];  // Grouped-Loads
c[0]=BranchFree_AsciiToUpper(c0);
c[1]=BranchFree_AsciiToUpper(c1);
c[2]=BranchFree_AsciiToUpper(c2);
c[3]=BranchFree_AsciiToUpper(c3);
c+=4;

一个像样的编译器应该能够内联 ToUpper 并完全交错上述代码,因为每个代码之间没有分支,没有内存别名,也没有相互依赖的指令。只是为了好玩,我决定实际编译它,一个针对 PowerPC 的编译器为双问题超标量核心生成完美的交错,这将轻松超越任何具有分支的代码

mr               r31,r3
mr               r13,r13
lbz              r11,0(r31)
lbz              r10,1(r31)
extsb            r11,r11
lbz              r9,2(r31)
extsb            r10,r10
lbz              r8,3(r31)
subfic           r7,r11,96
addi             r6,r11,-123
srawi            r5,r7,9
srawi            r4,r6,9
subfic           r3,r10,96
addi             r7,r10,-123
extsb            r9,r9
srawi            r6,r3,9
srawi            r3,r7,9
subfic           r7,r9,96
addi             r30,r9,-123
extsb            r8,r8
srawi            r7,r7,9
srawi            r30,r30,9
subfic           r29,r8,96
addi             r28,r8,-123
srawi            r29,r29,9
srawi            r28,r28,9
and              r5,r5,r4
and              r3,r6,r3
and              r7,r7,r30
and              r30,r29,r28
clrrwi           r4,r5,5
clrrwi           r6,r7,5
clrrwi           r5,r3,5
clrrwi           r7,r30,5
add              r4,r4,r11
add              r3,r5,r10
add              r11,r6,r9
stb              r4,0(r31)
add              r10,r7,r8
stb              r3,1(r31)
stb              r11,2(r31)
stb              r10,3(r31)

证据就在布丁中,即使在进入 SWAR 或 SIMD 之前,与分支版本相比,上面编译的代码也会非常快。

总而言之,这应该是最快方法的原因:

  1. 没有分支错误预测惩罚
  2. 能够一次针对 4-16(或更多)字节进行 SIMD 化算法
  3. 编译器(或程序员)可以展开和交错消除延迟并利用超标量(多问题)处理器
  4. 无内存延迟(即表查找)

NOTE: The Question Title was edited - original title was about optimization "Please Critique-- An optimal function for converting string cases in C" which explains why my answer deals with optimization only rather than generically "improving" the functions.

If you are really looking for the absolute fastest way to do it, a branch-free version is going to be the way to go in the long run because it can use SIMD. Plus it avoids having tables (which may be too large on an embedded system if memory is really cramped).

Here is a simple non-SIMD branch-free example and ToLower is a trivial change from this.

char BranchFree_AsciiToUpper(char inchar) 
{ 
        // Branch-Free / No-Lookup 
        // toupper() for ASCII-only 
        const int ConvertVal = 'A' - 'a'; 
        // Bits to Shift Arithmetic to Right : 9 == (char-bits + 1) 
        const int AsrBits = 9; 

        int c=(int)inchar; 
        //if( (('a'-1)<c) && (c<('z'+1)) ) { c += 'A'-'a'; } 
        int LowerBound = ('a'-1) - c; 
        int UpperBound = c - ('z' + 1); 
        int BranchFreeMask = (LowerBound & UpperBound)>>AsrBits;
        c = c + (BranchFreeMask & ConvertVal); 
        return((char)c); 
}

My function is expanded for clarity and uses non-hardcoded constants. You can do the same thing in a single line with hardcoded values but I like readable code; however, here is the "compressed" version of my algorithm. It's not any faster since it does the EXACT same thing "compressed" to one line.

c+=(((96-(int)c)&((int)c-123))>>9)&(-32);

There are a number of optimizations you can make here to make it even faster. You can hardcode more optimal numbers for ASCII because the example doesn't assume any encoding mapping other than a-z and A-Z are contiguous ranges. For example with ASCII, if you don't have a barrel shifter, you can actually change the AsrBits to 4 (9-5) since ConvertVal will be +/-32 depending on the toupper or tolower operation.

Once you have working branchfree versions, you can use SIMD or bit-twiddling SWAR (SIMD Within A Register) techniques to convert 4-16 bytes at a time (or even possibly more depending how wide your registers go and if you unroll to hide latency). This will be much faster than any lookup method which is pretty much limited to single-byte conversion unless you have immensely large tables which grow exponential per byte processed simultaneously.

Also, you can generate the branchfree predicate without using int upcasting but then you have to do a couple more operations (with upcasting it's just one subtract per range). You may need to do the expanded operations for SWAR but most SIMD implementations have a compare operation that will generate a mask for you for free.

The SWAR/SIMD operations also can benefit from fewer reads/writes to memory and the writes that do occur can be aligned. This is much faster on processors that have load-hit-store penalties (such as the PS3 Cell Processor). Combine this with simple prefetching in an unrolled version and you can avoid memory stalls nearly altogether.

I know it seems like there is a lot of code in my example there but there are ZERO branches (implicit or explicit) and no branch mispredictions as a result. If you are on a platform with significant branch mispredict penalties (which is true for many pipelined embedded processor), then even without SIMD, your optimized release build of the above code should run faster than something that seems much less complicated but creates implicit branches.

Even without SIMD/SWAR, it is possible for a smart compiler to unroll and interleave the above implementation to hide latencies and result in a very fast version - especially on modern superscalar processors that can issue more than one non-dependent instruction per cycle. This is not usually possible with any of the branching versions.

If you manually unroll, I would group loads and gather stores to make it easier for the compiler to interleave the non-branching non-dependent instructions in between. Example:

// Unrolled inner loop where 'char *c' is the string we're converting
char c0=c[0],c1=c[1],c2=c[2],c3=c[3];  // Grouped-Loads
c[0]=BranchFree_AsciiToUpper(c0);
c[1]=BranchFree_AsciiToUpper(c1);
c[2]=BranchFree_AsciiToUpper(c2);
c[3]=BranchFree_AsciiToUpper(c3);
c+=4;

A decent compiler should be able to inline the ToUpper and fully interleave the above code since there are no branches, no memory aliasing, and no codependent instructions between each one. Just for kicks, I decided to actually compile this and a compiler that targetted PowerPC generated perfect interleaving for dual-issue superscalar core that will easily outperform any code with branches.

mr               r31,r3
mr               r13,r13
lbz              r11,0(r31)
lbz              r10,1(r31)
extsb            r11,r11
lbz              r9,2(r31)
extsb            r10,r10
lbz              r8,3(r31)
subfic           r7,r11,96
addi             r6,r11,-123
srawi            r5,r7,9
srawi            r4,r6,9
subfic           r3,r10,96
addi             r7,r10,-123
extsb            r9,r9
srawi            r6,r3,9
srawi            r3,r7,9
subfic           r7,r9,96
addi             r30,r9,-123
extsb            r8,r8
srawi            r7,r7,9
srawi            r30,r30,9
subfic           r29,r8,96
addi             r28,r8,-123
srawi            r29,r29,9
srawi            r28,r28,9
and              r5,r5,r4
and              r3,r6,r3
and              r7,r7,r30
and              r30,r29,r28
clrrwi           r4,r5,5
clrrwi           r6,r7,5
clrrwi           r5,r3,5
clrrwi           r7,r30,5
add              r4,r4,r11
add              r3,r5,r10
add              r11,r6,r9
stb              r4,0(r31)
add              r10,r7,r8
stb              r3,1(r31)
stb              r11,2(r31)
stb              r10,3(r31)

The proof is in the pudding and the above compiled code gonna be really fast compared to branching versions even before going to SWAR or SIMD.

In summary, reasons why this should be the fastest method:

  1. No branch misprediction penalties
  2. Ability to SIMD-ify algorithm for 4-16 (or more) bytes at a time
  3. Compiler (or programmer) can unroll and interleave to eliminate latencies and to take advantage of superscalar (multi-issue) processors
  4. No memory latencies (i.e. table lookups)
梨涡 2024-10-03 16:51:21

我犹豫是否要回答这个问题,因为我已经 20 多年没有使用小型设备了。然而,我认为规则几乎是相同的(可能有一个补充):

  1. 最小化内存访问
  2. 最小化 CPU 周期
  3. 最小化代码大小

当我开发低级代码时,规则 #1 掩盖了所有其他规则。没有任何板载缓存,并且内存相对于处理器来说慢得令人难以置信;这就是 C 语言中存在“寄存器”存储类的原因。如今情况有所改变,但它仍然是最重要的两个问题之一。正如我在一篇文章中评论的那样,查找表是一个好主意,但要认识到这意味着每次测试都需要额外的内存访问。一旦它进入缓存,这可能不是问题,但每次进入该函数时,您都将付出几次缓存命中的代价(除非您经常调用它,以致查找表可以保留在缓存中) 。

第 2 条规则看起来像是“呃,你当然想这样做,为什么不是第 1 条规则呢?”但推理实际上更深入。事实上,在某些方面,它是规则#1 的重述,因为每条指令在执行之前都必须从内存中获取。这里有一个微妙的权衡:在纯整数处理器上,使用查找表来计算三角函数显然是有利的;在具有嵌入式浮点的芯片上,也许不是。

我不确定规则#3 是否仍然适用。根据我的经验,总是需要争先恐后地削减代码,将众所周知的 20 磅装入 10 磅的袋子中。但今天最小的麻袋似乎有 50 磅。然而,即使使用 50 磅的麻袋(或许多兆字节的 ROM)来保存代码/数据,您仍然需要将其拉入缓存(如果有的话)。

新规则#1:保持管道满

现代处理器具有深度指令管道(如果您不熟悉这个术语,请参阅这篇文章:http://arstechnica.com/old/content/2004/09/pipelined-1.ars/1 )。深层管道的一般经验法则是分支(“if”测试)的成本很高,因为这意味着可能必须刷新管道才能加载新代码。因此,您编写代码以在不太可能的情况下进行分支(请参阅 Adisak 的帖子,了解可能合理的无分支实现;如果可以的话,+1)。

比我更有经验的人可能会评论说,“现代处理器用两个分支加载管道,因此没有成本损失。”这一切都很好,但它提出了一个总体规则:

规则 0 :优化取决于您的架构和工作负载

我的洗碗机内的微处理器可能没有管道,也可能没有缓存。当然,它可能也不会进行大量的文本处理。或者也许两者都有;市场上似乎只有少数几种主要的嵌入式处理器,因此电路板上可能有 Pentium,而不是 8051 衍生产品。即便如此,即使在基于奔腾的嵌入式处理器中,也有很广泛的范围(http://en.wikipedia.org/wiki/List_of_Intel_Pentium_microprocessors#Embedded_processors)。对一个人来说最好的事情可能对另一个人来说不一定是最好的。

然后就是您正在处理什么类型的数据的问题。如果您正在处理文本,那么很可能(但不能保证)您的大部分数据都是字母,而不是数字或标点符号;这样你就可以对此进行优化。

然而,还有更多:我评论道“仅限 ASCII,嗯?”在OP上;另一位评论者的说法更明确:如果您在 2010 年处理文本,那么您可能不会处理 ASCII。至少,您将处理 ISO-8859-1 或类似的 8 位字符集。在这种情况下,也许无分支或智能分支(注意管道)解决方案仍然比查找表更快(是的,这是我的假设)。但是,如果您正在处理 Unicode BMP(16 位),则几乎必须使用该表,无论其内存成本如何,因为没有简单的规则来确定什么是小写,什么是大写。如果您正在处理 Unicode 的更高层面......那么,也许“旧斜体”的大写并不那么重要(特别是因为它没有大小写)。

最终,确定的唯一方法是分析给定的实际工作负载。

最后:清除代码 FTW

这篇文章开始于我向 OP 写了一条评论,说他/她使用宏是一个坏主意(并且无法输入它,因为 SO 进入了维护模式)。 Peter Torok(抱歉,我不支持 Unicode,甚至不支持 ISO-8859-1)给出了一个原因,但还有另一个原因:它们是黑匣子。

OP 看起来漂亮干净:代码短,大量使用按位和三元运算符,如果您了解该语言,则很容易理解。但如果您看到 A_Z 的扩展形式,那么理解实际工作就会容易得多。这可能会让您思考自己做了多少分支,特别是在 ToggleCase 方法中。然后您可能已经考虑过如何重新​​安排这些分支以最大程度地减少您正在执行的实际测试的数量。也许还考虑过维护管道。

I hesitated to answer this because it's been over 20 years since I worked with small devices. However, I think the rules are pretty much the same (with one possible addition):

  1. Minimize memory accesses
  2. Minimize CPU cycles
  3. Minimize code size

When I was developing low-level code, rule #1 overshadowed all the others. There wasn't any on-board cache, and memory was incredible slow relative to the processor; that's the reason that the "register" storage class exists in C. Today the situation has changed somewhat, but it's still one of the top two concerns. As I commented on one post, a lookup table is a good idea, but recognize that it means an extra memory access for each test. Once it gets into cache that may not be an issue, but you're going to pay the price of a several cache hits each time you enter the function (unless you're calling it so often that the lookup table can remain in cache).

Rule number 2 seems like "duh, of course you want to do that, why isn't it rule #1?" but the reasoning actually goes deeper. In fact, in some ways it's a restatement of rule #1, since each instruction has to be fetched from memory before it can be executed. There's a delicate tradeoff: on an integer-only processor, it's a clear win to use a lookup table to compute trigonometric functions; on a chip with embedded floating point, maybe not.

I'm not sure that rule #3 still applies. In my experience, there was always the scramble to cut code, fit the proverbial 20 pounds into a 10 pound sack. But it seems that today the smallest sack is 50 pounds. However, even with a 50 pound sack (or many-megabyte ROM) to hold your code/data, you still need to pull it into the cache (if you have one).

The new rule #1: keep the pipeline full

Modern processors have deep instruction pipelines (if you're not familiar with this term, see this article: http://arstechnica.com/old/content/2004/09/pipelining-1.ars/1). The general rule of thumb with deep pipelines is that the branching -- an "if" test -- is expensive, because it means that the pipeline might have to be flushed to load in the new code. So you write your code to branch on the unlikely case (see Adisak's post for a perhaps-justified no-branch implementation; +1 if I could).

Someone with more recent experience than me will probably comment, and say "modern processors load the pipeline with both branches, so there's no cost penalty."Which is all well and good, but it brings up an overarching rule:

Rule 0: optimization depends on your architecture and workload

The microprocessor inside my dishwasher probably doesn't have a pipeline and maybe doesn't have a cache. Of course, it's probably not going to do a lot of text processing either. Or maybe it has both; it seems that there are only a few major embedded processors in the market, so maybe there's a Pentium on that circuit board rather than an 8051 derivative. Even so, there's a wide range even within the Pentium-based embedded processors (http://en.wikipedia.org/wiki/List_of_Intel_Pentium_microprocessors#Embedded_processors). What is best for one might not be best for another.

Then there's the issue of what sort of data you're processing. If you're processing text, then it's likely (but not guaranteed) that most of your data will be letters, versus numbers or punctuation; so you can optimize for that.

However, there's more: I commented "ASCII only, huh?" on the OP; another commenter was more explicit: if you're processing text in 2010, you probably aren't processing ASCII. At the very least, you'll be dealing with ISO-8859-1 or a similar 8-bit character set. And in this case, maybe a no-branch or smart-branch (paying attention to the pipeline) solution will still be faster than a lookup table (yes, that's an assumption on my part). But if you're dealing with Unicode BMP (16 bits), you'll pretty much have to use the table, whatever its cost in terms of memory, because there are no easy rules to determine what's lower- versus upper-case. And if you're dealing with the higher planes of Unicode ... well, maybe capitalization of "Old Italics" isn't that important (especially because it doesn't have upper- and lower-case).

Ultimately, the only way to know for sure is to profile given realistic workloads.

Finally: Clear Code FTW

This post started when I wrote a comment to the OP that his/her use of macros was a bad idea (and couldn't enter it because SO went into maintenance mode). Peter Torok (sorry, I don't support Unicode or even ISO-8859-1) gave one reason, but there's another: they're black boxes.

The OP looks nice and clean: short code, heavy use of bitwise and ternary operators, easy to understand if you understand the language. But it would have been a lot easier to understand the actual work if you saw A_Z in its expanded form. That might have gotten you thinking about how much branching you were doing, particular in the ToggleCase method. And then you might have given some thought to how you could re-arrange those branches to minimize the number of actual tests that you're doing. And perhaps given some thought to maintaining the pipeline.

℉服软 2024-10-03 16:51:21

好的,开始了。在此选项卡上写入...在另一个选项卡上滚动浏览代码:-)

header

  1. #define a_z(c) (c >= 'a' && c <= 'z' )

    • 宏等函数的名称应全部大写(可能是 IS_LOWERCASE),以提醒用户这是一个宏
    • 扩展中的

    • c应该在括号内,以防止奇怪的副作用
    • 个人选择:我喜欢重新排序条件,使其读起来更像英语 'a' <= c <= 'z' as (('a' <= (c)) && ((c) <= 'z'))
  2. 我会让函数 void ToggleCase(char* c) 返回一个 char* (与发送的相同)为了能够按顺序使用它们: printf("%s\n", UpperCase(LowerCase("FooBar")));

源代码

  1. 三元运算符不会使您的代码更快或更容易阅读。我会写一个简单的 if

就是这样。

哦!还有一件事:您的代码假定 ASCII(您自己这么说的),但没有记录这一点。我会在头文件中添加对此的注释。

Ok, here goes. Writing on this tab ... scrolling through your code on the other tab :-)

header

  1. #define a_z(c) (c >= 'a' && c <= 'z')

    • the name of the function like macro should be in ALL CAPS (maybe IS_LOWERCASE) to alert users it's a macro
    • c in the expansion should be inside parenthesis to prevent strange side-effects
    • personal choice: I like to reorder the conditions to read more like the english 'a' <= c <= 'z' as (('a' <= (c)) && ((c) <= 'z'))
  2. I'd make the functions void ToggleCase(char* c) return a char* (the same that was sent in) to be able to use them in sequence: printf("%s\n", UpperCase(LowerCase("FooBar")));

source code

  1. The ternary operator does not make your code faster or easier to read. I'd write a simple if

That's it.

Oh! One more thing: Your code assumes ASCII (you said so yourself) but doesn't document that. I'd add a note about that in the header file.

你是年少的欢喜 2024-10-03 16:51:21

也许我是一个扫兴的人,因为这被认为是一个学习练习,但学习的关键部分应该是学会有效地使用你的工具。

ANSI C 在标准库中包含了必要的函数,并且编译器供应商可能已针对您的体系结构对它们进行了大量优化。

标准头文件 ctype.h 包括函数 tolower() 和 toupper()。

Maybe I'm being a party pooper, since this was said to be a learning exercise, but a key part of learning should be learning to use your tools efficiently.

ANSI C includes the necessary functions in the standard library, and presumably they have been heavily optimized for your architecture by the compiler vendor.

Standard header ctype.h includes the functions tolower() and toupper().

瑾兮 2024-10-03 16:51:21

首先,我建议将 a_zA_Z 重命名为 is_ASCII_Lowercaseis_ASCII_Uppercase 。它不像C-ish,但更容易理解。

另外,使用 ^=?: 是有效的,但我再次发现它比简单的 if 语句可读性较差。

First thing is I'd say rename a_z and A_Z to something like is_ASCII_Lowercase and is_ASCII_Uppercase. It's not as C-ish, but it's easier to understand.

Also, the use of ^= and ?: works, but again, I find it less readable than a simple if-statement.

箜明 2024-10-03 16:51:21

也许我在 C++ 上花费了太多时间,而在 C 上花费的时间不够,但我不太喜欢带有参数的宏……正如 Peter Torok 指出的那样,它们可能会导致一些问题。您对 CASE_FLAG 的定义没问题(它不带任何参数),但我会用函数替换宏 a_z 和 A_Z 。

Perhaps I've spent too much time with C++ and not enough with C, but I'm not a big fan of macros that have parameters... as Peter Torok points out they can lead to some problems. Your definition of CASE_FLAG is okay (it doesn't take any parameters), but I would replace the macros a_z and A_Z with functions instead.

浸婚纱 2024-10-03 16:51:21

怎么样(几乎有效):

char slot[] = { 0, 31, 63, 63 };
*c = slot[*c/32] + *c%32;

您可以更改一些事情:

*c += a_z(*c)*CASE_FLAG; // adds either zero or three two
// you could also replace multiplication with the shift (1<<5) trick

字符串实际上是数组:

char upper[] = "ABC..ABC..."; // 
...
*c = upper[*c+offset];

char upper[] = "ABC.."; // 
...
*c = upper[*c%32];

*c = 'A' + *c%32;

或其他...

how about (almost works):

char slot[] = { 0, 31, 63, 63 };
*c = slot[*c/32] + *c%32;

Couple things you could change:

*c += a_z(*c)*CASE_FLAG; // adds either zero or three two
// you could also replace multiplication with the shift (1<<5) trick

strings are actually arrays:

char upper[] = "ABC..ABC..."; // 
...
*c = upper[*c+offset];

or

char upper[] = "ABC.."; // 
...
*c = upper[*c%32];

or

*c = 'A' + *c%32;

or whatever ...

征棹 2024-10-03 16:51:21

我的方法是“只在需要时修剪”。

根据您的系统和 CPU 架构,很多事情可以以不同的方式完成。

我有一些与您的代码相关的设计点。首先,宏。宏有一些残酷的陷阱,应该谨慎使用。其次,使用全局来切换大小写。我会重写成这样 -

 enum CASE {UPPER, LOWER};

void ToggleCase(char* c, CASE newcase)
{
    if(newcase == UPPER)
       UpperCase(c);
    else if(newcase == LOWER)
       LowerCase(c);
    else 
       { ; } //null
}

从微观效率的角度来看,这为每次调用增加了大约 1 条额外的指令。还可能发生一些分支,这可能会导致缓存未命中。

void LowerCase(char* c)
{
  while (*c++)  //standard idiom for moving through a string.
  {
    *c = *c < 'Z' ? *c + 32 : *c;
  }
}


void UpperCase(char* c)
{
  while (*c++)
  {
    *c = *c > 'a' ? *c - 32 : *c;
  }
}

现在,我的代码受到了一些批评。

首先,它是分枝的。其次,假设输入是 [a-zA-Z]+。第三,它仅支持 ASCII(那么 EBDIC 呢?)。第四,它假设空终止(有些字符串在字符串开头有一些字符 - 我认为是 Pascal)。第五,代码的大写/小写并不是 100% 天真明显的。另请注意,ENUM 是一个隐藏得很深的整数。您可以传递 ToggleCase("some string", 1024) ,它将编译。

这些并不是说我的代码非常糟糕。它会发挥作用,并将继续发挥作用——只是在某些条件下。

My approach is "trim only when needed".

Depending on your system and your cpu architecture, a lot of things could be done differently.

There are a few design points I would have in relation to your code. First, macros. Macros have some brutal pitfalls, and should be used with care. Second, the use of a global to toggle case. I would rewrite to look something like this -

 enum CASE {UPPER, LOWER};

void ToggleCase(char* c, CASE newcase)
{
    if(newcase == UPPER)
       UpperCase(c);
    else if(newcase == LOWER)
       LowerCase(c);
    else 
       { ; } //null
}

In the micro-efficiency sense, this adds about 1 extra instruction per call. There is also some branching that can happen, which can potentially cause a cache miss.

void LowerCase(char* c)
{
  while (*c++)  //standard idiom for moving through a string.
  {
    *c = *c < 'Z' ? *c + 32 : *c;
  }
}


void UpperCase(char* c)
{
  while (*c++)
  {
    *c = *c > 'a' ? *c - 32 : *c;
  }
}

Now, there are some criticisms of my code.

First, it's branchy. Second, it's assuming that the input is [a-zA-Z]+. Third, it's ASCII-only (What about EBDIC?). Fourth, it's assuming null termination (Some strings have some characters at the start of the string - Pascal I think). Fifth, it's not 100% naively obvious that the code upper/lowercases. Also note that an ENUM is a badly-veiled integer. You can pass ToggleCase("some string", 1024) and it will compile.

These things aren't to say my code is super bad. It serves and will serve - just under some conditions.

安静被遗忘 2024-10-03 16:51:21

我使用宏是因为我认为它使代码看起来更好并且比函数调用更有效。

是不是更有效率?您对代码大小有什么要求? (对于生成的可执行代码,而不是 C 源代码。)在现代桌面系统上,这很少是问题,速度更重要;但除了“嵌入式系统应用程序”之外,您没有向我们提供任何更多详细信息,因此我们无法为您回答这个问题。然而,这在这里不是问题,因为宏内的代码确实非常小,但您不能认为避免函数调用总是更有效!

如果允许,您可以使用内联函数。自 99 年以来,它们已正式成为 C 的一部分,但在多个编译器中支持的时间要长得多。内联函数比宏干净得多,但是,根据您的具体目标要求,可能很难从源代码中预测生成的代码。然而,更常见的是,人们被困在不支持它们的过时(现在已经十多年了!)的 C 编译器中。

简而言之,您始终必须了解自己的确切要求才能确定最佳方案。然后您必须测试以验证您的性能预测

I've made use of macros because, I think, it makes the code look better and it is more efficient than function calls.

Is it more efficient? What are your requirements on code size? (For the generated executable code, not the C source code.) On modern desktop systems, that's rarely an issue and speed matters much more; but you've not given us any more details beyond "embedded systems applications," so there's no way we can answer this for you. However, it's not an issue here, because the code inside the macros is truly so small—but you can't assume that avoiding function calls is always more efficient!

You can use inline functions, if you're allowed. They've been officially part of C since '99, but supported for far longer in several compilers. Inline functions are much cleaner than macros, but, again depending on your exact target requirements, it can be difficult to predict the generated code from the source. More commonly, however, people are stuck with outdated (now over a decade!) C compilers that don't support them.

In short, you always have to know your exact requirements to determine what's optimal. And then you have to test to verify your performance predictions.

寒冷纷飞旳雪 2024-10-03 16:51:21

如果尝试一次处理多个字节,我认为最好的方法是强制所有值均为 0..127,添加 5 或 37(这将使 'z' 到 'Z' 变为 127),请注意值,然后添加 26,记下该值,然后进行一些修改。类似于:

unsigned long long orig,t1,t2,result;

t1 = (orig & 0x7F7F7F7F7F7F7F7F) + 0x0505050505050505;
t2 = t1 + 0x1A1A1A1A1A1A1A1A;
result = orig ^ ((~(orig | t1) & t2 & 0x8080808080808080) >> 2);

嗯...我想即使适用于 32 位机器,效果也相当不错。如果四个寄存器预加载了适当的常量,则 ARM 可以使用最佳代码来执行需要七个周期的七个指令的操作;我怀疑编译器会找到优化(或者发现将常量保留在寄存器中会有帮助 - 如果常量不保存在寄存器中,单独处理字节会更快)。

If one is trying to process multiple bytes at once, I think the best approach would be to force all values to be 0..127, add 5 or 37 (which would make 'z' to 'Z' become 127), note that value, and then add 26, note that value, and then do some munging. Something like:

unsigned long long orig,t1,t2,result;

t1 = (orig & 0x7F7F7F7F7F7F7F7F) + 0x0505050505050505;
t2 = t1 + 0x1A1A1A1A1A1A1A1A;
result = orig ^ ((~(orig | t1) & t2 & 0x8080808080808080) >> 2);

Hmm... I guess that works out pretty well, even if adapted for a 32-bit machine. If four registers are preloaded with the proper constants, an ARM could with optimal code probably do the operations with seven instructions taking seven cycles; I doubt a compiler would find the optimizations (or figure out that keeping the constants in registers would be helpful--if the constants aren't kept in registers, processing bytes singly would be faster).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文