在 C 中将整个文件转换为小写的最佳方法

发布于 2024-11-26 04:59:58 字数 133 浏览 1 评论 0原文

我想知道是否有一个非常好的（高性能）解决方案如何将整个文件转换为 C 中的小写。我使用 fgetc 将字符转换为小写，然后使用 fputc 将其写入另一个临时文件中。最后，我删除原始文件并将临时文件重命名为旧的原始名称。但我认为必须有更好的解决方案。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深居我梦 2024-12-03 04:59:58

这并没有真正回答问题（社区维基），但这里有一个（过度？）优化的函数，用于将文本转换为小写：

#include <assert.h>
#include <ctype.h>
#include <stdio.h>

int fast_lowercase(FILE *in, FILE *out)
{
    char buffer[65536];
    size_t readlen, wrotelen;
    char *p, *e;
    char conversion_table[256];
    int i;

    for (i = 0; i < 256; i++)
        conversion_table[i] = tolower(i);

    for (;;) {
        readlen = fread(buffer, 1, sizeof(buffer), in);
        if (readlen == 0) {
            if (ferror(in))
                return 1;
            assert(feof(in));
            return 0;
        }

        for (p = buffer, e = buffer + readlen; p < e; p++)
            *p = conversion_table[(unsigned char) *p];

        wrotelen = fwrite(buffer, 1, readlen, out);
        if (wrotelen != readlen)
            return 1;
    }
}

当然，这不是 Unicode 感知的。

我使用 GCC 4.6.0 和 i686（32 位）Linux 在 Intel Core 2 T5500 (1.66GHz) 上对此进行了基准测试。一些有趣的观察结果：

当使用 malloc 分配 buffer 时，速度大约比在堆栈上分配快 75%。
使用条件而不是转换表的速度大约快 65%。

This doesn't really answer the question (community wiki), but here's an (over?)-optimized function to convert text to lowercase:

#include <assert.h>
#include <ctype.h>
#include <stdio.h>

int fast_lowercase(FILE *in, FILE *out)
{
    char buffer[65536];
    size_t readlen, wrotelen;
    char *p, *e;
    char conversion_table[256];
    int i;

    for (i = 0; i < 256; i++)
        conversion_table[i] = tolower(i);

    for (;;) {
        readlen = fread(buffer, 1, sizeof(buffer), in);
        if (readlen == 0) {
            if (ferror(in))
                return 1;
            assert(feof(in));
            return 0;
        }

        for (p = buffer, e = buffer + readlen; p < e; p++)
            *p = conversion_table[(unsigned char) *p];

        wrotelen = fwrite(buffer, 1, readlen, out);
        if (wrotelen != readlen)
            return 1;
    }
}

This isn't Unicode-aware, of course.

I benchmarked this on an Intel Core 2 T5500 (1.66GHz), using GCC 4.6.0 and i686 (32-bit) Linux. Some interesting observations:

It's about 75% as fast when buffer is allocated with malloc rather than on the stack.
It's about 65% as fast using a conditional rather than a conversion table.

回复收藏 0 原文

万劫不复 2024-12-03 04:59:58

我想说你说得一针见血。临时文件意味着您不会删除原始文件，直到您确定已完成处理它，这意味着在发生错误时原始文件仍然存在。我想说这是正确的做法。

正如另一个答案所建议的（如果文件大小允许），您可以通过 mmap 函数对文件进行内存映射，并使其在内存中随时可用（如果文件小于页面大小，则没有真正的性能差异，因为它可能是无论如何，一旦你完成第一次读取，就会被读入内存）

回复收藏 0 原文

埖埖迣鎅 2024-12-03 04:59:58

通过使用 fread 和 fwrite 读取和写入大块输入/输出，通常可以在大输入上获得更快的速度。另外，您可能应该将更大的块（如果可能的话整个文件）转换到内存中，然后一次将其全部写入。

编辑：我又想起了一件事。有时，如果选择素数（至少不是 2 的幂）作为缓冲区大小，程序会更快。我似乎记得这与缓存机制的细节有关。

回复收藏 0 原文

梦魇绽荼蘼 2024-12-03 04:59:58

如果您正在处理大文件（例如，数兆字节）并且此操作绝对对速度至关重要，那么超出您所查询的范围可能是有意义的。需要特别考虑的一件事是，逐字符操作的性能不如使用 SIMD 指令。

即，如果您使用 SSE2，您可以对 toupper_parallel 进行编码（伪代码）：

for (cur_parallel_word = begin_of_block;
     cur_parallel_word < end_of_block;
     cur_parallel_word += parallel_word_width) {
    /*
     * in SSE2, parallel compares are either about 'greater' or 'equal'
     * so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
     * The 'ALL' macro is supposed to replicate into all parallel bytes.
     */
    mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
    mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
    /*
     * vector op - and all bytes in two vectors, 'PAND'
     */
    mask = mask1 & mask2;
    /*
     * vector op - add a vector of bytes. Would use 'PADDB'.
     */
    new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
    /*
     * vector op - zero bytes in the original vector that will be replaced
     */
    *cur_parallel_word &= !mask;           // that'd become 'PANDN'
    /*
     * vector op - extract characters from new that replace old, then or in.
     */
    *cur_parallel_word |= (new & mask);    // PAND / POR
}

即您将使用并行比较来检查哪些字节是大写的，然后屏蔽原始值和“大写”版本（一个带有掩模，另一个带有反面）在您或它们一起形成结果之前。

如果您使用 mmap 文件访问，这甚至可以就地执行，节省反弹缓冲区，并节省许多函数和/或系统调用。

当您的起点是逐个字符的“fgetc”/“fputc”循环时，有很多需要优化的地方；即使是 shell 实用程序也很可能表现得比这更好。

但我同意，如果您的需求非常特殊（即像要转换为大写的 ASCII 输入一样明确），则使用矢量指令集（如 SSE 内在函数/汇编或 ARM NEON），如上所述手工制作循环，或 PPC Altivec），可能会比现有的通用实用程序显着加速。

If you're processing big files (big as in, say, multi-megabytes) and this operation is absolutely speed-critical, then it might make sense to go beyond what you've inquired about. One thing to consider in particular is that a character-by-character operation will perform less well than using SIMD instructions.

I.e. if you'd use SSE2, you could code the toupper_parallel like (pseudocode):

for (cur_parallel_word = begin_of_block;
     cur_parallel_word < end_of_block;
     cur_parallel_word += parallel_word_width) {
    /*
     * in SSE2, parallel compares are either about 'greater' or 'equal'
     * so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
     * The 'ALL' macro is supposed to replicate into all parallel bytes.
     */
    mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
    mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
    /*
     * vector op - and all bytes in two vectors, 'PAND'
     */
    mask = mask1 & mask2;
    /*
     * vector op - add a vector of bytes. Would use 'PADDB'.
     */
    new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
    /*
     * vector op - zero bytes in the original vector that will be replaced
     */
    *cur_parallel_word &= !mask;           // that'd become 'PANDN'
    /*
     * vector op - extract characters from new that replace old, then or in.
     */
    *cur_parallel_word |= (new & mask);    // PAND / POR
}

I.e. you'd use parallel comparisons to check which bytes are uppercase, and then mask both original value and 'uppercased' version (one with the mask, the other with the inverse) before you or them together to form the result.

If you use mmap'ed file access, this could even be performed in-place, saving on the bounce buffer, and saving on many function and/or system calls.

There is a lot to optimize when your starting point is a character-by-character 'fgetc' / 'fputc' loop; even shell utilities are highly likely to perform better than that.

But I agree that if your need is very special-purpose (i.e. something as clear-cut as ASCII input to be converted to uppercase) then a handcrafted loop as above, using vector instruction sets (like SSE intrinsics/assembly, or ARM NEON, or PPC Altivec), is likely to make a significant speedup possible over existing general-purpose utilities.

回复收藏 0 原文