在 C 中将整个文件转换为小写的最佳方法

发布于 2024-11-26 04:59:58 字数 133 浏览 1 评论 0原文

我想知道是否有一个非常好的(高性能)解决方案如何将整个文件转换为 C 中的小写。 我使用 fgetc 将字符转换为小写,然后使用 fputc 将其写入另一个临时文件中。最后,我删除原始文件并将临时文件重命名为旧的原始名称。但我认为必须有更好的解决方案。

I was wondering if theres a realy good (performant) solution how to Convert a whole file to lower Case in C.
I use fgetc convert the char to lower case and write it in another temp-file with fputc. At the end i remove the original and rename the tempfile to the old originals name. But i think there must be a better Solution for it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

深居我梦 2024-12-03 04:59:58

这并没有真正回答问题(社区维基),但这里有一个(过度?)优化的函数,用于将文本转换为小写:

#include <assert.h>
#include <ctype.h>
#include <stdio.h>

int fast_lowercase(FILE *in, FILE *out)
{
    char buffer[65536];
    size_t readlen, wrotelen;
    char *p, *e;
    char conversion_table[256];
    int i;

    for (i = 0; i < 256; i++)
        conversion_table[i] = tolower(i);

    for (;;) {
        readlen = fread(buffer, 1, sizeof(buffer), in);
        if (readlen == 0) {
            if (ferror(in))
                return 1;
            assert(feof(in));
            return 0;
        }

        for (p = buffer, e = buffer + readlen; p < e; p++)
            *p = conversion_table[(unsigned char) *p];

        wrotelen = fwrite(buffer, 1, readlen, out);
        if (wrotelen != readlen)
            return 1;
    }
}

当然,这不是 Unicode 感知的。

我使用 GCC 4.6.0 和 i686(32 位)Linux 在 Intel Core 2 T5500 (1.66GHz) 上对此进行了基准测试。一些有趣的观察结果:

  • 当使用 malloc 分配 buffer 时,速度大约比在堆栈上分配快 75%。
  • 使用条件而不是转换表的速度大约快 65%。

This doesn't really answer the question (community wiki), but here's an (over?)-optimized function to convert text to lowercase:

#include <assert.h>
#include <ctype.h>
#include <stdio.h>

int fast_lowercase(FILE *in, FILE *out)
{
    char buffer[65536];
    size_t readlen, wrotelen;
    char *p, *e;
    char conversion_table[256];
    int i;

    for (i = 0; i < 256; i++)
        conversion_table[i] = tolower(i);

    for (;;) {
        readlen = fread(buffer, 1, sizeof(buffer), in);
        if (readlen == 0) {
            if (ferror(in))
                return 1;
            assert(feof(in));
            return 0;
        }

        for (p = buffer, e = buffer + readlen; p < e; p++)
            *p = conversion_table[(unsigned char) *p];

        wrotelen = fwrite(buffer, 1, readlen, out);
        if (wrotelen != readlen)
            return 1;
    }
}

This isn't Unicode-aware, of course.

I benchmarked this on an Intel Core 2 T5500 (1.66GHz), using GCC 4.6.0 and i686 (32-bit) Linux. Some interesting observations:

  • It's about 75% as fast when buffer is allocated with malloc rather than on the stack.
  • It's about 65% as fast using a conditional rather than a conversion table.
万劫不复 2024-12-03 04:59:58

我想说你说得一针见血。临时文件意味着您不会删除原始文件,直到您确定已完成处理它,这意味着在发生错误时原始文件仍然存在。我想说这是正确的做法。

正如另一个答案所建议的(如果文件大小允许),您可以通过 mmap 函数对文件进行内存映射,并使其在内存中随时可用(如果文件小于页面大小,则没有真正的性能差异,因为它可能是无论如何,一旦你完成第一次读取,就会被读入内存)

I'd say you've hit the nail on the head. Temp file means that you don't delete the original until you're sure that you're done processing it which means upon error the original remains. I'd say that's the correct way of doing it.

As suggested by another answer (if file size permits) you can do a memory mapping of the file via the mmap function and have it readily available in memory (no real performance difference if the file is less than the size of a page as it's probably going to get read into memory once you do the first read anyway)

埖埖迣鎅 2024-12-03 04:59:58

通过使用 freadfwrite 读取和写入大块输入/输出,通常可以在大输入上获得更快的速度。另外,您可能应该将更大的块(如果可能的话整个文件)转换到内存中,然后一次将其全部写入。

编辑:我又想起了一件事。有时,如果选择素数(至少不是 2 的幂)作为缓冲区大小,程序会更快。我似乎记得这与缓存机制的细节有关。

You can usually get a little bit faster on big inputs by using fread and fwrite to read and write big chunks of the input/output. Also you should probably convert a bigger chunk (whole file if possible) into memory and then write it all at once.

edit: I just rememberd one more thing. Sometimes programs can be faster if you select a prime number (at the very least not a power of 2) as the buffer size. I seem to recall this has to do with specifics of the cacheing mechanism.

梦魇绽荼蘼 2024-12-03 04:59:58

如果您正在处理大文件(例如,数兆字节)并且此操作绝对对速度至关重要,那么超出您所查询的范围可能是有意义的。需要特别考虑的一件事是,逐字符操作的性能不如使用 SIMD 指令。

即,如果您使用 SSE2,您可以对 toupper_parallel 进行编码(伪代码):

for (cur_parallel_word = begin_of_block;
     cur_parallel_word < end_of_block;
     cur_parallel_word += parallel_word_width) {
    /*
     * in SSE2, parallel compares are either about 'greater' or 'equal'
     * so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
     * The 'ALL' macro is supposed to replicate into all parallel bytes.
     */
    mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
    mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
    /*
     * vector op - and all bytes in two vectors, 'PAND'
     */
    mask = mask1 & mask2;
    /*
     * vector op - add a vector of bytes. Would use 'PADDB'.
     */
    new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
    /*
     * vector op - zero bytes in the original vector that will be replaced
     */
    *cur_parallel_word &= !mask;           // that'd become 'PANDN'
    /*
     * vector op - extract characters from new that replace old, then or in.
     */
    *cur_parallel_word |= (new & mask);    // PAND / POR
}

即您将使用并行比较来检查哪些字节是大写的,然后屏蔽原始值和“大写”版本(一个带有掩模,另一个带有反面)在您或它们一起形成结果之前。

如果您使用 mmap 文件访问,这甚至可以就地执行,节省反弹缓冲区,并节省许多函数和/或系统调用。

当您的起点是逐个字符的“fgetc”/“fputc”循环时,有很多需要优化的地方;即使是 shell 实用程序也很可能表现得比这更好。

但我同意,如果您的需求非常特殊(即像要转换为大写的 ASCII 输入一样明确),则使用矢量指令集(如 SSE 内在函数/汇编或 ARM NEON),如上所述手工制作循环,或 PPC Altivec),可能会比现有的通用实用程序显着加速。

If you're processing big files (big as in, say, multi-megabytes) and this operation is absolutely speed-critical, then it might make sense to go beyond what you've inquired about. One thing to consider in particular is that a character-by-character operation will perform less well than using SIMD instructions.

I.e. if you'd use SSE2, you could code the toupper_parallel like (pseudocode):

for (cur_parallel_word = begin_of_block;
     cur_parallel_word < end_of_block;
     cur_parallel_word += parallel_word_width) {
    /*
     * in SSE2, parallel compares are either about 'greater' or 'equal'
     * so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
     * The 'ALL' macro is supposed to replicate into all parallel bytes.
     */
    mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
    mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
    /*
     * vector op - and all bytes in two vectors, 'PAND'
     */
    mask = mask1 & mask2;
    /*
     * vector op - add a vector of bytes. Would use 'PADDB'.
     */
    new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
    /*
     * vector op - zero bytes in the original vector that will be replaced
     */
    *cur_parallel_word &= !mask;           // that'd become 'PANDN'
    /*
     * vector op - extract characters from new that replace old, then or in.
     */
    *cur_parallel_word |= (new & mask);    // PAND / POR
}

I.e. you'd use parallel comparisons to check which bytes are uppercase, and then mask both original value and 'uppercased' version (one with the mask, the other with the inverse) before you or them together to form the result.

If you use mmap'ed file access, this could even be performed in-place, saving on the bounce buffer, and saving on many function and/or system calls.

There is a lot to optimize when your starting point is a character-by-character 'fgetc' / 'fputc' loop; even shell utilities are highly likely to perform better than that.

But I agree that if your need is very special-purpose (i.e. something as clear-cut as ASCII input to be converted to uppercase) then a handcrafted loop as above, using vector instruction sets (like SSE intrinsics/assembly, or ARM NEON, or PPC Altivec), is likely to make a significant speedup possible over existing general-purpose utilities.

七度光 2024-12-03 04:59:58

好吧,如果您知道字符编码是什么,您绝对可以大大加快速度。由于您使用的是 Linux 和 C,因此我将冒险假设您使用的是 ASCII。

在 ASCII 中,我们知道 AZ 和 az 是连续的并且总是相距 32。因此,我们能做的就是忽略 toLower() 函数的安全检查和区域设置检查,并执行如下操作:(

伪代码)
foreach(int) 文件中的 char c:
c -= 32。

或者,如果可能有大写和小写字母,请进行检查
if (c > 64 && c < 91) // 大写 ASCII 范围
然后进行减法并将其写入文件。

另外,批量写入速度更快,因此我建议首先写入数组,然后一次性将数组的内容写入文件。

这应该要快得多。

Well, you can definitely speed this up a lot, if you know what the character encoding is. Since you're using Linux and C, I'm going to go out on a limb here and assume that you're using ASCII.

In ASCII, we know A-Z and a-z are contiguous and always 32 apart. So, what we can do is ignore the safety checks and locale checks of the toLower() function and do something like this:

(pseudo code)
foreach (int) char c in the file:
c -= 32.

Or, if there may be upper and lowercase letters, do a check like
if (c > 64 && c < 91) // the upper case ASCII range
then do the subtract and write it out to the file.

Also, batch writes are faster, so I would suggest first writing to an array, then all at once writing the contents of the array to the file.

This should be considerable faster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文