任何人都知道如何将巨大的 char 数组转换为 float,非常巨大的数组,性能比 atof/strtod/sscanf 更好

发布于 2024-08-17 06:40:46 字数 799 浏览 2 评论 0原文

我得到了一个 char 数组,一个巨大的数组 char p[n] 从 txt 中读取。

//1.txt
194.919 -241.808 234.896
195.569 -246.179 234.482
194.919 -241.808 234.896
...

foo(char *p, 浮点数 x, 浮点数 y, 浮点数 z) 我

尝试

使用atof、strtod,但是当数组太大时它们会非常耗时,因为它们会调用strlen()。而且sscanf也很慢......

调试代码,发现atof()和strtod都调用了Visual Studio中的strlen(),我们可以查crt代码。

strtod() call:
        answer = _fltin2( &answerstruct, ptr, (int)strlen(ptr), 0, 0, _loc_update.GetLocaleT());


atof() call:
        return( *(double *)&(_fltin2( &fltstruct, nptr, (int)strlen(nptr), 0, 0, _loc_update.GetLocaleT())->dval) );

我也尝试使用strtok,但我们不应该更改1.txt中的任何数据。

所以任何人都有最好的方法将所有这些转换为浮点数 x、y、z。

Visual Studio 2008 + WIN7

I got a char array, a huge array char p[n] read from a txt like.

//1.txt
194.919 -241.808 234.896
195.569 -246.179 234.482
194.919 -241.808 234.896
...

foo(char *p, float x, float y, float z)
{

}

I tried to use atof, strtod, but they are real time consuming when the array is too huge, because they will call the strlen(). and the sscanf is also very slow....

I debug into the code and find that both atof() and strtod call the strlen() in the visual studio, we can check the crt code.

strtod() call:
        answer = _fltin2( &answerstruct, ptr, (int)strlen(ptr), 0, 0, _loc_update.GetLocaleT());


atof() call:
        return( *(double *)&(_fltin2( &fltstruct, nptr, (int)strlen(nptr), 0, 0, _loc_update.GetLocaleT())->dval) );

I also try to use strtok, but we should not change any data in the 1.txt.

so any one have the best way to convert all these to float x, y, z.

Visual studio 2008 + WIN7

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

上课铃就是安魂曲 2024-08-24 06:40:46

如果您可以对浮点值的格式做出额外的假设,那么自己解析它们可能会提高性能。

用于解析不带指数且不进行输入验证的 ' ''\n' 分隔值的示例代码:

float parsef(const char **str)
{
    const char *cc = *str;

    _Bool neg = (*cc == '-');
    if(neg) ++cc;

    float value = 0, e = 1;

    for(; *cc != '.'; ++cc)
    {
        if(*cc == ' ' || *cc == '\n' || !*cc)
        {
            *str = cc;
            return neg ? -value : value;
        }

        value *= 10;
        value += *cc - '0';
    }

    for(++cc;; ++cc)
    {
        if(*cc == ' ' || *cc == '\n' || !*cc)
        {
            *str = cc;
            return neg ? -value : value;
        }

        e /= 10;
        value += (*cc - '0') * e;
    }
}

示例代码:

const char *str = "42 -15.4\n23.001";
do printf("%f\n", parsef(&str));
while(*str++);

If you can make additional assumptions about the format of the floating point values, parsing them yourself might increase performance.

Example code for parsing ' ' or '\n'-separated values without exponents and no input validation:

float parsef(const char **str)
{
    const char *cc = *str;

    _Bool neg = (*cc == '-');
    if(neg) ++cc;

    float value = 0, e = 1;

    for(; *cc != '.'; ++cc)
    {
        if(*cc == ' ' || *cc == '\n' || !*cc)
        {
            *str = cc;
            return neg ? -value : value;
        }

        value *= 10;
        value += *cc - '0';
    }

    for(++cc;; ++cc)
    {
        if(*cc == ' ' || *cc == '\n' || !*cc)
        {
            *str = cc;
            return neg ? -value : value;
        }

        e /= 10;
        value += (*cc - '0') * e;
    }
}

Example code:

const char *str = "42 -15.4\n23.001";
do printf("%f\n", parsef(&str));
while(*str++);
坦然微笑 2024-08-24 06:40:46

好的,您自己进行标记化然后调用 strtod 怎么样?

我的想法是这样的:

char *current = ...;  // initialited to the head of your character array
while (*current != '\0')
{
    char buffer[64];
    unsigned int idx = 0;

    // copy over current number
    while (*current != '\0' && !isspace(*current))
    {
        buffer[idx++] = *current++;
    }
    buffer[idx] = '\0';

    // move forward to next number
    while (*current != '\0' && isspace(*current))
    {
        current++;
    }

    // use strtod to convert buffer   
}

一些问题是标记化非常简单。它适用于您发布的格式,但如果格式不同(另一行使用 : 来分隔数字),它将不起作用。

另一个问题是代码假设所有数字都 < 64 个字符。如果它们更长,就会出现缓冲区溢出。

另外,复制到临时缓冲区会增加一些开销(但希望少于在整个缓冲区上不断执行 strlen 的开销)。我知道你说你不能改变原来的缓冲区,但是你可以做一个临时的改变(即只要你在返回之前将其恢复到原始状态,缓冲区就可以改变):

char *current = ...;  // initialited to the head of your character array
while (*current != '\0')
{
    char *next_sep = current;
    while (*next_sep != '\0' && !isspace(*next_sep))
    {
        next_sep++;
    }

    // save the separator before overwriting it
    char tmp = *next_sep;
    *next_sep = '\0';

    // use strtod on current

   // Restore the separator.
   *next_sep = tmp;

    current = next_sep;

    // move forward to next number
    while (*current != '\0' && isspace(*current))
    {
        current++;
    }
}

这种技术意味着没有复制,没有后顾之忧关于缓冲区溢出。您确实需要临时修改缓冲区;希望那是

Okay, how about doing the tokenization yourself and then calling strtod.

What I'm thinking is something like this:

char *current = ...;  // initialited to the head of your character array
while (*current != '\0')
{
    char buffer[64];
    unsigned int idx = 0;

    // copy over current number
    while (*current != '\0' && !isspace(*current))
    {
        buffer[idx++] = *current++;
    }
    buffer[idx] = '\0';

    // move forward to next number
    while (*current != '\0' && isspace(*current))
    {
        current++;
    }

    // use strtod to convert buffer   
}

Some issues with this is the tokenization is very simple. It will work for the format you posted, but if the format varies (another line uses : to separate the numbers), it won't work.

Another issue is that the code assumes all numbers have < 64 characters. If they are longer, you'll get a buffer overflow.

Also, the copying to a temporary buffer will add some overhead (but hopefully less then the overhead of constantly doing a strlen on the entire buffer). I know you said you can't change the original buffer, but can you do a temporary change (i.e. the buffer can change as as long as you return it to it's original state before you return):

char *current = ...;  // initialited to the head of your character array
while (*current != '\0')
{
    char *next_sep = current;
    while (*next_sep != '\0' && !isspace(*next_sep))
    {
        next_sep++;
    }

    // save the separator before overwriting it
    char tmp = *next_sep;
    *next_sep = '\0';

    // use strtod on current

   // Restore the separator.
   *next_sep = tmp;

    current = next_sep;

    // move forward to next number
    while (*current != '\0' && isspace(*current))
    {
        current++;
    }
}

This technique means no copying and no worries about buffer overflow. You do need to temporarily modify the buffer; hopefully that is

眼眸印温柔 2024-08-24 06:40:46

查看此代码。

如果不需要支持科学表示法、“+”号或前导制表符,则可以进一步优化。

它不使用 strlen 或任何其他标准库字符串例程。

// convert floating-point value in string represention to it's numerical value
// return false if NaN
// F is float/double
// T is char or wchar_t
// '1234.567' -> 1234.567
template <class F, class T> inline bool StrToDouble(const T* pczSrc, F& f)
{
    f= 0;

    if (!pczSrc)
        return false;

    while ((32 == *pczSrc) || (9 == *pczSrc))
        pczSrc++;

    bool bNegative= (_T('-') == *pczSrc);

    if ( (_T('-') == *pczSrc) || (_T('+') == *pczSrc) )
        pczSrc++;

    if ( (*pczSrc < _T('0')) || (*pczSrc > _T('9')) )
        return false;

    // todo: return false if number of digits is too large

    while ( (*pczSrc >= _T('0')) && (*pczSrc<=_T('9')) )
    {
        f= f*10. + (*pczSrc-_T('0'));
        pczSrc++;
    }

    if (_T('.') == *pczSrc)
    {
        pczSrc++;

        double e= 0.;
        double g= 1.;

        while ( (*pczSrc >= _T('0')) && (*pczSrc<=_T('9')) )
        {
            e= e*10. + (*pczSrc-_T('0'));
            g= g*10.                    ;
            pczSrc++;
        }

        f+= e/g;
    }

    if ( (_T('e') == *pczSrc) || (_T('E') == *pczSrc) ) // exponent, such in 7.32e-2
    {
        pczSrc++;

        bool bNegativeExp= (_T('-') == *pczSrc);

        if ( (_T('-') == *pczSrc) || (_T('+') == *pczSrc) )
            pczSrc++;

        int nExp= 0;
        while ( (*pczSrc >= _T('0')) && (*pczSrc <= _T('9')) )
        {
            nExp= nExp*10 + (*pczSrc-_T('0'));
            pczSrc++;
        }

        if (bNegativeExp)
            nExp= -nExp;

        // todo: return false if exponent / number of digits of exponent is too large

        f*= pow(10., nExp);
    }

    if (bNegative)
        f= -f;

    return true;
}

Check out this code.

It can be further optimized if there's no need to support scientific representation, '+' sign, or leading tabs.

It doesn't use strlen, or any other standard library string routine.

// convert floating-point value in string represention to it's numerical value
// return false if NaN
// F is float/double
// T is char or wchar_t
// '1234.567' -> 1234.567
template <class F, class T> inline bool StrToDouble(const T* pczSrc, F& f)
{
    f= 0;

    if (!pczSrc)
        return false;

    while ((32 == *pczSrc) || (9 == *pczSrc))
        pczSrc++;

    bool bNegative= (_T('-') == *pczSrc);

    if ( (_T('-') == *pczSrc) || (_T('+') == *pczSrc) )
        pczSrc++;

    if ( (*pczSrc < _T('0')) || (*pczSrc > _T('9')) )
        return false;

    // todo: return false if number of digits is too large

    while ( (*pczSrc >= _T('0')) && (*pczSrc<=_T('9')) )
    {
        f= f*10. + (*pczSrc-_T('0'));
        pczSrc++;
    }

    if (_T('.') == *pczSrc)
    {
        pczSrc++;

        double e= 0.;
        double g= 1.;

        while ( (*pczSrc >= _T('0')) && (*pczSrc<=_T('9')) )
        {
            e= e*10. + (*pczSrc-_T('0'));
            g= g*10.                    ;
            pczSrc++;
        }

        f+= e/g;
    }

    if ( (_T('e') == *pczSrc) || (_T('E') == *pczSrc) ) // exponent, such in 7.32e-2
    {
        pczSrc++;

        bool bNegativeExp= (_T('-') == *pczSrc);

        if ( (_T('-') == *pczSrc) || (_T('+') == *pczSrc) )
            pczSrc++;

        int nExp= 0;
        while ( (*pczSrc >= _T('0')) && (*pczSrc <= _T('9')) )
        {
            nExp= nExp*10 + (*pczSrc-_T('0'));
            pczSrc++;
        }

        if (bNegativeExp)
            nExp= -nExp;

        // todo: return false if exponent / number of digits of exponent is too large

        f*= pow(10., nExp);
    }

    if (bNegative)
        f= -f;

    return true;
}
寂寞笑我太脆弱 2024-08-24 06:40:46

只要您没有使用特别糟糕的标准库(现在不可能,它们都很好),就不可能比 atof 更快。

As long as you are not using a particularly bad standard library (impossible these times, they are all good) it's not possible to do it faster than atof.

倾城泪 2024-08-24 06:40:46

我不明白为什么 strod() 应该调用 strlen()。当然它可能,但它的规范中没有任何内容要求它,如果它确实如此,我会感到惊讶。我想说 strtod() 的速度与您所能达到的速度一样快,除非您自己编写一些 FPU 处理器特定的东西。

I don't see any reason why strod() should call strlen(). Of course it might, but nothing in its specification requires it and I'd be suprised if it did. And I'd say that strtod() about as fast as you'll get, short of writing some FPU processor-specific stuff yourself.

原来分手还会想你 2024-08-24 06:40:46

为什么你认为atof、strtod使用strlen?我从未实现过它们,但我无法想象为什么他们需要知道输入字符串的长度。这对他们来说没有任何价值。我会按照 Jason 的回答使用 strtod 。这就是它的用途。

是的,如果您有大量文本,则需要一些时间来转换。事情就是这样。

Why do you think atof, strtod use strlen? I've never implemented them, but I can't imagine why they'd need to know the length of the input string. It would be of no value to them. I'd use strtod as per Jason's answer. That's what it's for.

And yes, if you have a very large amount of text, it's going to take some time to convert. That's just the way it is.

时光清浅 2024-08-24 06:40:46

使用strtod。它几乎肯定不会调用strlen。为什么需要知道输入的长度?它只是运行过去的前导空格,然后消耗尽可能多的对浮点文字有意义的字符,然后返回刚刚过去的指针。您可以查看示例实现也许您使用它时效果不佳?以下是如何使用 strtod 的示例:

#include <stdio.h>
#include <stdlib.h>
int main() {
    char *p = "1.txt 194.919 -241.808 234.896 195.569 -246.179 234.482 194.919 -241.808 234.896";
    char *end = p;
    char *q;
    double d;
    while(*end++ != ' '); // move past "1.txt"
    do {
        q = end; 
        d = strtod(q, &end);
        printf("%g\n", d);
    } while(*end != '\0');
}

输出:

194.919
-241.808
234.896
195.569
-246.179
234.482
194.919
-241.808
234.896

在我的机器上。

Use strtod. It almost certainly does not call strlen. Why would it need to know the length of the input? It merely runs past leading whitespace, then consumes as many characters as possible that make sense for a floating point literal, and then returns a pointer just past that. You can see an example implementation Perhaps you're using it non-optimally? Here's a sample of how to use strtod:

#include <stdio.h>
#include <stdlib.h>
int main() {
    char *p = "1.txt 194.919 -241.808 234.896 195.569 -246.179 234.482 194.919 -241.808 234.896";
    char *end = p;
    char *q;
    double d;
    while(*end++ != ' '); // move past "1.txt"
    do {
        q = end; 
        d = strtod(q, &end);
        printf("%g\n", d);
    } while(*end != '\0');
}

This outputs:

194.919
-241.808
234.896
195.569
-246.179
234.482
194.919
-241.808
234.896

on my machine.

罗罗贝儿 2024-08-24 06:40:46

正如其他人所说,我认为您不会比标准库调用做得更好。它们已经存在很长时间并且经过了高度优化(嗯,它们应该是这样,至少在良好的实现中是这样)。

也就是说,有些事情我不清楚。您是否将整个文件读入内存,然后将数组转换为另一个数组?如果是这样,您可能需要检查正在运行的系统是否有足够的内存来执行交换。如果您这样做,当您从磁盘读取它们而不是存储它们时,是否可以一次只转换一行?

您可以考虑对程序进行多线程处理。一个线程从磁盘读取和缓冲行,n 个线程处理这些行。 Dr. Dobb's Journal 发表了一篇伟大的单读者/单作者您可以使用无锁队列实现。我在类似的应用程序中使用过它。我的工作线程每个都有一个输入队列,然后读取线程从磁盘读取数据并以循环方式将它们放入这些队列中。

As others have said, I don't think you're going to do much better than the standard library calls. They have been around for a long time and are quite highly optimized (well, they should be, at least in good implementations).

That said, there are some things that aren't clear to me. Are you reading the whole file into memory and then converting the array to another array? If so, you might want to check that the system you are running on has enough memory to do that with swapping. If you are doing this, would it be possible to just convert one line at a time as you read them off disk instead of storing them?

You could consider multithreading your program. One thread to read and buffer lines off disk, and n threads to process the lines. Dr. Dobb's Journal published a great single-reader/single-writer lockless queue implementation you could use. I've used this in a similar app. My worker threads each have an input queue, and then reader thread reads data off disk and places them into these queues in round robin style.

山人契 2024-08-24 06:40:46

怎么样:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

static float frac[] =
{
    0.000,
    0.001,
    0.002,
    ...               // fill in
    0.997,
    0.998,
    0.999,
};

static float exp[] =
{
    1e-38,
    1e-37,
    1e-36,
    ...               // fill in
    1e+36,
    1e+37,
    1e+38,
};

float cvt(char* p)
{
    char* d = strchr(p, '.');   // Find the decimal point.
    char* e = strchr(p, 'e');   // Find the exponent.
    if (e == NULL)
        e = strchr(p, 'E');

    float num = atoi(p);
    if (num > 0) {
        num += frac[atoi(d + 1)];
    } else {
        num -= frac[atoi(d + 1)];
    }
    if (e)
        num *= exp[atoi(e)];
    return num;
}

int main()
{
    char line[100];
    while(gets(line)) {
        printf("in %s, out %g\n", line, cvt(line));
    }
}

应该是三位有效数字。


Edit: watch out for big mantissas.


Edit again: and negative exponents. :-(

How about something like:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

static float frac[] =
{
    0.000,
    0.001,
    0.002,
    ...               // fill in
    0.997,
    0.998,
    0.999,
};

static float exp[] =
{
    1e-38,
    1e-37,
    1e-36,
    ...               // fill in
    1e+36,
    1e+37,
    1e+38,
};

float cvt(char* p)
{
    char* d = strchr(p, '.');   // Find the decimal point.
    char* e = strchr(p, 'e');   // Find the exponent.
    if (e == NULL)
        e = strchr(p, 'E');

    float num = atoi(p);
    if (num > 0) {
        num += frac[atoi(d + 1)];
    } else {
        num -= frac[atoi(d + 1)];
    }
    if (e)
        num *= exp[atoi(e)];
    return num;
}

int main()
{
    char line[100];
    while(gets(line)) {
        printf("in %s, out %g\n", line, cvt(line));
    }
}

Should be good to three significant digits.


Edit: watch out for big mantissas.


Edit again: and negative exponents. :-(

指尖上得阳光 2024-08-24 06:40:46

我怀疑 strlen 是否会花费您很多钱。

如果你可以利用你的数字落在相对有限的范围内,那么我建议你自己解析它,尽可能少地进行计算,例如:

#define DIGIT(c) ((c)>='0' && (c)<='9')

BOOL parseNum(char* *p0, float *f){
  char* p = *p0;
  int n = 0, frac = 1;
  BOOL bNeg = FALSE;
  while(*p == ' ') p++;
  if (*p == '-'){p++; bNeg = TRUE;}
  if (!(DIGIT(*p) || *p=='.')) return FALSE;
  while(DIGIT(*p)){
    n = n * 10 + (*p++ - '0');
  }
  if (*p == '.'){
    p++;
    while(DIGIT(*p)){
      n = n * 10 + (*p++ - '0');
      frac *= 10;
    }
  }
  *f = (float)n/(float)frac;
  if (bNeg) *f = -*f;
  *p0 = p;
  return TRUE;
}

I doubt if strlen is costing you much.

If you can take advantage of your numbers falling in a relatively restricted range, then what I suggest is to parse it yourself, doing as little computation as possible, such as:

#define DIGIT(c) ((c)>='0' && (c)<='9')

BOOL parseNum(char* *p0, float *f){
  char* p = *p0;
  int n = 0, frac = 1;
  BOOL bNeg = FALSE;
  while(*p == ' ') p++;
  if (*p == '-'){p++; bNeg = TRUE;}
  if (!(DIGIT(*p) || *p=='.')) return FALSE;
  while(DIGIT(*p)){
    n = n * 10 + (*p++ - '0');
  }
  if (*p == '.'){
    p++;
    while(DIGIT(*p)){
      n = n * 10 + (*p++ - '0');
      frac *= 10;
    }
  }
  *f = (float)n/(float)frac;
  if (bNeg) *f = -*f;
  *p0 = p;
  return TRUE;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文