从内存映射格式化文件中读取整数

发布于 2024-10-03 00:25:01 字数 365 浏览 9 评论 0原文

我将内存映射到一个大型格式化(文本)文件,每行包含一个整数,如下所示:

123
345
34324
3232
...

因此,我有一个指向第一个字节的内存的指针,还有一个指向最后一个字节的内存的指针。我正在尝试尽快将所有这些整数读入数组。最初,我创建了一个专门的 std::streambuf 类来与 std::istream 一起从该内存中读取数据,但它似乎相对较慢。

您对如何有效地将“1231232\r\n123123\r\n123\r\n1231\r\n2387897...”之类的字符串解析为数组 {1231232,123123,1231,231,2387897,. ..}?

文件中整数的数量事先是未知的。

I have memory mapped a large formatted (text) file containing one integer per line like so:

123
345
34324
3232
...

So, I have a pointer to the memory at the first byte and also a pointer to the memory at the last byte. I am trying to read all those integers into an array as fast as possible. Initially I created a specialized std::streambuf class to work with std::istream to read from that memory but it seem to be relatively slow.

Do you have any suggestion on how to efficiently parse a string like "1231232\r\n123123\r\n123\r\n1231\r\n2387897..." into an array {1231232,123123,1231,231,2387897,...} ?

The number of integers in the file is not known beforehand.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

月寒剑心 2024-10-10 00:25:01

对于我来说,了解更多有关 C++ 的知识是一项非常有趣的任务。

诚然,代码相当大并且有很多错误检查,但这仅显示了解析过程中可能出现多少不同的问题。

#include <ctype.h>
#include <limits.h>
#include <stdio.h>

#include <iterator>
#include <vector>
#include <string>

static void
die(const char *reason)
{
  fprintf(stderr, "aborted (%s)\n", reason);
  exit(EXIT_FAILURE);
}

template <class BytePtr>
static bool
read_uint(BytePtr *begin_ref, BytePtr end, unsigned int *out)
{
  const unsigned int MAX_DIV = UINT_MAX / 10;
  const unsigned int MAX_MOD = UINT_MAX % 10;

  BytePtr begin = *begin_ref;
  unsigned int n = 0;

  while (begin != end && '0' <= *begin && *begin <= '9') {
    unsigned digit = *begin - '0';
    if (n > MAX_DIV || (n == MAX_DIV && digit > MAX_MOD))
      die("unsigned overflow");
    n = 10 * n + digit;
    begin++;
  }

  if (begin == *begin_ref)
    return false;

  *begin_ref = begin;
  *out = n;
  return true;
}

template <class BytePtr, class IntConsumer>
void
parse_ints(BytePtr begin, BytePtr end, IntConsumer out)
{
  while (true) {
    while (begin != end && *begin == (unsigned char) *begin && isspace(*begin))
      begin++;
    if (begin == end)
      return;

    bool negative = *begin == '-';
    if (negative) {
      begin++;
      if (begin == end)
        die("minus at end of input");
    }

    unsigned int un;
    if (!read_uint(&begin, end, &un))
      die("no number found");

    if (!negative && un > INT_MAX)
      die("too large positive");
    if (negative && un > -((unsigned int)INT_MIN))
      die("too small negative");

    int n = negative ? -un : un;
    *out++ = n;
  }
}

static void
print(int x)
{
  printf("%d\n", x);
}

int
main()
{
  std::vector<int> result;
  std::string input("2147483647 -2147483648 0 00000 1 2 32767 4 -17 6");

  parse_ints(input.begin(), input.end(), back_inserter(result));

  std::for_each(result.begin(), result.end(), print);
  return 0;
}

我努力不调用任何类型的未定义行为,当将无符号数字转换为有符号数字或在未知数据类型上调用isspace时,这可能会变得非常棘手。

This was a really interesting task for me to learn a bit more about C++.

Admitted, the code is quite large and has a lot of error checking, but that only shows how many different things can go wrong during parsing.

#include <ctype.h>
#include <limits.h>
#include <stdio.h>

#include <iterator>
#include <vector>
#include <string>

static void
die(const char *reason)
{
  fprintf(stderr, "aborted (%s)\n", reason);
  exit(EXIT_FAILURE);
}

template <class BytePtr>
static bool
read_uint(BytePtr *begin_ref, BytePtr end, unsigned int *out)
{
  const unsigned int MAX_DIV = UINT_MAX / 10;
  const unsigned int MAX_MOD = UINT_MAX % 10;

  BytePtr begin = *begin_ref;
  unsigned int n = 0;

  while (begin != end && '0' <= *begin && *begin <= '9') {
    unsigned digit = *begin - '0';
    if (n > MAX_DIV || (n == MAX_DIV && digit > MAX_MOD))
      die("unsigned overflow");
    n = 10 * n + digit;
    begin++;
  }

  if (begin == *begin_ref)
    return false;

  *begin_ref = begin;
  *out = n;
  return true;
}

template <class BytePtr, class IntConsumer>
void
parse_ints(BytePtr begin, BytePtr end, IntConsumer out)
{
  while (true) {
    while (begin != end && *begin == (unsigned char) *begin && isspace(*begin))
      begin++;
    if (begin == end)
      return;

    bool negative = *begin == '-';
    if (negative) {
      begin++;
      if (begin == end)
        die("minus at end of input");
    }

    unsigned int un;
    if (!read_uint(&begin, end, &un))
      die("no number found");

    if (!negative && un > INT_MAX)
      die("too large positive");
    if (negative && un > -((unsigned int)INT_MIN))
      die("too small negative");

    int n = negative ? -un : un;
    *out++ = n;
  }
}

static void
print(int x)
{
  printf("%d\n", x);
}

int
main()
{
  std::vector<int> result;
  std::string input("2147483647 -2147483648 0 00000 1 2 32767 4 -17 6");

  parse_ints(input.begin(), input.end(), back_inserter(result));

  std::for_each(result.begin(), result.end(), print);
  return 0;
}

I tried hard not to invoke any kind of undefined behavior, which can get quite tricky when converting unsigned numbers to signed numbers or invoking isspace on an unknown data type.

草莓味的萝莉 2024-10-10 00:25:01
std::vector<int> array;
char * p = ...; // start of memory mapped block
while ( not end of memory block )
{
    array.push_back(static_cast<int>(strtol(p, &p, 10)));
    while (not end of memory block && !isdigit(*p))
        ++p;
}

这段代码有点不安全,因为不能保证 strtol 会在内存映射块的末尾停止,但它是一个开始。即使添加了额外的检查,也应该非常快。

std::vector<int> array;
char * p = ...; // start of memory mapped block
while ( not end of memory block )
{
    array.push_back(static_cast<int>(strtol(p, &p, 10)));
    while (not end of memory block && !isdigit(*p))
        ++p;
}

This code is a little unsafe since there's no guarantee that strtol will stop at the end of the memory mapped block, but it's a start. Should go very fast even with additional checking added.

一杯敬自由 2024-10-10 00:25:01

由于这是内存映射,因此简单地将字符复制到堆栈数组并将 atoi 复制到另一个内存映射文件顶部的另一个整数数组将非常有效。这样,分页文件根本不会用于这些大缓冲区。

open memory mapped file to output int buffer

declare small stack buffer of 20 chars
while not end of char array
  while current char not  line feed
    copy chars to stack buffer
    null terminate the buffer two chars back
    copy results of int buffer output buffer
    increment the output buffer pointer
  end while  
end while

虽然这不使用库,但它的优点是最大限度地减少内存映射文件的内存使用,因此临时缓冲区仅限于堆栈缓冲区和 atoi 内部使用的缓冲区。输出缓冲区可以根据需要丢弃或保存到文件中。

Since this is memory mapped a simple copy the chars to a stack array and atoi to the another integer array on top of a another memory mapped file would be the very efficient. This way the paging file is not used for these big buffers at all.

open memory mapped file to output int buffer

declare small stack buffer of 20 chars
while not end of char array
  while current char not  line feed
    copy chars to stack buffer
    null terminate the buffer two chars back
    copy results of int buffer output buffer
    increment the output buffer pointer
  end while  
end while

While this doesn't use the a library is has the advantage of minimising memory usage to memory mapped files, so temp buffers are limited to the stack one and the one used by atoi internally. The output buffer can be thrown away or left saved to the file as needed.

浮萍、无处依 2024-10-10 00:25:01

注意:此答案已被编辑几次。

逐行读取内存(基于 链接链接)。

class line 
{
   std::string data;
public:
   friend std::istream &operator>>(std::istream &is, line &l) 
   {
      std::getline(is, l.data);
      return is;
   }
   operator std::string() { return data; }    
};

std::streambuf osrb;
setg(ptr, ptr, ptrs + size-1);
std::istream istr(&osrb);

std::vector<int> ints;

std::istream_iterator<line> begin(istr);
std::istream_iterator<line> end;
std::transform(begin, end, std::back_inserter(ints), &boost::lexical_cast<int, std::string>);

NOTE: This answer has been edited a few times.

Reads memory line by line (based on link and link).

class line 
{
   std::string data;
public:
   friend std::istream &operator>>(std::istream &is, line &l) 
   {
      std::getline(is, l.data);
      return is;
   }
   operator std::string() { return data; }    
};

std::streambuf osrb;
setg(ptr, ptr, ptrs + size-1);
std::istream istr(&osrb);

std::vector<int> ints;

std::istream_iterator<line> begin(istr);
std::istream_iterator<line> end;
std::transform(begin, end, std::back_inserter(ints), &boost::lexical_cast<int, std::string>);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文