如何处理基于 C 的应用程序内部的数据流？

发布于 2024-09-27 02:59:08 字数 2855 浏览 5 评论 0原文

我正在从 C 应用程序中的 bzip2 流中提取数据。当数据块从解压缩器中出来时，可以将它们写入 stdout：

fwrite(buffer, 1, length, stdout);

这非常有效。当数据发送到 stdout 时，我得到了所有数据。

我不想写入 stdout，而是希望在单行块中内部处理该语句的输出：一个以换行符 \n 结尾的字符串。

我是否将解压缩器流的输出写入另一个缓冲区，一次一个字符，直到遇到换行符，然后调用每行处理函数？这很慢吗？有更聪明的方法吗？谢谢你的建议。

编辑

感谢您的建议。我最终创建了一对缓冲区，每次我传递输出缓冲区的数据时，它们将剩余部分（输出缓冲区末尾的“存根”）存储在短行缓冲区的开头。

我逐个字符地循环访问输出缓冲区，并一次处理换行符的数据。不含换行符的余数被分配和指定，并复制到下一个流的行缓冲区。看起来 realloc 比重复的 malloc-free 语句更便宜。

这是我想出的代码：

char bzBuf[BZBUFMAXLEN];
BZFILE *bzFp;
int bzError, bzNBuf;
char bzLineBuf[BZLINEBUFMAXLEN];
char *bzBufRemainder = NULL;
int bzBufPosition, bzLineBufPosition;

bzFp = BZ2_bzReadOpen(&bzError, *fp, 0, 0, NULL, 0); /* http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzcompress-init */ 

if (bzError != BZ_OK) {
    BZ2_bzReadClose(&bzError, bzFp);   
    fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be retrieved\n\n");
    return -1;          
}

bzError = BZ_OK;
bzLineBufPosition = 0;
while (bzError == BZ_OK) {

    bzNBuf = BZ2_bzRead(&bzError, bzFp, bzBuf, sizeof(bzBuf));

    if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
        if (bzBufRemainder != NULL) {
            /* fprintf(stderr, "copying bzBufRemainder to bzLineBuf...\n"); */
            strncpy(bzLineBuf, bzBufRemainder, strlen(bzBufRemainder)); /* leave out \0 */
            bzLineBufPosition = strlen(bzBufRemainder);
        }

        for (bzBufPosition = 0; bzBufPosition < bzNBuf; bzBufPosition++) {
            bzLineBuf[bzLineBufPosition++] = bzBuf[bzBufPosition];
            if (bzBuf[bzBufPosition] == '\n') {
                bzLineBuf[bzLineBufPosition] = '\0'; /* terminate bzLineBuf */

                /* process the line buffer, e.g. print it out or transform it, etc. */
                fprintf(stdout, "%s", bzLineBuf);

                bzLineBufPosition = 0; /* reset line buffer position */
            }
            else if (bzBufPosition == (bzNBuf - 1)) {
                bzLineBuf[bzLineBufPosition] = '\0';
                if (bzBufRemainder != NULL)
                    bzBufRemainder = (char *)realloc(bzBufRemainder, bzLineBufPosition);
                else
                    bzBufRemainder = (char *)malloc(bzLineBufPosition);
                strncpy(bzBufRemainder, bzLineBuf, bzLineBufPosition);
            }
        }
    }
}

if (bzError != BZ_STREAM_END) {
    BZ2_bzReadClose(&bzError, bzFp);
    fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be uncompressed\n\n");
    return -1;  
} else {   
    BZ2_bzReadGetUnused(&bzError, bzFp, 0, 0);
    BZ2_bzReadClose(&bzError, bzFp);
}

free(bzBufRemainder);
bzBufRemainder = NULL;

我非常感谢大家的帮助。这工作得很好。

原文

I am pulling data from a bzip2 stream within a C application. As chunks of data come out of the decompressor, they can be written to stdout:

fwrite(buffer, 1, length, stdout);

This works great. I get all the data when it is sent to stdout.

Instead of writing to stdout, I would like to process the output from this statement internally in one-line-chunks: a string that is terminated with a newline character \n.

Do I write the output of the decompressor stream to another buffer, one character at a time, until I hit a newline, and then call the per-line processing function? Is this slow and is there a smarter approach? Thanks for your advice.

EDIT

Thanks for your suggestions. I ended up creating a pair of buffers that store the remainder (the "stub" at the end of an output buffer) at the beginning of a short line buffer, each time I pass through the output buffer's worth of data.

I loop through the output buffer character by character and process a newline-line's worth of data at a time. The newline-less remainder gets allocated and assigned, and copied to the next stream's line buffer. It seems like realloc is less expensive than repeated malloc-free statements.

Here's the code I came up with:

char bzBuf[BZBUFMAXLEN];
BZFILE *bzFp;
int bzError, bzNBuf;
char bzLineBuf[BZLINEBUFMAXLEN];
char *bzBufRemainder = NULL;
int bzBufPosition, bzLineBufPosition;

bzFp = BZ2_bzReadOpen(&bzError, *fp, 0, 0, NULL, 0); /* http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzcompress-init */ 

if (bzError != BZ_OK) {
    BZ2_bzReadClose(&bzError, bzFp);   
    fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be retrieved\n\n");
    return -1;          
}

bzError = BZ_OK;
bzLineBufPosition = 0;
while (bzError == BZ_OK) {

    bzNBuf = BZ2_bzRead(&bzError, bzFp, bzBuf, sizeof(bzBuf));

    if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
        if (bzBufRemainder != NULL) {
            /* fprintf(stderr, "copying bzBufRemainder to bzLineBuf...\n"); */
            strncpy(bzLineBuf, bzBufRemainder, strlen(bzBufRemainder)); /* leave out \0 */
            bzLineBufPosition = strlen(bzBufRemainder);
        }

        for (bzBufPosition = 0; bzBufPosition < bzNBuf; bzBufPosition++) {
            bzLineBuf[bzLineBufPosition++] = bzBuf[bzBufPosition];
            if (bzBuf[bzBufPosition] == '\n') {
                bzLineBuf[bzLineBufPosition] = '\0'; /* terminate bzLineBuf */

                /* process the line buffer, e.g. print it out or transform it, etc. */
                fprintf(stdout, "%s", bzLineBuf);

                bzLineBufPosition = 0; /* reset line buffer position */
            }
            else if (bzBufPosition == (bzNBuf - 1)) {
                bzLineBuf[bzLineBufPosition] = '\0';
                if (bzBufRemainder != NULL)
                    bzBufRemainder = (char *)realloc(bzBufRemainder, bzLineBufPosition);
                else
                    bzBufRemainder = (char *)malloc(bzLineBufPosition);
                strncpy(bzBufRemainder, bzLineBuf, bzLineBufPosition);
            }
        }
    }
}

if (bzError != BZ_STREAM_END) {
    BZ2_bzReadClose(&bzError, bzFp);
    fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be uncompressed\n\n");
    return -1;  
} else {   
    BZ2_bzReadGetUnused(&bzError, bzFp, 0, 0);
    BZ2_bzReadClose(&bzError, bzFp);
}

free(bzBufRemainder);
bzBufRemainder = NULL;

I really appreciate everyone's help. This is working nicely.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你与昨日 2024-10-04 02:59:09

我认为没有更聪明的方法（除了找到一个已经为您完成此操作的自动机库）。为“最后一行”缓冲区分配适当的大小时要小心：如果它无法处理任意长度并且输入来自第三方可访问的内容，那么它就会成为安全风险。

回复收藏 0 原文

爱*していゐ 2024-10-04 02:59:09

我还一直在处理每行 bzip2 数据，我发现一次读取一个字节太慢了。这对我来说效果更好：

#include <stdio.h>
#include <stdlib.h>
#include <bzlib.h>

/* gcc -o bz bz.c -lbz2 */

#define CHUNK 128

struct bzdata {
  FILE *fp;
  BZFILE *bzf;
  int bzeof, bzlen, bzpos;
  char bzbuf[4096];
};

static int bz2_open(struct bzdata *bz, char *file);
static void bz2_close(struct bzdata *bz);
static int bz2_read_line(struct bzdata *bz, char **line, int *li);
static int bz2_buf(struct bzdata *bz, char **line, int *li, int *ll);


static int
bz2_buf(struct bzdata *bz, char **line, int *li, int *ll)
{
  int done = 0;

  for (; bz->bzpos < bz->bzlen && done == 0; bz->bzpos++) {
    if (*ll + 1 >= *li) {
      *li += CHUNK;
      *line = realloc(*line, (*li + 1) * sizeof(*(*line)));
    }
    if ( ((*line)[(*ll)++] = bz->bzbuf[bz->bzpos]) == '\n') {
      done = 1;
    }
  }

  if (bz->bzpos == bz->bzlen) {
    bz->bzpos = bz->bzlen  = 0;
  }

  (*line)[*ll] = '\0';

  return done;
}

static int
bz2_read_line(struct bzdata *bz, char **line, int *li)
{
  int bzerr = BZ_OK, done = 0, ll = 0;

  if (bz->bzpos) {
    done = bz2_buf(bz, line, li, &ll);
  }

  while (done == 0 && bz->bzeof == 0) {
    bz->bzlen = BZ2_bzRead(&bzerr, bz->bzf, bz->bzbuf, sizeof(bz->bzbuf));

    if (bzerr == BZ_OK || bzerr == BZ_STREAM_END) {
      bz->bzpos = 0;

      if (bzerr == BZ_STREAM_END) {
        bz->bzeof = 1;
      }
      done = bz2_buf(bz, line, li, &ll);
    } else { 
      done = -1;
    }
  }

  /* Handle last lines that don't have a line feed */
  if (done == 0 && ll > 0 && bz->bzeof) {
    done = 1;
  }

  return done;
}

static int
bz2_open(struct bzdata *bz, char *file)
{
  int bzerr = BZ_OK;

  if ( (bz->fp = fopen(file, "rb")) &&
       (bz->bzf = BZ2_bzReadOpen(&bzerr, bz->fp, 0, 0, NULL, 0)) &&
       bzerr == BZ_OK) {
    return 1;
  }

  return 0;
}

static void
bz2_close(struct bzdata *bz)
{
  int bzerr;

  if (bz->bzf) {
    BZ2_bzReadClose(&bzerr, bz->bzf);
    bz->bzf = NULL;
  }

  if (bz->fp) {
    fclose(bz->fp);
    bz->fp = NULL;
  }
  bz->bzpos = bz->bzlen = bz->bzeof = 0;
}

int main(int argc, char *argv[]) {
  struct bzdata *bz = NULL;
  int i, lc, li = 0;
  char *line = NULL;

  if (argc < 2) {
    return fprintf(stderr, "usage: %s file [file ...]\n", argv[0]);
  }  

  if ( (bz = calloc(1, sizeof(*bz))) ) {
    for (i = 1; i < argc; i++) {
      if (bz2_open(bz, argv[i])) {
        for (lc = 0; bz2_read_line(bz, &line, &li) > 0; lc++) {
          /* Process line here */
        }
        printf("%s: lines=%d\n", argv[i], lc);
      }
      bz2_close(bz);
    }

    free(bz);
  }

  if (line) {
    free(line);
  }

  return 0;
}

I've also been working with processing bzip2 data per line, and I found that reading one byte at a time was too slow. This worked better for me:

#include <stdio.h>
#include <stdlib.h>
#include <bzlib.h>

/* gcc -o bz bz.c -lbz2 */

#define CHUNK 128

struct bzdata {
  FILE *fp;
  BZFILE *bzf;
  int bzeof, bzlen, bzpos;
  char bzbuf[4096];
};

static int bz2_open(struct bzdata *bz, char *file);
static void bz2_close(struct bzdata *bz);
static int bz2_read_line(struct bzdata *bz, char **line, int *li);
static int bz2_buf(struct bzdata *bz, char **line, int *li, int *ll);


static int
bz2_buf(struct bzdata *bz, char **line, int *li, int *ll)
{
  int done = 0;

  for (; bz->bzpos < bz->bzlen && done == 0; bz->bzpos++) {
    if (*ll + 1 >= *li) {
      *li += CHUNK;
      *line = realloc(*line, (*li + 1) * sizeof(*(*line)));
    }
    if ( ((*line)[(*ll)++] = bz->bzbuf[bz->bzpos]) == '\n') {
      done = 1;
    }
  }

  if (bz->bzpos == bz->bzlen) {
    bz->bzpos = bz->bzlen  = 0;
  }

  (*line)[*ll] = '\0';

  return done;
}

static int
bz2_read_line(struct bzdata *bz, char **line, int *li)
{
  int bzerr = BZ_OK, done = 0, ll = 0;

  if (bz->bzpos) {
    done = bz2_buf(bz, line, li, &ll);
  }

  while (done == 0 && bz->bzeof == 0) {
    bz->bzlen = BZ2_bzRead(&bzerr, bz->bzf, bz->bzbuf, sizeof(bz->bzbuf));

    if (bzerr == BZ_OK || bzerr == BZ_STREAM_END) {
      bz->bzpos = 0;

      if (bzerr == BZ_STREAM_END) {
        bz->bzeof = 1;
      }
      done = bz2_buf(bz, line, li, &ll);
    } else { 
      done = -1;
    }
  }

  /* Handle last lines that don't have a line feed */
  if (done == 0 && ll > 0 && bz->bzeof) {
    done = 1;
  }

  return done;
}

static int
bz2_open(struct bzdata *bz, char *file)
{
  int bzerr = BZ_OK;

  if ( (bz->fp = fopen(file, "rb")) &&
       (bz->bzf = BZ2_bzReadOpen(&bzerr, bz->fp, 0, 0, NULL, 0)) &&
       bzerr == BZ_OK) {
    return 1;
  }

  return 0;
}

static void
bz2_close(struct bzdata *bz)
{
  int bzerr;

  if (bz->bzf) {
    BZ2_bzReadClose(&bzerr, bz->bzf);
    bz->bzf = NULL;
  }

  if (bz->fp) {
    fclose(bz->fp);
    bz->fp = NULL;
  }
  bz->bzpos = bz->bzlen = bz->bzeof = 0;
}

int main(int argc, char *argv[]) {
  struct bzdata *bz = NULL;
  int i, lc, li = 0;
  char *line = NULL;

  if (argc < 2) {
    return fprintf(stderr, "usage: %s file [file ...]\n", argv[0]);
  }  

  if ( (bz = calloc(1, sizeof(*bz))) ) {
    for (i = 1; i < argc; i++) {
      if (bz2_open(bz, argv[i])) {
        for (lc = 0; bz2_read_line(bz, &line, &li) > 0; lc++) {
          /* Process line here */
        }
        printf("%s: lines=%d\n", argv[i], lc);
      }
      bz2_close(bz);
    }

    free(bz);
  }

  if (line) {
    free(line);
  }

  return 0;
}

回复收藏 0 原文

不再见 2024-10-04 02:59:09

使用 C++ 的 std::string 可以很容易地做到这一点，但在 C 中，如果您想有效地做到这一点，则需要一些代码（除非您使用动态字符串库）。

char *bz_read_line(BZFILE *input)
{
    size_t offset = 0;
    size_t len = CHUNK;  // arbitrary
    char *output = (char *)xmalloc(len);
    int bzerror;

    while (BZ2_bzRead(&bzerror, input, output + offset, 1) == 1) {
        if (offset+1 == len) {
            len += CHUNK;
            output = xrealloc(output, len);
        }
        if (output[offset] == '\n')
            break;
        offset++;
    }

    if (output[offset] == '\n')
        output[offset] = '\0';  // strip trailing newline
    else if (bzerror != BZ_STREAM_END) {
        free(output);
        return NULL;
    }

    return output;
}

（其中 xmalloc 和 xrealloc 在内部处理错误。不要忘记释放返回的字符串。）

这几乎比bzcat：

lars@zygmunt:/tmp$ wc foo
 1193  5841 42868 foo
lars@zygmunt:/tmp$ bzip2 foo
lars@zygmunt:/tmp$ time bzcat foo.bz2 > /dev/null

real    0m0.010s
user    0m0.008s
sys     0m0.000s
lars@zygmunt:/tmp$ time ./a.out < foo.bz2 > /dev/null

real    0m0.093s
user    0m0.044s
sys     0m0.020s

自己决定是否可以接受。

This would be easy to do using C++'s std::string, but in C it takes some code if you want to do it efficiently (unless you use a dynamic string library).

char *bz_read_line(BZFILE *input)
{
    size_t offset = 0;
    size_t len = CHUNK;  // arbitrary
    char *output = (char *)xmalloc(len);
    int bzerror;

    while (BZ2_bzRead(&bzerror, input, output + offset, 1) == 1) {
        if (offset+1 == len) {
            len += CHUNK;
            output = xrealloc(output, len);
        }
        if (output[offset] == '\n')
            break;
        offset++;
    }

    if (output[offset] == '\n')
        output[offset] = '\0';  // strip trailing newline
    else if (bzerror != BZ_STREAM_END) {
        free(output);
        return NULL;
    }

    return output;
}

(Where xmalloc and xrealloc handle errors internally. Don't forget to free the returned string.)

This is almost an order of magnitude slower than bzcat:

lars@zygmunt:/tmp$ wc foo
 1193  5841 42868 foo
lars@zygmunt:/tmp$ bzip2 foo
lars@zygmunt:/tmp$ time bzcat foo.bz2 > /dev/null

real    0m0.010s
user    0m0.008s
sys     0m0.000s
lars@zygmunt:/tmp$ time ./a.out < foo.bz2 > /dev/null

real    0m0.093s
user    0m0.044s
sys     0m0.020s

Decide for yourself whether that's acceptable.

回复收藏 0 原文