如何处理基于 C 的应用程序内部的数据流?
我正在从 C 应用程序中的 bzip2
流中提取数据。当数据块从解压缩器中出来时,可以将它们写入 stdout
:
fwrite(buffer, 1, length, stdout);
这非常有效。当数据发送到 stdout 时,我得到了所有数据。
我不想写入 stdout
,而是希望在单行块中内部处理该语句的输出:一个以换行符 \n
结尾的字符串。
我是否将解压缩器流的输出写入另一个缓冲区,一次一个字符,直到遇到换行符,然后调用每行处理函数?这很慢吗?有更聪明的方法吗?谢谢你的建议。
编辑
感谢您的建议。我最终创建了一对缓冲区,每次我传递输出缓冲区的数据时,它们将剩余部分(输出缓冲区末尾的“存根”)存储在短行缓冲区的开头。
我逐个字符地循环访问输出缓冲区,并一次处理换行符的数据。不含换行符的余数被分配和指定,并复制到下一个流的行缓冲区。看起来 realloc
比重复的 malloc-free
语句更便宜。
这是我想出的代码:
char bzBuf[BZBUFMAXLEN];
BZFILE *bzFp;
int bzError, bzNBuf;
char bzLineBuf[BZLINEBUFMAXLEN];
char *bzBufRemainder = NULL;
int bzBufPosition, bzLineBufPosition;
bzFp = BZ2_bzReadOpen(&bzError, *fp, 0, 0, NULL, 0); /* http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzcompress-init */
if (bzError != BZ_OK) {
BZ2_bzReadClose(&bzError, bzFp);
fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be retrieved\n\n");
return -1;
}
bzError = BZ_OK;
bzLineBufPosition = 0;
while (bzError == BZ_OK) {
bzNBuf = BZ2_bzRead(&bzError, bzFp, bzBuf, sizeof(bzBuf));
if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
if (bzBufRemainder != NULL) {
/* fprintf(stderr, "copying bzBufRemainder to bzLineBuf...\n"); */
strncpy(bzLineBuf, bzBufRemainder, strlen(bzBufRemainder)); /* leave out \0 */
bzLineBufPosition = strlen(bzBufRemainder);
}
for (bzBufPosition = 0; bzBufPosition < bzNBuf; bzBufPosition++) {
bzLineBuf[bzLineBufPosition++] = bzBuf[bzBufPosition];
if (bzBuf[bzBufPosition] == '\n') {
bzLineBuf[bzLineBufPosition] = '\0'; /* terminate bzLineBuf */
/* process the line buffer, e.g. print it out or transform it, etc. */
fprintf(stdout, "%s", bzLineBuf);
bzLineBufPosition = 0; /* reset line buffer position */
}
else if (bzBufPosition == (bzNBuf - 1)) {
bzLineBuf[bzLineBufPosition] = '\0';
if (bzBufRemainder != NULL)
bzBufRemainder = (char *)realloc(bzBufRemainder, bzLineBufPosition);
else
bzBufRemainder = (char *)malloc(bzLineBufPosition);
strncpy(bzBufRemainder, bzLineBuf, bzLineBufPosition);
}
}
}
}
if (bzError != BZ_STREAM_END) {
BZ2_bzReadClose(&bzError, bzFp);
fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be uncompressed\n\n");
return -1;
} else {
BZ2_bzReadGetUnused(&bzError, bzFp, 0, 0);
BZ2_bzReadClose(&bzError, bzFp);
}
free(bzBufRemainder);
bzBufRemainder = NULL;
我非常感谢大家的帮助。这工作得很好。
I am pulling data from a bzip2
stream within a C application. As chunks of data come out of the decompressor, they can be written to stdout
:
fwrite(buffer, 1, length, stdout);
This works great. I get all the data when it is sent to stdout
.
Instead of writing to stdout
, I would like to process the output from this statement internally in one-line-chunks: a string that is terminated with a newline character \n
.
Do I write the output of the decompressor stream to another buffer, one character at a time, until I hit a newline, and then call the per-line processing function? Is this slow and is there a smarter approach? Thanks for your advice.
EDIT
Thanks for your suggestions. I ended up creating a pair of buffers that store the remainder (the "stub" at the end of an output buffer) at the beginning of a short line buffer, each time I pass through the output buffer's worth of data.
I loop through the output buffer character by character and process a newline-line's worth of data at a time. The newline-less remainder gets allocated and assigned, and copied to the next stream's line buffer. It seems like realloc
is less expensive than repeated malloc-free
statements.
Here's the code I came up with:
char bzBuf[BZBUFMAXLEN];
BZFILE *bzFp;
int bzError, bzNBuf;
char bzLineBuf[BZLINEBUFMAXLEN];
char *bzBufRemainder = NULL;
int bzBufPosition, bzLineBufPosition;
bzFp = BZ2_bzReadOpen(&bzError, *fp, 0, 0, NULL, 0); /* http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzcompress-init */
if (bzError != BZ_OK) {
BZ2_bzReadClose(&bzError, bzFp);
fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be retrieved\n\n");
return -1;
}
bzError = BZ_OK;
bzLineBufPosition = 0;
while (bzError == BZ_OK) {
bzNBuf = BZ2_bzRead(&bzError, bzFp, bzBuf, sizeof(bzBuf));
if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
if (bzBufRemainder != NULL) {
/* fprintf(stderr, "copying bzBufRemainder to bzLineBuf...\n"); */
strncpy(bzLineBuf, bzBufRemainder, strlen(bzBufRemainder)); /* leave out \0 */
bzLineBufPosition = strlen(bzBufRemainder);
}
for (bzBufPosition = 0; bzBufPosition < bzNBuf; bzBufPosition++) {
bzLineBuf[bzLineBufPosition++] = bzBuf[bzBufPosition];
if (bzBuf[bzBufPosition] == '\n') {
bzLineBuf[bzLineBufPosition] = '\0'; /* terminate bzLineBuf */
/* process the line buffer, e.g. print it out or transform it, etc. */
fprintf(stdout, "%s", bzLineBuf);
bzLineBufPosition = 0; /* reset line buffer position */
}
else if (bzBufPosition == (bzNBuf - 1)) {
bzLineBuf[bzLineBufPosition] = '\0';
if (bzBufRemainder != NULL)
bzBufRemainder = (char *)realloc(bzBufRemainder, bzLineBufPosition);
else
bzBufRemainder = (char *)malloc(bzLineBufPosition);
strncpy(bzBufRemainder, bzLineBuf, bzLineBufPosition);
}
}
}
}
if (bzError != BZ_STREAM_END) {
BZ2_bzReadClose(&bzError, bzFp);
fprintf(stderr, "\n\t[gchr2] - Error: Bzip2 data could not be uncompressed\n\n");
return -1;
} else {
BZ2_bzReadGetUnused(&bzError, bzFp, 0, 0);
BZ2_bzReadClose(&bzError, bzFp);
}
free(bzBufRemainder);
bzBufRemainder = NULL;
I really appreciate everyone's help. This is working nicely.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为没有更聪明的方法(除了找到一个已经为您完成此操作的自动机库)。为“最后一行”缓冲区分配适当的大小时要小心:如果它无法处理任意长度并且输入来自第三方可访问的内容,那么它就会成为安全风险。
I don't think there's a smarter approach (except finding an automata library that already does this for you). Be careful with allocating proper size for the "last line" buffer: if it cannot handle arbitrary length and the input comes from something accessible to third parties, it becomes a security risk.
我还一直在处理每行 bzip2 数据,我发现一次读取一个字节太慢了。这对我来说效果更好:
I've also been working with processing bzip2 data per line, and I found that reading one byte at a time was too slow. This worked better for me:
使用 C++ 的
std::string
可以很容易地做到这一点,但在 C 中,如果您想有效地做到这一点,则需要一些代码(除非您使用动态字符串库)。(其中
xmalloc
和xrealloc
在内部处理错误。不要忘记释放
返回的字符串。)这几乎比
bzcat
:自己决定是否可以接受。
This would be easy to do using C++'s
std::string
, but in C it takes some code if you want to do it efficiently (unless you use a dynamic string library).(Where
xmalloc
andxrealloc
handle errors internally. Don't forget tofree
the returned string.)This is almost an order of magnitude slower than
bzcat
:Decide for yourself whether that's acceptable.
我认为您应该将字符块复制到另一个缓冲区,直到您写入的最新块包含新行字符。然后你就可以在整条生产线上工作了。
您可以将缓冲区的其余部分(在
'\n'
之后)保存到临时缓冲区中,然后从中创建一个新行。I think you should copy chunks of characters to another buffer until the latest chunk you write contains a new line character. Then you can work on the whole line.
You can save the rest of the buffer (after the
'\n'
) into a temporary and then create a new line from it.