使用>>将多个并行进程的输出通过管道传输到一个文件是否安全?

发布于 2024-08-25 06:44:20 字数 141 浏览 7 评论 0原文

我正在从网络上抓取数据,并且我的抓取器有多个进程并行运行。

我希望每个进程的输出都位于同一个文件中。只要文本行保持完整并且不相互混淆,行的顺序并不重要。在 UNIX 中,我可以使用 >> 将每个进程的输出通过管道传输到同一个文件吗?操作员?

I'm scraping data from the web, and I have several processes of my scraper running in parallel.

I want the output of each of these processes to end up in the same file. As long as lines of text remain intact and don't get mixed up with each other, the order of the lines does not matter. In UNIX, can I just pipe the output of each process to the same file using the >> operator?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

风蛊 2024-09-01 06:44:21

不。不保证线路完好无损。它们可以混合在一起。

根据 liori 的答案进行搜索,我发现这个

{PIPE_BUF} 字节或更少的写入请求不得与在同一管道上进行写入的其他进程的数据交错。大于 {PIPE_BUF} 字节的写入可能会在任意边界上与其他进程的写入交错数据,无论文件状态标志的 O_NONBLOCK 标志是否设置。

因此,不能保证长于 {PIPE_BUF} 字节的行保持​​完整。

No. It is not guaranteed that lines will remain intact. They can become intermingled.

From searching based on liori's answer I found this:

Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.

So lines longer than {PIPE_BUF} bytes are not guaranteed to remain intact.

皓月长歌 2024-09-01 06:44:21

您可以做的一件可能有趣的事情是使用 gnu 并行: http://www.gnu.org/s/parallel/ 例如,如果您正在抓取站点:

stackoverflow.com, stackexchange.com, fogcreek.com 

您可以执行类似的操作

(echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k your_spider_script

,并且输出会并行缓冲,并且由于 -k 选项会按照上面站点列表的顺序返回给您。一个真实的例子(基本上是从第二个并行截屏中复制的):

 ~ $ (echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k ping -c 1 {}


PING stackoverflow.com (64.34.119.12): 56 data bytes

--- stackoverflow.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING stackexchange.com (64.34.119.12): 56 data bytes

--- stackexchange.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING fogcreek.com (64.34.80.170): 56 data bytes
64 bytes from 64.34.80.170: icmp_seq=0 ttl=250 time=23.961 ms

--- fogcreek.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 23.961/23.961/23.961/0.000 ms

无论如何,ymmv

One possibly interesting thing you could do is use gnu parallel: http://www.gnu.org/s/parallel/ For example if you you were spidering the sites:

stackoverflow.com, stackexchange.com, fogcreek.com 

you could do something like this

(echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k your_spider_script

and the output is buffered by parallel and because of the -k option returned to you in the order of the site list above. A real example (basically copied from the 2nd parallel screencast):

 ~ $ (echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k ping -c 1 {}


PING stackoverflow.com (64.34.119.12): 56 data bytes

--- stackoverflow.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING stackexchange.com (64.34.119.12): 56 data bytes

--- stackexchange.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING fogcreek.com (64.34.80.170): 56 data bytes
64 bytes from 64.34.80.170: icmp_seq=0 ttl=250 time=23.961 ms

--- fogcreek.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 23.961/23.961/23.961/0.000 ms

Anyway, ymmv

蔚蓝源自深海 2024-09-01 06:44:21

一般来说,不会。

在 Linux 上,这可能是可能的,只要满足两个条件:每一行都在一次操作中写入,并且该行的长度不超过 PIPE_SIZE(通常与 PAGE_SIZE 相同,通常为 4096)。但是……我不会指望这一点;这种行为可能会改变。

最好使用某种真正的日志记录机制,例如 syslog。

Generally, no.

On Linux this might be possible, as long as two conditions are met: each line is written in a one operation, and the line is no longer than PIPE_SIZE (usually the same as PAGE_SIZE, usually 4096). But... I wouldn't count on that; this behaviour might change.

It is better to use some kind of real logging mechanism, like syslog.

难如初 2024-09-01 06:44:21

绝对不,我有一个日志管理脚本,我认为它可以工作,并且它确实工作,直到我将其移至负载不足的生产服务器。这不是美好的一天……但基本上,有时你会得到完全混乱的台词。

如果我尝试从多个源捕获,那么拥有多个文件“纸质轨迹”会更简单(并且更容易调试),并且如果我需要一个整体日志文件,请根据时间戳连接(您正在使用时间-邮票,对吧?)或者正如 liori 所说,系统日志。

Definitely no, I had a log-management script where I assumed this worked, and it did work, until I moved it to an under-load production server. Not a good day... But basically you end up with sometimes completely mixed up lines.

If I'm trying to capture from multiple sources, it is much simpler (and easier to debug) having a multiple-file 'paper trails' and if I need an over-all log file, concatenate based on timestamp (you are using time-stamps, right?) or as liori said, syslog.

剧终人散尽 2024-09-01 06:44:21

使用临时文件并将它们连接在一起。这是做你想做的事情的唯一安全的方法,而且(可能)这种方式的性能损失可以忽略不计。如果性能确实是一个问题,请尝试确保您的 /tmp 目录是基于 RAM 的文件系统并将临时文件放在那里。这样,临时文件就存储在 RAM 中而不是硬盘驱动器上,因此读/写它们几乎是即时的。

Use temporary files and concatenate them together. It's the only safe way to do what you want to do, and there will (probably) be negligible performance loss that way. If performance is really a problem, try making sure that your /tmp directory is a RAM-based filesystem and putting your temporary files there. That way the temporary files are stored in RAM instead of on a hard drive, so reading/writing them is near-instant.

泅人 2024-09-01 06:44:21

您需要确保在单个写入操作中写入整行(因此,如果您使用某种形式的 stdio,则需要将其设置为行缓冲,至少为您所写入的最长行的长度可以输出。)由于 shell 使用 O_APPEND 来表示 >>重定向,然后您的所有写入都会自动附加到文件中,而无需您采取进一步的操作。

You'll need to ensure that you're writing whole lines in single write operations (so if you're using some form of stdio, you'll need to set it for line buffering for at least the length of the longest line that you can output.) Since the shell uses O_APPEND for the >> redirection then all your writes will then automatically append to the file with no further action on your part.

街角卖回忆 2024-09-01 06:44:21

简而言之,不。 >> 不尊重多个进程。

Briefly, no. >> doesn't respect multiple processes.

潦草背影 2024-09-01 06:44:21

除了使用临时文件的想法之外,您还可以使用某种聚合过程,尽管您仍然需要确保写入是原子的。

想想带有管道日志记录的 Apache2(如果您雄心勃勃的话,可以在管道的另一端进行类似的扩展)。这就是它所采用的方法,多个线程/进程共享一个日志记录进程。

In addition to the idea of using temporary files, you could also use some kind of aggregating process, although you would still need to make sure your writes are atomic.

Think Apache2 with piped logging (with something like spread on the other end of the pipe if you're feeling ambitious). That's the approach it takes, with multiple threads/processes sharing a single logging process.

俏︾媚 2024-09-01 06:44:21

如上所述,这是一个相当不错的 hack,但效果很好 =)

( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) | cat

与 '>>' 相同:

( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) >> log

在最后一个进程中使用 exec 可以保存一个进程:

( ping stackoverflow.com & ping stackexchange.com & exec ping fogcreek.com ) | cat

As mentioned above it's quite a hack, but works pretty well =)

( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) | cat

same thing with '>>' :

( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) >> log

and with exec on the last one you save one process:

( ping stackoverflow.com & ping stackexchange.com & exec ping fogcreek.com ) | cat
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文