让多个进程同时写入同一个文件是否安全? [CentOs 6,ext4]
我正在构建一个系统,其中多个从属进程通过 unix 域套接字进行通信,并且它们同时写入同一文件。我从未研究过文件系统或这个特定的文件系统(ext4),但感觉这里可能存在一些危险。
每个进程写入输出文件的不相交子集(即,正在写入的块中没有重叠)。例如,P1
仅写入文件的前 50%,而 P2
仅写入文件的后 50%。或者,P1
可能只写入奇数编号的块,而 P2
写入偶数编号的块。
让 P1
和 P2
(在单独的线程上同时运行)在不使用任何锁定的情况下写入同一文件是否安全?换句话说,文件系统是否隐式施加某种锁定?
注意:不幸的是,我无法自由输出多个文件并稍后加入它们。
注意:自发布此问题以来我的阅读与下面唯一发布的答案不一致。我读到的所有内容都表明我想做的事情很好,而下面的受访者坚持认为我正在做的事情不安全,但我无法辨别所描述的危险。
I'm building a system where multiple slave processes are communicating via unix domain sockets, and they are writing to the same file at the same time. I have never studied filesystems or this specific filesystem (ext4), but it feels like there might be some danger here.
Each process writes to a disjoint subset of the output file (ie, there is no overlap in the blocks being written). For example, P1
writes to only the first 50% of the file and P2
writes only to the second 50%. Or perhaps P1
writes only the odd-numbered blocks while P2
writes the even-numbered blocks.
Is it safe to have P1
and P2
(running simultaneously on separate threads) writing to the same file without using any locking? In other words, does the filesystem impose some kind of locking implicitly?
Note: I'm unfortunately not at liberty to output multiple files and join them later.
Note: My reading since posting this question does not agree with the only posted answer below. Everything I've read suggests that what I want to do is fine, whereas the respondent below insists what I am doing is unsafe, but I am unable to discern the described danger.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您使用 POSIX“原始”IO 系统调用(例如 read()、write()、lseek() 等),您所做的事情看起来完全没问题。
如果您使用 C stdio(fread()、fwrite() 和朋友)或其他一些具有自己的用户空间缓冲的语言运行时库,那么“Tilo”的答案是相关的,因为由于缓冲,这对于某些在您无法控制的范围内,不同的进程可能会覆盖彼此的数据。
Wrt OS 锁定,虽然 POSIX 规定对于某些特殊文件(管道和 FIFO)来说小于 PIPE_BUF 大小的写入或读取是原子的,但对于常规文件则没有这样的保证。实际上,我认为页面内的 IO 很可能是原子的,但没有这样的保证。操作系统仅在保护其自身内部数据结构所需的范围内进行内部锁定。可以使用文件锁或某种其他进程间通信机制来序列化对文件的访问。但是,所有这些仅当您有多个进程对文件的同一区域执行 IO 时才相关。就您而言,由于您的进程正在对文件的不相交部分进行 IO,所以这些都不重要,您应该没问题。
What you're doing seems perfectly OK, provided you're using the POSIX "raw" IO syscalls such as read(), write(), lseek() and so forth.
If you use C stdio (fread(), fwrite() and friends) or some other language runtime library which has its own userspace buffering, then the answer by "Tilo" is relevant, in that due to the buffering, which is to some extent outside your control, the different processes might overwrite each other's data.
Wrt OS locking, while POSIX states that writes or reads less than of size PIPE_BUF are atomic for some special files (pipes and FIFO's), there is no such guarantee for regular files. In practice, I think it's likely that IO's within a page are atomic, but there is no such guarantee. The OS only does locking internally to the extent that is necessary to protect its own internal data structures. One can use file locks, or some other interprocess communication mechanism, to serialize access to files. But, all this is relevant only of you have several processes doing IO to the same region of a file. In your case, as your processes are doing IO to disjoint sections of the file, none of this matters, and you should be fine.
不,通常这样做是不安全的!
您需要为每个进程获取独占写锁——这意味着当一个进程写入文件时,所有其他进程都必须等待.. I/O 密集型进程越多,等待时间就越长。
最好每个进程有一个输出文件,并在行开头使用时间戳和进程标识符格式化这些文件,以便稍后可以离线合并和排序这些输出文件。
提示:检查网络服务器日志文件的文件格式——这些文件是在行开头使用时间戳完成的,以便以后可以对它们进行组合和排序。
编辑
UNIX 进程在打开文件时使用特定/固定的缓冲区大小(例如 4096 字节),以将数据传输到磁盘上的文件或从磁盘上的文件传输数据。一旦写入缓冲区已满,该进程就会将其刷新到磁盘——这意味着:它将完整的完整缓冲区写入磁盘!请注意,这是在缓冲区已满时发生的! ——不是当有行尾的时候!这意味着即使对于将面向行的文本数据写入文件的单个进程,这些行通常也会在刷新缓冲区时在中间的某个位置被剪切。只有在最后,当文件在写入后关闭时,您才能假设该文件包含完整的行!
因此,根据您的进程决定刷新缓冲区的时间,它们会在不同的时间写入文件——例如,顺序不确定/可预测。当缓冲区刷新到文件时,您不能假设它只会写入完整的行 - 例如它通常会写入部分行,因此如果多个进程在没有同步的情况下刷新其缓冲区,则会弄乱输出。
在维基百科上查看这篇文章:http://en.wikipedia.org/wiki/File_locking#File_locking_in_UNIX
引用:
您应该使用flock 或Mutexes 来同步进程,并确保一次只有其中一个可以写入文件。
正如我之前提到的,为每个进程提供一个输出文件,然后根据需要(离线)合并这些文件,可能会更快、更容易、更直接。 这种方法被一些人使用例如,网络服务器需要从多个线程登录到多个文件,并且需要确保不同的线程都是高性能的(例如,不必在文件锁上相互等待)。
这是相关帖子:(检查 Mark Byer 的答案!接受的答案不正确/相关。)
使用>>将多个并行进程的输出通过管道传输到一个文件是否安全?
编辑2:
在评论中,您说您想要将固定大小的二进制数据块从不同的进程写入同一个文件。
只有当您的块大小恰好等于系统文件缓冲区大小的情况下,这才可以工作!
确保您的固定块长度恰好是系统的文件缓冲区大小。否则,您将遇到与未完成的线路相同的情况。例如,如果您使用 16k 块,而系统使用 4k 块,那么通常您会在文件中以看似随机的顺序看到 4k 块——不能保证您总是会看到来自同一进程的连续 4 个块
no, generally it is not safe to do this!
you need to obtain an exclusive write lock for each process -- that implies that all the other processes will have to wait while one process is writing to the file.. the more I/O intensive processes you have, the longer the wait time.
it is better to have one output file per process and format those files with a timestamp and process identifier in the beginning of the line, so that you can later merge and sort those output files offline.
Tip: check the file format of web-server log files -- these are done with the time-stamp at the beginning of the line, so they can be later combined and sorted.
EDIT
UNIX processes use a certain / fixed buffer size when they open files (e.g. 4096 bytes), to transfer data to and from the file on disk. Once the write-buffer is full, the process flushes it to disk -- that means: it writes the complete full buffer to disk! Please Note here that it is happening when the buffer is full! -- not when there is an end-of-line! That means even for a single process which writes line-oriented text data to file, that those lines are typically cut somewhere in the middle at the time the buffer is flushed. Only at the end, when the file is closed after writing, can you assume that the file contains complete lines!
So depending on when your process decide to flush their buffers, they write at different times to the file -- e.g. the order is not deterministic / predictable When a buffer is flushed to file, you can not assume that it will only write complete lines -- e.g. it will usually write partial lines, thereby messing up the output if several processes flush their buffers without synchronization.
Check this article on Wikipedia: http://en.wikipedia.org/wiki/File_locking#File_locking_in_UNIX
Quote:
You should use either flock, or Mutexes to synchronize the processes and make sure only one of them can write to the file at a time.
As I mentioned earlier, it is probably faster, easier and more straight-forward to have one output file for each process, and then later combine those files if needed (offline). This approach is used by some web-servers for example, which need to log to multiple files from multiple threads -- and need to make sure that the different threads are all high-performing (e.g. not having to wait for each other on a file lock).
Here's a related post: (Check Mark Byer's answer! the accepted answer is not correct/relevant.)
Is it safe to pipe the output of several parallel processes to one file using >>?
EDIT 2:
in the comment you said that you want to write fixed-size binary data blocks from the different processes to the same file.
Only in the case that your block size is exactly the size of the system's file-buffer size, could this work!
Make sure that your fixed block-length is exactly the system's file-buffer size. Otherwise you will get into the same situation as with the not completed lines. e.g. if you use 16k blocks, and the system uses 4k blocks, then in general you will see 4k blocks in the file in seemingly random order -- there is no guarantee that you will always see 4 blocks in a row from the same process