Python 会在文件写入完成之前打开文件吗?
我正在编写一个脚本,该脚本将轮询目录以查找新文件。
在这种情况下,是否有必要进行某种错误检查以确保在访问文件之前已完全写入文件?
我不想在文件完全写入磁盘之前使用它,但因为我想要从文件中获取的信息接近开头,所以似乎可以在不了解文件的情况下提取我需要的数据还没有写完。
这是我应该担心的事情,还是文件会因为操作系统正在写入硬盘而被锁定?
这是在 Linux 系统上。
I am writing a script that will be polling a directory looking for new files.
In this scenario, is it necessary to do some sort of error checking to make sure the files are completely written prior to accessing them?
I don't want to work with a file before it has been written completely to disk, but because the info I want from the file is near the beginning, it seems like it could be possible to pull the data I need without realizing the file isn't done being written.
Is that something I should worry about, or will the file be locked because the OS is writing to the hard drive?
This is on a Linux system.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
通常在 Linux 上,除非您使用某种锁定,否则两个进程可以很高兴地同时打开同一个文件,甚至用于写入。可通过三种方法避免出现问题:
锁定
通过让写入者对文件应用锁定,可以防止读取者部分读取文件。然而,大多数锁都是建议性的,因此仍然完全有可能看到部分结果。 (强制锁是存在的,但强烈不推荐,因为它们太脆弱了。)编写正确的锁定代码相对困难,将此类任务委托给专业库是很正常的(即数据库引擎!)特别是,您不想在网络文件系统上使用锁定;当它工作时,它会带来巨大的麻烦,并且经常会彻底出错。
约定
可以在同一目录中使用您不会在阅读端自动查找的另一个名称(例如
.foobar.txt.tmp
)创建文件,然后自动重命名为写入完成后,输入正确的名称(例如,foobar.txt
)。只要您注意处理先前运行无法正确写入文件的可能性,这种方法就可以很好地工作。如果一次只能有一名编写者,那么实现起来相当简单。不担心
最常见、最频繁写入的文件类型是日志文件。这些可以很容易地以这样的方式编写:信息严格地只附加到文件中,因此任何读者都可以安全地查看文件的开头,而不必担心其脚下的任何变化。这在实践中非常有效。
Python 在这方面并没有什么特别之处。所有在 Linux 上运行的程序都存在相同的问题。
Typically on Linux, unless you're using locking of some kind, two processes can quite happily have the same file open at once, even for writing. There are three ways of avoiding problems with this:
Locking
By having the writer apply a lock to the file, it is possible to prevent the reader from reading the file partially. However, most locks are advisory so it is still entirely possible to see partial results anyway. (Mandatory locks exist, but a strongly not recommended on the grounds that they're far too fragile.) It's relatively difficult to write correct locking code, and it is normal to delegate such tasks to a specialist library (i.e., to a database engine!) In particular, you don't want to use locking on networked filesystems; it's a source of colossal trouble when it works and can often go thoroughly wrong.
Convention
A file can instead be created in the same directory with another name that you don't automatically look for on the reading side (e.g.,
.foobar.txt.tmp
) and then renamed atomically to the right name (e.g.,foobar.txt
) once the writing is done. This can work quite well, so long as you take care to deal with the possibility of previous runs failing to correctly write the file. If there should only ever be one writer at a time, this is fairly simple to implement.Not Worrying About It
The most common type of file that is frequently written is a log file. These can be easily written in such a way that information is strictly only ever appended to the file, so any reader can safely look at the beginning of the file without having to worry about anything changing under its feet. This works very well in practice.
There's nothing special about Python in any of this. All programs running on Linux have the same issues.
在 Unix 上,除非写入应用程序出现问题,否则文件不会被锁定,您将能够从中读取内容。
当然,读者必须准备好处理不完整的文件(请记住,作者一方可能会发生 I/O 缓冲)。
如果这是行不通的,您将不得不考虑一些方案来同步写入器和读取器,例如:
On Unix, unless the writing application goes out of its way, the file won't be locked and you'll be able to read from it.
The reader will, of course, have to be prepared to deal with an incomplete file (bearing in mind that there may be I/O buffering happening on the writer's side).
If that's a non-starter, you'll have to think of some scheme to synchronize the writer and the reader, for example:
如果您对写入程序有一定的控制权,请让它将文件写入其他位置(例如 /tmp 目录),然后在完成后将其移动到正在监视的目录。
如果您无法控制进行写入的程序(我所说的“控制”是指“编辑源代码”),您可能也无法使其执行文件锁定,所以这可能是不可能的。在这种情况下,您可能需要了解有关文件格式的信息才能知道编写器何时完成。例如,如果编写者总是将“DONE”写入文件中的最后四个字符,则您可以打开文件,查找到末尾,然后读取最后四个字符。
If you have some control over the writing program, have it write the file somewhere else (like the /tmp directory) and then when it's done move it to the directory being watched.
If you don't have control of the program doing the writing (and by 'control' I mean 'edit the source code'), you probably won't be able to make it do file locking either, so that's probably out. In which case you'll likely need to know something about the file format to know when the writer is done. For instance, if the writer always writes "DONE" as the last four characters in the file, you could open the file, seek to the end, and read the last four characters.
是的,会的。
我更喜欢 Donal 描述的“文件命名约定”和重命名解决方案。
Yes it will.
I prefer the "file naming convention" and renaming solution described by Donal.