Perl和读取不同编码的文件

发布于 2024-08-23 16:12:34 字数 411 浏览 12 评论 0原文

我正在使用 perl 脚本读取文件,但我不确定文件采用什么编码。基本上,我的文件是书名列表,但每本书都有与之关联的其他信息(作者、出版日期) , ETC)。因此,每个书名都位于该书的离散数据块内。因此,我逐行迭代该文件,直到找到正则表达式 '/Book Title: (.*)/' 并获取括号中的内容。然后,我创建一个单独的 .txt 文件,文本文件的名称是我的书。但是,在我的 unix 服务器中,当我查看文件名时,它实际上不是 'LordOfTheFlies.txt' 而是 'LordOfTheFlies^M.txt''LordOfTheFlies^M.txt'代码>

这个“^M”是什么?这是我没有考虑到的奇怪的行尾编码吗?我尝试了 chomp 但似乎不起作用。使用 perl 的最佳文件编码是什么?

I am using a perl script to read in a file, but I'm not sure what encoding the file is in. Basically, my file is a list of book titles, but each book has other info associated with it (author, publication date, etc). So each book title is within a discrete chunk of data for the book. So I iterate through the file line by line until I find the regular expression '/Book Title: (.*)/' and take what's in the paren. Then, I create a separate .txt file with the name of the text file being my book. However, in my unix server, when I look at the name of the file, it's actually not, for example, 'LordOfTheFlies.txt' but rather 'LordOfTheFlies^M.txt'

What is this '^M'? Is that a weird end of line encoding I'm not taking into account? I tried chomp but it doesn't seem to be working. What is the best file encoding for working with perl?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

策马西风 2024-08-30 16:12:34

它是 Windows 系统在换行符之前插入的附加回车符(M == 第 13 个字母,因此 ASCII 13 可视化为 ^M)。

它与文件编码无关,只是行结束策略困扰着你。 Perl 通常擅长正确处理行结束字符,但如果它们出现在行尾以外的其他位置,您必须自己处理。您可以使用 s/\r// 而不是 chomp() 将它们取出。

It's the additional carriage return character that Windows systems insert before line feed characters (M == 13th letter, hence ASCII 13 is visualised as ^M).

It has nothing to do with file encoding, it's just the line ending policy biting you. Perl is usually good at handling line ending characters correctly, but if they occur somewhere else than the end of a line you have to do it yourself. You can use s/\r// instead of chomp() to get them out.

浮光之海 2024-08-30 16:12:34

在处理文件之前,需要知道文件的编码,这是由文件的生产者决定的。
那个“^M”是control-M,它是一个回车符,在Unix文件系统中不需要。
看起来该文件是在Unix中创建并传输到Windows的。当文本文件作为二进制文件传输时,也可以使用 ftp 添加它。

Before processing the file, you need to know the encoding of the file, which is determined by the producer of the file.
That "^M" is control-M, which is a carriage return, and is not needed in Unix file systems.
Looks like the file is created in Unix and transferred to Windows. It can also be added with ftp when text file are transfered as binaries.

゛清羽墨安 2024-08-30 16:12:34

尝试“砍”,而不是“砍”。 Chomp 删除“换行符”。 s/\r// 也不错。
对于您的一般问题,您可能希望为您必须的文件类型使用适当的模块,以使您的 Perl 生活更轻松、更好。

Try chop, instead of 'chomp'. Chomp removes the 'new line character'. s/\r// is also good.
For your general question, you might want to use appropriate module for the file type you have to make your life easier and better with Perl.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文