使用Python3读取srt（字幕）文件

发布于 2024-12-05 10:48:50 字数 522 浏览 1 评论 0原文

我希望能够使用 python3 读取 srt 文件。

这些文件可以在这里找到： http://www.opensubtitles.org/

信息如下： http://en.wikipedia.org/wiki/SubRip

Subrip 支持任何编码：ascii 或 unicode ，例如。

如果我理解正确的话，那么当我使用 python 读取函数时，我需要指定使用哪个解码器。那么我说我需要知道文件是如何编码的才能做出这个判断，对吗？如果是这样，如果我有一百个具有不同来源和语言支持的此类文件，如何为每个文件建立该文件？

最终，我更希望能够转换这些文件，以便它们都以 utf-8 编码开始。但据我所知，其中一些文件可能是一些晦涩的编码。

请帮忙，

巴里

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

残花月 2024-12-12 10:48:50

您可以使用 charade 包（以前的 chardet) 来检测编码。

回复收藏 0 原文

ぃ双果 2024-12-12 10:48:50

您可以检查每个 .srt字节顺序标记 > 用于测试编码的文件。但是，这可能不适用于所有文件，因为它不是必需的属性，而且只能在 UTF 文件中指定。您可能

testStr = b'\xff\xfeOtherdata'

if testStr[0:2] == b'\xff\xfe':
    print('UTF-16 Little Endian')
elif testStr[0:2] == b'\xfe\xff':
    print('UTF-16 Big Endian')
#...

想要做的只是打开文件，然后将您从文件中提取的任何内容解码为 unicode，处理 unicode 表示，直到准备好打印，然后再次将其编码回来。请参阅此演讲，了解更多信息以及可能相关的代码示例。

You can check for the byte order mark at the start of each .srt file to test for encoding. However, this probably won't work for all files, as it is not a required attribute, and only specified in UTF files anyways. A check can be performed by

testStr = b'\xff\xfeOtherdata'

if testStr[0:2] == b'\xff\xfe':
    print('UTF-16 Little Endian')
elif testStr[0:2] == b'\xfe\xff':
    print('UTF-16 Big Endian')
#...

What you probably want to do is simply open your file, then decode whatever you pull out of the file into unicode, deal with the unicode representation until you are ready to print, and then encode it back again. See this talk for some more information, and code samples that might be relevant.

回复收藏 0 原文