当前位置：文江博客话题详情

在主要为 Latin-1 的文件中查找非 Latin-1 文本的片段？

发布于 2024-12-24 23:27:04 字数 211 浏览 3 评论 0原文

我相信英语 .txt 是 Latin-1，但它可能包含另一种编码的片段。是否有库或工具可以定位这些片段？

我知道 Python chardat 库之类的东西，但我专门寻找一种工具来测试 Latin-1 文件并检测异常。即使是常规检测库也可以，如果它能够告诉我它检测到非 Latin-1 模式的点并给我索引。

命令行工具和 Python 库尤其受欢迎。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甚是思念 2024-12-31 23:27:04

Latin-1（或者您可能是指带有欧元符号的 latin-15 变体？）并不容易检测到。

简单的方法可能是检查是否确实使用了一些未使用的字符
（参见此处表） - 如果有，则说明有问题。然而，为了检测更微妙的违规行为，需要实际检查该语言是否是使用 latin-1 的语言之一。否则无法区分8位编码。最好从一开始就不要混合 8 位编码，而不以某种方式标记编码的更改......

回复收藏 0 原文

萌化 2024-12-31 23:27:04

您认为文件 (1) 是 Latin-1 (2) 可能包含另一种编码的片段的理由是什么？文件有多大？什么是“常规检测库”？您是否考虑过它可能是 Windows 编码（例如 cp1252）的可能性？

一些粗略的诊断：

# preliminaries
text = open('the_file.txt', 'rb').read()
print len(text), "bytes in file"

# How many non-ASCII bytes?
print sum(1 for c in text if c > '\x7f'), "non-ASCII bytes"

# Will it decode as UTF-8 OK?
try:
    junk = text.decode('utf8')
    print "utf8 decode OK"
except UnicodeDecodeError, e:
    print e

# Runs of more than one non-ASCII byte are somewhat rare in single-byte encodings
# of languages written in a Latin script ...
import re
runs = re.findall(r'[\x80-\xff]+', text)
nruns = len(runs)
print nruns, "runs of non-ASCII bytes"
if nruns:
    avg_rlen = sum(len(run) for run in runs) / float(nruns)
    print "average run length: %.2f bytes" % avg_rlen
# then if indicated you could write some code to display runs in context ...

What are the grounds for your beliefs that is file (1) is Latin-1 (2) may contain fragments in another encoding? How large is the file? What is a "regular detection library"? Have you considered the possibility that it might be a Windows encoding e.g. cp1252?

Some broad-brush diagnostics:

# preliminaries
text = open('the_file.txt', 'rb').read()
print len(text), "bytes in file"

# How many non-ASCII bytes?
print sum(1 for c in text if c > '\x7f'), "non-ASCII bytes"

# Will it decode as UTF-8 OK?
try:
    junk = text.decode('utf8')
    print "utf8 decode OK"
except UnicodeDecodeError, e:
    print e

# Runs of more than one non-ASCII byte are somewhat rare in single-byte encodings
# of languages written in a Latin script ...
import re
runs = re.findall(r'[\x80-\xff]+', text)
nruns = len(runs)
print nruns, "runs of non-ASCII bytes"
if nruns:
    avg_rlen = sum(len(run) for run in runs) / float(nruns)
    print "average run length: %.2f bytes" % avg_rlen
# then if indicated you could write some code to display runs in context ...

回复收藏 0 原文

~没有更多了~