用Python检测电子表格文件的编码
检测电子表格文件中文本编码的最可靠方法是什么?我正在使用Python 3.8来检测CSV和Microsoft Excel(.xlsx和.xls)文件的编码,这些文件包含来自西欧,中欧和东欧语言的文本(例如英语,西班牙语,法语,波兰语,俄语,俄语,乌克兰,, ETC。)。 每个电子表格文件仅包含以其中一种语言写的文本。
我已经尝试使用Beautiful Soup的Unicodedammit Libary来检测文本的编码,并且大部分时间都可以使用,并且BS4.UnicodeDampImit库比Chardet Library更准确( https://pypi.org/project/project/chardet/ )。但是,UnicoDedammit无法检测到一些编码,例如Windows 1250:
from bs4 import UnicodeDammit
text = 'Wrocław'
win_1250_bytes = text.encode('windows-1250') #Polish text encoded to win-1250
print(win_1250_bytes)
b'Wroc\xb3aw'
print(UnicodeDammit(win_1250_bytes).original_encoding)
iso-8859-1
看来CSV文件可以使用显式编码(例如Windows-1250等)保存,但是Microsoft Excel ExplEdseets在编码中受到限制,它可以保存为吗?默认情况下,MS-Excel文件似乎保存为UTF-8,并且无法保存到MS-Excel中的非UTF8编码。
我在代码中写了电子表格课程,但不确定如何准确检测CSV和Excel文件中文本的编码:
import pandas as pd
from os import path
from bs4 import UnicodeDammit
class SpreadsheetFile:
def __init__(self, file_path):
if path.isfile(file_path) and file_path.lower().endswith('.csv'):
self._dataframe = pd.read_csv(file_path)
self._encoding = self.get_file_encoding(file_path)
elif path.isfile(file_path) and (file_path.lower().endswith('.xls') or file_path.lower().endswith('.xlsx')):
self._dataframe = pd.read_excel(file_path)
self._encoding = self.get_file_encoding(file_path)
def get_file_encoding(self, file_path):
with open(file_path, 'rb') as f:
content = f.read()
return UnicodeDammit(content).original_encoding
感谢您的任何建议。
What is the most reliable way to detect the encoding of text in a spreadsheet file? I am using Python 3.8 to detect the encoding of CSV and Microsoft Excel (.xlsx and .xls) files that contain text from Western European, Central European, and Eastern European languages (e.g. English, Spanish, French, Polish, Russian, Ukrainian, etc.).
Each spreadsheet file will only contain text written in only one of those language.
I've tried using Beautiful Soup's UnicodeDammit libary to detect the encoding of a text and it works most of the time, and bs4.UnicodeDammit library is more accurate than the chardet library (https://pypi.org/project/chardet/). However, UnicodeDammit fails to detect some encodings such as Windows 1250:
from bs4 import UnicodeDammit
text = 'Wrocław'
win_1250_bytes = text.encode('windows-1250') #Polish text encoded to win-1250
print(win_1250_bytes)
b'Wroc\xb3aw'
print(UnicodeDammit(win_1250_bytes).original_encoding)
iso-8859-1
It seems that CSV files can be saved with an explicit encoding (such as windows-1250, etc.) but Microsoft Excel spreadsheets are limited in the encodings it can save as? It seems that MS-Excel file are saved as utf-8 by default and there's no way to save to a non-UTF8 encoding in MS-Excel.
I've written a spreadsheet class in my code but am unsure about how to accurately detect the encoding of the text in CSV and Excel files:
import pandas as pd
from os import path
from bs4 import UnicodeDammit
class SpreadsheetFile:
def __init__(self, file_path):
if path.isfile(file_path) and file_path.lower().endswith('.csv'):
self._dataframe = pd.read_csv(file_path)
self._encoding = self.get_file_encoding(file_path)
elif path.isfile(file_path) and (file_path.lower().endswith('.xls') or file_path.lower().endswith('.xlsx')):
self._dataframe = pd.read_excel(file_path)
self._encoding = self.get_file_encoding(file_path)
def get_file_encoding(self, file_path):
with open(file_path, 'rb') as f:
content = f.read()
return UnicodeDammit(content).original_encoding
Thanks for any suggestions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论