用Python检测电子表格文件的编码

发布于 2025-01-25 02:03:57 字数 1669 浏览 3 评论 0原文

检测电子表格文件中文本编码的最可靠方法是什么？我正在使用Python 3.8来检测CSV和Microsoft Excel（.xlsx和.xls）文件的编码，这些文件包含来自西欧，中欧和东欧语言的文本（例如英语，西班牙语，法语，波兰语，俄语，俄语，乌克兰，， ETC。）。每个电子表格文件仅包含以其中一种语言写的文本。

我已经尝试使用Beautiful Soup的Unicodedammit Libary来检测文本的编码，并且大部分时间都可以使用，并且BS4.UnicodeDampImit库比Chardet Library更准确（ https://pypi.org/project/project/chardet/ ）。但是，UnicoDedammit无法检测到一些编码，例如Windows 1250：

from bs4 import UnicodeDammit
text = 'Wrocław' 
win_1250_bytes = text.encode('windows-1250') #Polish text encoded to win-1250
print(win_1250_bytes)
b'Wroc\xb3aw'
print(UnicodeDammit(win_1250_bytes).original_encoding)
iso-8859-1

看来CSV文件可以使用显式编码（例如Windows-1250等）保存，但是Microsoft Excel ExplEdseets在编码中受到限制，它可以保存为吗？默认情况下，MS-Excel文件似乎保存为UTF-8，并且无法保存到MS-Excel中的非UTF8编码。

我在代码中写了电子表格课程，但不确定如何准确检测CSV和Excel文件中文本的编码：

import pandas as pd
from os import path
from bs4 import UnicodeDammit


class SpreadsheetFile:

    def __init__(self, file_path):
        if path.isfile(file_path) and file_path.lower().endswith('.csv'):
            self._dataframe = pd.read_csv(file_path)
   self._encoding = self.get_file_encoding(file_path)

        elif path.isfile(file_path) and (file_path.lower().endswith('.xls') or file_path.lower().endswith('.xlsx')):
            self._dataframe = pd.read_excel(file_path)
   self._encoding = self.get_file_encoding(file_path)


    def get_file_encoding(self, file_path):
        with open(file_path, 'rb') as f:
            content = f.read()
            return UnicodeDammit(content).original_encoding

感谢您的任何建议。

原文

What is the most reliable way to detect the encoding of text in a spreadsheet file? I am using Python 3.8 to detect the encoding of CSV and Microsoft Excel (.xlsx and .xls) files that contain text from Western European, Central European, and Eastern European languages (e.g. English, Spanish, French, Polish, Russian, Ukrainian, etc.).
Each spreadsheet file will only contain text written in only one of those language.

I've tried using Beautiful Soup's UnicodeDammit libary to detect the encoding of a text and it works most of the time, and bs4.UnicodeDammit library is more accurate than the chardet library (https://pypi.org/project/chardet/). However, UnicodeDammit fails to detect some encodings such as Windows 1250:

from bs4 import UnicodeDammit
text = 'Wrocław' 
win_1250_bytes = text.encode('windows-1250') #Polish text encoded to win-1250
print(win_1250_bytes)
b'Wroc\xb3aw'
print(UnicodeDammit(win_1250_bytes).original_encoding)
iso-8859-1

It seems that CSV files can be saved with an explicit encoding (such as windows-1250, etc.) but Microsoft Excel spreadsheets are limited in the encodings it can save as? It seems that MS-Excel file are saved as utf-8 by default and there's no way to save to a non-UTF8 encoding in MS-Excel.

I've written a spreadsheet class in my code but am unsure about how to accurately detect the encoding of the text in CSV and Excel files:

import pandas as pd
from os import path
from bs4 import UnicodeDammit


class SpreadsheetFile:

    def __init__(self, file_path):
        if path.isfile(file_path) and file_path.lower().endswith('.csv'):
            self._dataframe = pd.read_csv(file_path)
   self._encoding = self.get_file_encoding(file_path)

        elif path.isfile(file_path) and (file_path.lower().endswith('.xls') or file_path.lower().endswith('.xlsx')):
            self._dataframe = pd.read_excel(file_path)
   self._encoding = self.get_file_encoding(file_path)


    def get_file_encoding(self, file_path):
        with open(file_path, 'rb') as f:
            content = f.read()
            return UnicodeDammit(content).original_encoding

Thanks for any suggestions.

分享到QQ

分享到微博