打开和保存文件时丢失编码
我正在尝试打开一个包含常规 HTML 和特殊 Unicode 字符(例如“ÖäÅ öäå”(瑞典语))的文件,对其进行格式化,然后将其输出到文件中。
到目前为止,一切都很顺利,我可以打开文件,找到我需要的部分并输出到文件中。
但重点是:
我无法在不丢失编码的情况下将输入的 Unicode 数据保存到文件中(例如,“ö”变为“¶”)。
尽管我可以通过手动将它们输入到代码本身中,设法执行正则表达式并将它们输出为正确的编码。但当我导入文件、格式化它然后输出时却不是这样。
使用 OCT 时的工作方法示例(例如,这可以输出到文件而不会出现编码问题):
my $charsSWE = "öäåÅÄÖ";
# \344 = ä
# \345 = å
# \305 = Å
# \304 = Ä
# \326 = Ö
# \366 = ö
my $SwedishLetters = '\344 \345 \305 \304 \326 \366';
if($charsSWE =~ /([$SwedishLetters]+)/){
print "Output: $1\n";
}
下面的方法不起作用,因为编码丢失了(这是代码部分的快速说明,但其概念是相同[例如打开文件,获取并输出]):
open(FH, 'swedish.htm') or die("File could not be opened");
while(<FH>)
{
my @List = /([$SwedishLetters]+)/g;
message($List[0]) if @List;
}
close(FH);
I'm trying to open a file with regular HTML and special Unicode characters such as "ÖÄÅ öäå" (Swedish), format it and then output it to a file.
So far everything works out great, I can open the file, find the parts I need and output into a file.
But here is the point:
I can't save the inputted Unicode data into the file without losing my encoding (eg. an 'ö' becomes 'ö').
Although I can, by manually entering them into the code itself, manage to both perform regex and output them to correct encoding. But not when I'm importing a file, formatting it and then outputting.
Example on working approach when using OCT (eg. this can output to the file without the encoding problem):
my $charsSWE = "öäåÅÄÖ";
# \344 = ä
# \345 = å
# \305 = Å
# \304 = Ä
# \326 = Ö
# \366 = ö
my $SwedishLetters = '\344 \345 \305 \304 \326 \366';
if($charsSWE =~ /([$SwedishLetters]+)/){
print "Output: $1\n";
}
The way below does not work because the encoding is lost (this is a quick illustration of the part of the code but its concept is the same [eg. open file, fetch and output]):
open(FH, 'swedish.htm') or die("File could not be opened");
while(<FH>)
{
my @List = /([$SwedishLetters]+)/g;
message($List[0]) if @List;
}
close(FH);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可能需要使用不同的编码。
You may need to use a different encoding.