当前位置：文江博客话题详情

数据库文本中的奇怪字符：Ã、Ã、¢、â‚ €,

发布于 2024-12-11 11:13:16 字数 1778 浏览 0 评论 0原文

我不确定这第一次发生是什么时候。

我有一个新的直运联属网站，并从批发商处收到产品目录的导出副本。我将其格式化并导入 Prestashop 1.4.4。

网站的前端在产品文本中包含奇怪的字符组合： à、 à、、等。它们出现在常见字符的位置，例如，-：等。

这些字符出现在大约 40% 的数据库表，而不仅仅是像 ps_product_lang 这样的产品特定表。

另一个网站线程说当数据库连接字符串使用时也会出现同样的问题错误的字符编码类型。

在/config/setting.inc中，没有提到任何字符编码字符串，只是MySQL引擎，它设置为InnoDB，这与我在PHPMyAdmin中看到的相符。

我导出了 ps_product_lang，用正确的字符替换了这些字符的所有实例，以 UTF-8 格式保存了 CSV 文件，然后使用 PHPMyAdmin 重新导入它们，指定 UTF-8 作为语言。

然而，在 PHPMyAdmin 中进行新搜索后，我现在 ps_product_lang 中这些不良字符的实例数量比开始时的数量大约是原来的 10 倍。

如果问题就像在数据库连接字符串中指定正确的语言属性一样简单，那么我应该在哪里/如何设置它，以及做什么？

顺便说一句，我尝试在此线程，但问题仍然存在：

SET NAMES utf8

更新：PHPMyAdmin 说：

MySQL 字符集：UTF-8 Unicode (utf8)

这与我在上次导入文件中使用的字符集相同，这导致了更多的字符损坏。在导入过程中指定 UTF-8 作为导入文件的字符集。

UPDATE2

这是一个示例：

人们真正过着不受束缚的生活… � 在线购买和租赁电影、下载软件以及在网络上共享和存储文件。

UPDATE3

显示字符集：

character_set_client utf8character_set_connection
utf8character_set_databaselatin1character_set_filesystembinarycharacter_set_resultsutf8character_set_serverlatin1character_set_systemutf8
所以
）
，
我
在 PHPMyAdmin 中运行了一个 SQL 命令来

也许我的数据库需要转换（或删除并重新创建为 UTF-8。如果 MySQL 服务器是 latin1，这会造成问题吗？

MySQL 能否处理以 UTF8 提供内容但将其存储为 latin1 的翻译？我认为不可能，因为 UTF8 是 latin1 的超集。我的网络托管支持在 48 小时内尚未回复。对他们来说可能太难了。

原文

I'm not certain when this first occured.

I have a new drop-shipping affiliate website, and receive an exported copy of the product catalog from the wholesaler. I format and import this into Prestashop 1.4.4.

The front end of the website contains combinations of strange characters inside product text: Ã, Ã, ¢, â‚ etc. They appear in place of common characters like , - : etc.

These characters are present in about 40% of the database tables, not just product specific tables like ps_product_lang.

Another website thread says this same problem occurs when the database connection string uses an incorrect character encoding type.

In /config/setting.inc, there is no character encoding string mentioned, just the MySQL Engine, which is set to InnoDB, which matches what I see in PHPMyAdmin.

I exported ps_product_lang, replaced all instances of these characters with correct characters, saved the CSV file in UTF-8 format, and reimported them using PHPMyAdmin, specifying UTF-8 as the language.

However, after doing a new search in PHPMyAdmin, I now have about 10 times as many instances of these bad characters in ps_product_lang than I started with.

If the problem is as simple as specifying the correct language attribute in the database connection string, where/how do I set this, and what to?

Incidently, I tried running this command in PHPMyAdmin mentioned in this thread, but the problem remains:

SET NAMES utf8

UPDATE: PHPMyAdmin says:

MySQL charset: UTF-8 Unicode (utf8)

This is the same character set I used in the last import file, which caused more character corruptions. UTF-8 was specified as the charset of the import file during the import process.

UPDATE2

Here is a sample:

people are truly living untetheredÃƒÆ’Ã‚Â¢ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬ÃƒÂ¯Ã¢â‚¬Â
Ã‚ï† buying and renting movies online, downloading software, and
sharing and storing files on the web.

UPDATE3

I ran an SQL command in PHPMyAdmin to display the character sets:

character_set_client utf8
character_set_connection utf8
character_set_database latin1
character_set_filesystem binary
character_set_results utf8
character_set_server latin1
character_set_system utf8

So, perhaps my database needs to be converted (or deleted and recreated) to UTF-8. Could this pose a problem if the MySQL server is latin1?

Can MySQL handle the translation of serving content as UTF8 but storing it as latin1? I don't think it can, as UTF8 is a superset of latin1. My web hosting support has not replied in 48 hours. Might be too hard for them.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

长亭外，古道边 2024-12-18 11:13:16

如果表的字符集与其内容相同，请尝试使用 mysql_set_charset('UTF8', $link_identifier)。请注意，MySQL 使用 UTF8 来指定 UTF-8 编码，而不是更常见的 UTF-8。

也请查看我对类似问题的其他答案。

回复收藏 0 原文

葬シ愛 2024-12-18 11:13:16

这肯定是编码问题。您的数据库和网站中的编码不同，这一事实就是问题的原因。此外，如果您运行该命令，则必须更改表中已有的记录以将这些字符转换为 UTF-8。

更新：根据您上次的评论，问题的核心是您的数据库和数据源（CSV文件）使用不同的编码。因此，您可以将数据库转换为 UTF-8，或者至少，当您获取 CSV 中的数据时，必须将它们从 UTF-8 转换为 latin1。

您可以按照本文进行转换：

回复收藏 0 原文

太阳男子 2024-12-18 11:13:16

这似乎是 UTF-8 编码问题，可能是由数据库文件内容的双重 UTF8 编码引起的。

这种情况可能是由于诸如选择或未选择的字符集（例如创建数据库备份文件时）以及保存数据库文件的文件格式和编码等因素而发生的。

我在以下场景中看到了这些奇怪的 UTF-8 字符（描述可能不完全准确，因为我无法再访问相关数据库）：

据我所知，数据库和表具有“uft8_general_ci”排序规则。
备份由数据库组成。
备份文件在 Windows 上以 UNIX 文件格式和 ANSI 编码打开。
通过将数据库备份文件中的内容复制粘贴到 phpMyAdmin 中，将数据库恢复到新的 MySQL 服务器上。

查看文件内容：

在文本编辑器中打开SQL备份文件，发现SQL备份文件中有奇怪的字符，例如“sâ”。另外，如果在另一个编辑器中打开同一文件，您可能会得到不同的结果。我在这里使用 TextPad，但在 SublimeText 中打开同一个文件时显示“sà¥”，因为 SublimeText 正确地对文件进行了 UTF8 编码——不过，当您开始尝试在 PHP 中解决问题时，这有点令人困惑，因为您看不到首先在 SublimeText 中正确的数据。无论如何，可以通过记下文本编辑器在呈现文件内容时使用的编码来解决这个问题。
这些奇怪的字符是双重编码的 UTF-8 字符，因此在我的例子中，第一个“à”部分等于“à”，“à¥”=“¥”（这是我的第一个“编码”）。 “å”字符等于 UTF-8 字符“å”（这是我的第二个编码）。

因此，问题在于“错误”（UTF8 编码两次）utf-8 需要转换回“正确”utf-8（仅 UTF8 编码一次）。

尝试在 PHP 中修复此问题有点具有挑战性：

utf8_decode() 无法处理字符。

// Fails silently (as in - nothing is output)
$str = "sÃƒÂ¥";

$str = utf8_decode($str);
printf("\n%s", $str);

$str = utf8_decode($str);
printf("\n%s", $str);

iconv() 失败并显示“注意： iconv()：在输入字符串中检测到非法字符”。

echo iconv("UTF-8", "ISO-8859-1", "sÃƒÂ¥");

在这种情况下，另一个良好且可能的解决方案也会默默地失败

$str = "sÃƒÂ¥";
echo html_entity_decode(htmlentities($str, ENT_QUOTES, 'UTF-8'), ENT_QUOTES , 'ISO-8859-15');

mb_convert_encoding( ）默默地： #

$str = "sÃƒÂ¥";
echo mb_convert_encoding($str, 'ISO-8859-15', 'UTF-8');
// (No output)

尝试通过以下方式修复 MySQL 中的编码转换 MySQL 数据库字符集和排序到 UTF-8 失败：

ALTER DATABASE myDatabase CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE myTable CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

我看到有几种方法可以解决此问题。

首先是使用正确的编码进行备份（编码需要与实际的数据库和表编码相匹配）。您只需在文本编辑器中打开生成的 SQL 文件即可验证编码。

另一种是用单UTF8编码字符替换双UTF8编码字符。这可以在文本编辑器中手动完成。为了帮助完成此过程，您可以从尝试 UTF-8 编码调试图表< 中手动选择不正确的字符/a>（可能是替换 5-10 个错误的问题）。

最后，脚本可以协助该过程：

    $str = "sÃƒÂ¥";
    // The two arrays can also be generated by double-encoding values in the first array and single-encoding values in the second array.
    $str = str_replace(["Ãƒ","Â¥"], ["Ã","¥"], $str); 
    $str = utf8_decode($str);
    echo $str;
    // Output: "så" (correct)

This appears to be a UTF-8 encoding issue that may have been caused by a double-UTF8-encoding of the database file contents.

This situation could happen due to factors such as the character set that was or was not selected (for instance when a database backup file was created) and the file format and encoding database file was saved with.

I have seen these strange UTF-8 characters in the following scenario (the description may not be entirely accurate as I no longer have access to the database in question):

As I recall, there the database and tables had a "uft8_general_ci" collation.
Backup is made of the database.
Backup file is opened on Windows in UNIX file format and with ANSI encoding.
Database is restored on a new MySQL server by copy-pasting the contents from the database backup file into phpMyAdmin.

Looking into the file contents:

Opening the SQL backup file in a text editor shows that the SQL backup file has strange characters such as "sÃƒÂ¥". On a side note, you may get different results if opening the same file in another editor. I use TextPad here but opening the same file in SublimeText said "sÃ¥" because SublimeText correctly UTF8-encoded the file -- still, this is a bit confusing when you start trying to fix the issue in PHP because you don't see the right data in SublimeText at first. Anyways, that can be resolved by taking note of which encoding your text editor is using when presenting the file contents.
The strange characters are double-encoded UTF-8 characters, so in my case the first "Ãƒ" part equals "Ã" and "Â¥" = "¥" (this is my first "encoding"). THe "Ã¥" characters equals the UTF-8 character for "å" (this is my second encoding).

So, the issue is that "false" (UTF8-encoded twice) utf-8 needs to be converted back into "correct" utf-8 (only UTF8-encoded once).

Trying to fix this in PHP turns out to be a bit challenging:

utf8_decode() is not able to process the characters.

// Fails silently (as in - nothing is output)
$str = "sÃƒÂ¥";

$str = utf8_decode($str);
printf("\n%s", $str);

$str = utf8_decode($str);
printf("\n%s", $str);

iconv() fails with "Notice: iconv(): Detected an illegal character in input string".

echo iconv("UTF-8", "ISO-8859-1", "sÃƒÂ¥");

Another fine and possible solution fails silently too in this scenario

$str = "sÃƒÂ¥";
echo html_entity_decode(htmlentities($str, ENT_QUOTES, 'UTF-8'), ENT_QUOTES , 'ISO-8859-15');

mb_convert_encoding() silently: #

$str = "sÃƒÂ¥";
echo mb_convert_encoding($str, 'ISO-8859-15', 'UTF-8');
// (No output)

Trying to fix the encoding in MySQL by converting the MySQL database characterset and collation to UTF-8 was unsuccessfully:

ALTER DATABASE myDatabase CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE myTable CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

I see a couple of ways to resolve this issue.

The first is to make a backup with correct encoding (the encoding needs to match the actual database and table encoding). You can verify the encoding by simply opening the resulting SQL file in a text editor.

The other is to replace double-UTF8-encoded characters with single-UTF8-encoded characters. This can be done manually in a text editor. To assist in this process, you can manually pick incorrect characters from Try UTF-8 Encoding Debugging Chart (it may be a matter of replacing 5-10 errors).

Finally, a script can assist in the process:

    $str = "sÃƒÂ¥";
    // The two arrays can also be generated by double-encoding values in the first array and single-encoding values in the second array.
    $str = str_replace(["Ãƒ","Â¥"], ["Ã","¥"], $str); 
    $str = utf8_decode($str);
    echo $str;
    // Output: "så" (correct)

回复收藏 0 原文

吃→可爱长大的 2024-12-18 11:13:16

我今天遇到了一个非常相似的问题：mysqldump 将我的 utf-8 基本编码 utf-8 变音符号转储为两个 latin1 字符，尽管文件本身是常规 utf8。

例如：“é”被编码为两个字符“é”。这两个字符对应于字母的 utf8 两个字节编码，但应将其解释为单个字符。

为了解决问题并在另一台服务器上正确导入数据库，我必须使用 ftfy（代表“为您修复文本”）转换文件。(https://github.com/LuminosoInsight/python-ftfy) python 库。该库完全符合我的预期：将错误编码的 utf-8 转换为正确编码的utf-8。

é”变成了“é”，

ftfy附带了一个命令行脚本，但它转换了文件，因此无法将其导入回mysql。

例如：这个latin1组合“ 脚本来做到这一点：

#!/usr/bin/python3
# coding: utf-8

import ftfy

# Set input_file
input_file = open('mysql.utf8.bad.dump', 'r', encoding="utf-8")
# Set output file
output_file = open ('mysql.utf8.good.dump', 'w')

# Create fixed output stream
stream = ftfy.fix_file(
    input_file,
    encoding=None,
    fix_entities='auto', 
    remove_terminal_escapes=False, 
    fix_encoding=True, 
    fix_latin_ligatures=False, 
    fix_character_width=False, 
    uncurl_quotes=False, 
    fix_line_breaks=False, 
    fix_surrogates=False, 
    remove_control_chars=False, 
    remove_bom=False, 
    normalization='NFC'
)

# Save stream to output file
stream_iterator = iter(stream)
while stream_iterator:
    try:
        line = next(stream_iterator)
        output_file.write(line)
    except StopIteration:
        break

I encountered today quite a similar problem : mysqldump dumped my utf-8 base encoding utf-8 diacritic characters as two latin1 characters, although the file itself is regular utf8.

For example : "é" was encoded as two characters "Ã©". These two characters correspond to the utf8 two bytes encoding of the letter but it should be interpreted as a single character.

To solve the problem and correctly import the database on another server, I had to convert the file using the ftfy (stands for "Fixes Text For You). (https://github.com/LuminosoInsight/python-ftfy) python library. The library does exactly what I expect : transform bad encoded utf-8 to correctly encoded utf-8.

ftfy comes with a command line script but it transforms the file so it can not be imported back into mysql.

I wrote a python3 script to do the trick :

#!/usr/bin/python3
# coding: utf-8

import ftfy

# Set input_file
input_file = open('mysql.utf8.bad.dump', 'r', encoding="utf-8")
# Set output file
output_file = open ('mysql.utf8.good.dump', 'w')

# Create fixed output stream
stream = ftfy.fix_file(
    input_file,
    encoding=None,
    fix_entities='auto', 
    remove_terminal_escapes=False, 
    fix_encoding=True, 
    fix_latin_ligatures=False, 
    fix_character_width=False, 
    uncurl_quotes=False, 
    fix_line_breaks=False, 
    fix_surrogates=False, 
    remove_control_chars=False, 
    remove_bom=False, 
    normalization='NFC'
)

# Save stream to output file
stream_iterator = iter(stream)
while stream_iterator:
    try:
        line = next(stream_iterator)
        output_file.write(line)
    except StopIteration:
        break

回复收藏 0 原文