从 MS Word 复制的表单字段在数据库中输入时会导致无效字符
我在将 Web 表单提交给 PHP 脚本然后插入 MySQL 数据库时遇到问题。
问题出在复制和复制上。从 Microsoft Word 或类似的文字处理软件粘贴,主要会影响项目符号,但有时也会影响引号和单引号。我无法嗅探该人提交的字符编码。
我的文件顶部有以下处理数据的代码(函数):
function init_byte_map(){
global $byte_map;
for($x=128;$x<256;++$x){
$byte_map[chr($x)]=utf8_encode(chr($x));
}
$cp1252_map=array(
"\x80"=>"\xE2\x82\xAC", // EURO SIGN
"\x82" => "\xE2\x80\x9A", // SINGLE LOW-9 QUOTATION MARK
"\x83" => "\xC6\x92", // LATIN SMALL LETTER F WITH HOOK
"\x84" => "\xE2\x80\x9E", // DOUBLE LOW-9 QUOTATION MARK
"\x85" => "\xE2\x80\xA6", // HORIZONTAL ELLIPSIS
"\x86" => "\xE2\x80\xA0", // DAGGER
"\x87" => "\xE2\x80\xA1", // DOUBLE DAGGER
"\x88" => "\xCB\x86", // MODIFIER LETTER CIRCUMFLEX ACCENT
"\x89" => "\xE2\x80\xB0", // PER MILLE SIGN
"\x8A" => "\xC5\xA0", // LATIN CAPITAL LETTER S WITH CARON
"\x8B" => "\xE2\x80\xB9", // SINGLE LEFT-POINTING ANGLE QUOTATION MARK
"\x8C" => "\xC5\x92", // LATIN CAPITAL LIGATURE OE
"\x8E" => "\xC5\xBD", // LATIN CAPITAL LETTER Z WITH CARON
"\x91" => "\xE2\x80\x98", // LEFT SINGLE QUOTATION MARK
"\x92" => "\xE2\x80\x99", // RIGHT SINGLE QUOTATION MARK
"\x93" => "\xE2\x80\x9C", // LEFT DOUBLE QUOTATION MARK
"\x94" => "\xE2\x80\x9D", // RIGHT DOUBLE QUOTATION MARK
"\x95" => "\xE2\x80\xA2", // BULLET
"\x96" => "\xE2\x80\x93", // EN DASH
"\x97" => "\xE2\x80\x94", // EM DASH
"\x98" => "\xCB\x9C", // SMALL TILDE
"\x99" => "\xE2\x84\xA2", // TRADE MARK SIGN
"\x9A" => "\xC5\xA1", // LATIN SMALL LETTER S WITH CARON
"\x9B" => "\xE2\x80\xBA", // SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
"\x9C" => "\xC5\x93", // LATIN SMALL LIGATURE OE
"\x9E" => "\xC5\xBE", // LATIN SMALL LETTER Z WITH CARON
"\x9F" => "\xC5\xB8" // LATIN CAPITAL LETTER Y WITH DIAERESIS
);
foreach($cp1252_map as $k=>$v){
$byte_map[$k]=$v;
}
}
function fix_latin($instr){
if(mb_check_encoding($instr,'UTF-8'))return $instr; // no need for the rest if it's all valid UTF-8 already
global $nibble_good_chars,$byte_map;
$outstr='';
$char='';
$rest='';
while((strlen($instr))>0){
if(1==preg_match($nibble_good_chars,$input,$match)){
$char=$match[1];
$rest=$match[2];
$outstr.=$char;
}elseif(1==preg_match('@^(.)(.*)$@s',$input,$match)){
$char=$match[1];
$rest=$match[2];
$outstr.=$byte_map[$char];
}
$instr=$rest;
}
return $outstr;
}
$byte_map=array();
init_byte_map();
$ascii_char='[\x00-\x7F]';
$cont_byte='[\x80-\xBF]';
$utf8_2='[\xC0-\xDF]'.$cont_byte;
$utf8_3='[\xE0-\xEF]'.$cont_byte.'{2}';
$utf8_4='[\xF0-\xF7]'.$cont_byte.'{3}';
$utf8_5='[\xF8-\xFB]'.$cont_byte.'{4}';
$nibble_good_chars = "@^($ascii_char+|$utf8_2|$utf8_3|$utf8_4|$utf8_5)(.*)$@s";
然后我接收每个表单字段并运行 fix_latin 函数。
foreach ($jobdata AS $field => $string)
{
$string = fix_latin($string);
$jobdata[$field] = addslashes(str_replace("\n", '<br />', htmlspecialchars($string)));
}
数据被输入数据库并通过电子邮件发送给系统管理员以供批准。今天,我收到一封管理电子邮件,其中包含以下要点:
Job Description: Responsibilities:
路 Assist multi-state companies
当我查看数据库或在脚本中进行编辑时,项目符号被替换为方框,而不是实体。
I am having a problem with a web form that is being submitted to a PHP script and then inserting into a MySQL database.
The problem lies with Copy & Paste from Microsoft Word or similar word processing software and mostly effects bullets but sometimes will effect quotes and single-quotes. I am not able to sniff the character encoding the person is submitting.
I have the following code(functions) at the top of my file that processes the data:
function init_byte_map(){
global $byte_map;
for($x=128;$x<256;++$x){
$byte_map[chr($x)]=utf8_encode(chr($x));
}
$cp1252_map=array(
"\x80"=>"\xE2\x82\xAC", // EURO SIGN
"\x82" => "\xE2\x80\x9A", // SINGLE LOW-9 QUOTATION MARK
"\x83" => "\xC6\x92", // LATIN SMALL LETTER F WITH HOOK
"\x84" => "\xE2\x80\x9E", // DOUBLE LOW-9 QUOTATION MARK
"\x85" => "\xE2\x80\xA6", // HORIZONTAL ELLIPSIS
"\x86" => "\xE2\x80\xA0", // DAGGER
"\x87" => "\xE2\x80\xA1", // DOUBLE DAGGER
"\x88" => "\xCB\x86", // MODIFIER LETTER CIRCUMFLEX ACCENT
"\x89" => "\xE2\x80\xB0", // PER MILLE SIGN
"\x8A" => "\xC5\xA0", // LATIN CAPITAL LETTER S WITH CARON
"\x8B" => "\xE2\x80\xB9", // SINGLE LEFT-POINTING ANGLE QUOTATION MARK
"\x8C" => "\xC5\x92", // LATIN CAPITAL LIGATURE OE
"\x8E" => "\xC5\xBD", // LATIN CAPITAL LETTER Z WITH CARON
"\x91" => "\xE2\x80\x98", // LEFT SINGLE QUOTATION MARK
"\x92" => "\xE2\x80\x99", // RIGHT SINGLE QUOTATION MARK
"\x93" => "\xE2\x80\x9C", // LEFT DOUBLE QUOTATION MARK
"\x94" => "\xE2\x80\x9D", // RIGHT DOUBLE QUOTATION MARK
"\x95" => "\xE2\x80\xA2", // BULLET
"\x96" => "\xE2\x80\x93", // EN DASH
"\x97" => "\xE2\x80\x94", // EM DASH
"\x98" => "\xCB\x9C", // SMALL TILDE
"\x99" => "\xE2\x84\xA2", // TRADE MARK SIGN
"\x9A" => "\xC5\xA1", // LATIN SMALL LETTER S WITH CARON
"\x9B" => "\xE2\x80\xBA", // SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
"\x9C" => "\xC5\x93", // LATIN SMALL LIGATURE OE
"\x9E" => "\xC5\xBE", // LATIN SMALL LETTER Z WITH CARON
"\x9F" => "\xC5\xB8" // LATIN CAPITAL LETTER Y WITH DIAERESIS
);
foreach($cp1252_map as $k=>$v){
$byte_map[$k]=$v;
}
}
function fix_latin($instr){
if(mb_check_encoding($instr,'UTF-8'))return $instr; // no need for the rest if it's all valid UTF-8 already
global $nibble_good_chars,$byte_map;
$outstr='';
$char='';
$rest='';
while((strlen($instr))>0){
if(1==preg_match($nibble_good_chars,$input,$match)){
$char=$match[1];
$rest=$match[2];
$outstr.=$char;
}elseif(1==preg_match('@^(.)(.*)$@s',$input,$match)){
$char=$match[1];
$rest=$match[2];
$outstr.=$byte_map[$char];
}
$instr=$rest;
}
return $outstr;
}
$byte_map=array();
init_byte_map();
$ascii_char='[\x00-\x7F]';
$cont_byte='[\x80-\xBF]';
$utf8_2='[\xC0-\xDF]'.$cont_byte;
$utf8_3='[\xE0-\xEF]'.$cont_byte.'{2}';
$utf8_4='[\xF0-\xF7]'.$cont_byte.'{3}';
$utf8_5='[\xF8-\xFB]'.$cont_byte.'{4}';
$nibble_good_chars = "@^($ascii_char+|$utf8_2|$utf8_3|$utf8_4|$utf8_5)(.*)$@s";
I then receive each form field and run the fix_latin function.
foreach ($jobdata AS $field => $string)
{
$string = fix_latin($string);
$jobdata[$field] = addslashes(str_replace("\n", '<br />', htmlspecialchars($string)));
}
The data is entered in the database and also e-mailed to the system admin for approval. Today I received an admin e-mail that had the following for a bullet point:
Job Description: Responsibilities:
路 Assist multi-state companies
And when I view the database or edit within the script, the bullet is replaced with a square box, not the • entity.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
表单应使用与其宿主文档相同的字符编码提交。理论上,您可以在声明表单时使用
如果您对包含表单的页面使用与您希望提交数据相同的字符编码,则应该使用正确的字符编码获取数据。
此外,如果您的脚本既发送电子邮件又将数据存储到表中,那么您需要确保电子邮件和表使用相同的字符编码。您需要在电子邮件中设置适当的标头,以确保读者知道您正在使用的字符编码。
我建议始终使用 UTF8,确保您的数据库和网页均使用 UTF8 编码,并且您的脚本发送的任何电子邮件也设置一个标头,表明它们也使用 UTF8 编码。它应该有望消除对繁琐的转换函数(如您一直在使用的函数)的需要。我自己在一个项目中遇到了类似的问题,起初我尝试了你的方法来解决这个问题,但最终它实在是太多了,无法处理,因为你需要捕获和处理成千上万的潜在输入。
同时,一个简单的解决方法是不直接从 Word 粘贴,而是从 Word 粘贴到简单的文本编辑器(例如记事本),然后从记事本复制并粘贴到浏览器。
Forms should submit with the same character encoding as their host document. In theory you can override the character encoding by using
<form accept-charset="UTF-8">
when declaring your form, but this doesn't work in internet explorer (surprise surprise).If you use the same character encoding for the page that contains the form as you want your data to be submitted in, you should get data using the correct character encoding.
Additionally, if your script is both sending an e-mail and storing the data to a table, then you need to make sure both the e-mail and the table are using the same character encoding. You need to set the appropriate headers in your email to make sure the reader knows what character encoding you're using.
I'd recommend using UTF8 throughout, make sure both your database and your web pages are encoded with UTF8, and that any emails your scripts send also set a header indicating that they're encoded with UTF8 as well. It should hopefully eliminate the need for cumbersome conversion functions like the one you've been using. I was running into similar problems myself in a project, and at first I tried your approach to the problem, but in the end it was simply too much to deal with as there's thousands of potential inputs you need to catch and deal with.
In the meantime, a simple work around is to not paste directly from Word, but to paste from Word to a simple text editor such as Notepad, then copy and paste from Notepad to the browser.