FileUpload 服务器控制和 unicode 字符
我正在使用 FileUpload 服务器控件上传以前从 MS Word 保存的 HTML 文档(作为网页;已过滤)。字符集是 windows-1252。 该文档具有智能引号(弯引号)和常规引号。它还有一些空格(显然),当深入观察时,这些空格是除正常 TAB 或 SPACE 之外的字符。
在 StreamReader 中捕获文件内容时,这些特殊字符将被转换为问号。我假设它是因为默认编码是 UTF-8 并且文件是 Unicode。
我继续使用 Unicode 编码创建 StreamReader,然后用正确的字符替换所有不需要的字符(我在 stackoverflow 中实际找到的代码)。这似乎有效......只是我无法将字符串转换回 UTF-8 以在 asp:literal 中显示它。 代码就在那里,它应该可以工作......但是输出(ConvertToASCII)不可读。
请看下面:
protected void btnUpload_Click(object sender, EventArgs e)
{
StreamReader sreader;
if (uplSOWDoc.HasFile)
{
try
{
if (uplSOWDoc.PostedFile.ContentType == "text/html" || uplSOWDoc.PostedFile.ContentType == "text/plain")
{
sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode);
string sowText = sreader.ReadToEnd();
sowLiteral.Text = ConvertToASCII(sowText);
lblUploadResults.Text = "File loaded successfully.";
}
else
lblUploadResults.Text = "Upload failed. Just text or html files are allowed.";
}
catch(Exception ex)
{
lblUploadResults.Text = ex.Message;
}
}
}
private string ConvertToASCII(string source)
{
if (source.IndexOf('\u2013') > -1) source = source.Replace('\u2013', '-');
if (source.IndexOf('\u2014') > -1) source = source.Replace('\u2014', '-');
if (source.IndexOf('\u2015') > -1) source = source.Replace('\u2015', '-');
if (source.IndexOf('\u2017') > -1) source = source.Replace('\u2017', '_');
if (source.IndexOf('\u2018') > -1) source = source.Replace('\u2018', '\'');
if (source.IndexOf('\u2019') > -1) source = source.Replace('\u2019', '\'');
if (source.IndexOf('\u201a') > -1) source = source.Replace('\u201a', ',');
if (source.IndexOf('\u201b') > -1) source = source.Replace('\u201b', '\'');
if (source.IndexOf('\u201c') > -1) source = source.Replace('\u201c', '\"');
if (source.IndexOf('\u201d') > -1) source = source.Replace('\u201d', '\"');
if (source.IndexOf('\u201e') > -1) source = source.Replace('\u201e', '\"');
if (source.IndexOf('\u2026') > -1) source = source.Replace("\u2026", "...");
if (source.IndexOf('\u2032') > -1) source = source.Replace('\u2032', '\'');
if (source.IndexOf('\u2033') > -1) source = source.Replace('\u2033', '\"');
byte[] sourceBytes = Encoding.Unicode.GetBytes(source);
byte[] targetBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, sourceBytes);
char[] asciiChars = new char[Encoding.ASCII.GetCharCount(targetBytes, 0, targetBytes.Length)];
Encoding.ASCII.GetChars(targetBytes, 0, targetBytes.Length, asciiChars, 0);
string result = new string(asciiChars);
return result;
}
另外,正如我之前所说,还有一些更多的“透明”字符似乎对应于单词 doc 具有编号缩进的位置,我不知道如何捕获它们的 unicode 值来替换它们......所以如果您有任何提示,请告诉我。
非常感谢!
I'm using the FileUpload server control to upload a HTML document previously saved(as webpage; filtered) from MS Word. The charset is windows-1252.
The document has smart quotation marks(curly) as well as regular quotes. It also has some blank spaces(apparently) that when looked deeply are characters other than the normal TAB or SPACE.
When capturing the file content in a StreamReader, those special characters are translated to question marks. I assume its because the default encoidng is UTF-8 and the file is Unicode.
I went ahead and created the StreamReader using Unicode encoding, then replacing all the unwanted characters with the correct ones(code that I actually found in stackoverflow). This seems to work....just that I cant convert the string back to UTF-8 to display it in a asp:literal.
The code is there, its supposed to work....but the output(ConvertToASCII) is unreadable.
Please look below:
protected void btnUpload_Click(object sender, EventArgs e)
{
StreamReader sreader;
if (uplSOWDoc.HasFile)
{
try
{
if (uplSOWDoc.PostedFile.ContentType == "text/html" || uplSOWDoc.PostedFile.ContentType == "text/plain")
{
sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode);
string sowText = sreader.ReadToEnd();
sowLiteral.Text = ConvertToASCII(sowText);
lblUploadResults.Text = "File loaded successfully.";
}
else
lblUploadResults.Text = "Upload failed. Just text or html files are allowed.";
}
catch(Exception ex)
{
lblUploadResults.Text = ex.Message;
}
}
}
private string ConvertToASCII(string source)
{
if (source.IndexOf('\u2013') > -1) source = source.Replace('\u2013', '-');
if (source.IndexOf('\u2014') > -1) source = source.Replace('\u2014', '-');
if (source.IndexOf('\u2015') > -1) source = source.Replace('\u2015', '-');
if (source.IndexOf('\u2017') > -1) source = source.Replace('\u2017', '_');
if (source.IndexOf('\u2018') > -1) source = source.Replace('\u2018', '\'');
if (source.IndexOf('\u2019') > -1) source = source.Replace('\u2019', '\'');
if (source.IndexOf('\u201a') > -1) source = source.Replace('\u201a', ',');
if (source.IndexOf('\u201b') > -1) source = source.Replace('\u201b', '\'');
if (source.IndexOf('\u201c') > -1) source = source.Replace('\u201c', '\"');
if (source.IndexOf('\u201d') > -1) source = source.Replace('\u201d', '\"');
if (source.IndexOf('\u201e') > -1) source = source.Replace('\u201e', '\"');
if (source.IndexOf('\u2026') > -1) source = source.Replace("\u2026", "...");
if (source.IndexOf('\u2032') > -1) source = source.Replace('\u2032', '\'');
if (source.IndexOf('\u2033') > -1) source = source.Replace('\u2033', '\"');
byte[] sourceBytes = Encoding.Unicode.GetBytes(source);
byte[] targetBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, sourceBytes);
char[] asciiChars = new char[Encoding.ASCII.GetCharCount(targetBytes, 0, targetBytes.Length)];
Encoding.ASCII.GetChars(targetBytes, 0, targetBytes.Length, asciiChars, 0);
string result = new string(asciiChars);
return result;
}
Also, as I said before, there are some more "transparent" characters that seem to correspond to where the word doc has numbering indentation that I have no idea how to capture their unicode value to replace them....so if you have any tips, please let me know.
Thanks a lot in advance!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据 MSDN 上的 StreamReader:
因此,如果您上传的文件字符集是
windows-1252
,那么您的行:不正确,因为文件内容不是 Unicode 编码的。相反,请使用:
其中最终布尔参数用于检测 BOM。
According to StreamReader on MSDN:
Therefore, if your uploaded file charset is
windows-1252
, then your line:is incorrect, as the file content is not Unicode encoded. Instead, use:
where the final boolean parameter is to detect the BOM.
恭喜,您是第 100 万个被“Encoding.Unicode”攻击的编码员。
不存在“Unicode 编码”这样的东西。 Unicode 是字符集,有多种不同的编码。
Encoding.Unicode实际上是UTF-16LE的具体编码,其中字符被编码为UTF-16“代码单元”,然后每个16位代码单元以小端顺序写入字节。这是 Windows NT 的本机内存中 Unicode 字符串格式,但您几乎不想使用它来读取或写入文件。作为每单元 2 字节的编码,它与 ASCII 不兼容,并且存储或传输效率不高。
如今,UTF-8 是一种更常见的 Unicode 文本编码。但微软将 UTF-16LE 错误命名为“Unicode”,继续迷惑和愚弄那些只想“支持 Unicode”的用户。由于 Encoding.Unicode 是一种非 ASCII 兼容的编码,因此尝试以 ASCII 超集编码(例如 UTF-8 或 Windows 默认代码页,如 1252 西欧)读取文件将会使所有内容变得非常难以辨认,而不是只是非 ASCII 字符。
在这种情况下,您的文件存储的编码是 Windows 代码页 1252。所以阅读它时:
我将保留它。不要费心尝试“转换为 ASCII”。这些智能引号是非常好的字符,应该像任何其他 Unicode 字符一样得到支持;如果您在显示智能引号时遇到问题,您可能也破坏了所有其他非 ASCII 字符。最好解决导致这种情况发生的问题,而不是仅仅在少数常见情况下尝试避免这种情况。
Congratulations, you are the one millionth coder to get bitten by “Encoding.Unicode”.
There is no such thing as the “Unicode encoding”. Unicode is the character set, which has many different encodings.
Encoding.Unicode is actually the specific encoding UTF-16LE, in which characters are encoded as UTF-16 “code units” and then each 16-bit code unit is written to bytes in a little-endian order. This is the native in-memory Unicode string format for Windows NT, but you almost never want to use it for reading or writing files. Being a 2-byte-per-unit encoding, it isn't ASCII-compatible, and it's not very efficient for storage or on the wire.
These days UTF-8 is a much more common encoding used for Unicode text. But Microsoft's misnaming of UTF-16LE as “Unicode” continues to confuse and fool users who just want to “support Unicode”. As Encoding.Unicode is a non-ASCII-compatible encoding, trying to read files in an ASCII-superset encoding (such as UTF-8 or a Windows default code page like 1252 Western European) will make an enormous illegible mess of everything, not just the non-ASCII characters.
In this case the encoding your file is stored in is Windows code page 1252. So read it with:
I'd leave it at that. Don't bother trying to “convert to ASCII”. Those smart quotes are perfectly good characters and should be supported like any other Unicode character; if you are having problems displaying smart quotes you are probably mangling all other non-ASCII characters too. Best fix the problem that's causing that to happen, rather than try to avoid it for just a few common cases.