从剪贴板(从 Excel 粘贴)获取包含重音字符的 CSV 数据
场景
- 我的用户将从 Excel 复制单元格(从而将其放入剪贴板)
- 我的应用程序将从剪贴板检索这些单元格
问题
- 我的代码从剪贴板检索 CSV 格式
- 但是,如果原始 Excel 内容包含类似 ä ( a 带变音符号)然后检索到的 CSV 字符串没有正确的字符(ä 最终对我来说显示为“正方形”)
- 相比之下,如果我的代码从剪贴板检索 Unicode 文本格式,则一切正常: ä 被保留在从剪贴板检索到的字符串中
源代码 - 原始 - 问题
[STAThread]
static void Main(string[] args)
{
var fmt_csv = System.Windows.Forms.DataFormats.CommaSeparatedValue;
// read the CSV
var dataobject = System.Windows.Forms.Clipboard.GetDataObject();
var stream = (System.IO.Stream)dataobject.GetData(fmt_csv);
var enc = new System.Text.UTF8Encoding();
var reader = new System.IO.StreamReader(stream,enc);
string data_csv = reader.ReadToEnd();
// read the unicode string
string data_string = System.Windows.Forms.Clipboard.GetText();
}
所在 运行示例代码时的结果
- 重现步骤: 在 Excel 中输入一些文本(我使用了单词“doppelgänger”加上一些数字),然后只需按 Ctrl-C 进行复制将其复制到剪贴板,然后运行上面的代码。
- data_csv 设置为“doppelgänger,1\r\n2,3\r\n\0”
- data_string 设置为“doppelgänger\t1\r\n2\t3\r\n”
问题
- 我该怎么做才能获得正确的字符?
评论
- 是的,我知道我可以通过使用 Unicode 文本来解决这个问题。 但我实际上想了解在
- 检索流时使用或不使用 UTF-8 编码的 CSV 发生了什么对结果没有影响
答案
在查看评论并密切关注 Excel 的内容之后在 CSV 的剪贴板中,Excel 可能会使用“旧版”编码而不是 UTF-8 来放置内容,这似乎是合理的。 所以我尝试使用 Windows 1252 代码页作为编码,它起作用了。 请参阅下面的代码
源代码 - 带答案
[STAThread]
static void Main(string[] args)
{
var fmt_csv = System.Windows.Forms.DataFormats.CommaSeparatedValue;
//read the CSV
var dataobject = System.Windows.Forms.Clipboard.GetDataObject();
var stream = (System.IO.Stream)dataobject.GetData(fmt_csv);
var enc = System.Text.Encoding.GetEncoding(1252);
var reader = new System.IO.StreamReader(stream,enc);
string data_csv= reader.ReadToEnd();
//read the Unicode String
string data_string = System.Windows.Forms.Clipboard.GetText();
}
SCENARIO
- My users will copy cells from Excel (thus placing it into the clipboard)
- And my application will retrieve those cells from the clipboard
THE PROBLEM
- My code retrieves the CSV format from the clipboard
- However, the if the original Excel content contains characters like ä (a with umlaut) then retrieved CSV string doesn't have the correct characters (ä ends up showing as a "square" for me)
- In comparison, if my code retrieves the Unicode text format from the clipboard everything works fine: the ä is preserved in the string retrieved from the clipboard
SOURCE CODE - ORIGINAL - WITH THE PROBLEM
[STAThread]
static void Main(string[] args)
{
var fmt_csv = System.Windows.Forms.DataFormats.CommaSeparatedValue;
// read the CSV
var dataobject = System.Windows.Forms.Clipboard.GetDataObject();
var stream = (System.IO.Stream)dataobject.GetData(fmt_csv);
var enc = new System.Text.UTF8Encoding();
var reader = new System.IO.StreamReader(stream,enc);
string data_csv = reader.ReadToEnd();
// read the unicode string
string data_string = System.Windows.Forms.Clipboard.GetText();
}
THE RESULTS WHEN RUNNING THE SAMPLE CODE
- Repro steps: Enter some text in Excel (I used the word "doppelgänger" plus some numbers) and simply hit Ctrl-C to copy it to the clipboard and then run the code above.
- data_csv is set to "doppelg�nger,1\r\n2,3\r\n\0"
- data_string is set to "doppelgänger\t1\r\n2\t3\r\n"
QUESTION
- What can I do to get the correct characters?
COMMENTS
- Yes, I know I could workaround this problem by using the Unicode text. But I actually want to understand what is going on with the CSV
- using or not using the UTF-8 encoding when retrieving the stream makes no difference in the results
THE ANSWER
After looking at the comments, and paying close attention to what Excel was putting on the clipboard for CSV, it seemed reasonable that Excel might be placing the contents using an "legacy" encoding instead of UTF-8. So I tried the using the Windows 1252 codepage as the encoding and it worked. See the code below
SOURCE CODE - WITH THE ANSWER
[STAThread]
static void Main(string[] args)
{
var fmt_csv = System.Windows.Forms.DataFormats.CommaSeparatedValue;
//read the CSV
var dataobject = System.Windows.Forms.Clipboard.GetDataObject();
var stream = (System.IO.Stream)dataobject.GetData(fmt_csv);
var enc = System.Text.Encoding.GetEncoding(1252);
var reader = new System.IO.StreamReader(stream,enc);
string data_csv= reader.ReadToEnd();
//read the Unicode String
string data_string = System.Windows.Forms.Clipboard.GetText();
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Excel 使用 Unicode 字符编码将字符串存储在剪贴板上。 当您尝试读取 ANSI 格式的字符串时出现正方形的原因是系统的 ANSI 代码页中没有该字符的表示形式。 你应该只使用 Unicode。 如果您要处理本地化问题,那么 ANSI 只会带来更多的麻烦,而不是它的价值。
编辑: Joel Spolsky 写了一篇关于字符编码的精彩介绍,绝对值得一看:每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!)
Excel stores the string on the clipboard using the Unicode character encoding. The reason you get a square when you try to read the string in ANSI is that there is no representation for that character in your system's ANSI codepage. You should just use Unicode. If you're going to be dealing with localization issues, then ANSI is just more trouble than it's worth.
Edit: Joel Spolsky wrote an excellent introduction to character encodings, which is definitely worth checking out: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
您将流编码为 UTF8 不起作用。 元音变音的字节正在转换为“替换字符”unicode 字符。
相反,只需查看流的数据,无需任何额外的编码指令。 数据将采用 Excel 使用的某种设定格式。 您应该能够通过查看字节来判断 unlaut 所在的位置。 然后您应该能够将其转换为 UTF-8。
最坏的情况是 CSV 格式化程序丢弃所有非 Ascii 字符。 在这种情况下,您也许可以编写自己的数据格式化程序。
在某些情况下,Excel 人员认为 CSV 仅表示 Ascii。
请参阅 http://www .tech-archive.net/Archive/Excel/microsoft.public.excel.misc/2008-07/msg02270.html
Your encoding of the stream as UTF8 is not working. The bytes for the umlaut are being converted into the "replacement character" unicode character.
Instead, just look at the stream's data without any extra encoding instructions. The data will be in some set format used by Excel. You should be able to tell by looking at the byte(s) where the unlaut is. You should then be able to convert it to UTF-8.
Worst case is if the CSV Formatter throws out everything that is not Ascii. In that case, you might be able to write your own Data formatter.
In some cases, the Excel folks have figured that CSV means Ascii only.
See http://www.tech-archive.net/Archive/Excel/microsoft.public.excel.misc/2008-07/msg02270.html