CSV 文件中的 Unicode?
我需要生成一个 CSV 文件。也许我“做错了”,因为我用自己的代码转储文件而不是使用库,但无论如何。
看来我一切都对了。引号、逗号和所有内容似乎都被完美地转义了。这很容易。问题是我正在使用 unicode 字符串进行测试,结果显示为 ????。当我使用 MS Excel 保存带有测试字符串的文件并点击“另存为 CSV”打开文件时,我遇到了同样的问题(unicode 字母变成了??????)。不支持unicode吗?
我只是尝试转储这样的字符串,而不是将其输出到网页
var f = new System.IO.StreamWriter(filename, false, System.Text.Encoding.Unicode);
,现在我看到了 unicode 文本,但所有内容现在都在一列中。奇怪的是,在我选择的文本编辑器中,一切看起来都很正常,如果我复制/粘贴几列并将其粘贴到另存为 .csv 中,我会看到这些列很好。尽管它可能会去除 unicode。
我该如何正确保存这个?
I need to generate a CSV file. Maybe i am 'doing it wrong' because i am dumping the file with my own code instead of using a lib but anyways.
It looks like i have everything right. Quotes, commas and everything seems to be escaped perfectly. It was rather easy. The problem is i am using unicode strings to test and they come out as ????. When i use MS Excel to save a file with my test string and i hit save as CSV opening the file gets me the same problem (unicode letters becoming ?????). Is unicode not supported?
I just tried dumping the string like this instead of outputting it to a webpage
var f = new System.IO.StreamWriter(filename, false, System.Text.Encoding.Unicode);
and now i see the unicode text but everything is now in one column. Whats weird is everything looks normal in my text editor of choice and if i copy/paste a few columns out and paste it in saving as .csv i see the columns fine. Although it probably strips unicode out.
How do i save this properly?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
System.Text.Encoding.Unicode
使用 UTF-16 编码。尝试告诉您的文本编辑器使用 UTF-16 进行解码;我猜您用来显示输出文件的编辑器默认为 UTF-8 或 ASCII。如果是这样,另一种方法可能是使用 System.Text.Encoding.UTF8 对输出进行编码。System.Text.Encoding.Unicode
uses UTF-16 encoding. Try telling your text-editors to decode with UTF-16; I'm guessing the editor you are using to display the output file is defaulting to UTF-8 or ASCII. If this is so, an alternative might be to encode the output withSystem.Text.Encoding.UTF8
instead.您需要做两件事:将文本文件(或 html 页面)标记为包含 Unicode 字符(UTF-8 或 UTF-16),并确保您使用的文本编辑器支持 Unicode 文本。在 Windows 上记事本是一个不错的选择。
要将文本文件(例如 .csv)标记为包含 Unicode 文本,您需要编写 字节顺序标记(BOM) 作为文本文件中的第一个字符。对于 UTF-16 小尾数法 (Intel),BOM 将为字节 0xFF、0xFE。字节顺序标记告诉文档阅读器文档中的字符是按大端序还是小端序排序。 BOM 字符是 Unicode 字符表中保留的非打印字符。此 BOM 还可用于区分 ASCII 文本与 UTF-8 和其他 Unicode 编码(因为 UTF-8 BOM 字节序列与 UTF-16 等不同)。
一些文档编写者会为您编写 BOM,或者可以选择包含或排除 BOM。使用二进制十六进制转储查看文本文件字节以确定是否有 BOM。不要使用文本编辑器 - BOM 是非显示字符。
要指示您生成的 HTML 页面包含 Unicode 字符,您需要设置 Content-Type 标头来指示 Unicode 字符集:
Content-Type: text/html;例如,charset=utf-8
表示 UTF-8 编码的 Unicode 文本。You need to do two things: mark the text file (or html page) as containing Unicode chars (either UTF-8 or UTF-16), and make sure that you are using a text editor that supports Unicode text. Notepad is a good choice on Windows.
To mark a text file (such as .csv) as containing Unicode text, you need to write a Byte Order Mark (BOM) as the first character in the text file. For UTF-16 little-endian (Intel), the BOM would be bytes 0xFF, 0xFE. The Byte Order Mark tells the document reader whether the characters in the document are ordered as big-endian or little-endian. The BOM character is a reserved non-printing character in the Unicode character tables. This BOM can also be used to distinguish ASCII text from UTF-8 and other Unicode encodings (because the UTF-8 BOM byte sequence is different from UTF-16, etc).
Some document writers will write the BOM for you, or have an option to include or exclude the BOM. Use a binary hex dump to view the text file bytes to determine whether you have a BOM or not. Do not use a text editor - the BOM is a non-display char.
To indicate that an HTML page you are generating contains Unicode characters, you need to set the Content-Type header to indicate a Unicode charset:
Content-Type: text/html; charset=utf-8
indicates UTF-8 encoded Unicode text, for example.也可能只是 Word 使用的字体缺少您尝试显示的这些字符。如果我打开 Word,按住 ALT 并组合数字键盘,它会将字体更改为数学字体,但仍然显示相关字体中缺少的字符项目。
It could also just be the font Word is using is missing these characters you are trying to display. If I open Word, hold ALT and mash my numpad, it changes the font to a math font, but still displays the missing character item from the font in question.
我遇到了类似的事情。
当我使用带有 -w 选项的 BCP 来使用 Unicode (UTF-16) 时,
在 Excel 中打开时,每行都显示为单个列。
我找到了这篇文章: 在 Excel 中打开 UTF16 编码的 CSV 文件
他们提到了“制表符”(从未听说过),但我认为他们指的是制表符“\t”。
对于 BCP,我删除了“-t”参数,因此默认使用“\t”(制表符)作为分隔符。
“CSV”文件采用制表符分隔,但在 Excel 中打开时会显示正确的列数。
我无法用文档解释它,但它看起来更像是 Excel 中的错误而不是功能。
也许 CSV 标准仅支持 UTF-8 逗号和(无论出于何种原因)
解析器(如 Excel)在解析 UTF-16 中的制表符时错过了这个备忘录。
I ran into something similar.
When I used BCP with the -w option to use Unicode (UTF-16),
every Row appeared as a single Column when opening in Excel.
I found this post: Opening CSV file with UTF16 encoding in Excel
They mention a "Tabulator" (never heard of that), but I think they mean the Tab-Character "\t".
For BCP I removed the "-t" Parameter, so it would default to "\t" (Tab) as the Delimiter.
The "CSV" File is Tab Delimited, but opening in Excel renders with the correct number of Columns.
I can't explain it with documentation, but it looks more like a bug in Excel than a feature.
Maybe the CSV standard only supports UTF-8 Commas and (for whatever reason)
parsers (like Excel) missed that memo when it came to parsing Tabs in UTF-16.