当前位置：文江博客话题详情

从 Web 服务输出中清除 ASCII 控制字符

发布于 2024-11-14 23:58:33 字数 1281 浏览 2 评论 0 原文

我在处理从最近使用的 Web 服务接收的一些文本时遇到了一些困难。 Web 服务发回 XML，这很好，但我们在某些 XML 的中间得到了 ASCII 控制字符。我想在这篇文章中粘贴一个示例，但由于是无效字符，我什至无法将其粘贴到此文本区域中。

我花了一些时间研究在这些情况下该怎么做，我发现了这篇内容丰富的文章：http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/。以下是这篇文章中的相关引用：

这些字符没有任何业务存在于 XML 数据中；他们是应该是非法字符已删除...

因此，按照文章的建议，我编写了一些代码来获取此服务的原始输出，并删除它的任何控制字符字符（并且不是空格、制表符、cr 或 lf）

在这里是那个代码：

System.Net.WebClient client = new System.Net.WebClient();

byte[] invalidCharacters = { 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 
                             0xC, 0xE, 0xF, 0x10, 0x11, 0x12, 0x14, 0x15, 0x16, 
                             0x17, 0x18, 0x1A, 0x1B, 0x1E, 0x1F, 0x7F };

byte[] sanitizedResponse = (from a in client.DownloadData(url)
                            where !invalidCharacters.Contains(a)
                            select a).ToArray();

result = System.Text.UTF8Encoding.UTF8.GetString(sanitizedResponse);

但这让我思考。如果我收到双字节字符，我会弄乱我得到的任何数据吗？对于某些代码页来说，由一个或两个单字节 ASCII 控制字符组成的双字节字符是否有效？文章称这些字符“与 XML 数据无关”，这听起来像是最终决定，但我想要另一种意见。

感谢任何反馈

原文

I was having some difficulties with some text I was receiving from a Web Service I consume recently. The web service sends back XML, which is fine, but we're getting ASCII control characters in the middle of some of the XML. I wanted to paste an example in this posting but being invalid characters, I can't even paste it into this textarea.

I spent some time researching what to do in these cases and I found this informative article: http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/. Here is a quote from this article that is relevant:

These aren’t characters that have any
business being in XML data; they’re
illegal characters that should be
removed...

So, following the article's advice I've written some code to take the raw output from this service and strip it of any character that is a control character (and that is not a space, tab, cr or lf)

Here is that code:

System.Net.WebClient client = new System.Net.WebClient();

byte[] invalidCharacters = { 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 
                             0xC, 0xE, 0xF, 0x10, 0x11, 0x12, 0x14, 0x15, 0x16, 
                             0x17, 0x18, 0x1A, 0x1B, 0x1E, 0x1F, 0x7F };

byte[] sanitizedResponse = (from a in client.DownloadData(url)
                            where !invalidCharacters.Contains(a)
                            select a).ToArray();

result = System.Text.UTF8Encoding.UTF8.GetString(sanitizedResponse);

This got me thinking though. If I receive double-byte characters, will I screw up any of the data I'm getting back? Is it valid for some codepages to have double-byte characters that are made up of one or two single byte ASCII control characters? The article saying that these characters have "no business" being in XML data sounds final but I want a second opinion.

Appreciate any feedback

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

相守太难 2024-11-21 23:58:33

好吧，您显示的代码是假设 UTF-8 - 它的数据中永远不会有任何字节（除了那些字符），因为它的设计方式。但是，我鼓励使用文本驱动方法而不是这种字节驱动方法 - 我可能会使用DownloadString而不是DownloadData（并依靠 WebClient 选择正确的编码），然后在解析数据之前使用正则表达式清理数据。

我还会联系网络服务提供商，解释他们正在提供无效的 XML...

回复收藏 0 原文

记忆里有你的影子 2024-11-21 23:58:33

请尝试以下操作：

byte[] byteArray = Encoding.ASCII.GetBytes( test ); 
MemoryStream stream = new MemoryStream( byteArray );    
stream.Position = 0;
StreamReader reader = new StreamReader( stream );            
string text = reader.ReadToEnd();

Try the following:

byte[] byteArray = Encoding.ASCII.GetBytes( test ); 
MemoryStream stream = new MemoryStream( byteArray );    
stream.Position = 0;
StreamReader reader = new StreamReader( stream );            
string text = reader.ReadToEnd();

回复收藏 0 原文

~没有更多了~

关于作者

别在捏我脸啦

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

从 Web 服务输出中清除 ASCII 控制字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

从 Web 服务输出中清除 ASCII 控制字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。