从 SQL Server 检索 XML 时如何防止非法字符出现在 XML 中

发布于 2024-09-16 05:59:20 字数 251 浏览 15 评论 0原文

有时,我的类中的属性的字符串值会变得奇怪。它们包含非法字符并像这样显示(带框):

123[]45[]6789

我假设这些是非法/无法识别的字符。我将所有对象序列化为 XML,然后通过 Web 服务上传它们。当我再次检索它们时,一些字符被替换为奇怪的字符。使用 Word 键入的连字符和破折号最常发生这种情况。是这个原因吗?

无论如何,我可以通过正则表达式或其他方式检查字符串是否包含这些无法识别的字符吗?

Sometimes the string values of Properties in my Classes become odd. They contain illegal characters and are displayed like this (with boxes):

123[]45[]6789

I'm assuming those are illegal/unrecognized characters. I serialize all my objects to XML and then upload them via Web Service. When I retrieve them again, some characters are replaced with oddities. This happens most often with hyphens and dashes that have been typed using Word. Is that the cause of it?

Is there anyway I can check to see if the string contains any of these unrecognized characters via regex or something?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

我们只是彼此的过ke 2024-09-23 05:59:20

首先要记住的是,不存在“特殊字符”或“非法字符”之类的东西。有些字符在某些情况下是特殊的,有些是非字符,但不存在一般的“特殊字符”或“非法字符”。

您这里拥有的是:

  1. 完全正常的字符,您的字体没有字形。
  2. 不可打印的完全正常的字符(例如控制字符)。
  3. 调试器如何工作的人工制品。

首先要找出这个角色是什么。找到该字符的整数值,然后查找。

需要注意的一个重要因素是 U+FFFD (�),因为当解码器收到一堆在它尝试使用的编码上下文中没有意义的字节时,有时会使用它(例如 0x80 后跟 0x20 使得在 UTF-8 中没有意义,一种可能的响应是使用 U+FFFD 作为“这里有奇怪的东西”标记,其他可能的响应是抛出错误,并且也默默地忽略错误或尝试通过最后两个来猜测意图带来安全问题)。

一旦你弄清楚了这一点,你就可以开始推理为什么它会在不符合预期的情况下进入那里。可能是编码问题(写入的字符集不是读入的字符集)?它真的是有意存在的吗?会不会是别的什么?在您获得有关该错误的更多信息之前,您无法开始回答这个问题。

最后,还有如何应对的问题。从您在上述研究中找到的答案中,这一点有望变得显而易见。答案可能是“没什么,没关系”,可能是简单的事情,也可能是困难的事情。还不能说。

不要仅使用正则表达式进行过滤。也许这将是正确的解决方案,但您还不知道,所以也许您正在制造比现在更难发现的更深层次的错误,或者损坏了完美的数据。

The first thing to remember, is that there is no such thing as a "special character" or an "illegal character". There are characters that are special in certain circumstances, there are non-characters, but there are no generally "special characters" or "illegal characters".

What you have here is either:

  1. Perfectly normal characters for which your font doesn't have a glyph.
  2. Perfectly normal characters that aren't printable (e.g. control characters).
  3. An artefact of how the debugger works.

The first thing is to find out what that character is. Find the integer value of the character, and then look it up.

An important one to look out for is U+FFFD (�) as it is sometimes used when a decoder has recieved a bunch of bytes that make no sense in the context of the encoding it is trying to use (e.g. 0x80 followed by 0x20 makes no sense in UTF-8, and one possible response is to use U+FFFD as a "something strange here" marker, other possible responses are throwing an error, and also silently ignoring the error or trying to guess at intent though those last two bring security issues).

Once you've this figured out, you can begin to reason about why it's getting in there if it isn't expected. Could it be an ecoding issue (charset written in is not the charset read in)? Could it be actually intended to be there? Could it be something else? You can't begin to answer that until you have more information on the bug.

Finally, there's the matter of what to do about it. This will hopefully be obvious from the answers you've found in your research above. Possibly the answer will be "nothing it's fine", possibly something simple or something hard. Can't say yet.

Do not just filter with a regular expression. Maybe that will turn out to be the correct solution, but you don't know yet, so maybe you're making a deeper bug harder to find than it is now, or damaging perfectly good data.

墨落画卷 2024-09-23 05:59:20

就我个人而言,我不认为使用正则表达式来检查这些字符是正确的解决方案。如果您不存储这些字符,那么显然存在某种编码问题。

验证 XML 文档本身是否使用正确的编码来存储,以支持您需要存储的字符。然后验证您在读取文件时是否使用与文档相同的编码,即如果您的 XML 文档存储为 UTF-8,那么您需要确保在读取它时将其编码为 UTF-8。

Personally I don't think using a Regex to check for these characters is the correct solution. If you aren't storing those characters then there is obviously some sort of encoding issue.

Verify that the XML document itself is stored using the correct encoding to support the characters you need to store. Then verify when you are reading the file in that you are using the same encoding as the document i.e. if your XML document is stored as UTF-8 then you need to make sure when you read it in your encoding it as UTF-8.

守不住的情 2024-09-23 05:59:20

更深入地了解字符本身,实际的字符值是什么?

当一个角色显示为一个正方形时,这意味着您无法在视觉上表示它。这要么是因为它是非视觉字符,要么是在当前字符集之外。

编辑,不

在您的示例中,我大胆猜测您看到了嵌入的换行符。

Take a deeper look at the characters themselves, what are the acutal char values?

When a character shows up an a square it means you can't represent it visually. This is either because it's a non-visual character, or it's outside of your current char set.

edit, nope

In your example I'd venture a guess that your seeing imbedded newline characters.

最初的梦 2024-09-23 05:59:20

定义允许的字符并阻止其他所有内容,即:

// only lowercase letters and digits
if(Regex.IsMatch(yourString, @"^[a-z0-9]*$"))
{
    // allowed
}

但我认为您的问题可能出在其他地方,因为您说它来自序列化(有效)字符串然后反序列化(无效)字符串。您可能使用默认序列化,并且没有为您的类应用正确的 ISerialized 实现(或正确使用 Serialized 属性),从而导致属性或字段被序列化而你不想被序列化。

PS:其他人提到了编码问题,这是一个可能的原因,可能意味着您根本无法读回数据。关于编码,有一个简单的规则:在任何地方(流、数据库、xml)使用相同的编码,并且要具体。如果不是,则使用默认编码,该编码可能因系统而异。


编辑:可能的解决方案

基于新信息(请参阅原始问题下的线程),很明显该问题与编码有关。 OP 提到它出现时带有破折号,当在一些花哨的编辑环境中使用时,破折号通常会被替换为漂亮的破折号,例如“—”()。由于似乎在如何修复 SQL Server 以接受正确编码的字符串方面存在一些不清楚之处,因此您还可以在 XML 中解决此问题。

创建 XML 时,只需将编码更改为最基本的编码 (US-ASCII)。这将自动强制 XML 编写器使用正确的数字实体。当您反序列化时,这将在您的字符串中正确解析,无需再费力。大致思路如下:

Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header

但请注意使用 StringBuilderStringWriter,因为它固定为使用 UTF-16,并且 XmlWriter 将始终以该编码写入,更多信息这个问题在我的博客中,它与 SQL Server 不兼容。

注意:当使用 ASCII 编码时,任何高于 0x7F 的字符都会被编码。因此,é 看起来像 é ,破折号可能看起来像 ,但这意味着是一样的,你不应该担心这一点。每个支持 XML 的工具都会正确解释此输入。

注 2:您想要更改 XML 编写方式的位置是您所说的 Web 服务,它接收 XML,然后将其存储到 SQL 中服务器数据库。在存储到 SQL Server 之前,必须应用更改。链中较早的部分是无用的。

Define the allowed characters and block everything else, i.e.:

// only lowercase letters and digits
if(Regex.IsMatch(yourString, @"^[a-z0-9]*
quot;))
{
    // allowed
}

But I think your problem may lie somewhere else, because you say it comes from serializing (valid) string and then deserializing (invalid) strings. It is possibly that you use default serialization and that you don't apply proper ISerializable implementation for your classes (or proper use of the Serializable attributes), resulting in properties or fields being serialized that you don't want to be serialized.

PS: others have mentioned encoding issues, which is a possible cause and might mean you cannot read back the data at all. About encoding there's one simple rule: use the same encoding everywhere (streams, database, xml) and be specific. If you are not, the default encoding is used, which can be different from system to system.


Edit: possible solution

Based on new information (see thread under original question), it is pretty clear that the issue has to do with encoding. The OP mentions that it appears with dashes, which are often replaced with pretty dashes like "—" () when used in some fancy editing environment. Since it seems that there's some unclarity in how to fix SQL Server to accept proper encoded strings, you can also solve this in your XML.

When you create your XML, simply change the encoding to the most basic possible (US-ASCII). This will automatically force the XML writer to use the proper numerical entities. When you deserialize, this will be properly parsed in your strings without further ado. Something along these lines:

Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header

But be aware of using StringBuilder or StringWriter, because it is fixed to using UTF-16, and the XmlWriter will always write in that encoding, more info on that issue at my blog, which is not compatible with SQL Server.

Note: when using the ASCII encoding, any character higher than 0x7F will be encoded. So, é will look like é and the dash may look like , but this means just the same and you should not worry about that. Every XML capable tool will properly interpret this input.

Note 2: the location where you want to change the way XML is written is the Web Service you talk of, that receives XML and then stores it into the SQL Server database. Before storing into SQL Server, the change must be applied. Earlier on in the chain is useless.

奢望 2024-09-23 05:59:20
public static T DeserializeFromXml<T>(string xml)
        {
            T result;
            XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
            XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));

            using (StringReader sr3 = new StringReader(xml))
            {
                XmlReaderSettings settings = new XmlReaderSettings()
                {
                    CheckCharacters = false // default value is true;
                };

                using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
                {
                    result = (T)serializer.Deserialize(xr3);
                }
            }

            return result;
        }
public static T DeserializeFromXml<T>(string xml)
        {
            T result;
            XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
            XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));

            using (StringReader sr3 = new StringReader(xml))
            {
                XmlReaderSettings settings = new XmlReaderSettings()
                {
                    CheckCharacters = false // default value is true;
                };

                using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
                {
                    result = (T)serializer.Deserialize(xr3);
                }
            }

            return result;
        }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文