如何确定不同编码/序列化/等之间的差异?

发布于 2024-10-03 16:07:02 字数 489 浏览 7 评论 0原文

是否有各种类型的数据格式解码器,例如 Base64、ASP EventValidation 对象、XML 序列化或其他格式?我可以做一个简单的测试吗?

例如,我这里有一个字符串,它是基于 cgi 的 Web 表单的一部分,它显然是十六进制(完整大小为 5kb): ad2004726c35e66d8b19c5177a331b24988f3cf11871084f6cc9ff808baf5cdee83f031a56dc42b65ee5309f1f1

我不知道那是什么,十六进制到ascii给了我还有一些废话,比如 Ra_d__IVo6"Odd1_1/G&?sG&OfQw1I1_eS,它显然不是一个基于 64 的字符串...

问题基本上是:除了查看不同类型并尝试之外,还有其他方法吗,并猜测

: 我认为这个字符串是基于附加的 52616e646f6d4956 的加密数据,但我的问题不是这个字符串是什么,而是我如何轻松地告诉这些事情。

There's all types of decoders for data formats such as Base64, the ASP EventValidation object, XML serialization, or otherwise? Is there a simple test I can do?

For example, I have a string here, it's part of a cgi-based web form, it's obviously hex (full size is 5kb): 52616e646f6d49567ef61b360522ae5ae69064f0ecb664a831c4196dad319215013aa8d04726b5d54ed673dad2004726c35e66d8b19c5177a331b24988f3cf11871084f6cc9ff808baf5cdee83f031a56dc42b65ee5309f1f1

I got no idea what that is, hex to ascii gives me some more nonsense like Ra_d__IVo6"Odd1_1/G&?sG&OfQw1I1_eS, it's obviously not a base 64 string...

The question is basically: is there a method other than looking at differnt types, trying it, and guessing?

edit:
I think this string is encrypted data based on the perpended 52616e646f6d4956, but my question isn't what is the string, rather, how I can tell these things easily.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

半世蒼涼 2024-10-10 16:07:02

您可以开发自己的启发式算法。类似于病毒扫描程序。它并不是 100% 有效,但随着时间的推移应该会有所改善。例如,您可以获取该字符串并注意它仅包含十六进制字母表中的字符,将其标记为可能被加密、压缩或与十六进制字符集相关的任何其他内容。

您可以扩展启发式方法来尝试 N 种不同的编码并执行字数统计。这可能有助于缩小编码的可能性,但在简单的情况下,例如标准英语字母表,编码表之间存在大量重叠,因此您肯定会得到误报。但是,只要重叠不包含字符的外部/不匹配,您仍然应该获得可读的内容。

正如马克指出的那样,并非所有内容都一定是可读内容。图片、zip 文件和其他数据列表在转换为编码表表示形式时将导致纯属无意义。但是,即使是这样的项目也有可能包含启发式检测到的一致数据。

这个话题可能会涉及很多。看看TCP协议就知道了。人们不只是在互联网上发送数据包,期望在客户端对数据进行一些神奇的解释。有预定义的规则(协议)来定义客户端/服务器之间传输数据的方式和类型。因此,要直接回答有关“猜测”的问题,您无法确定将收到的数据或您的解释,但您当然可以开发一个比“猜测”更智能的应用程序。

You can develop your own heuristic algorithm. Similar to a virus scanner. It doesn't work 100%, but it should improve over time. For example, you could take the string and note that it contains only characters from the hex alphabet, flag it for the possibility of being encrypted, zipped or whatever else that is related to the hex character set.

You could extend the heuristic to try N different encodings and perform word count's. This could help narrow down the possibilities of the encoding's, but in the simple case with say the standard english alphabet there's plenty of overlap across encoding tables so you will certainly get false positives. But, as long as the overlap doesn't contain character's outside/mismatching you should still get readable content.

As Marc pointed out, not all content is necessarily readable content. Pictures, zip files, and a list of other data will result in pure nonsense when converted to an encoding table representation. But, even items such as these have potential to contain consistent data to be detected by the heuristic.

This topic can get pretty involved. Just look at the TCP protocol. One doesn't just fire packets across the internet expecting some magical interpretation of data on the client side. There are pre-defined rules (protocols) to define the way and type of data to be transmitted between the client/server. So, to directly answer your question regarding "guessing", you cannot be certain of the data you will receive nor of your interpretation, but you certainly can develop an application that is smarter than a "guess".

与酒说心事 2024-10-10 16:07:02

在一般情况下,这会很困难。显然,寻找正确的字符范围有助于发现诸如 base-64 之类的东西,但除此之外,您还需要大量的每种类型的逻辑。例如,任何基于文本的内容本身都可以使用任何 Unicode/代码页编码。

Xml 和 json 可能相对容易推断(根据起始字符猜测,然后尝试通过解析器/验证器运行它)。当然,非 x-HTML 会使事情变得复杂。

二进制形式更加棘手且数量更多;它可以是一个图像吗?
?声音?拉链?或者二进制数据格式;也许是protobuf?还是定制的?

我们处于什么字节序?

然后;整个有效负载是gzip吗?泄气?加密的?

所以是的;它可能可以完成 - 例如,wireshark 尝试过。但这是一项大量工作,没有神奇的捷径。

In the general case that will be hard. Obviously looking for the right character range helps spot things like base-64, but beyond that you'd need a lot of per-type logic. Anything text-based could itself use any Unicode/code-page encoding, for example.

Xml and json are probably ratively easy to infer (guess based on the start chars, then try running it through a parser/validator). Of course non-x-HTML complicates matters.

Binary forms are trickier and more numerous; could it be an image,
? Sound? Zip? Or a binary data format; protobuf perhaps? Or bespoke?

And what Endianness are we in?

Then; is the entire payload gzip? Deflate? Encrypted?

So yes; it can probably be done - wireshark tries, for example. But it is a lot of work, with no magic short-cuts.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文