压缩 ASCII 数据以适应 UTF-32 API?
我有一个接收 Unicode 数据的 API,但我只需要在其中存储 ASCII。我想压缩&混淆(或加密)将以 Unicode 形式保留的字符串值。
我的愿望是压缩此模式数据,或对其进行加密以防止窥探。我认为不可能两者都做得很好。
考虑到我想将源数据限制为有效的、可打印的 ASCII;如何将原始字符串值“压缩”为更小的值、混淆的值或两者兼有的值?
这是我想象的工作方式(尽管你可能有更好的方法):
- 这个源代码将采用给定的字符串作为输入
- 该字符串的字节表示将被采用(UTF8,ASCII,你决定)
- 一些神奇的事情发生了 - (这是我需要您帮助的部分)
- 生成的字节将被转换为 int 或 long(无小数点)
- 数字将使用此实用程序转换为相应的字符 http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(请注意,实用程序将用于强制执行约束,即“最终”Unicode 名称不得包含以下字符“/”, '\'、'#'、'?' 或 '%')
背景
Microsoft Azure 表有一个接受 Unicode 数据作为存储或属性名称的 API。这是一个无模式的数据库(因此可以临时创建列),因此模式按行存储。缺点是此架构数据多次存储在磁盘上,并且还在 XML blob 中通过线路传输,相当冗余。
此外,我正在开发一个实用程序,它动态加密/解密Azure表数据,但架构未加密。我想以某种方式掩盖或混淆此标头信息。
I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
- This source code will take a given String as input
- The bytes representation of that string will be taken (UTF8, ASCII, you decide)
- Some magic happens - (this is the part I need your help on)
- The resulting bytes will be converted into an int or long (no decimal points)
- The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这些只是一些想法。
步骤 3 实际上不是很简单吗(只需将数据压缩和/或加密为不同的字节)?对于 7 位 ASCII,您还可以在压缩和/或加密之前,通过打包位来存储数据,使它们适合更少的字节。
如果您可以在步骤 5 中使用 UTF-32、UTF-8 等,则您可以访问 Unicode 标准,最大为 0x10FFFD,有一些例外;例如,某些代码点是 Unicode 标准中的非字符,例如 0xFFFF,而其他代码点是无效字符,例如 0xD800。
These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.