使整数列表更加人性化
这是我为了解决工作中无法修复的问题而进行的一个业余项目。我们的系统输出一个代码来表示一个事物与另一个事物的组合。一些示例代码是:
9-9-0-4-4-5-4-0-2-0-0-0-2-0-0-0-0-0-2-1-2-1-2-2 -2-4
9-5-0-7-4-3-5-7-4-0-5-1-4-2-1-5-5-4-6-3-7-9-72
9-15-0-9-1-6-2-1-2-0-0-1-6-0-7
到目前为止,我见过的一个插槽中的最大数量约为 150,但它们很可能会走得更高。
设计系统时,并没有要求这段代码是什么样子。但现在客户希望能够从一张纸上手动输入它,上面的代码不适合这种情况。我们已经说过我们不会对此采取任何行动,但这似乎是一个有趣的挑战。
我的问题是哪里是开始无损压缩此代码的好地方?显而易见的解决方案(例如使用较短的密钥存储此代码)不是一个选择;我们的数据库是只读的。我需要构建一种两种方法来使该代码更加人性化。
This is a bit of a side project I have taken on to solve a no-fix issue for work. Our system outputs a code to represent a combination of things on another thing. Some example codes are:
9-9-0-4-4-5-4-0-2-0-0-0-2-0-0-0-0-0-2-1-2-1-2-2-2-4
9-5-0-7-4-3-5-7-4-0-5-1-4-2-1-5-5-4-6-3-7-9-72
9-15-0-9-1-6-2-1-2-0-0-1-6-0-7
The max number in one of the slots I've seen so far is about 150 but they will likely go higher.
When the system was designed there was no requirement for what this code would look like. But now the client wants to be able to type it in by hand from a sheet of paper, something the code above isn't suited for. We've said we won't do anything about it, but it seems like a fun challenge to take on.
My question is where is a good place to start loss-less compressing this code? Obvious solutions such as store this code with a shorter key are not an option; our database is read only. I need to build a two way method to make this code more human friendly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
1)我同意你肯定需要一个校验和 - 数据输入错误很常见,除非你有训练有素的员工和带有自动交叉检查的独立重复键控。
2)我建议 http://en.wikipedia.org/wiki/Huffman_coding 转动你的列表将数字转换成比特流。为了获得所需的概率,您需要一个适当大小的真实数据样本,这样您就可以进行计数,将 Ni 设置为数字 i 在数据中出现的次数。然后我建议设置 Pi = (Ni + 1) / (Sum_i (Ni + 1)) - 这可以稍微平滑概率。另外,使用此方法,如果您看到数字 0-150,您可以通过输入数字 151-255 并将其设置为 Ni = 0 来添加一点余量。解决罕见大数字的另一种方法是添加某种转义序列。
3)找到一种方法让人们输入所得到的位序列确实是一个应用心理学问题,但这里有一些想法的建议。
3a) 软件许可证 - 只需在 64 个字符的字母表中对每个字符编码 6 位,但以一种使人们更容易保持位置的方式对字符进行分组,例如 BC017-06777-14871-160C4
3b) 英国汽车牌照。使用字母表的变化向人们展示如何对字符进行分组,例如 ABCD0123EFGH4567IJKL...
3c) 一个非常大的字母表 - 为自己准备一个 2^n 个单词的列表,用于一些大小合适的 n 并将 n 位编码为单词,例如 GREEN ENCHANTED LOGICIAN。 ..-
1) I agree that you definately need a checksum - data entry errors are very common, unless you have really well trained staff and independent duplicate keying with automatic crosss-checking.
2) I suggest http://en.wikipedia.org/wiki/Huffman_coding to turn your list of numbers into a stream of bits. To get the probabilities required for this, you need a decent sized sample of real data, so you can make a count, setting Ni to the number of times number i appears in the data. Then I suggest setting Pi = (Ni + 1) / (Sum_i (Ni + 1)) - which smooths the probabilities a bit. Also, with this method, if you see e.g. numbers 0-150 you could add a bit of slack by entering numbers 151-255 and setting them to Ni = 0. Another way round rare large numbers would be to add some sort of escape sequence.
3) Finding a way for people to type the resulting sequence of bits is really an applied psychology problem but here are some suggestions of ideas to pinch.
3a) Software licences - just encode six bits per character in some 64-character alphabet, but group characters in a way that makes it easier for people to keep place e.g. BC017-06777-14871-160C4
3b) UK car license plates. Use a change of alphabet to show people how to group characters e.g. ABCD0123EFGH4567IJKL...
3c) A really large alphabet - get yourself a list of 2^n words for some decent sized n and encode n bits as a word e.g. GREEN ENCHANTED LOGICIAN... -
不久前我还担心这个问题。事实证明,你不能比 base64 做得更好 - 尝试为每个字符压缩更多位并不值得(一旦你进入“奇怪”的位数,编码和解码就会变得更加复杂)。但与此同时,您最终会得到一些在输入时可能出错的内容(将 0 与 O 等混淆)。一个选项是选择一组修改后的字符和字母(因此它仍然是基数 64,但是,比如说,您用“>”替换“0”。另一个选项是添加校验和。同样,为了实现简单,我觉得 不幸的是,
我没有得到任何进一步的信息 - 事情改变了方向 - 所以我无法提供代码或特定的校验和选择
PS 我意识到缺少一个我没有解释的步骤:我要压缩文本。之前转换成某种二进制形式编码(使用一些标准压缩算法),总结一下:压缩、添加校验和、base64 编码;base 64 解码、检查校验和、解压缩。
i worried about this problem a while back. it turns out that you can't do much better than base64 - trying to squeeze a few more bits per character isn't really worth the effort (once you get into "strange" numbers of bits encoding and decoding becomes more complex). but at the same time, you end up with something that's likely to have errors when entered (confusing a 0 with an O etc). one option is to choose a modified set of characters and letters (so it's still base 64, but, say, you substitute ">" for "0". another is to add a checksum. again, for simplicity of implementation, i felt the checksum approach was better.
unfortunately i never got any further - things changed direction - so i can't offer code or a particular checksum choice.
ps i realised there's a missing step i didn't explain: i was going to compress the text into some binary form before encoding (using some standard compression algorithm). so to summarize: compress, add checksum, base64 encode; base 64 decode, check checksum, decompress.
这与我过去使用过的类似。当然有更好的方法可以做到这一点,但我使用这种方法是因为它很容易在 Transact-SQL 中进行镜像,而这在当时是一项要求。如果您的 id 的分布是非随机的,您当然可以修改它以合并霍夫曼编码,但这可能是不必要的。
您没有指定语言,因此这是用 c# 编写的,但转换到任何语言应该很容易。在查找中,您会看到经常混淆的字符被省略。这应该会加快进入速度。我也有固定长度的要求,但是你可以很容易地修改它。
This is similar to what I have used in the past. There are certainly better ways of doing this, but I used this method because it was easy to mirror in Transact-SQL which was a requirement at the time. You could certainly modify this to incorporate Huffman encoding if the distribution of your id's is non-random, but it's probably unnecessary.
You didn't specify language, so this is in c#, but it should be very easy to transition to any language. In the lookup you'll see commonly confused characters are omitted. This should speed up entry. I also had the requirement to have a fixed length, but it would be easy for you to modify this.