是否有一个好的双向哈希将电子邮件地址转换为可预测的、可读的 UNIX 用户名？

发布于 2024-12-01 09:44:44 字数 515 浏览 4 评论 0原文

我们正在使用许多基于 UNIX 的文件系统，所有这些文件系统都有一组类似的限制，即某些字符不能在用户名字段中使用。这些限制之一是不能使用“@”、“_”或“.”。在名字中。作为 unix，还有许多其他限制。

所以问题是是否有一种众所周知的算法可以将电子邮件地址转换为可预测的 unix 文件名。我们需要在某个时候逆转这一点才能收到电子邮件。

我考虑过做类似“.”->“DOT”、“@”->“AT”等的事情。但是存在大小限制和其他通常存在问题的事情。我还可以通过将电子邮件的 @xyz.com 部分映射到特殊字符或其他字符来进行优化。每个实现最多只需要支持 3 个域。我希望有人能找到一个无需大量权衡的解决方案。

更新： - 两个目标文件系统是 AFS 和 NFS。

-Base64 不起作用，因为它具有不兼容的字符。 “/”

- 可读更好。

似乎最好的答案是将 @xyz.com 域替换为单个非标准字符，然后有一个函数可以将名称的第一部分缩小为适合各种文件系统的用户名长度限制的内容。但什么是好的功能呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风苍溪 2024-12-08 09:44:45

您可以尝试用于 URI 的 URL 百分比 (%) 编码方案的修改版本。

如果您的特定文件系统不允许使用百分比符号，只需将其替换为其他允许的字符（并记住对该字符的任何出现进行正确编码）。

使用此方法：
[email protected]

将变为：
mail%2Eaddress%40server%2Ecom

或者，如果您必须替换（例如），则用字母 a 代替 % 符号：
ma61ila2Ea61ddressa40servera2Ecom

也许不完全是人类可读的，但通过编码算法很容易处理。为了获得最佳空间效率，转义字符应该是文件系统允许的字符，但不太可能在地址中频繁出现。

这种编码方案的优点是大多数普通字符的大小不会增加。仅当文件系统不支持的字符时，字符串长度才会增加。

回复收藏 0 原文

情栀口红 2024-12-08 09:44:45

查看base64。编码和解码是明确定义的。
与任何一天滚动我自己的格式相比，我更喜欢这个。

回复收藏 0 原文

海未深 2024-12-08 09:44:45

嗯，从你的问题来看，我在这一点上并不完全清楚，但既然你想要一些转换，我假设你想要一些至少是人类可读的东西？

每个操作系统可能有不同的限制，但是您是否足够接近这些平台，以便能够找出/测试用户名中可接受的内容？如果您能找到三个“特殊”字符，可以用来替换 '@'、'.'、'_'，那么您就可以开始了。（这是全面的吗？如果不是，你需要确保你知道所有这些，否则你可能会发生冲突。）我搜索了一下，试图找出是否有 POSIX 标准，但找不到任何东西，所以这就是为什么我认为如果你能测试什么是有效的那将是最直接的途径。

即使只有一个特殊字符，您也可以进行 URL 编码，如果可用，则使用 '%'，如果不可用，则使用您选择的任何字符，例如 '!", then { '@'->'!40", ' _'->'!5F','.'-> '!2E'}。（规范 [RFC1738] http://www.rfc-editor.org/rfc/rfc1738 .txt）将字符定义为 US-ASCII，这样您就可以找到一个表格，例如在 wikipedia 的 ASCII 文章并在那里查找正确的十六进制数字。）或者，您可以做自己的简单映射，因为您不需要整个 ASCII 集，您可以只做一个映射每个转义字符有两个字符，例如，'!a','!u','!p' 表示 at、下划线、句点。

如果您有两个特殊字符，例如“%”和“!”，则可以分隔代表该字符的文本，例如 %at!、&us! 和 '&pd!'。（这几乎是 html 风格的编码，但您使用的是可用的编码，而不是 '&' 和 ';'，并且您正在编写自己的助记符。）另一个想法是您可以使用符号的运行确定翻译的字符，其中每个新字符取决于正在使用的符号。（如果我们需要将两个不允许的字符相邻放置，这可以方便地停止运行。）因此假设“%”和“!”，句点为 1，下划线为 2，at 符号为 3，'[电子邮件受保护]' 将变为 'mickey%!!sample%%!!!fake%out'。还有其他变体，但这个变体很容易编码。

如果这些都不是一个选项（例如根本没有符号，只有 [a-zA-Z0-9]），那么我真的认为 Base64 答案听起来是正确的。实际上，一旦我们除了简单的替换（甚至是）以外的任何东西，如果这是目标的话，就已经很难打字了。但是，如果您确实需要尝试保持电子邮件的可读性，那么您要做的就是实施某种转义。我正在考虑使用“0”作为转义字符，所以现在“0”变成“00”，“@”变成“01”，“.”变为“02”，“_”变为“03”。所以现在，'[电子邮件受保护]'会变成'mickey0010203sample0301fake02out'。不漂亮，但应该可以；因为我们转义了任何原始 0，所以始终确保为您选择的转义字符定义一个映射，应该没问题。

这就是我能想到的 atm。 :) 当然，如果这些用户名不需要在原始文件中可读，那么 Base64 显然不起作用，因为它可以产生斜杠。哎呀，好吧，只需每个字符的 2 位 US-ASCII 十六进制值就可以了...] ~~是一个很好的方法；有很多经过调试、经过严格现场测试的良好代码，它可以轻松解决您的问题。~~ :)

Hmm, from your question I'm not totally clear on this point, but since you wanted some conversion I'm assuming that you want something that is at least human readable?

Each OS may have different restrictions, but are you close enough to the platforms that you would be able to find out/test what is acceptable in a username? If you could find three 'special' characters that you could use just to do a replace on '@', '.', '_' you would be good to go. (Is that comprehensive? if not you would need to make sure you know all of them otherwise you could clash.) I searched a bit trying to find whether there was a POSIX standard, but wasn't able to find anything, so that's why I think if you can just test what's valid that would be the most direct route.

With even one special character, you could do URL encoding, either with '%' if it's available, or whatever you choose if not, say '!", then { '@'->'!40", '_'->'!5F', '.'-> '!2E' }. (The spec [RFC1738] http://www.rfc-editor.org/rfc/rfc1738.txt) defines the characters as US-ASCII so you can just find a table, e.g. in wikipedia's ASCII article and look up the correct hex digits there.) Or, you could just do your own simple mapping since you don't need the whole ASCII set, you could just do a map with two characters per escaped character and have, say, '!a','!u','!p' for at, underscore, period.

If you have two special characters, say, '%', and '!', you could delimit text that represents the character, say, %at!, &us!, and '&pd!'. (This is pretty much html-style encoding, but instead of '&' and ';' you are using the available ones, and you're making up your own mnemonics.) Another idea is that you could use runs of a symbol to determine the translated character, where each new character flops which symbol is being used. (This conveniently stops the run if we need to put two of the disallowed characters next to each other.) So assume '%' and '!', with period being 1, underscore 2, and at-sign being three, '[email protected]' would become 'mickey%!!sample%%!!!fake%out'. There are other variations but this one is easy to code.

If none of this is an option (e.g. no symbols at all, just [a-zA-Z0-9]), then really I think the Base64 answer sounds about right. Really once we're getting to anything other than a simple replacement (and even that) it's already getting hard to type if that's the goal. But if you really need to try to keep the email mostly readable, what you do is implement some sort of escaping. I'm thinking use '0' as your escape character, so now '0' becomes '00', '@' becomes '01', '.' becomes '02', and '_' becomes '03'. So now, '[email protected]'would become 'mickey0010203sample0301fake02out'. Not beautiful but it should work; since we escaped any raw 0's, just always make sure you define a mapping for whatever you choose as your escape char and you should be fine..

That's all I can think of atm. :) Definitely if there's no need for these usernames to be readable in the raw it seems like apparently Base64 won't work, since it can produce slashes. Heck, ok, just the 2-digit US-ASCII hex value for each character and you're done...] ~~is a good way to go; there's lots of nice debugged, heavily field-tested code out there for it and it solves your problem quite handily.~~ :)

回复收藏 0 原文

猫七 2024-12-08 09:44:45

鉴于...
- 各种文件系统中允许的有限字符集
- 希望保持编码的电子邮件地址简短（既为了人类可读性，又为了可能考虑到文件系统限制）
...一种可能的方法可能是两步编码逻辑，其中

首先使用无损压缩算法（例如 Lempel-Ziv）压缩电子邮件，有效地将其转换为“二进制”形式，存储在较短的字节数组
然后使用类似 Base64 的算法对这个字节数组进行编码

其想法是最小化二进制表示的大小，以便与编码的存储效率低下相关的扩展 - 只能大致存储6 位（可能还少一点）每个字符-，不会导致编码的字符串太长。
如果压缩和编码都没有变得过于复杂，这样的系统可能会生成可能是输入字符串大小（电子邮件地址）的 4/5 的编码字符串：压缩应该很容易将大小减半，但编码（例如 Base32），将使二进制形式的大小增加 8/5。

提高压缩比的努力可能允许选择更“浪费”的编码方案（具有更小的字符集），这可能有助于使输出更易于人类阅读，并且在各种文件系统上也更广泛安全。例如，Base64 似乎是最佳选择。在空间方面，仅使用大写字母（基数 26）可以确保底层方案到文件名不区分大小写的文件系统的可移植性。
初始通用压缩的另一个好处是，几乎不需要对有效输入键的语法做出任何假设（此处为电子邮件地址）。

压缩思路：
LZ 似乎是一个不错的选择，“尽管人们可能会考虑使用电子邮件地址中常见的模式（例如“.com”甚至“a.com”、“b.com”等）来启动其初始缓冲区。
初始缓冲区将确保每个压缩电子邮件地址有多个“引用”实例，因此总体压缩比更好）。为了进一步压缩几个字节，也许可以使用 LZH 或其他 LZ 变体。
除了上面提到的缓冲区启动之外，另一种定制可能是使用比典型 LZ 算法更短的缓冲区，因为我们必须压缩的字符串（电子邮件地址实例）本身非常短，并且不会从 512 字节缓冲区中受益。（缓冲区大小越短，引文代码就越短）

编码思路：
Base64 不适合原样，因为存在斜杠 (/)、加号 (+) 和等号 (=) 字符。可以使用替代字符来替换它们；我想到了破折号 (-)，但找到目标文件系统的所有“风格”都允许的三个字符可能有点困难。
尽管如此，Base64 及其每 3 个有效负载字节 4 个输出字符的比率提供了可能难以实现的存储效率上限[对于可接受的字符集]。
在这种效率的最低端，可能是数组中字节的十六进制值的 ASCII 表示。这种有效负载字节加倍的格式在长度上可能是可接受的，并且由于其简单性而有趣（输入中的每个半字节（4 位）与编码字符串中的字符之间存在直接且简单的关系。< br>
Base32，其中 A 到 Z 分别编码 0 到 25 和 0 到 5 编码 26 到 31，本质上是 Base64 的变体，每 5 个有效负载字节比率为 8 个输出字符，这可能是一个非常可行的折衷方案。

Given...
- the limited set of characters allowed in various file systems
- the desire to keep the encoded email address short (both for human readability and for possible concerns with file system limitations)
...a possible approach may be a two steps encoding logic whereby the email is

first compressed using a lossless compression algorithm such as Lempel-Ziv, effectively turning it into a "binary" form, stored in a shorter array of bytes
then this array of bytes is encoded using a Base64-like algorithm

The idea is to minimize the size of the binary representation, so that the expansion associated with the storage inefficiency of the encoding -which can only store roughly 6 bits (and probably a bit less) per character-, doesn't cause the encoded string to be too long.
Without getting overly sophisticated for the compression nor the encoding, such a system would likely produce encoded strings that are maybe 4/5 of the input string size (the email address): the compression should easily half the size, but the encoding, say Base32, would grow the binary form size by 8/5.

Efforts in improving the compression ratio may allow the selection of more "wasteful" encoding schemes (with smaller character sets) and this may help making the output more human-readable and also more broadly safe on various flavors of file systems. For example whereby a Base64 seems optimal. space-wise, using only uppercase letter (base 26) may ensure portability of the underlying scheme to file systems where the file names are not case sensitive.
Another benefit of the initial generic compression is that few, if any, assumptions need to be made about the syntax of valid input key (email addresses here).

Ideas for compression:
LZ seems like a good choice, 'though one may consider primin its initial buffer with common patterns found in email addresses (example ".com" or even "a.com", "b.com" etc.).
This initial buffer would ensure several instances of "citations" per compressed email address, hence a better compression ratio overall). To further squeeze a few bytes, maybe LZH or other LZ-variations could be used.
Aside from the priming of the buffer mentioned above, another customization may be to use a shorter buffer than typical LZ algorithms, since the string we have to compress (email address instances) are themselves very short and would not benefit from say a 512 bytes buffer. (Shorter buffer sizes allow shorter codes for the citations)

Ideas for encoding:
Base64 is not suitable as-is because of the slash (/), plus (+) and equal (=) characters. Alternate characters could be used to replace these; dash (-) comes to mind, but finding three charcters, allowed by all "flavors" of the targeted file systems may be a stretch.
Never the less, Base64 and its 4 output characters per 3 payload bytes ratio provide what is probably the barely achievable upper limit of storage efficiency [for an acceptable character set].
At the lower end of this efficiency, is maybe an ASCII representation of the Hexadeciamal values of the bytes in the array. This format with a doubling of the payload bytes may be acceptable, length-wise, and is interesting because of its simplicity (there is a direct and simple relation between each nibble (4 bits) in the input and characters in the encoded string.
Base32 whereby A thru Z encode 0 thru 25 and 0 thru 5 encode 26 thru 31, respectively, essentially variation of Base64 with an 8 output characters per 5 payload bytes ratio may be a very viable compromise.

回复收藏 0 原文

~没有更多了~