生成一个不太全局唯一的标识符
我发现了许多关于生成 UID 的不同问题,但据我所知,我的要求有些独特(哈哈)。
总结一下:我需要生成一个非常短的 ID,该 ID 是“本地”唯一的,但不必是“全局”或“通用”唯一的。这些限制不仅仅是基于美观或空间问题,而是因为它本质上被用作硬件标签并且受到硬件的限制。以下是规范:
硬性要求
- ID 必须仅包含十进制数字(基础数据是 BCD);
- ID 的最大长度为 12 个字符(数字)。
- 必须离线生成 - 数据库/网络连接并不总是可用!
软要求
- 我们希望它以日历年和/或月份开始。由于这确实浪费了大量的熵,所以我不介意对此做出妥协或完全废弃它(如果有必要)。
- 从特定机器生成的 ID 应按顺序显示。
- ID 不必按机器排序 - 例如,机器 1 可以输出 [123000, 124000, 125000],机器 2 可以输出 [123500, 123600, 124100]。
- 然而,从集体意义上看,顺序越多越好。像 [200912000001, 200912000002, 200912000003, ...] 这样的一组 ID 是完美的,尽管这显然不能跨多台机器扩展。
使用场景:
- 该方案范围内的ID将由10台,最多可能100台不同的机器生成。
- 生成的 ID 总数不会超过几百万个。
- 并发度极低。单台机器生成 ID 的频率不会超过每 5 分钟左右一次。此外,很可能不会有超过 5 台机器同时在同一小时甚至同一天内生成 ID。我预计一天内在给定机器上生成的 ID 少于 100 个,所有机器生成的 ID 少于 500 个。
- 少数机器(3-5 台)很可能负责生成 80% 以上的 ID。
我知道可以使用少于 12 个十进制数字将时间戳编码为低至 100 毫秒甚至 10 毫秒的精度,这足以保证此应用程序的“足够唯一”ID。我之所以在这里问这个问题,是因为我真的很想尝试在其中合并人类可读的年/月,或者对有关源机器的一些信息进行编码,或者两者兼而有之。
我希望有人可以帮助对这些软要求做出妥协……或者解释为什么考虑到其他要求,这些要求都不可能。
(PS,我的“母语”语言是 C#,但如果有人有任何绝妙的想法,任何语言的代码甚至伪代码都可以。)
更新:
现在我有机会在上面睡觉了,我我认为我实际上要做的是默认使用时间戳编码,并允许各个安装通过定义自己的 2 位或 3 位机器 ID 来切换到机器顺序 ID。这样,想要弄乱 ID 并装入人类可读信息的客户可以找到自己的确保唯一性的方法,并且我们不对误用负责。如果机器碰巧进行所有在线安装,也许我们可以通过提供服务器实用程序来处理机器 ID 来提供帮助。
I've found a number of different questions on generating UIDs, but as far as I can tell, my requirements here are somewhat unique (ha).
To summarize: I need to generate a very short ID that's "locally" unique, but does not have to be "globally" or "universally" unique. The constraints are not simply based on aesthetic or space concerns, but due to the fact that this is essentially being used as a hardware tag and is this subject to the hardware's constraints. Here are the specifications:
Hard Requirements
- The ID must contain only decimal digits (the underlying data is a BCD);
- The maximum length of the ID is 12 characters (digits).
- Must be generated offline - a database/web connection is not always available!
Soft Requirements
- We'd like it to begin with the calendar year and/or month. As this does waste a lot of entropy, I don't mind compromising on this or scrapping it entirely (if necessary).
- IDs generated from a particular machine should appear sequential.
- IDs do not have to sort by machine - for example, it's perfectly fine for machine 1 to spit out [123000, 124000, 125000], and machine 2 to spit out [123500, 123600, 124100].
- However, the more sequential-looking in a collective sense, the better. A set of IDs like [200912000001, 200912000002, 200912000003, ...] would be perfect, although this obviously does not scale across multiple machines.
Usage Scenario:
- IDs within the scope of this scheme will be generated from 10, maybe 100 different machines at most.
- There will not be more than a few million IDs generated, total.
- Concurrency is extremely low. A single machine will not generate IDs more often than every 5 minutes or so. Also, most likely no more than 5 machines at a time will generate IDs within the same hour or even the same day. I expect less than 100 IDs to be generated within one day on a given machine and less than 500 for all machines.
- A small number of machines (3-5) would most likely be responsible for generating more than 80% of the IDs.
I know that it's possible to encode a timestamp down to 100 ms or even 10 ms precision using less than 12 decimal digits, which is more than enough to guarantee a "unique enough" ID for this application. The reason I am asking this here on SO, is because I would really like to either try to incorporate human-readable year/month in there or encode some piece of information about the source machine, or both.
I'm hoping that someone can either help with a compromise on those soft requirements... or explain why none of them are possible given the other requirements.
(P.S. My "native" language is C# but code in any language or even pseudocode is fine if anybody has any brilliant ideas.)
Update:
Now that I've had the chance to sleep on it, I think what I'm actually going to do is use a timestamp encoding by default, and allow individual installations to switch to a machine-sequential ID by defining their own 2- or 3-digit machine ID. That way, customers who want to mess with the ID and pack in human-readable information can sort out their own method of ensuring uniqueness, and we're not responsible for misuse. Maybe we help out by providing a server utility to handle machine IDs if they happen to be doing all online installations.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我以前处理过这个问题,从长远来看,尝试将有用的信息存储到序列号中是一个坏主意。设备序列号应该毫无意义。就像主键一样 当您开始尝试将真实数据放入序列号时,您就将业务逻辑放入其中,
并且您将被迫像将来您讨厌的任何其他代码一样维护它。相信我。;o)
如果您尝试存储日期/时间值,那么您将浪费无效时间/日期的数字空间,例如,月份字段中的值永远不会大于 12。
直接的纪元/单位时间计数器会更好,但对于每分钟仅生成几个 id 的机器,您仍然会浪费大量空间,
请查看 Wikipedia 上的 VIN 页面。只有少数制造商和几千辆汽车有空间,他们现在正在重复使用车辆识别号,因为他们在其中添加了意义,从而耗尽了空间。
http://en.wikipedia.org/wiki/VIN
这并不是说所有含义序列号不好,只需严格限制它以确保数字不会冲突。
像这样的...
这就是您需要避免冲突的全部。如果你添加一个位置数字,那么当你到达 11 个位置时你就完蛋了。
抱歉,如果这听起来像咆哮。我经常处理电子产品和各种机械零件的制造工作。除非有大量可用空间或辅助标签(其中-哇-提供了前面提到的必要的ID空间),否则它永远不会长期结束。
Let me start by saying I've dealt with this before and attempting to store useful information into a serial number is a BAD idea long term. A device serial number should be meaningless. Just like the primary key of a database record should be meaningless.
The second you start trying to put real data into your serial number, you've just thrown BUSINESS LOGIC into it and you will be forced to maintain it like any other piece of code. Future you will hate past you. Trust me on this. ;o)
If you attempt to store date/time values, then you'll waste numeric space with invalid time/dates. For instance you'll never have anything greater than 12 in the month field.
A straight epoch / unit time counter would be better, but for a machine that only generates a few id's per minute you'll still waste a lot of space.
12 digits is not a lot of space. Look at the VIN page on Wikipedia. Space for only a few manufacturers, only a few thousand cars. They are now reusing VINs because they ran out of space by packing meaning into it.
http://en.wikipedia.org/wiki/VIN
That's not to say ALL meaning in a serial number is bad, just keep it strictly limited to making sure the numbers don't collide.
Something like this...
That's ALL you need to avoid collisions. If you adding a location digit, then you are screwed when you get to 11 locations.
Sorry if this sounds like a rant. I deal with this a lot manufacturing electronics and various machined parts. It had never ended well long term unless there's LOTS of space available, or a secondary tag (which -wow- provides the necessary id space mentioned before)
yyMMddhhmmID
怎么样?示例:来自
ID = 01
的计算机的0912113201
。或者(如果您不喜欢两位数年份(Y2K 哈哈)),
yyyyMMIDxxxx
怎么样?示例:来自
ID = 01
的计算机的200912010001
。正如您所说,每台机器每五分钟最多只会生成一个标识符,这为您每月提供了 8,928 (24 * 31 * 60 / 5 = 8928) 个标识符的空间,这些标识符适合
xxxx
。如果您需要在xxxx
序列或机器 ID 中添加额外的数字,则可以将年份压缩为三位数年份yyy
(例如 009)。这两个都符合您要求的时间戳/机器 ID。
我们都喜欢具体的代码:
输出:
How about
yyMMddhhmmID
?Example:
0912113201
from machine withID = 01
.Alternatively (if you don't like two-digit years (Y2K lol)), how about
yyyyMMIDxxxx
?Example:
200912010001
from machine withID = 01
.As you said that each machine will only generate one identifier maximum every five minutes, this gives you room for 8,928 (24 * 31 * 60 / 5 = 8928) identifiers per month which will fit in
xxxx
. Here you could squeeze the year down to a three-digit yearyyy
(009, e.g.) if you needed an extra digit in thexxxx
sequence or the machine ID.Both of these fit timestamp/machine ID as you requested.
We all like concrete code:
Outputs:
安装软件时,还要安装包含唯一数字 ID 的机器 ID 文件/注册表项。由于您只有几台机器,因此该数字不应超过 3 或 4 位。使用这些作为 MS 数字。从 1 开始按顺序生成剩余的数字。
When you install your software, also install a machiine id file/registry key which contains a unique numeric id. As you only have a few machines, this should not take more than 3 or 4 digits. Use these as the MS digits. Generate the remaining digits sequentially starting at 1.
我收集您正在为 Windows 进行开发(回复:您针对 Jason 的回答对“MSI/EXE”的评论)。因此,您可以通过 WMI 或类似方式获取一些唯一的硬件属性(例如处理器或 HDD 序列号,或 NIC 的 MAC 地址)作为唯一计算机 ID 的基础。另一种选择也可能是使用您自己开发的硬件的唯一序列号(如果有)。
这很可能比您需要的更长,因此您可能会截断或散列它以将其减少到(例如)16 位左右,并将其用作您的机器 ID。显然,这可能会导致冲突,但机器数量较少(约 100 台)意味着这种情况不太可能发生,并且使用加密哈希(例如 MD5)的截断输出可以使这种情况变得更小。
然后,由于您有一个(很可能是唯一的)机器 ID,因此您可以使用其他答案列出的方法生成本质上唯一的 ID。
I'm gathering you're developing for Windows (re: your comment about "MSI/EXE" in response to Jason's answer). As such, you could WMI or similar to get some unique hardware attribute (processor or HDD serial number, or NIC's MAC address for example) to base a unique machine ID upon. An alternative might also be using the unique serial number of the hardware you are yourself developing (if it has one).
That would most likely be longer than you need, so you could potentially truncate or hash it to reduce it to (say) 16 bits or so and use that as your machine ID. Obviously, this may cause collisions, but the small number of machines (~100) means this is unlikely, and using the truncated output of a cryptographic hash (say MD5) makes this even less so.
Then, since you have a (most probably unique) machine ID, you can then generate essentially unique IDs using the approaches listed by the other answers.
24 小时内有 864000 个 100 毫秒的滴答声,因此将其附加到日期 09.12.24.86400.0 上可能会起作用,但您必须失去世纪才能适应 12 位数字,而且您没有任何空间用于机器 ID。
There are 864000 100ms ticks in 24 hours, so tacking that onto a date might work 09.12.24.86400.0, but you have to lose the century to fit in 12 digits, and you don't have any space for machine IDs.
想法一:
YYMMDDmmnnnn
其中
~~
想法二:
mmmmnnnnnnnn
哪里
Idea number one:
YYMMDDmmnnnn
where
~~
Idea number two:
mmmmnnnnnnnn
Where
我的建议是将多种方法组合在一个 id 中。例如:从两年数字、两个月数字开始,然后生成一个随机数,其中时间作为接下来几位数字的种子,然后生成最后几个数字的唯一机器 ID。或者类似的东西。
My suggestion would be to combine multiple approaches in a single id. For example: start with the two year digits, the two month digits and then generate a random number with the time as a seed for the next several digits and then a unique machine id for the last couple. Or something like that.
每台机器都有一个起始 ID DDNNN,其中 DD 是唯一的机器标识符,NNN 是该机器当天生成的当前标识符。每台机器都会跟踪它在特定日期生成的 id,并在需要新的 id 时通过将最后一个 id 加 1 来分配下一个 id。它会在每天开始时将其计数器重置为 0。日期 YYYYDOY 被添加到每台机器生成的数字之前(4 位数的年份,3 位数的年份)。该数字保证是唯一的,因为机器标识符是唯一的。
如果您需要更多空间来容纳更多机器,您可以删除年份中的千禧年并为机器 ID 添加一个数字:YYYDOYDDDNNN。
Each machine gets a starting id of DDNNN, where DD is a unique machine identifier and NNN is the current identifier generated by that machine that day. Each machine keeps track of the ids that it has generated on a particular date and allocates the next one when it needs a new one by incrementing the last one by 1. It resets its counter to 0 at the beginning of each day. The date YYYYDOY is prepended to the number generated by each machine (4-digit year, 3-digit day of year). The number is guaranteed unique because the machine identifier is unique.
If you needed more space for more machines, you could drop the millenium from the year and add a digit for the machine id: YYYDOYDDDNNN.
“单台机器生成 ID 的频率不会超过每 5 分钟左右”
假设这是真的,那么只需使用时间戳即可。 (32 位 Unix 时间有 10 位十进制数字,但将在 2038 年用完)
但我认为假设不会发生冲突是相当乐观的。
“从特定机器生成的 ID 应该按顺序显示。”
那么您唯一的选择就是使用序列号。
这似乎与你在后面的约束中所说的不太相符?
连接节点 ID 的填充版本以获取整个集群中的唯一值。
"A single machine will not generate IDs more often than every 5 minutes or so"
Assuming this is true, then just use the timestamp. (32 bit Unix time has 10 decimal digits but will run out in 2038)
But I think its rather optimistic to assume there won't be a collision.
"IDs generated from a particular machine should appear sequential."
Then your only option is to use a sequence number.
Which doesn't really seem to match what you say in later constraints?
Concatenate a padded version of the node id to get unique values across the cluster.
使用机器的 MAC 地址作为机器 ID。您可以使用它来编码您的时间戳,即通过 XOR,或者您可以将其附加/前置到生成的序列化代码中。
Use the MAC address of the machine as a MACHINE ID. You can use this to encode your timestamp i.e. via XOR or you can append/prepend it to the generated serialized code.