将 Int32 转换为 ushort 并再次转换回来
我正在尝试设计一个系统,将大于 65535 的整数值打包到 ushort 中。 让我解释。
我们有一个系统,它使用 SQL Server 中的 IDENTITY 列生成 Int32 值,并受到生产中客户端 API 的限制,该 API 会将我们的 Int32 ID 溢出为 ushort。 幸运的是,在任何给定时间,客户端都只有大约 20 个左右具有这些 ID 的事物实例(我们称它们为包),并且只需要让它们在本地同级中是唯一的。 普遍接受的解决方案是将我们的 Int32 ID 转换为 ushorts(不,我不是说转换,我的意思是翻译),然后再将它们传输给客户端,但是这种方法有一些倒刺:
- 一些小于 65535 的 ID 可能仍然在由于未过期,可以随时在给定客户端上玩。
- 我们不能发生任何 ID 冲突 - 也就是说,如果包 ID 1 发送到客户端,则跟踪 65535 从 Int32 中删除多少次以在应用于 65536 时生成 ushort 的算法也会导致 1,从而导致冲突。
- 我们必须能够在返回时将 ushort 重建为 Int32。
我们可以用来解决这个问题的是一个有符号的字节字段,它会向我们回显并为我们提供 127 个值(实际上是 117 个,因为我们使用 0-9 来表示其他值)。 从现在开始我将把它称为“字节字段”。
我们已经讨论了三种不同的转换例程:
- 乘法:在字节字段中存储我们从 Int32 中删除 65535 来生成 ushort 的次数。 这存在如上所述的冲突问题。
- 序列化会话状态:对于每个客户端,根据有关该客户端的事实生成会话 ID。 然后存储一个 1:1 转换表,从 1 开始一直到交付的包裹数量,这样当客户端再次访问我们的服务器时,包裹库存可以转换回其已知的数据库 ID。 这会带来开销问题,因为我们要将序列化会话状态支持到数据库,并且我们希望每秒支持数百到数千个事务。
- 多种算法方法,其中字节字段是转换算法的 ID,该算法采用 Int32 并将其转换为 ushort。 显然,其中许多将是简单的乘法(以增加我们可以转换的 ID 上限),但有些必须与较小的边界(如 32768)进行乘法,并添加/减去一个数字以接近可以保证兄弟姐妹中唯一的数字。 这种方法是处理器密集型的,但应该允许我们在保持可扩展性的同时避免冲突(尽管使用这种方法,我们的上限有限,在 ushort 问题因升级而自行消失之前,无法达到该上限)。
所以我的问题是:是否有比我上面的方法更好的方法,如果没有,我应该在算法方面寻找什么(对于方法#3),当给定数字大于时生成 1-65535 之间的数字0 并且一定不能是单向哈希?
澄清:这并不是说 ushort 上限是最大的问题,而是客户端 API 使用了 ushort,所以我无法组合客户端上的字节字段来获得更大的值(客户端 API 是不可升级的,但最终会逐步淘汰) )。
I am attempting to devise a system for packing integer values greater than 65535 into a ushort. Let me explain.
We have a system which generates Int32 values using an IDENTITY column from SQL Server and are limited by an in-production client API that overflows our Int32 IDs to ushorts. Fortunately the client only has about 20 or so instances of things with these IDs - let's call them packages - at any given time and it only needs to have them unique amongst local siblings. The generally accepted solution is to translate our Int32 IDs to ushorts (and no I don't mean casting, I mean translating) before transmitting them to the client, however there are barbs with this approach:
- Some IDs less than 65535 may still be in play on a given client at any time due to non-expiration.
- We cannot have any ID collisions - that is if package ID 1 goes to the client, an algorithm that tracks how many times 65535 is removed from an Int32 to make a ushort when applied to 65536 would also result in 1 thus causing a collision.
- We must be able to reconstruct the ushort into the Int32 upon return.
What we have available to solve this problem is a single signed byte field that is echoed to us and gives us 127 values to play with (really 117 because we're using 0-9 for something else). I'll refer to this as the "byte field" from here on out.
We have discussed three different translation routines:
- Multiplicative: store in the byte field how many times we remove 65535 from our Int32 to make a ushort. This has collision problems as detailed above.
- Serialized Session State: for each client, generate a session ID based on facts about that client. Then store a 1:1 translation table starting from 1 up to the number of packages delivered so when the client accesses our server again the inventory of packages can be translated back to their known database IDs. This has overhead problems since we'd be backing serialized session state to a database and we want to support hundreds to thousands of transactions a second.
- Varied algorithmic approach where the byte field is an ID of a transformative algorithm that takes an Int32 and transforms it into a ushort. Obviously many of these are going to be simple Multiplicative (to increase our ceiling of IDs we can transform) but some will have to be multiplicative with a smaller boundry (like 32768) with a number added to/subtracted from to get as close to a number that can be guaranteed unique amongst siblings. This approach is processor intensive but should allow us to avoid collisions while remaining scalable (though with this approach we have a limited ceiling that won't be reached before the ushort problem goes away on its own due to upgrades).
So my question is: is there a better way than my approaches above, and if not, what should I be looking for in terms of algorithms (for approach #3) to generate a number between 1-65535 when a given number is greater than 0 and must not be a one way hash?
Clarification: its not that the ushort ceiling is the greatest problem, its that the client API uses a ushort so I cannot combine the byte field on the client to get bigger values (the client API is non-upgradeable but will eventually phase out of existence).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
关于方法 2:
第二种方法几乎就是 NAT 的工作原理。 本地网络上的每个 TCP/UDP 客户端最多可使用 65535 个端口(端口 0 除外)和一个私有 IP。 路由器只知道一个公共 IP。 由于两个客户端可能都有源端口300,因此不能简单地将私有IP替换为公共IP,这会导致出现冲突。 因此,它取代了 IP 并“翻译”了端口(NAT:网络地址翻译)。 返回时,它会将端口转换回来,并再次用私有 IP 替换公共 IP,然后将包转发回去。 除此之外你什么也不做。 然而,路由器将这些信息保存在内存中,并且在进行 NAT 时它们的速度并不算太慢(拥有数百台计算机的公司有时会通过 NAT 连接到 Internet,并且在大多数情况下速度下降几乎不明显)。 您说您希望每秒处理数千笔交易 - 但会有多少客户? 因为这主要定义备份映射所需的内存大小。 如果客户端不是太多,您可以将与排序表的映射保留在内存中,在这种情况下,速度将是最小的问题(表变得更大,服务器内存不足是更大的问题)。
我有点不清楚的是你曾经说过
但你说
我猜,第二个语句的意思可能是,如果客户端请求 ID 65536,它可能仍然具有低于 65535 的 ID,并且这些 ID 可以低至(比方说)20。直接下单吧? 所以你不能说,仅仅因为它现在请求 65536,它可能有一些更小的值,但肯定不在 1-1000 的范围内,对吗? 它实际上可能保留对 20、90、2005 和 41238 的引用,但仍然超过 65535,这就是你的意思?
我个人比第三种方法更喜欢你的第二种方法,因为在任何情况下都更容易避免碰撞,并且将数字翻译回来是一个简单、简单的操作。 尽管我怀疑你的第三种方法从长远来看是否有效。 好的,您可能有一个字节来存储减去 2^16 的次数。 但是,您只能减去 117 * 2^16 作为最大数字。 如果数字超过这个数字,你会怎么做? 使用不同的算法,这不会做减法,但会做什么呢? 划分? 移位位? 在这种情况下,您会失去粒度,这意味着该算法无法再命中任何可能的数字(它将产生较大的跳跃)。 如果在 32 位上应用一个神奇的转换函数来将其转换为 16 位(+ 一个额外字节)然后将其转换回来是如此容易,那么我猜这个世界上的每种压缩方法都会使用它,因为它可以,不无论 32 位数字是什么,始终将其压缩为 24 位(16 位 + 1 个字节)。 那将是神奇的。 不可能将 32 位打包成 24 位,并且也打包如何将其转换回 24 位的所有逻辑。 您将需要一些外部存储,这让我们回到第二种方法。 这是唯一有效的方法,并且它适用于 32 位数字范围内的每个数字。
Regarding approach 2:
Your second approach is pretty much how NAT works. Every TCP/UDP client on the local network has up to 65535 ports in use (except port 0) and a private IP. The router knows only a single public IP. Since two clients may both have source port 300, it cannot simply just replace the private IP with a public one, that would cause collisions to appear. Thus it replaces the IP and "translates" the port (NAT: Network Address Translation). On return, it translates the port back and replaces the public with a private IP again, before forwarding the package back. You'd be doing nothing else than that. However, routers keep that information in memory - and they are not too slow when doing NAT (companies with hundreds of computers are NATed to the Internet sometimes and the slow down is hardly noticeably in most cases). You say you want up to thousand transactions a second - but how many clients will there be? As this mainly will define the size of memory needed to backup the mappings. If there are not too many clients, you could keep the mapping with a sorted table in memory, in that case, speed will be the smallest problem (table getting to bigger and server running out of memory is the bigger one).
What is a bit unclear to me is that you once say
but then you say
I guess, what you probably meant by the second statement is, that if a client requests ID 65536, it might still have IDs below 65535 and these can be as low as (let's say) 20. It's not that the client processes IDs in a straight order, right? So you cannot say, just because it now requested 65536, it may have some smaller values, but certainly not in the range 1-1000, correct? It might actually keep a reference to 20, 90, 2005 and 41238 and still go over 65535, that's what you meant?
I personally like your second approach more than the third one, as it is easier to avoid a collision in any case and translating the number back is a plain, simple operation. Although I doubt that your third approach can work in the long run. Okay, you might have a byte to store how often you subtracted 2^16 of the number. However, you can only subtract 117 * 2^16 as largest numbers. What will you do if numbers go above that? Using a different algorithm, that does not subtract, but does what? Divide? Shift bits? In that case you lose granularity, that means this algorithm can't hit any possible number any longer (it will make large jumps). If it was so easy to just apply a magic translation function upon 32 bit to make 16 bit from it (+ one extra byte) and then just transform it back, guess every compression method in this world would use it, as it could, no matter what the 32 bit number was, always compress it down to 24 bit (16 bit + one byte). That would be magic. It is not possible to pack 32 bit into 24 bit and also pack all the logic how to transform it back into it as well. You will need some external storage, which brings us back to your 2nd approach. This is the only approach that will work and it will work for every number in 32 bit number range.
我可以想到其他一些选项:
数据库中的全局条目是否少于 65536 个? 如果是这样,那么您可以维护一个与会话状态无关的映射表,但它是应用程序的持久部分。
索引中的大多数条目是否都少于(例如 50,000)? 如果是这种情况,您可以直接映射此类值,并对其余值使用与会话关联的映射。
如果持久保存此类会话数据是一个问题,并且客户端数量相当小,则可以启用客户端/会话亲和性并维护服务器本地的映射。
如果它不是 Web 应用程序,您可以在客户端本身维护地图。
我没有看到任何可以避免冲突的算法方法 - 我怀疑你总能想出一个会发生冲突的例子。
I can think of a few other options:
Are there globally fewer than 65536 entries in the database? If so, then you could maintain a mapping table that's not associated with session state, but is a persisted part of the application.
Are the majority of entries at indexes less than, say 50,000? If that's the case you could map such values directly, and use a map associated with the session for the remaining ones.
If persisting such session data is an issue and the number of clients is reasonably small, you could enable client/session affinity and maintain the map local to the server.
If it's not a web application, you could maintain the map on the client itself.
I don't see any algorithmic way that would avoid collisions - I suspect you could always come up with an examples that would collide.
您还需要比 65535 “多”多少? 您始终可以从“字节字段”中添加一些位作为 ID 的高位。 只需 2 位即可得到 262,143,3 位即可得到 524,287。
How much "more" than 65535 do you need? You could always just add a few bits from your "byte field" as the high-order bits of the ID. Just 2 bits would get you to 262,143, 3 bits would get you 524,287.