将 protobuf-net bcl.Guid 的 HI/LO 与 sql 唯一标识符协调一致以进行相关子查询?

发布于 2024-11-17 19:39:44 字数 2089 浏览 2 评论 0原文

是否有任何标准/样板方法可以将 SQL uniqueidentifiers 转换为与 protobuf-net 的 BCL.Guids 相同的 HI/LO 块?

更新:

因此,给定一个 Guid,"4D1CE8BE-C36B-4FFA-A4C8-9056619E9967",ProtoBuf.NET 会将其序列化为{"lo":5763133538796628158,"hi":7465171998244653220,"___error___":null},这就是它随后存储在 Hadoop 中的方式。因此,SQL Server 有一个带有 Guid id 的列,而 Hadoop 最终有单独的 id.lo 和 id.hi 值。我需要的是一个查询,根据 SQL Server 的 id 在 Hadoop 中查找记录。下面的 C# 代码为我提供了我想要的翻译值,但它需要我运行一个中间客户端应用程序来拆分 Guid。我希望能够直接从 SQL 查询获取 id.lo 和 id.hi,这样我就可以将它们直接放入后续 Hadoop Map-Reduce 作业中的过滤器中:

static void TestGuidSplit() {
            // Protobuf.Net serializes the following Guid as:
            // {"lo":5763133538796628158,"hi":7465171998244653220,"___error___":null}
            Guid testGuid = new Guid("4D1CE8BE-C36B-4FFA-A4C8-9056619E9967");
            Tuple<long, long> loHi = LoHi(testGuid);
            Console.WriteLine("lo: {0}, Hi:{1}.", loHi.Item1,loHi.Item2);
            Console.ReadLine();
        }
        static Tuple<long, long> LoHi(Guid someGuid) {
            byte[] bytes = someGuid.ToByteArray();
            long[] longs = new long[2];
            longs[0] = BitConverter.ToInt64(bytes, 0); // [0] = 5763133538796628158 = lo
            longs[1] = BitConverter.ToInt64(bytes, 8); // [1] = 7465171998244653220 = hi
            return new Tuple<long, long>(longs[0], longs[1]);
        }

原始问题:

我在 SQL Server 中有大量数据,需要将它们加入到通过 protobuf-net 导出到 Hadoop 的数据中。我们有共同的密钥,但它们都是唯一的标识符。使用同质格式的键,我们可以从 SQL 中提取一个平面文件,在 Hive 中对该平面文件创建一个 tabledef,然后使用 Hive 对该文件和 hadoop 表运行联接查询。但对于不同的 uuid 格式,这似乎不可能。

我们是否需要一个中间过程来协调 uuid 格式?我希望有一种方法可以解决这个问题,即更像一个简单的 ETL 过程。最终,我们只想获取一组 hadoop 数据,其中 some_id 在(sql id 列表)中。如果在没有从 sql 过滤到相关 id 的情况下提取 hadoop 数据,则其大小将难以管理。

我能想到的描述我想要做的事情的最简单的例子是想象我在 SQL Server 中有两个表,“a”和“b”,其中“a”包含几个 uuid 字段,“b”包含多个 uuid 字段。是“a”的副本,只不过 uuid 现在是来自 protobuf-net 的 bcl.guid 的 64 位整数 HI/LO。鉴于这种情况,我想从“b”中选择*,其中someid(从“a”中选择someid,其中interesting = true)。我缺少的是一个从“a”获取 someid 的 HI 和/或 LO 的函数,以提供给来自“b”的查询的 in 子句。

Is there any standard / boilerplate way to convert SQL uniqueidentifiers into the same HI/LO blocks as protobuf-net's BCL.Guids?

UPDATE:

So, given a Guid, "4D1CE8BE-C36B-4FFA-A4C8-9056619E9967", ProtoBuf.NET will serialize it as {"lo":5763133538796628158,"hi":7465171998244653220,"___error___":null}, which is how it subsequently gets stored in Hadoop. So SQL server has a column with a Guid id, and Hadoop ends up having separate id.lo and id.hi values. What I need is a query to look up a record in Hadoop given the id from SQL server. The C# code below gives me the translated value I'm after, but it requires me to run an intermediate client app simply to split the Guid. What I want to be able to get the id.lo and id.hi directly from a SQL query, so I can drop them straight into a filter in a subsequent Hadoop Map-Reduce job:

static void TestGuidSplit() {
            // Protobuf.Net serializes the following Guid as:
            // {"lo":5763133538796628158,"hi":7465171998244653220,"___error___":null}
            Guid testGuid = new Guid("4D1CE8BE-C36B-4FFA-A4C8-9056619E9967");
            Tuple<long, long> loHi = LoHi(testGuid);
            Console.WriteLine("lo: {0}, Hi:{1}.", loHi.Item1,loHi.Item2);
            Console.ReadLine();
        }
        static Tuple<long, long> LoHi(Guid someGuid) {
            byte[] bytes = someGuid.ToByteArray();
            long[] longs = new long[2];
            longs[0] = BitConverter.ToInt64(bytes, 0); // [0] = 5763133538796628158 = lo
            longs[1] = BitConverter.ToInt64(bytes, 8); // [1] = 7465171998244653220 = hi
            return new Tuple<long, long>(longs[0], longs[1]);
        }

ORIGINAL QUESTION:

I have a ton of data in SQL Server that I need to join to data that was exported to Hadoop via protobuf-net. We have keys in common, but they are all uniqueidentifiers. With homogeneously-formatted keys, we could just extract a flat file from SQL, create a tabledef over that flat file in Hive, and then use Hive to run a joined query over that and the hadoop tables. With the disparate uuid formats though, this doesn't seem possible.

Do we need an intermediate process to harmonize the uuid formats? I'm hoping there's a way around this, i.e. to do it more like a simple ETL process. Ultimately, we just want to get at a set of hadoop data where some_id in (list of sql ids). The hadoop data is of unmanageable size if extracted without filtering down to the relevant ids from sql.

The simplest example I can think of to describe what I'm trying to do would be to imagine that I had two tables in SQL server, 'a' and 'b', where 'a' contained several uuid fields, and 'b' was a copy of 'a' except that the uuids were now 64-bit integers, HI/LO, from protobuf-net's bcl.guid. Given that scenario, I want to select * from 'b' where someid in (select someid from 'a' where interesting = true). What I'm missing is a function to get the HI and/or LO of someid from 'a', to supply to the in clause for my query from 'b'.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

南城旧梦 2024-11-24 19:39:44

在 SQL 中从 SQL uniqueidentifier 列转换为 protobuf-net 生成的相同 hi / lo 值:

declare @guid uniqueidentifier = convert(uniqueidentifier, '4D1CE8BE-C36B-4FFA-A4C8-9056619E9967')
select @guid as 'guid' -- writes: 4D1CE8BE-C36B-4FFA-A4C8-9056619E9967, to prove it parsed correctly
declare @blob binary(16) = CONVERT(binary(16), @guid)
select CAST(SUBSTRING(@blob, 8, 1) + SUBSTRING(@blob, 7, 1) + SUBSTRING(@blob, 6, 1) + SUBSTRING(@blob, 5, 1) +
       SUBSTRING(@blob, 4, 1) + SUBSTRING(@blob, 3, 1) + SUBSTRING(@blob, 2, 1) + SUBSTRING(@blob, 1, 1) as bigint) as 'lo',
       CAST(SUBSTRING(@blob, 16, 1) + SUBSTRING(@blob, 15, 1) + SUBSTRING(@blob, 14, 1) + SUBSTRING(@blob, 13, 1) +
       SUBSTRING(@blob, 12, 1) + SUBSTRING(@blob, 11, 1) + SUBSTRING(@blob, 10, 1) + SUBSTRING(@blob, 9, 1) as bigint) as 'hi'
       -- writes: 5763133538796628158, 7465171998244653220

请注意,您可以将其包装为UDF某处...

Conversion in SQL from a SQL uniqueidentifier column to the same hi / lo values generated by protobuf-net:

declare @guid uniqueidentifier = convert(uniqueidentifier, '4D1CE8BE-C36B-4FFA-A4C8-9056619E9967')
select @guid as 'guid' -- writes: 4D1CE8BE-C36B-4FFA-A4C8-9056619E9967, to prove it parsed correctly
declare @blob binary(16) = CONVERT(binary(16), @guid)
select CAST(SUBSTRING(@blob, 8, 1) + SUBSTRING(@blob, 7, 1) + SUBSTRING(@blob, 6, 1) + SUBSTRING(@blob, 5, 1) +
       SUBSTRING(@blob, 4, 1) + SUBSTRING(@blob, 3, 1) + SUBSTRING(@blob, 2, 1) + SUBSTRING(@blob, 1, 1) as bigint) as 'lo',
       CAST(SUBSTRING(@blob, 16, 1) + SUBSTRING(@blob, 15, 1) + SUBSTRING(@blob, 14, 1) + SUBSTRING(@blob, 13, 1) +
       SUBSTRING(@blob, 12, 1) + SUBSTRING(@blob, 11, 1) + SUBSTRING(@blob, 10, 1) + SUBSTRING(@blob, 9, 1) as bigint) as 'hi'
       -- writes: 5763133538796628158, 7465171998244653220

Note you can probably wrap this as a UDF somewhere...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文