Google Boogtable二级指数

发布于 2025-02-01 21:05:31 字数 281 浏览 4 评论 0原文

在审查Google Boogtable时，我发现它没有定义次要索引的能力。

因此，如果您有十亿美元的交易，对于1000万客户来说，似乎您需要进行全桌扫描才能为一个客户提供所有交易。

由于Google Boogtable似乎正在使用引擎盖下的Apache HBase，因此我的第一个想法是：大概一个人可以将Apache Phoenix放在顶部。

但是，我发现最相关的最相关的是，似乎a

好吧，现在我们更远了几年，尽管我确认似乎仍然没有得到支持，但我想知道是否已经出现任何模式来启用次要索引？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

败给现实 2025-02-08 21:05:31

因此，如果您有十亿美元的交易，对于1000万客户来说，似乎您需要进行完整的桌子扫描才能为一个客户提供所有交易。

这是一个读取访问模式。如果每个客户查询是一种频繁的访问模式，则有效的架构设计将使划分键在客户ID上。这样一来，人均读取查询将转化为对该客户数据的非常快速的查找。

如果您可以识别其他读取和写入访问模式，每个发生的频率，与每种访问模式的延迟和吞吐量有关的要求，则可以设计架构设计的其他改进。如果您提供这些细节，我将很乐意帮助您考虑一下。

在审查Google Bigtable时，我发现它没有定义次要索引的能力。

那是正确的。

因为Google Bigtable似乎正在使用引擎盖下的Apache HBase ...

这不是很正确，但是很接近。 HBASE以宏伟的形式建模。两者都不依赖于另一个项目的实施。但是，与Cloud Boogtable通信的一种方法是通过Java的Cloud Boogtable HBase客户端，这是Apache HBase客户端的自定义版本。

由于Google Boogtable似乎正在使用ham下的Apache HBase，所以我的第一个想法是：大概可以将Apache Phoenix放在顶部。

您可以重构Phoenix与Java的Cloud Bigtable HBase客户端兼容。至少，如果单独进行工作，那将是一个为期一年的项目。至于诸如协处理器之类的缺失功能，您肯定可以找到一种复制HB协作者实现的并行性的方法。

好吧，现在我们更远了几年了，尽管我确认似乎没有支持的处理器，但我想知道是否已经出现任何模式来启用次要索引？

虽然没有什么比Stackoverflow回答的更糟糕的答案，该回答问题是OP已经做出的所有基本技术选择，但基于关于OP的用例的很少的信息，我仍然会继续做类似的事情...我可以问为什么我要问为什么为什么您打算使用笨拙吗？一个不同的存储系统可能是细分数据配置文件的一种有用方法，是按音量，速度，变化，访问和安全性。如果您可以提供这些细节，我很乐意帮助您考虑一下。

卷：需要更多信息，但是如果数据确实是一项十亿笔交易，则假设每个交易都是合理的行尺寸（这是一个重要的假设，请让我更多地了解每次交易的大小），那么您的数据将很容易适合进入我列出的任何解决方案，包括CloudSQL，其最大实例大小为30至 64 tb 。 64 TB除以10亿千克。为了进行比较，SQL Server的最大行长为8KB。在GCP之外的一个示例中，SQL Server允许数据库大小 500,000 trabytes 。

至于速度，我怀疑这些数据的速度很低，原因有两个。其中一种是数据的性质：收集和存储人体输入数据的Web应用程序和移动应用程序通常是较低的速度，至少在单个用户测量时。二，根据您提供的有关数据的详细信息：如果1000万客户有10亿美元的交易，那就是每个客户的100笔交易。同样，听起来像SQL Server或MongoDB/Firebase是合适的。

数据变化：鉴于客户交易的性质，并且鉴于您正在尝试将您的Boogtable包裹在OLTP DB立面（例如Phoenix）中，这听起来像是典型的OLTP关系DB用例，即高度结构化和低变化。但是，如果它是较高的变化，则可以使用Firebase或MongoDB。

访问：您告诉我一个阅读访问模式，尽管可能还有很多其他信息，我没有办法神圣。对于每个访问模式，您可能想回答的问题：

在读取操作中检索了多少数据？（您已经为您在问题中描述的访问模式回答了这一点）
插入操作中写了多少数据？
数据多久写一次？
数据多久读取一次？

就写入访问模式而言，我怀疑有一个单一的写入访问模式，其特征是少量数据以较低的速度编写。

So if you have a billion transactions, for 10 million customers, it would seem you need a full table scan to pull out all transactions for one customer.

This is a read access pattern. If a per-customer query is a frequent access pattern, then an effective schema design would have row-keys prefixed by customer ID. That way, a per-customer read query will translate to an extremely fast lookup of that customer's data.

If you can identify additional read and write access patterns, the frequency that each takes place, the requirements relating to latency and throughput for each access pattern, then additional refinements of schema design can be devised. I will be happy to help you think through it, if you provide those details.

In reviewing Google Bigtable I found that it does not offer the ability to define secondary indices.

That is correct.

As Google Bigtable seems to be using Apache HBase under the hood ...

This is not quite right, but close. HBase is modeled after BigTable. Neither project relies on implementation from the other. However, one way to communicate with Cloud Bigtable is through the Cloud Bigtable HBase client for Java, which is a customized version of the Apache HBase client.

As Google Bigtable seems to be using Apache HBase under the hood, my first thought was: Presumably one can put Apache Phoenix on top.

You could refactor Phoenix to be compatible with the Cloud Bigtable HBase client for Java. It would be a year-long project, at least, if working on it individually. As far as missing features like coprocessors, you could surely find a way to replicate that sort of parallelism achieved by HB co-processors.

Well, now we are quite a few years further and though I confirmed co-processors still do not appear to be supported, I wondered if any pattern had emerged to enable secondary indices?

While there is nothing worse then a StackOverflow answer that questions all underlying technical choices the OP already made, based on very little information about the OP's use-case, I am nonetheless going to proceed to do something like that... May I ask why you are planning to use BigTable? A different storage system might be better One useful way to breakdown the data profile is by Volume, Velocity, Variation, Access, and Security. I will be happy to help you think through it, if you can please provide those details.

Volume: Need more info, but if the data is indeed a billion transaction, then assuming that each transaction is reasonably row sized (this is an important assumption, please let me know more about size of each transaction), then your data would easily fit into any solution i listed, including CloudSQL, which has a max instance size of 30 to 64 TB. 64 TB Divided by one billion is 64 KB. For comparison, SQL Server has a max row length of 8KB. For an example outside of GCP, SQL Server allows a database size of over 500,000 terabytes.

As far as velocity, I suspect this data is low velocity, for two reasons. One, the nature of the data: Web applications and mobile apps that collect and store human-entered data are typically low velocity, at least when measured by individual user. Two, from the details you provided about the data: if 10 million customers have a billion transactions, that is 100 transactions per customer. Again, sounds like a sql server or MongoDB/Firebase would be appropriate.

Variation in data: Given the nature of customer transactions, and given that you are trying to wrap your bigtable in a OLTP DB facade such as Phoenix, it sounds like a typical OLTP Relational DB use case, which is to say highly structured and low variation. If it is higher variation, though, you can use Firebase or MongoDB.

Access: You told me about one read-access pattern, though there could be many others, which I have no way to divine. Questions you may want to answer, for each access pattern:

How much data is retrieved in a read operation? (you already answered this for the access pattern you described in your question)
How much data is written in an insert operation?
How often is data written?
How often is data read?

As far as the write access pattern, I would suspect that there is a single write-access pattern, characterized by a small amount of data is written at a time, at a low rate.

回复收藏 0 原文

~没有更多了~