排序bigquerystorage读取API

发布于 2025-02-06 13:21:14 字数 96 浏览 2 评论 0原文

如标题所述,使用存储读取API构建的读取流读取的数据是否有任何形式的顺序?关于分区和群集键是否有任何顺序,因为我知道分区是共归结的,如果使用了群集,则分区中的数据存储在群集块中?

As the title states, is there any sort order for the data read using the read streams constructed with the Storage Read API? Is there any ordering with respect to partitions and clustering keys, as I understand partitions are colocated and if clustering is used, the data in a partition is stored in clustered blocks?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

一花一树开 2025-02-13 13:21:15

对于第一个问题

存储api在直接。无法使用存储读取API对您接收数据的哪个顺序做出假设。

对于群集表中的第二个问题,

每当将新数据添加到表或特定分区时,数据就会自动组织。 rel =“ nofollow noreferrer”>分区表doc 和

分区表:一个分区表是一个特殊的表,分为段,称为分区,使管理和查询数据更容易。

群集表:在BigQuery中创建群集表时,表数据将根据表模式中的一个或多个列的内容自动组织。您指定的列用于将相关的数据共处。当数据写入聚类表中时,BigQuery使用聚类列中的值对数据进行分类。
这些值用于将数据整理到BigQuery存储中的多个块中。聚类列的顺序确定数据的排序顺序。当将新数据添加到表或特定分区时,BigQuery在背景中执行自动重新群体以恢复表或分区的排序属性。

当您在的情况下使用群集时,将其应用于整个数据集。如果表为分区表,则将其应用于每个分区。

您可以关注此代码实验室更好的理解。来自实验室: -
将此stackoverflow.question_2018表作为示例。假设它具有3列

  1. Creation_date  2.标题  3. tags

如果我们从具有create_date作为日期分区的主表中创建一个新的分区表,则根据分区逻辑,它将在每个创建日期都有一个分区。

现在,如果我们创建一个表create_date作为分区,并应用cluster on column tags tags ,则聚类将是应用于每个分区。即使我们在此表中添加新数据,BigQuery也会照顾重新组织数据。


希望这可以帮助您理解。

For the 1st Question

Storage API operates on storage directly.Thus you really can’t make assumptions regarding in which order you will receive the data by using Storage Read API.

For the 2nd Question

In a clustered table the data gets automatically organized whenever new data is added to a table or specific partition.From the partitioned table doc and clustered table doc

Partition table: A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data.

Cluster table: When you create a clustered table in BigQuery the table data is automatically organized based on the contents of one or more columns in the table's schema. The columns you specify are used to collocate related data. When data is written to a clustered table, BigQuery sorts the data using the values in the clustering columns.
These values are used to organize the data into multiple blocks in BigQuery storage. The order of clustered columns determines the sort order of the data. When new data is added to a table or a specific partition, BigQuery performs automatic re-clustering in the background to restore the sort property of the table or partition.

When you are using cluster by with some columns , it gets applied to the whole dataset. If the table is a partitioned table then it will be applied to each partition.

You can follow this code lab for a better understanding. From the lab:-
Consider this stackoverflow.question_2018 table as an example. Let's assume it has 3 columns

  1. Creation_date   2.Title   3.Tags

If we create a new partitioned table from the main table having creation_date as date partition , then as per partitioning logic it will have a partition for every creation date.
enter image description here

Now if we create a table creation_date as a partition and apply cluster by on column tags then clustering will be applied to each of the partitions. Even if we add new data in this table , bigquery will take care of reorganizing the data.

enter image description here
Hope this helps you to understand.

分开我的手 2025-02-13 13:21:15

存储读取API可以在多个流中返回数据。根据每个流的数据如何组合在一起,最终结果可能会或可能不会保留原始订单

Storage Read API can return data in multiple streams. Depending on how the data from each stream are put together, the final result may or may not preserve the original order

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文