在元数据模式上运行的 bigquery（混合传统和标准）中的并行 SQL 查询。（每一行是一个不同的表/列）

发布于 2025-01-19 15:03:43 字数 1843 浏览 3 评论 0原文

df_table包含一些列列表的元数据，其中包含以下信息：table_schema，table_name和column_name。对于column_name中的每个列，我想计算熵（位），shannon_entropy和值计数。

以下代码在python中效果很好，但这不是平行的。我想知道是否更有效地运行此操作：

job_config_True = bigquery.QueryJobConfig(use_legacy_sql=True)
job_config_False = bigquery.QueryJobConfig(use_legacy_sql=False)

for i,j in df_table[df_table['shannon_entropy'].isna()].iterrows():
    try:
        table_schema = (j['table_schema'])
        table_name =  (j['table_name'])
        column_name =  (j['column_name'])

        q1 = f'''select -sum(p*log2(p)) as shannon_entropy  from (
             select RATIO_TO_REPORT(c) over() p from (
             select {column_name}, count(*) c  FROM {table_schema}.{table_name}  group by 1))     
        '''


        query_job = bqclient.query(q1, job_config=job_config_True)  # Make an API request.
        shannon_entropy =  query_job.result().to_dataframe()['shannon_entropy'][0] 
    except:
        shannon_entropy = np.nan
        pass

    q = f'''UPDATE  `myproject.info_tabels_all`  t1
    set t1.entropy  =t2.entropy ,t1.values = t2.total , t1.unique = t2.distinct_total , t1.shannon_entropy  = {shannon_entropy}
    from (
        SELECT   
         LOG(2, COUNT(DISTINCT {column_name})) as entropy,
         count({column_name}) as total, 
         COUNT(DISTINCT {column_name}) as distinct_total 
         FROM `datateam-248616.{table_schema}.{table_name}`  ) t2 
        where  table_schema = '{table_schema}' and table_name = '{table_name}' and column_name = '{column_name}'
    '''  

    print( table_name , shannon_entropy)
    query_job = bqclient.query(q, job_config_False)  # Make an API request.

我在此过程中使用了此代码： bigquery：compute of column

原文

df_table contains meta data for the some list of columns with information like: table_schema, table_name and column_name.
for each column in column_name, I would like to calculate entropy (bits) , shannon_entropy and count of values.

the following code works good in python, but it is not parallel.
I wonder if more efficient way to run this :

job_config_True = bigquery.QueryJobConfig(use_legacy_sql=True)
job_config_False = bigquery.QueryJobConfig(use_legacy_sql=False)

for i,j in df_table[df_table['shannon_entropy'].isna()].iterrows():
    try:
        table_schema = (j['table_schema'])
        table_name =  (j['table_name'])
        column_name =  (j['column_name'])

        q1 = f'''select -sum(p*log2(p)) as shannon_entropy  from (
             select RATIO_TO_REPORT(c) over() p from (
             select {column_name}, count(*) c  FROM {table_schema}.{table_name}  group by 1))     
        '''


        query_job = bqclient.query(q1, job_config=job_config_True)  # Make an API request.
        shannon_entropy =  query_job.result().to_dataframe()['shannon_entropy'][0] 
    except:
        shannon_entropy = np.nan
        pass

    q = f'''UPDATE  `myproject.info_tabels_all`  t1
    set t1.entropy  =t2.entropy ,t1.values = t2.total , t1.unique = t2.distinct_total , t1.shannon_entropy  = {shannon_entropy}
    from (
        SELECT   
         LOG(2, COUNT(DISTINCT {column_name})) as entropy,
         count({column_name}) as total, 
         COUNT(DISTINCT {column_name}) as distinct_total 
         FROM `datateam-248616.{table_schema}.{table_name}`  ) t2 
        where  table_schema = '{table_schema}' and table_name = '{table_name}' and column_name = '{column_name}'
    '''  

    print( table_name , shannon_entropy)
    query_job = bqclient.query(q, job_config_False)  # Make an API request.

I used this code in the process :
BigQuery: compute entropy of a column

分享到QQ

分享到微博