在元数据模式上运行的 bigquery(混合传统和标准)中的并行 SQL 查询。 (每一行是一个不同的表/列)
df_table包含一些列列表的元数据,其中包含以下信息:table_schema,table_name和column_name。 对于column_name中的每个列,我想计算熵(位),shannon_entropy和值计数。
以下代码在python中效果很好,但这不是平行的。 我想知道是否更有效地运行此操作:
job_config_True = bigquery.QueryJobConfig(use_legacy_sql=True)
job_config_False = bigquery.QueryJobConfig(use_legacy_sql=False)
for i,j in df_table[df_table['shannon_entropy'].isna()].iterrows():
try:
table_schema = (j['table_schema'])
table_name = (j['table_name'])
column_name = (j['column_name'])
q1 = f'''select -sum(p*log2(p)) as shannon_entropy from (
select RATIO_TO_REPORT(c) over() p from (
select {column_name}, count(*) c FROM {table_schema}.{table_name} group by 1))
'''
query_job = bqclient.query(q1, job_config=job_config_True) # Make an API request.
shannon_entropy = query_job.result().to_dataframe()['shannon_entropy'][0]
except:
shannon_entropy = np.nan
pass
q = f'''UPDATE `myproject.info_tabels_all` t1
set t1.entropy =t2.entropy ,t1.values = t2.total , t1.unique = t2.distinct_total , t1.shannon_entropy = {shannon_entropy}
from (
SELECT
LOG(2, COUNT(DISTINCT {column_name})) as entropy,
count({column_name}) as total,
COUNT(DISTINCT {column_name}) as distinct_total
FROM `datateam-248616.{table_schema}.{table_name}` ) t2
where table_schema = '{table_schema}' and table_name = '{table_name}' and column_name = '{column_name}'
'''
print( table_name , shannon_entropy)
query_job = bqclient.query(q, job_config_False) # Make an API request.
我在此过程中使用了此代码: bigquery:compute of column
df_table contains meta data for the some list of columns with information like: table_schema, table_name and column_name.
for each column in column_name, I would like to calculate entropy (bits) , shannon_entropy and count of values.
the following code works good in python, but it is not parallel.
I wonder if more efficient way to run this :
job_config_True = bigquery.QueryJobConfig(use_legacy_sql=True)
job_config_False = bigquery.QueryJobConfig(use_legacy_sql=False)
for i,j in df_table[df_table['shannon_entropy'].isna()].iterrows():
try:
table_schema = (j['table_schema'])
table_name = (j['table_name'])
column_name = (j['column_name'])
q1 = f'''select -sum(p*log2(p)) as shannon_entropy from (
select RATIO_TO_REPORT(c) over() p from (
select {column_name}, count(*) c FROM {table_schema}.{table_name} group by 1))
'''
query_job = bqclient.query(q1, job_config=job_config_True) # Make an API request.
shannon_entropy = query_job.result().to_dataframe()['shannon_entropy'][0]
except:
shannon_entropy = np.nan
pass
q = f'''UPDATE `myproject.info_tabels_all` t1
set t1.entropy =t2.entropy ,t1.values = t2.total , t1.unique = t2.distinct_total , t1.shannon_entropy = {shannon_entropy}
from (
SELECT
LOG(2, COUNT(DISTINCT {column_name})) as entropy,
count({column_name}) as total,
COUNT(DISTINCT {column_name}) as distinct_total
FROM `datateam-248616.{table_schema}.{table_name}` ) t2
where table_schema = '{table_schema}' and table_name = '{table_name}' and column_name = '{column_name}'
'''
print( table_name , shannon_entropy)
query_job = bqclient.query(q, job_config_False) # Make an API request.
I used this code in the process :
BigQuery: compute entropy of a column
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论