计算带有Pyspark的蜂巢表列表的行数

发布于 2025-01-30 12:52:14 字数 1713 浏览 2 评论 0 原文

我创建了一个Spark程序,该程序从 Hive 数据库中检索表名称,然后计算单个表的行计数。但是,我试图通过从2个表开始的多个表来获得多个表的行数来提高这一点。

我的火花代码是:

from pyspark.sql import SparkSession
import sys

def sql_count_rows(db,table):
    sql_query = """select count(*) from {0}.{1}""".format(db,table)
    return sql_query

db_name = sys.argv[1]

spark = SparkSession \
    .builder \
    .appName("HiveTableRecordCount") \
    .enableHiveSupport() \
    .getOrCreate()

spark.sql("use {0}".format(db_name))
tables_df=spark.sql("show tables").collect()
tables_df=tables_df[0:2] #filter the first two tables
print("list content: ",tables_df)
print("list length: ",len(tables_df))
queryBuilder=""
#queryBuilder=queryBuilder + """select count(*) from {0}.{1}""".format(tables_df['database'], tables_df['tableName'])
#print("queryBuilder: ",queryBuilder)
loop_length=1
index=0
while loop_length < len(tables_df):
    queryBuilder =  sql_count_rows(tables_df[index]['database'], tables_df[index]['tableName'])#tables_df.foreach(lambda row: sql_count_rows(row,queryBuilder))
    queryBuilder = queryBuilder + "\nunion all \n"
    loop_length+=1
    index+=1

spark.sql(queryBuilder).show()
spark.stop()

要编写 this 文章,用Scala Spark 编写。

执行程序时,我会收到以下错误:

pyspark.sql.utils.parseexception:u“ \ nmismatched input'''''''''''(',','select'select','from'','value',''table','table','map','ydre''}(line)(line) 3,pos 0)\ n \ n == sql == \ nselect count(*)来自gr_mapping.active_contracts_stg_v2 \ nunion all \ n ^^^ \ n“

您能帮助您了解我在做什么吗?

I have created a spark program that retrieves the table names from a Hive database and then calculates the row count for a single table. However, I am trying to level this up by getting the row count for multiple tables, starting with 2 tables.

My spark code is:

from pyspark.sql import SparkSession
import sys

def sql_count_rows(db,table):
    sql_query = """select count(*) from {0}.{1}""".format(db,table)
    return sql_query

db_name = sys.argv[1]

spark = SparkSession \
    .builder \
    .appName("HiveTableRecordCount") \
    .enableHiveSupport() \
    .getOrCreate()

spark.sql("use {0}".format(db_name))
tables_df=spark.sql("show tables").collect()
tables_df=tables_df[0:2] #filter the first two tables
print("list content: ",tables_df)
print("list length: ",len(tables_df))
queryBuilder=""
#queryBuilder=queryBuilder + """select count(*) from {0}.{1}""".format(tables_df['database'], tables_df['tableName'])
#print("queryBuilder: ",queryBuilder)
loop_length=1
index=0
while loop_length < len(tables_df):
    queryBuilder =  sql_count_rows(tables_df[index]['database'], tables_df[index]['tableName'])#tables_df.foreach(lambda row: sql_count_rows(row,queryBuilder))
    queryBuilder = queryBuilder + "\nunion all \n"
    loop_length+=1
    index+=1

spark.sql(queryBuilder).show()
spark.stop()

To write the code I have inspired by this article, written in scala spark.

When I execute the program I recieve back the following error:

pyspark.sql.utils.ParseException: u"\nmismatched input '' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'MAP', 'REDUCE'}(line 3, pos 0)\n\n== SQL ==\nselect count(*) from gr_mapping.active_contracts_stg_v2\nunion all \n^^^\n"

Could you please help to understand what I am doing wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

弄潮 2025-02-06 12:52:14

问题是,尽管在scala \ n 中,在字符串中意味着变量内的新行,但在python中却没有。

基本上,您可以在没有 \ n 的情况下加入语句:

queryBuilder = queryBuilder + " union all "```

The problem is that while in Scala \n in String means a new line inside a variable, in Python it does not.

Basically, you can concatenate the statements without the \n:

queryBuilder = queryBuilder + " union all "```
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文