如何使用 boto3 或 psycopg2 python 库在 redshift 中插入数据

发布于 2025-01-10 15:35:04 字数 248 浏览 1 评论 0原文

在 python lambda 函数中,最适合使用“boto3”和“Psycopg2”进行 redshift 操作的库:

  • 在 redshift 集群中查找表 在 redshift
  • 集群中创建表
  • 在 redshift 集群中插入数据

如果我得到以下答复,我将不胜感激:

  • 满足上述所有 3 个需求的任一库的 python 代码。

提前致谢!!

Which library is best to use among "boto3" and "Psycopg2" for redshift operations in python lambda functions:

  • Lookup for a table in redshift cluster
  • Create a table in redshift cluster
  • Insert data in redshift cluster

I would appretiate if i am answered with following:

  • python code for either of the library that addresses all of the above 3 needs.

Thanks in Advance!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

岁月静好 2025-01-17 15:35:04

使用 psycopg2 从 Lambda 直接连接到 Redshift 是更简单、更直接的方法,但存在很大的限制。 Lambda 函数有运行时间限制,即使您的 SQL 命令没有超过最大运行时间,您也需要为 Lambda 函数付费以等待 Redshift 完成 SQL。对于快速运行的 SQL 命令,运行速度很快,这不是问题,但插入数据可能需要一些时间,具体取决于数据量。

如果您所有的 Redshift 操作都不到几秒(并且不会随着时间的推移而变得更长),那么 psycopg2 直接连接到 Redshift 可能是最佳选择。如果数据插入需要一两分钟,但这个过程不经常运行(每天),那么 psycopg2 可能仍然是最佳选择,因为频繁运行时 Lambda 并不是很昂贵。这是流程简单性与成本计算的比较。

使用 Redshift Data API 更为复杂。此过程允许您将 SQL 触发到 Redshift 并终止 Lambda。稍后运行的 Lambda 将检查 SQL 是否已完成,并检查 SQL 的结果。 SQL 未完成意味着需要稍后调用 Lambda 以查看事情是否完成。此轮询过程通常由 Step Function 和一组不同的 Lambda 函数完成。不是超级困难,但复杂程度高于单个 Lambda。由于这是一个轮询过程,因此在检查结果之间存在等待时间,如果太长会导致延迟,如果太短则会导致过度轮询和额外成本。

如果您因超时原因需要使用 Data API,那么您可能需要同时使用 psycopg2 来对数据库进行短期运行查询 - 例如“此表是否存在?”。使用数据 API 执行长时间运行的步骤,例如“将这组 1TB 数据插入 Redshift”。

Connecting directly to Redshift from Lambda with psycopg2 is the simpler, more straight-forward way to go but comes with a significant limitation. Lambda functions have run-time limits and even if your SQL commands don't exceed the max run-time, you will be paying for the Lambda function to wait for Redshift to complete the SQL. For fast-running SQL commands things run quickly and this isn't a problem but inserting data can take some time depending on the amount of data.

If all your Redshift actions are less than a few seconds (and won't grow longer with time) then psycopg2 connecting directly to Redshift is likely the way to go. If the data insert takes a minute or 2 BUT this process doesn't run very often (daily) then psycopg2 may still be the way to go as Lambda isn't very expensive when run in frequently. It is a process simplicity vs. cost calculation.

Using Redshift Data API is more complicated. This process lets you fire the SQL to Redshift and terminate the Lambda. A later running Lambda checks to see if the SQL has completed and the results of the SQL are checked. The SQL not completing means that Lambda needs to be invoke at a later time to see if things are complete. This polling process often is done by a Step Function and a set of different Lambda functions. Not super difficult but a level of complexity above a single Lambda. Since this is a polling process there is a wait time between checks for results which if too long leads to latency and if too short over-polling and additional costs.

If you need to have Data API for time-out reasons then you may want to use both psycopg2 for short running queries to the database - like 'does this table exist?'. Use Data API for long-running steps like 'insert this 1TB set of data into Redshift'.

不念旧人 2025-01-17 15:35:04

使用 boto3 执行所有三个操作的示例基本 Python 代码。

import json
import boto3

clientdata = boto3.client('redshift-data')

# looks up table and returns true if found
def lookup_table(table_name):
  response = clientdata.list_tables(
    ClusterIdentifier='redshift-cluster-1',
    Database='dev',
    DbUser='awsuser',
    TablePattern=table_name
  )
  print(response)
  if ( len(response['Tables']) == 0 ):
    return False
  else:
    return True

# creates table with one integer column
def create_table(table_name):
  sqlstmt = 'CREATE TABLE '+table_name+' (col1 integer);'
  print(sqlstmt)
  response = clientdata.execute_statement(
    ClusterIdentifier='redshift-cluster-1',
    Database='dev',
    DbUser='awsuser',
    Sql=sqlstmt,
    StatementName='CreateTable'
  )
  print(response)

# inserts one row with integer value for col1
def insert_data(table_name, dval):
  print(dval)
  sqlstmt = 'INSERT INTO '+table_name+'(col1) VALUES ('+str(dval)+');'
  response = clientdata.execute_statement(
    ClusterIdentifier='redshift-cluster-1',
    Database='dev',
    DbUser='awsuser',
    Sql=sqlstmt,
    StatementName='InsertData'
  )
  print(response)

result = lookup_table('date')
if ( result ):
  print("Table exists.")
else:
  print("Table does not exist!")

create_table("testtab")
insert_data("testtab", 11)

我没有使用 Lambda,而是仅从我的 shell 执行它。
希望这有帮助。假设已经为客户端设置了凭据和默认区域。

Sample basic python code for all three operations using boto3.

import json
import boto3

clientdata = boto3.client('redshift-data')

# looks up table and returns true if found
def lookup_table(table_name):
  response = clientdata.list_tables(
    ClusterIdentifier='redshift-cluster-1',
    Database='dev',
    DbUser='awsuser',
    TablePattern=table_name
  )
  print(response)
  if ( len(response['Tables']) == 0 ):
    return False
  else:
    return True

# creates table with one integer column
def create_table(table_name):
  sqlstmt = 'CREATE TABLE '+table_name+' (col1 integer);'
  print(sqlstmt)
  response = clientdata.execute_statement(
    ClusterIdentifier='redshift-cluster-1',
    Database='dev',
    DbUser='awsuser',
    Sql=sqlstmt,
    StatementName='CreateTable'
  )
  print(response)

# inserts one row with integer value for col1
def insert_data(table_name, dval):
  print(dval)
  sqlstmt = 'INSERT INTO '+table_name+'(col1) VALUES ('+str(dval)+');'
  response = clientdata.execute_statement(
    ClusterIdentifier='redshift-cluster-1',
    Database='dev',
    DbUser='awsuser',
    Sql=sqlstmt,
    StatementName='InsertData'
  )
  print(response)

result = lookup_table('date')
if ( result ):
  print("Table exists.")
else:
  print("Table does not exist!")

create_table("testtab")
insert_data("testtab", 11)

I am not using Lambda, instead executing it just from my shell.
Hope this helps. Assuming credentials and default region are already set up for the client.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文