我有大量数据文件需要处理并存储在远程数据库中。数据文件的每一行代表数据库中的一行,但在插入数据库之前必须进行格式化。
我的第一个解决方案是通过编写bash脚本处理数据文件并生成SQL数据文件,然后将转储SQL文件导入数据库。这个解决方案似乎太慢了,正如您所看到的,涉及创建中间 SQL 文件的额外步骤。
我的第二个解决方案是编写 bash 脚本,在处理数据文件的每一行时,创建 INSERT INTO ... 语句并将 SQL 语句发送到远程数据库:
echo sql_statement | psql -h Remote_server -U 用户名 -d 数据库
即不创建 SQL 文件。然而,这个解决方案有一个主要问题,我正在寻求建议:
每次我都必须重新连接到远程数据库以插入一行。
有没有一种方法可以连接到远程数据库,保持连接,然后“管道”或“发送”插入 SQL-语句而不创建一个巨大的 SQL 文件?
I have a large amount of data files to process and to be stored in the remote database. Each line of a data file represents a row in the database, but must be formatted before inserting into the database.
My first solution was to process data files by writing bash scripts and produce SQL data files, and then import the dump SQL files into the database. This solution seems to be too slow and as you can see involves an extra step of creating intermediary SQL file.
My second solution was to write bash scripts that while processing each line of the data file, creates and INSERT INTO ...
statement and sends the SQL statement to the remote database:
echo sql_statement | psql -h remote_server -U username -d database
i.e. does not create SQL file. This solution, however, has one major issue that I am searching an advice on:
Each time I have to reconnect to the remote database to insert one single row.
Is there a way to connect to the remote database, stay connected and then "pipe" or "send" the insert-SQL-statement without creating a huge SQL file?
发布评论
评论(2)
回答您的实际问题
是。您可以使用命名管道而不是创建文件。考虑以下演示。
在我的数据库
event
中创建一个架构x
进行测试:从 shell 创建一个命名管道 (fifo),如下所示:
1) 调用SQL 命令
COPY
在服务器上使用命名管道:这将在表
xx
上获取独占锁数据库。连接保持打开状态,直到 fifo 获取数据。小心不要让这个打开太久!您可以在填充管道后调用此函数,以最大限度地减少阻塞时间。您可以选择事件的顺序。一旦两个进程绑定到管道,该命令就会执行。第一个等待第二个。或者2)您可以从客户端上的管道执行SQL:
这更适合您的情况。此外,在 SQL 整体执行之前不会锁定表。
Bash 将显示为被阻止。它正在等待管道的输入。要从一个 bash 实例完成这一切,您可以将等待进程发送到后台。像这样:
无论哪种方式,从同一个 bash 或不同的实例,您现在都可以填充管道。
变体三行演示 1):(
注意使用制表符作为分隔符或指示 COPY 使用
WITH DELIMITER 'delimiter_character'
接受不同的分隔符)这将使用 COPY 命令触发挂起的 psql 执行并返回:
变体演示 2):
完成后删除命名管道:
检查成功:
上面代码的有用链接
使用 postgres 读取压缩文件命名管道
命名管道简介
在后台运行 bash 脚本的最佳实践
您可能需要也可能不需要的建议
对于批量
INSERT
,您有比单独的INSERT 每行。使用此语法变体:将语句写入文件并执行批量
INSERT
,如下所示:(5432 或数据库集群正在侦听的任何端口)
my_insert_file.sql
可以保存多个 SQL 语句。事实上,像这样恢复/部署整个数据库是常见的做法。请查阅有关-f
的手册 参数,或在 bash 中:man psql
。或者,如果您可以将(压缩的)文件传输到服务器,则可以使用 COPY 可以更快地插入(解压缩的)数据。
您还可以在 PostgreSQL 内进行部分或全部处理。为此,您可以
COPY TO
(或INSERT INTO
)临时表并使用普通 SQL 语句来准备并最终 INSERT/UPDATE 您的表。我经常这样做。请注意,临时表随会话一起生存和消亡。您可以使用像 pgAdmin 这样的 GUI 来轻松处理。 SQL 编辑器窗口中的会话将保持打开状态,直到您关闭该窗口。 (因此,临时表会一直存在,直到您关闭窗口为止。)
Answer to your actual question
Yes. You can use a named pipe instead of creating a file. Consider the following demo.
Create a schema
x
in my databaseevent
for testing:Create a named pipe (fifo) from the shell like this:
Either 1) call the SQL command
COPY
using a named pipe on the server:This will acquire an exclusive lock on the table
x.x
in the database. The connection stays open until the fifo gets data. Be careful not to leave this open for too long! You can call this after you have filled the pipe to minimize blocking time. You can chose the sequence of events. The command executes as soon as two processes bind to the pipe. The first waits for the second.Or 2) you can execute SQL from the pipe on the client:
This is better suited for your case. Also, no table locks until SQL is executed in one piece.
Bash will appear blocked. It is waiting for input to the pipe. To do it all from one bash instance, you can send the waiting process to the background instead. Like this:
Either way, from the same bash or a different instance, you can fill the pipe now.
Demo with three rows for variant 1):
(Take care to use tabs as delimiters or instruct COPY to accept a different delimiter using
WITH DELIMITER 'delimiter_character'
)That will trigger the pending psql with the COPY command to execute and return:
Demo for for variant 2):
Delete the named pipe after you are done:
Check success:
Useful links for the code above
Reading compressed files with postgres using named pipes
Introduction to Named Pipes
Best practice to run bash script in background
Advice you may or may not not need
For bulk
INSERT
you have better solutions than a separate INSERT per row. Use this syntax variant:Write your statements to a file and do one mass
INSERT
like this:(5432 or whatever port the db-cluster is listening on)
my_insert_file.sql
can hold multiple SQL statements. In fact, it's common practise to restore / deploy whole databases like that. Consult the manual about the-f
parameter, or in bash:man psql
.Or, if you can transfer the (compressed) file to the server, you can use COPY to insert the (decompressed) data even faster.
You can also do some or all of the processing inside PostgreSQL. For that you can
COPY TO
(orINSERT INTO
) a temporary table and use plain SQL statements to prepare and finally INSERT / UPDATE your tables. I do that a lot. Be aware that temporary tables live and die with the session.You could use a GUI like pgAdmin for comfortable handling. A session in an SQL Editor window remains open until you close the window. (Therefore, temporary tables live until you close the window.)
我知道我迟到了,但为什么不能将所有
INSERT
语句合并到一个字符串中,并用分号标记每个语句的结尾? (警告!前面有伪代码...)而不是:
使用:
这样您就不必在文件系统上创建任何内容,进行大量重定向,在后台运行任何任务,记住之后删除文件系统上的任何内容,或者甚至提醒自己什么是命名管道。
I know I'm late to the party, but why couldn't you combine all your
INSERT
statements into a single string, with a semicolon marking the end of each statement? (Warning! Pseudocode ahead...)Instead of:
Use:
That way you don't have to create anything on your filesystem, do a bunch of redirection, run any tasks in the background, remember to delete anything on your filesystem afterwards, or even remind yourself what a named pipe is.