使用python的连接酸SQL文件,而忽略重复项
我有一个目录,其中包含大量的.txt文件形式的SQL文件。它们用于填充SQLite数据库。我不想将它们加载到单个文本文件中,而不是一个一个一个文件,因此我只能加载一个文件。
问题在于我在这里和那里都有重复。我的txt文件来自硬件测试,不时我有两个相同的ID。在这种情况下,我想保留创建的最新文件。我的数据库中有一个唯一的约束,因此我必须每个ID保留一个文件,并且我想保留最新的文件。
我的SQL文件看起来像:
INSERT INTO Table...
(〜1000行)
,例如,其中有230个。因此,我想最终得到类似的内容:
INSERT INTO Table...
INSERT INTO Table...
INSERT INTO Table...
(〜200行 * 1000行)
这是我想到的脚本:
df = pd.DataFrame(columns=['Filename','Truncated filename','Content'])
for p in Path(path).glob('**/*.txt'):
filename = p.name
content = p.read_text()
# Remove the 16 last characters from filename because they correspond to datetype and ".txt". Ex: 202206141707.txt
df.loc[len(df.index)] = [filename, filename[:-16], content]
# Keep the latest value in case of duplicates -- doesn't seem to work
(df.groupby('Truncated filename', as_index=False).apply(lambda x: x if len(x) == 1 else x.iloc[[-1]]).reset_index(level=0, drop=True))
# Sort the dataframe
sort = df.sort_values("Truncated filename", axis=0, ascending=False)
# Concatenate the "Conten" column in a text file
output_text = df['Content'].str.cat(sep=' ')
# Write the output file
with open(output_path, 'w') as f:
f.write(output_text)
我的逻辑是:
- 从目录中获取所有文件,
- 将所有文件列入fileName和content and columnss(截断的文件名“列可帮助我找到重复项)
- 删除重复项,同时每次
- 将所有“内容”列在文本文件中加上所有“ content”列
- 文本文件
- 写入我的数据库中的
,但是它似乎不起作用。当我将输出文件加载到数据库中时,我的行仍然失败了唯一的约束。
如果您有一种想法可以通过另一种方法解决此问题,也欢迎您。
I have a directory containing a large amount of SQL files in the form of .txt files. They are used to fill a SQLite database. Instead of loading them one by one, I would like to concatenate all these files into a single text file, so I can load just one file.
The problem is that I have duplicates here and there. My txt files come from hardware tests, and from time to time I have two identical IDs. In this case, I would like to keep the latest file that was created. I have a UNIQUE
constraint in my database so I have to keep one file per ID, and I want to keep the latest one.
My SQL files look like that:
INSERT INTO Table...
(~1000 lines)
And I have, say, 230 of them. So I wanted to end up with something like that:
INSERT INTO Table...
INSERT INTO Table...
INSERT INTO Table...
(~200 rows * 1000 lines)
This is the script I came up with:
df = pd.DataFrame(columns=['Filename','Truncated filename','Content'])
for p in Path(path).glob('**/*.txt'):
filename = p.name
content = p.read_text()
# Remove the 16 last characters from filename because they correspond to datetype and ".txt". Ex: 202206141707.txt
df.loc[len(df.index)] = [filename, filename[:-16], content]
# Keep the latest value in case of duplicates -- doesn't seem to work
(df.groupby('Truncated filename', as_index=False).apply(lambda x: x if len(x) == 1 else x.iloc[[-1]]).reset_index(level=0, drop=True))
# Sort the dataframe
sort = df.sort_values("Truncated filename", axis=0, ascending=False)
# Concatenate the "Conten" column in a text file
output_text = df['Content'].str.cat(sep=' ')
# Write the output file
with open(output_path, 'w') as f:
f.write(output_text)
My logic was the following:
- Get all files from directory
- Make a dataframe with filename and content as columns ("Truncated Filename" columns helps me find duplicates)
- Remove duplicates while keeping the last value each time
- Concatenate all of "content" column in a text file
- Write that text file
- Import it in my database
However, it doesn't seem to work. When I load the output file in my database, I still have rows failing the UNIQUE constraint.
If you have an idea to solve this problem with another approach, it is welcome too.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论