使用python的连接酸SQL文件，而忽略重复项

发布于 2025-02-07 14:29:46 字数 1593 浏览 0 评论 0原文

我有一个目录，其中包含大量的.txt文件形式的SQL文件。它们用于填充SQLite数据库。我不想将它们加载到单个文本文件中，而不是一个一个一个文件，因此我只能加载一个文件。

问题在于我在这里和那里都有重复。我的txt文件来自硬件测试，不时我有两个相同的ID。在这种情况下，我想保留创建的最新文件。我的数据库中有一个唯一的约束，因此我必须每个ID保留一个文件，并且我想保留最新的文件。

我的SQL文件看起来像：

INSERT INTO Table...

（〜1000行）

，例如，其中有230个。因此，我想最终得到类似的内容：

INSERT INTO Table...
INSERT INTO Table...
INSERT INTO Table...

（〜200行 * 1000行）

这是我想到的脚本：

df = pd.DataFrame(columns=['Filename','Truncated filename','Content'])
for p in Path(path).glob('**/*.txt'):
    filename = p.name
    content = p.read_text()
    # Remove the 16 last characters from filename because they correspond to datetype and ".txt". Ex: 202206141707.txt
    df.loc[len(df.index)] = [filename, filename[:-16], content]
# Keep the latest value in case of duplicates -- doesn't seem to work
(df.groupby('Truncated filename', as_index=False).apply(lambda x: x if len(x) == 1 else x.iloc[[-1]]).reset_index(level=0, drop=True))
# Sort the dataframe
sort = df.sort_values("Truncated filename", axis=0, ascending=False)
# Concatenate the "Conten" column in a text file
output_text = df['Content'].str.cat(sep=' ')
# Write the output file
with open(output_path, 'w') as f:
    f.write(output_text)

我的逻辑是：

从目录中获取所有文件，
将所有文件列入fileName和content and columnss（截断的文件名“列可帮助我找到重复项）
删除重复项，同时每次
将所有“内容”列在文本文件中加上所有“ content”列
文本文件
写入我的数据库中的

，但是它似乎不起作用。当我将输出文件加载到数据库中时，我的行仍然失败了唯一的约束。

如果您有一种想法可以通过另一种方法解决此问题，也欢迎您。

原文

I have a directory containing a large amount of SQL files in the form of .txt files. They are used to fill a SQLite database. Instead of loading them one by one, I would like to concatenate all these files into a single text file, so I can load just one file.

The problem is that I have duplicates here and there. My txt files come from hardware tests, and from time to time I have two identical IDs. In this case, I would like to keep the latest file that was created. I have a UNIQUE constraint in my database so I have to keep one file per ID, and I want to keep the latest one.

My SQL files look like that:

INSERT INTO Table...

(~1000 lines)

And I have, say, 230 of them. So I wanted to end up with something like that:

INSERT INTO Table...
INSERT INTO Table...
INSERT INTO Table...

(~200 rows * 1000 lines)

This is the script I came up with:

df = pd.DataFrame(columns=['Filename','Truncated filename','Content'])
for p in Path(path).glob('**/*.txt'):
    filename = p.name
    content = p.read_text()
    # Remove the 16 last characters from filename because they correspond to datetype and ".txt". Ex: 202206141707.txt
    df.loc[len(df.index)] = [filename, filename[:-16], content]
# Keep the latest value in case of duplicates -- doesn't seem to work
(df.groupby('Truncated filename', as_index=False).apply(lambda x: x if len(x) == 1 else x.iloc[[-1]]).reset_index(level=0, drop=True))
# Sort the dataframe
sort = df.sort_values("Truncated filename", axis=0, ascending=False)
# Concatenate the "Conten" column in a text file
output_text = df['Content'].str.cat(sep=' ')
# Write the output file
with open(output_path, 'w') as f:
    f.write(output_text)

My logic was the following:

Get all files from directory
Make a dataframe with filename and content as columns ("Truncated Filename" columns helps me find duplicates)
Remove duplicates while keeping the last value each time
Concatenate all of "content" column in a text file
Write that text file
Import it in my database

However, it doesn't seem to work. When I load the output file in my database, I still have rows failing the UNIQUE constraint.

If you have an idea to solve this problem with another approach, it is welcome too.

分享到QQ

分享到微博