使用python的连接酸SQL文件,而忽略重复项

发布于 2025-02-07 14:29:46 字数 1593 浏览 0 评论 0原文

我有一个目录,其中包含大量的.txt文件形式的SQL文件。它们用于填充SQLite数据库。我不想将它们加载到单个文本文件中,而不是一个一个一个文件,因此我只能加载一个文件。

问题在于我在这里和那里都有重复。我的txt文件来自硬件测试,不时我有两个相同的ID。在这种情况下,我想保留创建的最新文件。我的数据库中有一个唯一的约束,因此我必须每个ID保留一个文件,并且我想保留最新的文件。

我的SQL文件看起来像:

INSERT INTO Table...

(〜1000行)

,例如,其中有230个。因此,我想最终得到类似的内容:

INSERT INTO Table...
INSERT INTO Table...
INSERT INTO Table...

(〜200行 * 1000行)

这是我想到的脚本:

df = pd.DataFrame(columns=['Filename','Truncated filename','Content'])
for p in Path(path).glob('**/*.txt'):
    filename = p.name
    content = p.read_text()
    # Remove the 16 last characters from filename because they correspond to datetype and ".txt". Ex: 202206141707.txt
    df.loc[len(df.index)] = [filename, filename[:-16], content]
# Keep the latest value in case of duplicates -- doesn't seem to work
(df.groupby('Truncated filename', as_index=False).apply(lambda x: x if len(x) == 1 else x.iloc[[-1]]).reset_index(level=0, drop=True))
# Sort the dataframe
sort = df.sort_values("Truncated filename", axis=0, ascending=False)
# Concatenate the "Conten" column in a text file
output_text = df['Content'].str.cat(sep=' ')
# Write the output file
with open(output_path, 'w') as f:
    f.write(output_text)

我的逻辑是:

  • 从目录中获取所有文件,
  • 将所有文件列入fileName和content and columnss(截断的文件名“列可帮助我找到重复项)
  • 删除重复项,同时每次
  • 将所有“内容”列在文本文件中加上所有“ content”列
  • 文本文件
  • 写入我的数据库中的

,但是它似乎不起作用。当我将输出文件加载到数据库中时,我的行仍然失败了唯一的约束。

如果您有一种想法可以通过另一种方法解决此问题,也欢迎您。

I have a directory containing a large amount of SQL files in the form of .txt files. They are used to fill a SQLite database. Instead of loading them one by one, I would like to concatenate all these files into a single text file, so I can load just one file.

The problem is that I have duplicates here and there. My txt files come from hardware tests, and from time to time I have two identical IDs. In this case, I would like to keep the latest file that was created. I have a UNIQUE constraint in my database so I have to keep one file per ID, and I want to keep the latest one.

My SQL files look like that:

INSERT INTO Table...

(~1000 lines)

And I have, say, 230 of them. So I wanted to end up with something like that:

INSERT INTO Table...
INSERT INTO Table...
INSERT INTO Table...

(~200 rows * 1000 lines)

This is the script I came up with:

df = pd.DataFrame(columns=['Filename','Truncated filename','Content'])
for p in Path(path).glob('**/*.txt'):
    filename = p.name
    content = p.read_text()
    # Remove the 16 last characters from filename because they correspond to datetype and ".txt". Ex: 202206141707.txt
    df.loc[len(df.index)] = [filename, filename[:-16], content]
# Keep the latest value in case of duplicates -- doesn't seem to work
(df.groupby('Truncated filename', as_index=False).apply(lambda x: x if len(x) == 1 else x.iloc[[-1]]).reset_index(level=0, drop=True))
# Sort the dataframe
sort = df.sort_values("Truncated filename", axis=0, ascending=False)
# Concatenate the "Conten" column in a text file
output_text = df['Content'].str.cat(sep=' ')
# Write the output file
with open(output_path, 'w') as f:
    f.write(output_text)

My logic was the following:

  • Get all files from directory
  • Make a dataframe with filename and content as columns ("Truncated Filename" columns helps me find duplicates)
  • Remove duplicates while keeping the last value each time
  • Concatenate all of "content" column in a text file
  • Write that text file
  • Import it in my database

However, it doesn't seem to work. When I load the output file in my database, I still have rows failing the UNIQUE constraint.

If you have an idea to solve this problem with another approach, it is welcome too.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文