优化大数据集上的迭代

发布于 2025-01-27 07:43:53 字数 1017 浏览 2 评论 0原文

我有两个大数据集df1和df2，都有一个列记录了每次观察的时间。我想找到df1的每个条目与df2的每个条目之间的时差。

下面的代码有效，但是当我尝试在整个数据集上运行它时会遇到内存错误。如何优化它以提高内存效率？

df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")

LINE_NUMBER_table0 = [ ] # Initialize an empty list where we will add the number of row of table0
LINE_NUMBER_table1 = [ ] # Initialize an empty list where we will add the number of row of table1
TIME_DIFFERENCE = [ ] # Initialize an empty list where we will add the time difference between the row i of table0 and the row j of tabele1

for i in range(1000) :
    for j in range(1000) :
        LINE_NUMBER_table0.append(i) # Add the number of row i of table0
        LINE_NUMBER_table1.append(j) # Add the number of row j of table1 
        timedifference = df1["mjd"][i] - df2["MJD"][j] # Calculate the time difference between row i and row j
        TIME_DIFFERENCE.append(timedifference) # Add this time difference to the list TIME_DIFFERENCE

原文

I have two large datasets df1 and df2, both have a column that records the time each observation was made. I want to find the time difference between every entry of df1 and every entry of df2.

The code below works but runs into memory errors when I attempt to run it on the entire datasets. How can I optimize this for memory efficiency?

df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")

LINE_NUMBER_table0 = [ ] # Initialize an empty list where we will add the number of row of table0
LINE_NUMBER_table1 = [ ] # Initialize an empty list where we will add the number of row of table1
TIME_DIFFERENCE = [ ] # Initialize an empty list where we will add the time difference between the row i of table0 and the row j of tabele1

for i in range(1000) :
    for j in range(1000) :
        LINE_NUMBER_table0.append(i) # Add the number of row i of table0
        LINE_NUMBER_table1.append(j) # Add the number of row j of table1 
        timedifference = df1["mjd"][i] - df2["MJD"][j] # Calculate the time difference between row i and row j
        TIME_DIFFERENCE.append(timedifference) # Add this time difference to the list TIME_DIFFERENCE

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千と千尋 2025-02-03 07:43:53

您不需要循环。 Python循环通常效率低下（尤其是在PANDAS DataFrames上迭代，请参见这篇文章）。您需要使用矢量调用。例如，numpy功能或熊猫的功能。在这种情况下，您可以使用np.tile和np.repeat。这是一个（未经测试的）示例：

import numpy as np

df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")

tmp = np.arange(1000)
LINE_NUMBER_table0 = np.repeat(tmp, 1000)
LINE_NUMBER_table1 = np.tile(tmp, 1000)

df1_mjd = np.repeat(df1["mjd"].to_numpy(), 1000)
df2_MJD = np.tile(df2["MJD"].to_numpy(), 1000)
TIME_DIFFERENCE = df1_mjd - df2_MJD

请注意，您可以使用your_array.tolist（）将numpy数组转换回列表，但最好与Numpy Array一起使用dermastion（请注意Pandas使用Numpypy阵列在内部，因此Pandas DataFram和Numpy阵列之间的转换价格便宜，而不是列表）。

You do not need a loop for that. Python loops are generally inefficient (especially iterating on Pandas dataframes, see this post). You need to use vectorized calls instead. For example, Numpy functions or the ones of Pandas. In this case, you can use np.tile and np.repeat. Here is an (untested) example:

import numpy as np

df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")

tmp = np.arange(1000)
LINE_NUMBER_table0 = np.repeat(tmp, 1000)
LINE_NUMBER_table1 = np.tile(tmp, 1000)

df1_mjd = np.repeat(df1["mjd"].to_numpy(), 1000)
df2_MJD = np.tile(df2["MJD"].to_numpy(), 1000)
TIME_DIFFERENCE = df1_mjd - df2_MJD

Note that you can convert Numpy array back to list using your_array.tolist() but it is better to work with Numpy array for sake of performance (note that Pandas uses Numpy array internally so the conversion between Pandas datafram and Numpy array is cheap as opposed to lists).

回复收藏 0 原文

~没有更多了~