删除任何具有非唯一时间戳的行

发布于 2025-01-20 22:00:24 字数 1102 浏览 3 评论 0原文

我的论文被困在一些代码上，该论文将在几天内到期，因此非常感谢您的帮助。

我有一个 NumPy 数组，如下所示：

[['2017-01-30T06:00:00.000000000', 48.67, 55.04],
['2017-01-30T06:00:00.000000000', 49.55249735, 55.04],
['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.55544345, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]...

我正在尝试检查每行的时间戳，如果任何时间戳出现多次，我想删除具有该时间戳的所有行。因此，生成的数组将如下所示：

[['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]...

即上午 6 点和上午 8 点的行被删除，因为它们出现多次。

我尝试过使用 np.unique 但无法使其正常工作。我还尝试循环遍历数组并检查时间戳是否等于前一个时间戳，然后删除这两个时间戳，但是如果存在相同时间戳的第三个或更多实例，则这不起作用。

我真的很忙，所以任何帮助将非常非常感谢。

到目前为止我尝试过的代码是这样的：

def del_duplicate_rows(data):
  date_times = []
  for d in data:
    date_times.append(d)
    if len(date_times) > 1:
      if date_times[d] == date_times[d-1]:
        data = np.delete(data, d, axis=0)
        data = np.delete(data, d-1, axis=0)
  return data

原文

I am stuck on a bit of code for my dissertation which is due in a few days so help would be really appreciated.

I have a NumPy array that looks like this:

[['2017-01-30T06:00:00.000000000', 48.67, 55.04],
['2017-01-30T06:00:00.000000000', 49.55249735, 55.04],
['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.55544345, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]...

I am trying to check the timestamp of each row, and if any timestamp appears more than once, I want to remove all rows which have that timestamp. So the resulting array would look like this:

[['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]...

i.e. the rows at 6am and 8am are deleted because they appear more than once.

I have tried using np.unique but cannot get this to work. I have also tried looping through the array and checking if the timestamp is equal to the previous timestamp, and then delete both of these however that does not work if there is a third or more instance of the same timestamp.

I am really stuck for time so any help would be greatly, greatly appreciated.

The code I have tried so far is this:

def del_duplicate_rows(data):
  date_times = []
  for d in data:
    date_times.append(d)
    if len(date_times) > 1:
      if date_times[d] == date_times[d-1]:
        data = np.delete(data, d, axis=0)
        data = np.delete(data, d-1, axis=0)
  return data

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猥琐帝 2025-01-27 22:00:24

不需要 pandas 或 numpy。您所需要的只是 Python 的内置 itertools：

import itertools

groups = list(list(group) for key, group in itertools.groupby(data, lambda x: x[0]))

result = list(itertools.chain.from_iterable(group for group in groups if len(group) == 1))
print(result)

根据给定的数据，输出：

[['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]]

解释：

由于时间戳已排序，我们可以使用 .groupby() 将具有相同时间戳的条目分组到一个列表中。

然后，我们迭代这些组，保留那些只有一个条目的组。然后，我们使用 chain.from_iterable() 展平剩余的单项组列表以获得我们想要的结果。

No need for pandas or numpy. All you need is Python's built-in itertools:

import itertools

groups = list(list(group) for key, group in itertools.groupby(data, lambda x: x[0]))

result = list(itertools.chain.from_iterable(group for group in groups if len(group) == 1))
print(result)

With the given data, this outputs:

[['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]]

Explanation:

Since the timestamps are sorted, we can use .groupby() to group entries with the same timestamp together into one list.

Then, we iterate over these groups, retaining those with only one entry. Then, we flatten the remaining one-entry group lists using chain.from_iterable() to obtain our desired result.

回复收藏 0 原文

独享拥抱 2025-01-27 22:00:24

如果你碰巧已经有 pandas 数据框中的数据，你可以这样做

df.drop_duplicates(subset='date', # or whatever the first column is called
                   keep=False)

If you happen to have the data in a pandas dataframe already, you can do

df.drop_duplicates(subset='date', # or whatever the first column is called
                   keep=False)

回复收藏 0 原文

在你怀里撒娇 2025-01-27 22:00:24

要查找唯一值并计数它们，您可以使用np.unique（...，return_counts = true，return_index = true），您可以找到count == 1的值然后找到索引，然后从查找索引如下返回原始数组：

a = np.array([['2017-01-30T06:00:00.000000000', 48.67, 55.04],
['2017-01-30T06:00:00.000000000', 49.55249735, 55.04],
['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.55544345, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]], dtype='object')

unq, idx, cnt =  np.unique(a[:,0], return_index=True, return_counts=True)
out = a[idx[cnt==1]]
print(out)

输出：

[['2017-01-30T07:00:00.000000000' 48.67262295 55.04]
 ['2017-01-30T09:00:00.000000000' 48.67262295 55.04]]

For finding unique values and counting them you can use np.unique(..., return_counts=True, return_index=True) then you can find values that count == 1 then find the index and return the original array from finding index like below:

a = np.array([['2017-01-30T06:00:00.000000000', 48.67, 55.04],
['2017-01-30T06:00:00.000000000', 49.55249735, 55.04],
['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.67262295, 55.04],
['2017-01-30T08:00:00.000000000', 48.55544345, 55.04],
['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]], dtype='object')

unq, idx, cnt =  np.unique(a[:,0], return_index=True, return_counts=True)
out = a[idx[cnt==1]]
print(out)

Output:

[['2017-01-30T07:00:00.000000000' 48.67262295 55.04]
 ['2017-01-30T09:00:00.000000000' 48.67262295 55.04]]

回复收藏 0 原文

梦一生花开无言 2025-01-27 22:00:24

在这里您使用的是numpy.unique：

import numpy
data = numpy.array([
    ['2017-01-30T06:00:00.000000000', 48.67, 55.04],
    ['2017-01-30T06:00:00.000000000', 49.55249735, 55.04],
    ['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
    ['2017-01-30T08:00:00.000000000', 48.67262295, 55.04],
    ['2017-01-30T08:00:00.000000000', 48.55544345, 55.04],
    ['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]
], dtype='object')
result = data[numpy.unique(data[::, 1], return_index=True)[1]].tolist()

结果：

[
    ['2017-01-30T08:00:00.000000000', 48.55544345, 55.04], 
    ['2017-01-30T06:00:00.000000000', 48.67, 55.04], 
    ['2017-01-30T07:00:00.000000000', 48.67262295, 55.04], 
    ['2017-01-30T06:00:00.000000000', 49.55249735, 55.04]
]

Here you are using numpy.unique:

import numpy
data = numpy.array([
    ['2017-01-30T06:00:00.000000000', 48.67, 55.04],
    ['2017-01-30T06:00:00.000000000', 49.55249735, 55.04],
    ['2017-01-30T07:00:00.000000000', 48.67262295, 55.04],
    ['2017-01-30T08:00:00.000000000', 48.67262295, 55.04],
    ['2017-01-30T08:00:00.000000000', 48.55544345, 55.04],
    ['2017-01-30T09:00:00.000000000', 48.67262295, 55.04]
], dtype='object')
result = data[numpy.unique(data[::, 1], return_index=True)[1]].tolist()

Result:

[
    ['2017-01-30T08:00:00.000000000', 48.55544345, 55.04], 
    ['2017-01-30T06:00:00.000000000', 48.67, 55.04], 
    ['2017-01-30T07:00:00.000000000', 48.67262295, 55.04], 
    ['2017-01-30T06:00:00.000000000', 49.55249735, 55.04]
]

回复收藏 0 原文

~没有更多了~