Python、PyTables - 利用内核内搜索

发布于 2024-08-16 23:29:41 字数 2114 浏览 2 评论 0原文

我有包含多个组的 HDF5 文件,其中每个组包含一个包含 >= 2500 万行的数据集。在模拟的每个时间步长,每个智能体都会输出他/她在该时间步长感知到的其他智能体。场景中有约 2000 个智能体和数千个时间步;输出的 O(n^2) 性质解释了行数的巨大。

我感兴趣的是计算按类别划分的独特目击事件的数量。例如,代理属于一方、红色、蓝色或绿色。我想制作一个二维表,其中第 i 行、第 j 列是类别 j 中至少有一个类别 i 中的代理感知到的代理数量。 (我在此代码示例中使用了 Sides,但我们也可以通过其他方式对代理进行分类,例如通过他们拥有的武器或他们携带的传感器。)

这是一个示例输出表;请注意,模拟不会输出蓝色/蓝色的感觉,因为它需要大量的空间,而且我们对它们不感兴趣。绿色相同)

      blue     green      red
blue  0      492       186
green 1075    0     186
red   451    498      26

这些列是

  1. 刻度 - 时间步长
  2. sensingAgentId - 执行感测的代理的ID
  3. sensedAgentId - 正在感测的代理的id
  4. detRange - 两个代理之间的范围(以米
  5. 为单位) senseType - 完成什么类型​​的感测的枚举类型

这是代码我我目前正在使用它来完成此任务:

def createHeatmap():
  h5file = openFile("someFile.h5")
  run0 = h5file.root.run0.detections

  # A dictionary of dictionaries, {'blue': {'blue':0, 'red':0, ...}
  classHeat = emptyDict(sides)

  # Interested in Per Category Unique Detections
  seenClass = {}

  # Initially each side has seen no one    
  for theSide in sides:
    seenClass[theSide] = []

  # In-kernel search filtering out many rows in file; in this instance 25,789,825 rows
  # are filtered to 4,409,176  
  classifications = run0.where('senseType == 3')

  # Iterate and filter 
  for row in classifications:
    sensedId = row['sensedAgentId']
    # side is a function that returns the string representation of the side of agent
    # with that id.
    sensedSide = side(sensedId)
    sensingSide = side(row['sensingAgentId'])

    # The side has already seen this agent before; ignore it
    if sensedId in seenClass[sensingSide]:
      continue
    else:
      classHeat[sensingSide][sensedSide] += 1
      seenClass[sensingSide].append(sensedId)


  return classHeat

注意:我有 Java 背景,所以如果这不是 Pythonic,我很抱歉。请指出这一点并提出改进此代码的方法,我很想更加精通 Python。

现在,这非常慢:执行此迭代和成员资格检查大约需要 50 秒,并且这是最严格的成员资格标准集(其他检测类型有更多的行需要迭代)。

我的问题是,是否可以将工作从 python 移到内核内搜索查询中?如果是这样,怎么办?我是否缺少一些明显的加速?我需要能够为一组运行中的每次运行(〜30)以及多组标准(〜5)运行此函数,因此如果可以加快速度,那就太好了。

最后一点:我尝试使用 psyco 但几乎没有什么区别。

I have HDF5 files with multiple groups, where each group contains a data set with >= 25 million rows. At each time step of simulation, each agent outputs the other agents he/she sensed at that time step. There are ~2000 agents in the scenario and thousands of time steps; the O(n^2) nature of the output explains the huge number of rows.

What I'm interested in calculating is the number of unique sightings by category. For instance, agents belong to a side, red, blue, or green. I want to make a two-dimensional table where row i, column j is the number of agents in category j that were sensed by at least one agent in category i. (I'm using the Sides in this code example, but we could classify the agents in other ways as well, such as by the weapon they have, or the sensors they carry.)

Here's a sample output table; note that the simulation does not output blue/blue sensations because it takes a ton of room and we aren't interested in them. Same for green green)

      blue     green      red
blue  0      492       186
green 1075    0     186
red   451    498      26

The columns are

  1. tick - time step
  2. sensingAgentId - id of agent doing sensing
  3. sensedAgentId - id of agent being sensed
  4. detRange - range in meters between two agents
  5. senseType - an enumerated type for what type of sensing was done

Here's the code I am currently using to accomplish this:

def createHeatmap():
  h5file = openFile("someFile.h5")
  run0 = h5file.root.run0.detections

  # A dictionary of dictionaries, {'blue': {'blue':0, 'red':0, ...}
  classHeat = emptyDict(sides)

  # Interested in Per Category Unique Detections
  seenClass = {}

  # Initially each side has seen no one    
  for theSide in sides:
    seenClass[theSide] = []

  # In-kernel search filtering out many rows in file; in this instance 25,789,825 rows
  # are filtered to 4,409,176  
  classifications = run0.where('senseType == 3')

  # Iterate and filter 
  for row in classifications:
    sensedId = row['sensedAgentId']
    # side is a function that returns the string representation of the side of agent
    # with that id.
    sensedSide = side(sensedId)
    sensingSide = side(row['sensingAgentId'])

    # The side has already seen this agent before; ignore it
    if sensedId in seenClass[sensingSide]:
      continue
    else:
      classHeat[sensingSide][sensedSide] += 1
      seenClass[sensingSide].append(sensedId)


  return classHeat

Note: I have a Java background, so I apologize if this is not Pythonic. Please point this out and suggest ways to improve this code, I would love to become more proficient with Python.

Now, this is very slow: it takes approximately 50 seconds to do this iteration and membership checking, and this is with the most restrictive set of membership criteria (other detection types have many more rows to iterate over).

My question is, is it possible to move the work out of python and into the in-kernel search query? If so, how? Is there some glaringly obvious speedup I am missing? I need to be able to run this function for each run in a set of runs (~30), and for multiple set of criteria (~5), so it would be great if this could be sped up.

Final note: I tried using psyco but that barely made a difference.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

心头的小情儿 2024-08-23 23:29:41

如果你有 N=~2k 特工,我建议将所有目击事件放入大小为 NxN 的 numpy 数组中。这很容易适合内存(整数大约为 16 兆)。只要在目击事件发生的地方存储 1 即可。

假设您有一个sightings数组。第一个坐标是 Sensing,第二个坐标是 Sensed。假设您还有一维索引数组,列出哪些代理位于哪一侧。您可以通过以下方式获取 A 面 B 面的目击次数:

sideAseesB = sightings[sideAindices, sideBindices]
sideAseesBcount = numpy.logical_or.reduce(sideAseesB, axis=0).sum()

您可能需要使用 sightings.take(sideAindices, axis=0).take(sideBindices, axis=1)第一步,但我对此表示怀疑。

If you have N=~2k agents, I suggest putting all sightings into a numpy array of size NxN. This easily fits in memory (around 16 meg for integers). Just store a 1 wherever a sighting occurred.

Assume that you have an array sightings. The first coordinate is Sensing, the second is Sensed. Assume you also have 1-d index arrays listing which agents are on which side. You can get the number of sightings of side B by side A this way:

sideAseesB = sightings[sideAindices, sideBindices]
sideAseesBcount = numpy.logical_or.reduce(sideAseesB, axis=0).sum()

It's possible you'd need to use sightings.take(sideAindices, axis=0).take(sideBindices, axis=1) in the first step, but I doubt it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文