python 映射数据结构的性能瓶颈

发布于 2024-12-07 11:04:56 字数 1507 浏览 1 评论 0原文

我在 python 中用于一个更大的项目的数据结构之一面临着一些性能问题。

基本上,我正在导入一个表格分隔文件。使用普通的 python open(...) 文件迭代器,我用 line.split("\t") 分割行。现在我希望将列的实际值插入到某种字典中,返回该值的 ID。而且它变得很慢:

一般来说 - 字典类看起来像这样:

class Dictionary(list):
  def getBitLength(self):
      if(len(self) == 0):
          return 0
      else:
          return math.log(len(self), 2)

  def insertValue(self, value):
      self.append(value)
      return len(self) - 1

  def getValueForValueId(self, valueId):
      return self[valueId]

  def getValueIdForValue(self, value):
      if(value in self):
         return self.index(value)
      else:
         return self.insertValue(value)

基本思想是,valueId 是字典列表中值的索引。

对程序进行分析后发现,超过 50% 的时间花在了 getValueIdForValue(...) 上。

1566562 function calls in 23.218 seconds

Ordered by: cumulative time
List reduced from 93 to 10 due to restriction <10>

240000   13.341    0.000   16.953    0.000 Dictionary.py:22(getValueIdForValue)
206997    3.196    0.000    3.196    0.000 :0(index)

问题是,这只是一个小测试。在实际应用环境中,该函数将被调用数百万次,这将大大增加其运行时间。

当然,我可以从 python dict 继承,但是性能问题非常相似,因为我需要获取给定值的键(如果该值已经插入到字典中)。

由于我现在还不是 Python 专业人士,你们能给我一些如何提高效率的建议吗?

最佳&感谢您的帮助,

n3otec

===

谢谢大家!

bidic的性能要好得多:

  240000    2.458    0.000    8.546    0.000 Dictionary.py:34(getValueIdForValue)
  230990    1.678    0.000    5.134    0.000 Dictionary.py:27(insertValue)

最好, 诺泰克

I am facing a little performance problem with one of my data structures used for a bigger project in python.

Basically, I am importing a tabular delimited file. Using the normal python open(...) file iterator I am splitting the lines with line.split("\t"). Now I want the actual value of a column be inserted in some sort of dictionary returning an ID for the value. And there it is getting slow:

In general - the dictionary class looks like this:

class Dictionary(list):
  def getBitLength(self):
      if(len(self) == 0):
          return 0
      else:
          return math.log(len(self), 2)

  def insertValue(self, value):
      self.append(value)
      return len(self) - 1

  def getValueForValueId(self, valueId):
      return self[valueId]

  def getValueIdForValue(self, value):
      if(value in self):
         return self.index(value)
      else:
         return self.insertValue(value)

The basic idea was, that the valueId is the index of the value in the dictionary list.

Profiling the program tells me that more than 50% are spend on getValueIdForValue(...).

1566562 function calls in 23.218 seconds

Ordered by: cumulative time
List reduced from 93 to 10 due to restriction <10>

240000   13.341    0.000   16.953    0.000 Dictionary.py:22(getValueIdForValue)
206997    3.196    0.000    3.196    0.000 :0(index)

The problem is, that this is just a small test. In real application environment this function would be called several million times which would increase the runtime for this to a large extend.

Of course I could inherit from python dict, but than the performance problem is quite similar since I need to get key of a given value (in case that the value already has been inserted to the dictionary).

Since I am not the Python Pro until now, can you guys give me any tips how to make this a bit more efficient?

Best & thanks for the help,

n3otec

===

Thanks guys!

Performance of bidict is much better:

  240000    2.458    0.000    8.546    0.000 Dictionary.py:34(getValueIdForValue)
  230990    1.678    0.000    5.134    0.000 Dictionary.py:27(insertValue)

Best,
n3otec

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

拥抱影子 2024-12-14 11:04:57

如果键和值是唯一的,则可以使用双向字典。 这里有一个 python 包

If keys and values are unique, you can use a bidirectional dictionary. There is one python package here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文