PHP 循环性能优化

发布于 2024-08-14 20:09:01 字数 442 浏览 12 评论 0原文

我正在编写一个 PHP 函数，该函数需要循环遍历指针数组，并为每个项目提取该数据（无论是来自 MySQL 数据库还是平面文件）。由于可能有成千上万次迭代，有人有任何优化这个的想法吗？

我的第一个想法是拥有一个我处理的缓存数据的静态数组，任何修改都只会更改该缓存数组，然后最后我可以将其刷新到磁盘。然而，在超过 1000 个项目的循环中，如果我只在数组中保留大约 30 个项目，这将毫无用处。每个项目都不是太大，但内存中 1000 多个项目就太多了，因此需要磁盘存储。

数据只是 gzip 压缩的序列化对象。目前我正在使用数据库来存储数据，但我想也许平面文件会更快（我不关心并发问题，我不需要解析它，只需解压缩和反序列化）。我已经有一个自定义迭代器，它将一次拉入 5 个项目（以减少数据库连接）并将它们存储在该缓存中。但同样，当我需要迭代数千次时，使用 30 的缓存是相当无用的。

基本上我只需要一种快速迭代这些项目的方法。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凡间太子 2024-08-21 20:09:01

好吧，你还没有付出太多来继续下去。您没有描述您的数据，也没有描述您的数据正在做什么，或者何时需要一个对象而不是另一个对象，以及这些对象如何临时释放，以及在什么情况下需要它回来，以及......所以

任何人在这里说的任何话都将是完全无稽之谈。

...所以沿着这些思路，这是一个黑暗中的镜头。

如果您每次都只能轻松地在内存中保存 x 个项目，请为 x 个项目留出空间。然后，每次访问该对象时，记下时间（这可能并不意味着时钟时间，而可能意味着您访问它们的顺序）。将每个项目保留在列表中（它可能不是在列表中实现，而是作为类似堆的结构），以便最近使用的项目更快地出现在列表中。当您需要将新项目放入内存时，您可以替换最长时间前使用的项目，然后将该项目移动到列表的前面。您可能需要保留项目的另一个索引，以便在需要时知道它们在列表中的确切位置。然后您要做的就是查找该项目所在的位置，根据需要链接其父指针和子指针，然后将其移动到列表的前面。可能还有其他方法来优化查找时间。

这称为 LRU 算法。这是虚拟内存的页面替换方案。它的作用是延迟瓶颈（磁盘 I/O），直到它可能无法避免。值得注意的是，该算法并不能保证最优替换，但它的性能仍然相当不错。

除此之外，我建议在很大程度上并行化您的代码（如果可能），以便当一项需要访问硬盘来加载或转储时，您可以让该处理器忙于执行实际工作。

<编辑>
根据您的评论，您正在研究神经网络。在您最初输入数据（在校正阶段之前）的情况下，或者当您积极使用它进行分类时，我不认为该算法是一个坏主意，除非没有可能的方法来适应内存中最常用的节点。

在校正阶段（也许是反向传播？），您必须在内存中保留哪些节点应该很明显......因为您已经访问过它们！

如果您的网络很大，您将无法避免没有磁盘 I/O。诀窍是找到一种方法来最小化它。
< /编辑>

Well, you haven't given a whole lot to go on. You don't describe your data, and you don't describe what your data is doing or when you need one object as opposed to another, and how those objects get released temporarily, and under what circumstances you need it back, and...

So anything anybody says here is going to be a complete shot in the dark.

...so along those lines, here's a shot in the dark.

If you are only comfortable holding x items in memory at any one time, set aside space for x items. Then, every time you access the object, make a note of the time (this might not mean clock time so much as it may mean the order in which you access them). Keep each item in a list (it may not be implemented in a list, but rather as a heap-like structure) so that the most recently used items appear sooner in the list. When you need to put a new one into memory, you replace the one that was used the longest time ago and then you move that item to the front of the list. You may need to keep another index of the items so that you know where exactly they are in the list when you need them. What you do then is look up where the item is located, link its parent and child pointers as appropriate, then move it to the front of the list. There are probably other ways to optimize lookup time, too.

This is called the LRU algroithm. It's a page replacement scheme for virtual memory. What it does is it delays your bottleneck (the disk I/O) until it's probably impossible to avoid. It is worth noting that this algorithm does not guarantee optimal replacement, but it performs pretty well nonetheless.

Beyond that, I would recommend parallelizing your code to a large degree (if possible) so that when one item needs to hit the hard disk to load or to dump, you can keep that processor busy doing real work.

< edit >
Based off of your comment, you are working on a neural network. In the case of your initial fedding of the data (before the correction stage), or when you are actively using it to classify, I don't see how the algorithm is a bad idea, unless there is just no possible way to fit the most commonly used nodes in memory.

In the correction stage (perhaps back-prop?), it should be apparent what nodes you MUST keep in memory... because you've already visited them!

If your network is large, you aren't going to get away with no disk I/O. The trick is to find a way to minimize it.
< /edit >

回复收藏 0 原文

木槿暧夏七纪年 2024-08-21 20:09:01

显然，将其保存在内存中比其他任何方法都要快。每件物品有多大？即使每个都是 1K，一万个也就 10M。

回复收藏 0 原文

温柔女人霸气范 2024-08-21 20:09:01

在获得所需的数据后，您始终可以中断循环。这样它就不会继续循环。如果您要存储的是平面文件..您的服务器硬盘将遭受包含数千或数百万个不同文件大小的文件的困扰。但是，如果您正在谈论存储在数据库中的整个实际文件。那么最好将其存储在文件夹中并将该文件的路径保存在数据库中。并尝试将拉出的项目放入 XML 中。这样它就更容易访问，并且它可以包含许多属性来表示所提取项目的详细信息，例如（名称、上传日期等）。

回复收藏 0 原文