处理Python中大计算的内存使用
我正在尝试用 python 进行一些计算,但内存不足。因此,我想读/写一个文件以释放内存。我需要一个类似非常大的列表对象的东西,所以我想为文件中的每个对象写一行,然后读/写该行而不是内存。行排序对我来说很重要,因为我将使用行号作为索引。所以我想知道如何替换python中的行,而不移动其他行(实际上,移动行就可以了,只要它们返回到我期望的位置即可)。
编辑
我正在尝试帮助一个朋友,他的Python水平比我差或等于。这段代码应该找到最大的素数,它可以除以给定的非素数。这段代码适用于数字,直到数字达到 100 万,但是死后,我的记忆在尝试列出数字列表时耗尽了。
# a comes from a user input
primes_upper_limit = (a+1) / 2
counter = 3L
numbers = list()
while counter <= primes_upper_limit:
numbers.append(counter)
counter += 2L
counter=3
i=0
half = (primes_upper_limit + 1) / 2 - 1
root = primes_upper_limit ** 0.5
while counter < root:
if numbers[i]:
j = int((counter*counter - 3) / 2)
numbers[j] = 0
while j < half:
numbers[j] = 0
j += counter
i += 1
counter = 2*i + 3
primes = [2] + [num for num in numbers if num]
for numb in reversed(primes):
if a % numb == 0:
print numb
break
Another Edit为每个索引写入不同的文件怎么样?例如,十亿个具有长整数文件名的文件,而文件内部只有一个数字?
I am trying to do some calculations with python, where I ran out of memory. Therefore, I want to read/write a file in order to free memory. I need a something like a very big list object, so I thought writing a line for each object in the file and read/write to that lines instead of to memory. Line ordering is important for me since I will use line numbers as index. So I was wondering how I can replace lines in python, without moving around other lines (Actually, it is fine to move lines, as long as they return back to where I expect them to be).
Edit
I am trying to help a friend, which is worse than or equal to me in python. This code supposed to find biggest prime number, that divides given non-prime number. This code works for numbers until the numbers like 1 million, but after dead, my memory gets exhausted while trying to make numbers list.
# a comes from a user input
primes_upper_limit = (a+1) / 2
counter = 3L
numbers = list()
while counter <= primes_upper_limit:
numbers.append(counter)
counter += 2L
counter=3
i=0
half = (primes_upper_limit + 1) / 2 - 1
root = primes_upper_limit ** 0.5
while counter < root:
if numbers[i]:
j = int((counter*counter - 3) / 2)
numbers[j] = 0
while j < half:
numbers[j] = 0
j += counter
i += 1
counter = 2*i + 3
primes = [2] + [num for num in numbers if num]
for numb in reversed(primes):
if a % numb == 0:
print numb
break
Another Edit
What about wrinting different files for each index? for example a billion of files with long integer filenames, and just a number inside of the file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您想要找到 a 的最大素因数。 (欧拉项目问题 3)
您当前选择的算法和实现是通过以下方式实现的:
数字
就像您目前所做的那样)numbers
列表以获取素数列表 {p} <= sqrt(a)我对此算法的评论如下。正如欧文和我评论的那样,筛选和试除法是严重不可扩展的算法。对于大的(十亿或万亿)你真的应该使用 NumPy。不管怎样,关于实现这个算法的一些评论:
int(math.sqrt(a))
,而不是(a+1) )/2 和你一样吗?数字
列表,然后筛选它的素数 - 数字列表不可扩展。只需直接构建列表primes
即可。您可以使用 while/for 循环和 xrange(3,sqrt(a)+2,2) (它为您提供一个迭代器)。正如您提到的 xrange() 在2**31L
处溢出,但结合 sqrt 观察,您仍然可以成功分解到2**62
因此,我会将
generate_primes(a)
与find_largest_prime_divisor(a)
分开。分解有很大帮助。这是我对代码的重写,但由于保留了筛选列表,性能仍然下降了数十亿(a > 10**11 +1)。我们可以使用 collections.deque 代替素数列表,以获得更快的 O(1) append() 操作,但这是一个小的优化。
You want to find the largest prime divisor of a. (Project Euler Question 3)
Your current choice of algorithm and implementation do this by:
numbers
of all candidate primes in range (3 <= n <= sqrt(a), or (a+1)/2 as you currently do)numbers
list to get a list of primes {p} <= sqrt(a)My comments on this algorithm are below. Sieving and trial division are seriously not scalable algorithms, as Owen and I comment. For large a (billion, or trillion) you really should use NumPy. Anyway some comments on implementing this algorithm:
int(math.sqrt(a))
, not (a+1)/2 as you do?numbers
, then sieve it for primeness - the numbers list is not scalable. Just construct the listprimes
directly. You can use while/for-loops andxrange(3,sqrt(a)+2,2)
(which gives you an iterator). As you mention xrange() overflows at2**31L
, but combined with the sqrt observation, you can still successfully factor up to2**62
So I would separate out
generate_primes(a)
fromfind_largest_prime_divisor(a)
. Decomposition helps greatly.Here is my rewrite of your code, but performance still falls off in the billions (a > 10**11 +1) due to keeping the sieved list. We can use collections.deque instead of list for primes, to get a faster O(1) append() operation, but that's a minor optimization.
如果我理解正确的话,这不是一件容易的事。他们的方式我解释它,你想要保持文件句柄打开,并使用该文件作为存储字符数据的地方。
假设您有一个类似的文件,
并且您想将“b”替换为“bb”。这将是一个痛苦,因为该文件实际上看起来像
a\nb\nc
——你不能只覆盖b
,你需要另一个字节。我的建议是尝试找到一种方法,使您的算法在不使用文件进行额外存储的情况下工作。如果出现堆栈溢出,很可能您并没有真正耗尽内存,而是超出了调用堆栈,该堆栈要小得多。
您可以尝试重新设计您的算法以使其不再递归。有时您可以使用
list
来替代调用堆栈 - 但您可以做很多事情,而且我认为在没有看到您的算法的情况下我无法给出很多一般性建议。编辑
啊,我明白你的意思...当列表
变得非常大时,你可能会耗尽内存。所以我猜你基本上是在做筛子,这就是为什么你有这么大的列表
数字
?这是有道理的。如果你想继续这样做,你可以尝试使用numpy
bool
数组,因为它每个单元使用的内存要少得多:或者(也许这并不吸引人) )您可以采用一种完全不同的方法,不使用大列表,例如完全分解数字并选择最大的因子。
类似这样的:
即假设您正在分解
20
:tail
开头为20
,然后您会发现2
tail
变为10
,然后变为5
。这并不是很高效,并且对于大(数十亿)素数来说会变得太慢,但对于具有小因数的数字来说这是可以的。
我的意思是你的筛子也很好,直到你开始耗尽内存;)。你可以尝试一下
numpy
。If I'm understanding you correctly, this is not an easy task. They way I interpreted it, you want to keep a file handle open, and use the file as a place to store character data.
Say you had a file like,
and you wanted to replace 'b' with 'bb'. That's going to be a pain, because the file actually looks like
a\nb\nc
-- you can't just overwrite theb
, you need another byte.My advice would be to try and find a way to make your algorithm work without using a file for extra storage. If you got a stack overflow, chances are you didn't really run out of memory, you overran the call stack, which is much smaller.
You could try reworking your algorithm to not be recursive. Sometimes you can use a
list
to substitute for the call stack -- but there are many things you could do and I don't think I could give much general advice not seeing your algorithm.edit
Ah I see what you mean... when the list
grows really big, you could run out of memory. So I guess you're basically doing a sieve, and that's why you have the big list
numbers
? It makes sense. If you want to keep doing it this way, you could try anumpy
bool
array, because it will use substantially less memory per cell:Or (and maybe this is not appealing) you could go with an entirely different approach that doesn't use a big list, such as factoring the number entirely and picking the biggest factor.
Something like:
ie say you were factoring
20
:tail
starts out as20
, then you find2
tail
becomes10
, then it becomes5
.This is not terrible efficient and will become way too slow for a large (billions) prime number, but it's ok for numbers with small factors.
I mean your sieve is good too, until you start running out of memory ;). You could give
numpy
a shot.pytables 非常适合处理和存储大量数据。但首先要从实现 smci 答案中的注释开始,以最大限度地减少需要存储的数字量。
pytables is excellent for working with and storing huge amounts of data. But first start with implementing the comments in smci's answer to minimize the amount of numbers you need to store.
对于只有 12 位数字的数字,如欧拉计划 #3 中那样,不需要花哨的整数分解方法,也不需要在磁盘上存储中间结果。使用此算法求 n 的因子:
这只是对每个整数进行试除,直到达到平方根,这表明剩余的辅因子是素数。每个因素都会按照找到的情况进行打印。
For a number with only twelve digits, as in Project Euler #3, no fancy integer factorization method is needed, and there is no need to store intermediate results on disk. Use this algorithm to find the factors of n:
This just does trial division by every integer until it reaches the square root, which indicates that the remaining cofactor is prime. Each factor is printed as it is found.