处理Python中大计算的内存使用

发布于 2024-12-01 20:56:37 字数 1017 浏览 4 评论 0原文

我正在尝试用 python 进行一些计算，但内存不足。因此，我想读/写一个文件以释放内存。我需要一个类似非常大的列表对象的东西，所以我想为文件中的每个对象写一行，然后读/写该行而不是内存。行排序对我来说很重要，因为我将使用行号作为索引。所以我想知道如何替换python中的行，而不移动其他行（实际上，移动行就可以了，只要它们返回到我期望的位置即可）。

编辑

我正在尝试帮助一个朋友，他的Python水平比我差或等于。这段代码应该找到最大的素数，它可以除以给定的非素数。这段代码适用于数字，直到数字达到 100 万，但是死后，我的记忆在尝试列出数字列表时耗尽了。

# a comes from a user input
primes_upper_limit = (a+1) / 2
counter = 3L
numbers = list()
while counter <= primes_upper_limit:
    numbers.append(counter)
    counter += 2L

counter=3
i=0
half = (primes_upper_limit + 1) / 2 - 1
root = primes_upper_limit ** 0.5
while counter < root:
    if numbers[i]:
        j = int((counter*counter - 3) / 2)
        numbers[j] = 0
        while j < half:
            numbers[j] = 0
            j += counter
    i += 1
    counter = 2*i + 3
primes = [2] + [num for num in numbers if num]
for numb in reversed(primes):
    if a % numb == 0:
        print numb
        break

Another Edit

为每个索引写入不同的文件怎么样？例如，十亿个具有长整数文件名的文件，而文件内部只有一个数字？

原文

I am trying to do some calculations with python, where I ran out of memory. Therefore, I want to read/write a file in order to free memory. I need a something like a very big list object, so I thought writing a line for each object in the file and read/write to that lines instead of to memory. Line ordering is important for me since I will use line numbers as index. So I was wondering how I can replace lines in python, without moving around other lines (Actually, it is fine to move lines, as long as they return back to where I expect them to be).

Edit

I am trying to help a friend, which is worse than or equal to me in python. This code supposed to find biggest prime number, that divides given non-prime number. This code works for numbers until the numbers like 1 million, but after dead, my memory gets exhausted while trying to make numbers list.

# a comes from a user input
primes_upper_limit = (a+1) / 2
counter = 3L
numbers = list()
while counter <= primes_upper_limit:
    numbers.append(counter)
    counter += 2L

counter=3
i=0
half = (primes_upper_limit + 1) / 2 - 1
root = primes_upper_limit ** 0.5
while counter < root:
    if numbers[i]:
        j = int((counter*counter - 3) / 2)
        numbers[j] = 0
        while j < half:
            numbers[j] = 0
            j += counter
    i += 1
    counter = 2*i + 3
primes = [2] + [num for num in numbers if num]
for numb in reversed(primes):
    if a % numb == 0:
        print numb
        break

Another Edit

What about wrinting different files for each index? for example a billion of files with long integer filenames, and just a number inside of the file?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蝶…霜飞 2024-12-08 20:56:37

您想要找到 a 的最大素因数。（欧拉项目问题 3）
您当前选择的算法和实现是通过以下方式实现的：

生成 (3 <= n <= sqrt(a) 或 (a+1)/2 范围内所有候选素数的列表数字就像您目前所做的那样）
筛选 numbers 列表以获取素数列表 {p} <= sqrt(a)
试除法：测试 a 能被每个 p 整除。存储 a 的所有素因数 {q}。
打印所有除数{q}；我们只想要最大的。

我对此算法的评论如下。正如欧文和我评论的那样，筛选和试除法是严重不可扩展的算法。对于大的（十亿或万亿）你真的应该使用 NumPy。不管怎样，关于实现这个算法的一些评论：

你知道你只需要测试√a，int(math.sqrt(a))，而不是(a+1) )/2 和你一样吗？
不需要建立一个巨大的候选数字列表，然后筛选它的素数 - 数字列表不可扩展。只需直接构建列表primes即可。您可以使用 while/for 循环和 xrange(3,sqrt(a)+2,2) （它为您提供一个迭代器）。正如您提到的 xrange() 在 2**31L 处溢出，但结合 sqrt 观察，您仍然可以成功分解到 2**62
一般来说，这是较差的得到 a 的素数分解，即每次找到素因数 p | a，你只需要继续筛选剩余的因子a/p或a/p²或a/p³或其他）。除了非常大的素数（或伪素数）的罕见情况外，这将大大减少您正在使用的数字的大小。
此外，您只需要生成素数列表 {p} 一次；此后存储它并进行查找，而不是重新生成它。
因此，我会将 generate_primes(a) 与 find_largest_prime_divisor(a) 分开。分解有很大帮助。

这是我对代码的重写，但由于保留了筛选列表，性能仍然下降了数十亿（a > 10**11 +1）。我们可以使用 collections.deque 代替素数列表，以获得更快的 O(1) append() 操作，但这是一个小的优化。

# Prime Factorization by trial division

from math import ceil,sqrt
from collections import deque

# Global list of primes (strictly we should use a class variable not a global)
#primes = deque()
primes = []

def is_prime(n):
    """Test whether n is divisible by any prime known so far"""
    global primes
    for p in primes:
         if n%p == 0:
             return False #  n was divisible by p
    return True # either n is prime, or divisible by some p larger than our list    
def generate_primes(a):
    """Generate sieved list of primes (up to sqrt(a)) as we go"""
    global primes
    primes_upper_limit = int(sqrt(a))
    # We get huge speedup by using xrange() instead of range(), so we have to seed the list with 2
    primes.append(2)
    print "Generating sieved list of primes up to", primes_upper_limit, "...",
    # Consider prime candidates 2,3,5,7... in increasing increments of 2
    #for number in [2] + range(3,primes_upper_limit+2,2):
    for number in xrange(3,primes_upper_limit+2,2):
        if is_prime(number): # use global 'primes'
            #print "Found new prime", number
            primes.append(number) # Found a new prime larger than our list
    print "done"    
def find_largest_prime_factor(x, debug=False):
    """Find all prime factors of x, and return the largest."""
    global primes
    # First we need the list of all primes <= sqrt(x)    
    generate_primes(x)
    to_factor = x # running value of the remaining quantity we need to factor
    largest_prime_factor = None
    for p in primes:
        if debug: print "Testing divisibility by", p
        if to_factor%p != 0:
            continue
        if debug: print "...yes it is"
        largest_prime_factor = p
        # Divide out all factors of p in x (may have multiplicity)
        while to_factor%p == 0:
            to_factor /= p
        # Stop when all factors have been found
        if to_factor==1:
            break
    else:
        print "Tested all primes up to sqrt(a), remaining factor must be a single prime > sqrt(a) :", to_factor
    print "\nLargest prime factor of x is", largest_prime_factor
    return largest_prime_factor

You want to find the largest prime divisor of a. (Project Euler Question 3)
Your current choice of algorithm and implementation do this by:

Generate a list numbers of all candidate primes in range (3 <= n <= sqrt(a), or (a+1)/2 as you currently do)
Sieve the numbers list to get a list of primes {p} <= sqrt(a)
Trial Division: test the divisibility of a by each p. Store all prime divisors {q} of a.
Print all divisors {q}; we only want the largest.

My comments on this algorithm are below. Sieving and trial division are seriously not scalable algorithms, as Owen and I comment. For large a (billion, or trillion) you really should use NumPy. Anyway some comments on implementing this algorithm:

Did you know you only need to test up to √a, int(math.sqrt(a)), not (a+1)/2 as you do?
There is no need to build a huge list of candidates numbers, then sieve it for primeness - the numbers list is not scalable. Just construct the list primes directly. You can use while/for-loops and xrange(3,sqrt(a)+2,2) (which gives you an iterator). As you mention xrange() overflows at 2**31L, but combined with the sqrt observation, you can still successfully factor up to 2**62
In general this is inferior to getting the prime decomposition of a, i.e. every time you find a prime divisor p | a, you only need to continue to sieve the remaining factor a/p or a/p² or a/p³ or whatever). Except for the rare case of very large primes (or pseudoprimes), this will greatly reduce the magnitude of the numbers you are working with.
Also, you only ever need to generate the list of primes {p} once; thereafter store it and do lookups, not regenerate it.
So I would separate out generate_primes(a) from find_largest_prime_divisor(a). Decomposition helps greatly.

Here is my rewrite of your code, but performance still falls off in the billions (a > 10**11 +1) due to keeping the sieved list. We can use collections.deque instead of list for primes, to get a faster O(1) append() operation, but that's a minor optimization.

# Prime Factorization by trial division

from math import ceil,sqrt
from collections import deque

# Global list of primes (strictly we should use a class variable not a global)
#primes = deque()
primes = []

def is_prime(n):
    """Test whether n is divisible by any prime known so far"""
    global primes
    for p in primes:
         if n%p == 0:
             return False #  n was divisible by p
    return True # either n is prime, or divisible by some p larger than our list    
def generate_primes(a):
    """Generate sieved list of primes (up to sqrt(a)) as we go"""
    global primes
    primes_upper_limit = int(sqrt(a))
    # We get huge speedup by using xrange() instead of range(), so we have to seed the list with 2
    primes.append(2)
    print "Generating sieved list of primes up to", primes_upper_limit, "...",
    # Consider prime candidates 2,3,5,7... in increasing increments of 2
    #for number in [2] + range(3,primes_upper_limit+2,2):
    for number in xrange(3,primes_upper_limit+2,2):
        if is_prime(number): # use global 'primes'
            #print "Found new prime", number
            primes.append(number) # Found a new prime larger than our list
    print "done"    
def find_largest_prime_factor(x, debug=False):
    """Find all prime factors of x, and return the largest."""
    global primes
    # First we need the list of all primes <= sqrt(x)    
    generate_primes(x)
    to_factor = x # running value of the remaining quantity we need to factor
    largest_prime_factor = None
    for p in primes:
        if debug: print "Testing divisibility by", p
        if to_factor%p != 0:
            continue
        if debug: print "...yes it is"
        largest_prime_factor = p
        # Divide out all factors of p in x (may have multiplicity)
        while to_factor%p == 0:
            to_factor /= p
        # Stop when all factors have been found
        if to_factor==1:
            break
    else:
        print "Tested all primes up to sqrt(a), remaining factor must be a single prime > sqrt(a) :", to_factor
    print "\nLargest prime factor of x is", largest_prime_factor
    return largest_prime_factor

回复收藏 0 原文

別甾虛僞 2024-12-08 20:56:37

如果我理解正确的话，这不是一件容易的事。他们的方式我解释它，你想要保持文件句柄打开，并使用该文件作为存储字符数据的地方。

假设您有一个类似的文件，

a
b
c

并且您想将“b”替换为“bb”。这将是一个痛苦，因为该文件实际上看起来像 a\nb\nc ——你不能只覆盖 b，你需要另一个字节。

我的建议是尝试找到一种方法，使您的算法在不使用文件进行额外存储的情况下工作。如果出现堆栈溢出，很可能您并没有真正耗尽内存，而是超出了调用堆栈，该堆栈要小得多。

您可以尝试重新设计您的算法以使其不再递归。有时您可以使用 list 来替代调用堆栈 - 但您可以做很多事情，而且我认为在没有看到您的算法的情况下我无法给出很多一般性建议。

编辑

啊，我明白你的意思...当列表

while counter <= primes_upper_limit:
    numbers.append(counter)
    counter += 2L

变得非常大时，你可能会耗尽内存。所以我猜你基本上是在做筛子，这就是为什么你有这么大的列表数字？这是有道理的。如果你想继续这样做，你可以尝试使用 numpy bool 数组，因为它每个单元使用的内存要少得多：

import numpy as np

numbers = np.repeat(True, a/2)

或者（也许这并不吸引人））您可以采用一种完全不同的方法，不使用大列表，例如完全分解数字并选择最大的因子。

类似这样的：

factors = [ ]
tail = a

while tail > 1:
    j = 2
    while 1:
        if tail % j == 0:
            factors.append(j)
            tail = tail / j
            print('%s %s' % (factors, tail))
            break
        else:
            j += 1

即假设您正在分解 20：tail 开头为 20，然后您会发现 2 tail 变为 10，然后变为 5。

这并不是很高效，并且对于大（数十亿）素数来说会变得太慢，但对于具有小因数的数字来说这是可以的。

我的意思是你的筛子也很好，直到你开始耗尽内存;)。你可以尝试一下numpy。

If I'm understanding you correctly, this is not an easy task. They way I interpreted it, you want to keep a file handle open, and use the file as a place to store character data.

Say you had a file like,

a
b
c

and you wanted to replace 'b' with 'bb'. That's going to be a pain, because the file actually looks like a\nb\nc -- you can't just overwrite the b, you need another byte.

My advice would be to try and find a way to make your algorithm work without using a file for extra storage. If you got a stack overflow, chances are you didn't really run out of memory, you overran the call stack, which is much smaller.

You could try reworking your algorithm to not be recursive. Sometimes you can use a list to substitute for the call stack -- but there are many things you could do and I don't think I could give much general advice not seeing your algorithm.

edit

Ah I see what you mean... when the list

while counter <= primes_upper_limit:
    numbers.append(counter)
    counter += 2L

grows really big, you could run out of memory. So I guess you're basically doing a sieve, and that's why you have the big list numbers? It makes sense. If you want to keep doing it this way, you could try a numpy bool array, because it will use substantially less memory per cell:

import numpy as np

numbers = np.repeat(True, a/2)

Or (and maybe this is not appealing) you could go with an entirely different approach that doesn't use a big list, such as factoring the number entirely and picking the biggest factor.

Something like:

factors = [ ]
tail = a

while tail > 1:
    j = 2
    while 1:
        if tail % j == 0:
            factors.append(j)
            tail = tail / j
            print('%s %s' % (factors, tail))
            break
        else:
            j += 1

ie say you were factoring 20: tail starts out as 20, then you find 2 tail becomes 10, then it becomes 5.

This is not terrible efficient and will become way too slow for a large (billions) prime number, but it's ok for numbers with small factors.

I mean your sieve is good too, until you start running out of memory ;). You could give numpy a shot.

回复收藏 0 原文