Python - 如何在不耗尽内存或减慢处理速度的情况下构建和访问大量数据

发布于 2024-12-11 04:12:54 字数 2614 浏览 0 评论 0原文

我正在用 Python 编写一个脚本来对专有 ESRI 数据库表中的数据进行质量控制。该脚本的目的不是修改无效数据，而只是通过 csv 文件向用户报告无效数据。我正在使用 ESRI 的 ArcPy 包通过 arcpy.SearchCursor 访问每个单独的记录。 SearchCursor 是访问 ESRI 格式的每个单独记录的唯一方法。

当我滚动浏览表的每条记录时，我会进行多次质量控制检查以验证特定的业务逻辑。其中一项检查是查找特定字段中的重复数据。这些领域之一可能是几何学。我通过为每个字段创建一个空容器对象来完成此操作，并在检查每个记录时使用以下逻辑。

for field in dupCheckFields:
    if row.getValue(field) in fieldValues[field]: dupValues.add(row.getValue(idField))
    else: fieldValues[field].append(row.getValue(field))

上面的代码是我使用的基本逻辑的示例。我遇到麻烦的是，每个表可能包含 5000 条记录到 1000 万条记录。我要么内存不足，要么性能陷入停滞。

我尝试过以下容器类型：集合、列表、字典、ZODB + BList 和 Shelve。

对于内存中的类型（集合、列表、字典），该过程在开始时非常快，但随着它的进展，它会变得慢得多。我们这些类型，如果表中有很多记录，我将耗尽内存。对于持久数据类型，我不会耗尽内存，但需要很长时间来处理。

我只在脚本运行时需要数据，所有持久数据文件将在完成后删除。

问题：是否有更好的容器类型可以提供大量数据的低内存存储，而在访问数据时不会造成很大的性能损失？

系统：Win7 64 位，Python 2.6.5 32 位，4GB RAM

预先感谢您的帮助。

编辑：

示例 SQLite 代码：

import sqlite3, os, arcpy, timeit

fc = r"path\to\feature\class"

# test feature class was in ESRI ArcSDE format and contained "." characters separating database name, owner, and feature class name
fcName = fc.split(".")[-1]

# convert ESRI data types to SQLite data types
dataTypes = {"String":"text","Guid":"text","Double":"real","SmallInteger":"integer"}

fields = [(field.name,dataTypes[field.type]) for field in arcpy.ListFields(fc) if field.name != arcpy.Describe(fc).OIDFieldName]

# SQL string to create table in SQLite with same schema as feature class
createTableString = """create table %s(%s,primary key(%s))""" % (fcName,",\n".join('%s %s' % field for field in fields),fields[0][0])

# SQL string to insert data into SQLite table
insertString = """insert into %s values(%s)""" % (fcName, ",".join(["?" for i in xrange(len(fields))]))

# location to save SQLite database
loc = r'C:\TEMPORARY_QC_DATA'

def createDB():
    conn = sqlite3.connect(os.path.join(loc,'database.db'))
    cur = conn.cursor()

    cur.execute(createTableString)

    conn.commit()

    rows = arcpy.SearchCursor(fc)

    i = 0
    for row in rows:
        try:
            cur.execute(insertString, [row.getValue(field[0]) for field in fields])
            if i % 10000 == 0:
                print i, "records"
                conn.commit()
            i += 1
        except sqlite3.IntegrityError: pass
    print i, "records"

t1 = timeit.Timer("createDB()","from __main__ import createDB")

print t1.timeit(1)

不幸的是，我无法共享在此代码中使用的测试数据，但它是一个 ESRI ArcSDE 地理数据库表，其中包含大约。 10 个字段，大约700 万条记录。

我尝试使用 timeit 来确定此过程花费了多长时间，但是经过 2 小时的处理，仅完成了 120,000 条记录。

原文

I am writing a script in Python to QC data in a proprietary ESRI database table. The purpose of the script is not to modify invalid data, but simply to report invalid data to the user via a csv file. I am using ESRI's ArcPy package to access each individual record with arcpy.SearchCursor. The SearchCursor is the only way to access each individual record in the ESRI formats.

As I scroll through each record of the tables, I do multiple QC checks to validate specific business logic. One of those checks is looking for duplicate data in particular fields. One of those fields may be geometry. I have done this by creating an empty container object for each of those fields and as I check each record I use the following logic.

for field in dupCheckFields:
    if row.getValue(field) in fieldValues[field]: dupValues.add(row.getValue(idField))
    else: fieldValues[field].append(row.getValue(field))

The above code is an example of the basic logic I use. Where I am running into trouble is the fact that each of these tables may contain anywhere from 5000 records to 10 million records. I either run out of memory or the performance grinds to a halt.

I have tried the following container types: sets, lists, dictionaries, ZODB + BList, and Shelve.

With the in-memory types (sets, lists, dictionaries) the process is very fast at the start, but as it progresses it gets much slower. We these types, if I have many records in the table I will run out of memory. With the persistent data types, I don't run out of memory, but it takes a very long time to process.

I only need the data while the script is running and any persistent data files will be deleted upon completion.

Question: Is there a better container type out there to provide low-memory storage of lots of data without a large cost in performance when accessing the data?

System: Win7 64-bit, Python 2.6.5 32-bit, 4gb RAM

Thanks in advance for your help.

EDIT:

Sample SQLite code:

import sqlite3, os, arcpy, timeit

fc = r"path\to\feature\class"

# test feature class was in ESRI ArcSDE format and contained "." characters separating database name, owner, and feature class name
fcName = fc.split(".")[-1]

# convert ESRI data types to SQLite data types
dataTypes = {"String":"text","Guid":"text","Double":"real","SmallInteger":"integer"}

fields = [(field.name,dataTypes[field.type]) for field in arcpy.ListFields(fc) if field.name != arcpy.Describe(fc).OIDFieldName]

# SQL string to create table in SQLite with same schema as feature class
createTableString = """create table %s(%s,primary key(%s))""" % (fcName,",\n".join('%s %s' % field for field in fields),fields[0][0])

# SQL string to insert data into SQLite table
insertString = """insert into %s values(%s)""" % (fcName, ",".join(["?" for i in xrange(len(fields))]))

# location to save SQLite database
loc = r'C:\TEMPORARY_QC_DATA'

def createDB():
    conn = sqlite3.connect(os.path.join(loc,'database.db'))
    cur = conn.cursor()

    cur.execute(createTableString)

    conn.commit()

    rows = arcpy.SearchCursor(fc)

    i = 0
    for row in rows:
        try:
            cur.execute(insertString, [row.getValue(field[0]) for field in fields])
            if i % 10000 == 0:
                print i, "records"
                conn.commit()
            i += 1
        except sqlite3.IntegrityError: pass
    print i, "records"

t1 = timeit.Timer("createDB()","from __main__ import createDB")

print t1.timeit(1)

Unfortunately I cannot share the test data I used with this code, however it was an ESRI ArcSDE geodatabase table containing approx. 10 fields and approx. 7 mil records.

I tried to use timeit to determine how long this process took, however after 2 hours of processing, only 120,000 records were complete.

分享到QQ

分享到微博