Python - 如何在不耗尽内存或减慢处理速度的情况下构建和访问大量数据

发布于 2024-12-11 04:12:54 字数 2614 浏览 0 评论 0原文

我正在用 Python 编写一个脚本来对专有 ESRI 数据库表中的数据进行质量控制。该脚本的目的不是修改无效数据,而只是通过 csv 文件向用户报告无效数据。我正在使用 ESRI 的 ArcPy 包通过 arcpy.SearchCursor 访问每个单独的记录。 SearchCursor 是访问 ESRI 格式的每个单独记录的唯一方法。

当我滚动浏览表的每条记录时,我会进行多次质量控制检查以验证特定的业务逻辑。其中一项检查是查找特定字段中的重复数据。这些领域之一可能是几何学。我通过为每个字段创建一个空容器对象来完成此操作,并在检查每个记录时使用以下逻辑。

for field in dupCheckFields:
    if row.getValue(field) in fieldValues[field]: dupValues.add(row.getValue(idField))
    else: fieldValues[field].append(row.getValue(field))

上面的代码是我使用的基本逻辑的示例。我遇到麻烦的是,每个表可能包含 5000 条记录到 1000 万条记录。我要么内存不足,要么性能陷入停滞。

我尝试过以下容器类型:集合、列表、字典、ZODB + BList 和 Shelve。

对于内存中的类型(集合、列表、字典),该过程在开始时非常快,但随着它的进展,它会变得慢得多。我们这些类型,如果表中有很多记录,我将耗尽内存。对于持久数据类型,我不会耗尽内存,但需要很长时间来处理。

我只在脚本运行时需要数据,所有持久数据文件将在完成后删除。

问题:是否有更好的容器类型可以提供大量数据的低内存存储,而在访问数据时不会造成很大的性能损失?

系统:Win7 64 位,Python 2.6.5 32 位,4GB RAM

预先感谢您的帮助。

编辑:

示例 SQLite 代码:

import sqlite3, os, arcpy, timeit

fc = r"path\to\feature\class"

# test feature class was in ESRI ArcSDE format and contained "." characters separating database name, owner, and feature class name
fcName = fc.split(".")[-1]

# convert ESRI data types to SQLite data types
dataTypes = {"String":"text","Guid":"text","Double":"real","SmallInteger":"integer"}

fields = [(field.name,dataTypes[field.type]) for field in arcpy.ListFields(fc) if field.name != arcpy.Describe(fc).OIDFieldName]

# SQL string to create table in SQLite with same schema as feature class
createTableString = """create table %s(%s,primary key(%s))""" % (fcName,",\n".join('%s %s' % field for field in fields),fields[0][0])

# SQL string to insert data into SQLite table
insertString = """insert into %s values(%s)""" % (fcName, ",".join(["?" for i in xrange(len(fields))]))

# location to save SQLite database
loc = r'C:\TEMPORARY_QC_DATA'

def createDB():
    conn = sqlite3.connect(os.path.join(loc,'database.db'))
    cur = conn.cursor()

    cur.execute(createTableString)

    conn.commit()

    rows = arcpy.SearchCursor(fc)

    i = 0
    for row in rows:
        try:
            cur.execute(insertString, [row.getValue(field[0]) for field in fields])
            if i % 10000 == 0:
                print i, "records"
                conn.commit()
            i += 1
        except sqlite3.IntegrityError: pass
    print i, "records"

t1 = timeit.Timer("createDB()","from __main__ import createDB")

print t1.timeit(1)

不幸的是,我无法共享在此代码中使用的测试数据,但它是一个 ESRI ArcSDE 地理数据库表,其中包含大约。 10 个字段,大约700 万条记录。

我尝试使用 timeit 来确定此过程花费了多长时间,但是经过 2 小时的处理,仅完成了 120,000 条记录。

I am writing a script in Python to QC data in a proprietary ESRI database table. The purpose of the script is not to modify invalid data, but simply to report invalid data to the user via a csv file. I am using ESRI's ArcPy package to access each individual record with arcpy.SearchCursor. The SearchCursor is the only way to access each individual record in the ESRI formats.

As I scroll through each record of the tables, I do multiple QC checks to validate specific business logic. One of those checks is looking for duplicate data in particular fields. One of those fields may be geometry. I have done this by creating an empty container object for each of those fields and as I check each record I use the following logic.

for field in dupCheckFields:
    if row.getValue(field) in fieldValues[field]: dupValues.add(row.getValue(idField))
    else: fieldValues[field].append(row.getValue(field))

The above code is an example of the basic logic I use. Where I am running into trouble is the fact that each of these tables may contain anywhere from 5000 records to 10 million records. I either run out of memory or the performance grinds to a halt.

I have tried the following container types: sets, lists, dictionaries, ZODB + BList, and Shelve.

With the in-memory types (sets, lists, dictionaries) the process is very fast at the start, but as it progresses it gets much slower. We these types, if I have many records in the table I will run out of memory. With the persistent data types, I don't run out of memory, but it takes a very long time to process.

I only need the data while the script is running and any persistent data files will be deleted upon completion.

Question: Is there a better container type out there to provide low-memory storage of lots of data without a large cost in performance when accessing the data?

System: Win7 64-bit, Python 2.6.5 32-bit, 4gb RAM

Thanks in advance for your help.

EDIT:

Sample SQLite code:

import sqlite3, os, arcpy, timeit

fc = r"path\to\feature\class"

# test feature class was in ESRI ArcSDE format and contained "." characters separating database name, owner, and feature class name
fcName = fc.split(".")[-1]

# convert ESRI data types to SQLite data types
dataTypes = {"String":"text","Guid":"text","Double":"real","SmallInteger":"integer"}

fields = [(field.name,dataTypes[field.type]) for field in arcpy.ListFields(fc) if field.name != arcpy.Describe(fc).OIDFieldName]

# SQL string to create table in SQLite with same schema as feature class
createTableString = """create table %s(%s,primary key(%s))""" % (fcName,",\n".join('%s %s' % field for field in fields),fields[0][0])

# SQL string to insert data into SQLite table
insertString = """insert into %s values(%s)""" % (fcName, ",".join(["?" for i in xrange(len(fields))]))

# location to save SQLite database
loc = r'C:\TEMPORARY_QC_DATA'

def createDB():
    conn = sqlite3.connect(os.path.join(loc,'database.db'))
    cur = conn.cursor()

    cur.execute(createTableString)

    conn.commit()

    rows = arcpy.SearchCursor(fc)

    i = 0
    for row in rows:
        try:
            cur.execute(insertString, [row.getValue(field[0]) for field in fields])
            if i % 10000 == 0:
                print i, "records"
                conn.commit()
            i += 1
        except sqlite3.IntegrityError: pass
    print i, "records"

t1 = timeit.Timer("createDB()","from __main__ import createDB")

print t1.timeit(1)

Unfortunately I cannot share the test data I used with this code, however it was an ESRI ArcSDE geodatabase table containing approx. 10 fields and approx. 7 mil records.

I tried to use timeit to determine how long this process took, however after 2 hours of processing, only 120,000 records were complete.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

-残月青衣踏尘吟 2024-12-18 04:12:54

如果将哈希值存储在(压缩的)文件中,则可以流式传输它们以比较哈希值以查找重复项。流式传输通常具有非常低的内存要求 - 您可以设置所需的缓冲区,例如每个散列记录一行。权衡通常是时间,特别是如果您添加压缩,但如果您按某些标准对文件进行排序,那么您可能能够遍历未压缩的流以更快地比较记录。

If you store hashes in (compressed) files, you could stream through them to compare hashes to look for duplicates. Streaming usually has very low memory requirements — you can set the buffer you want, say, one line per hashed record. The tradeoff is generally time, particularly if you add compression, but if you order the files by some criteria, then you may be able to walk through the uncompressed streams to more quickly compare records.

莳間冲淡了誓言ζ 2024-12-18 04:12:54

我想我会评估在 SQLite 数据库中存储持久数据(例如已知字段值和计数)。这当然是内存使用和性能之间的权衡。

如果您使用支持并发访问的持久性机制,则可以使用 multiprocessing 并行处理数据。完成后,可以从数据库生成错误摘要。

I think I'd evaluate storing persistent data (such as the known field vals and counts) in a SQLite database. It is of course a trade off between memory usage and performance.

If you use a persistence mechanism that supports concurrent access, you can probably parallelise the processing of your data, using multiprocessing. Once complete, a summary of errors can be generated from the database.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文