尽管使用多处理传入时有内容,但对象内部的数组变为空

发布于 2025-01-19 11:01:24 字数 2230 浏览 0 评论 0原文

我是 python 新手,在函数中传递对象时遇到问题。 基本上,我正在尝试读取超过 1.4B 行的大文件。

我正在传递一个包含文件信息的对象。其中之一是一个非常大的数组,其中包含文件中每行的开头位置。

这是一个大数组,通过仅传递对象引用,我希望只有一个数组实例,然后由多个进程共享该实例,尽管我不知道这是否真的发生。

当传递到 process_line 函数时,数组为空,从而导致错误。这就是问题所在。

这是函数被调用的地方(参见 p.starmap)

with open(file_name, 'r') as f:
    

    line_start = file_inf.start_offset
    # Iterate over all lines and construct arguments for `process_line`
    while line_start < file_inf.file_size:
        line_end = min(file_inf.file_size, line_start + line_size)    #end = minimum of either file size and line start + line size

        # Save `process_line` arguments
        args = [file_name, line_start, line_end, file_inf.line_offset ]                        #arguments for process_line
        line_args.append(args)

        # Move to the next line
        line_start = line_end
        
print(line_args[1])      
with multiprocessing.Pool(cpu_count) as p:  #run the process_line function on each line
# Run lines in parallel
# starmap() is like map() except the the we have multiple arguments in a list so we use starmap  
    line_result = p.starmap(process_line, line_args)    #maps the process_line function to each line

这是函数:

def process_line(file_name,line_start, line_end, file_obj):
line_results = register()
c2 = register()
c1 = register()
with open(file_name, 'r') as f:
    # Moving stream position to `line_start`
    f.seek(file_obj[line_start])
    i = 0
    if line_start == 63400:
        print ("hello")
    # Read and process lines until `line_end`
    
    for line in f:
        line_start += 1
        if line_start > line_end:
            line_results.__append__(c2)
            c2.clear()
            break
        c1 = func(line)
        c2.__add__(c1)   
        i= i+1       
return line_results.countf

其中 file_obj 包含 line_offset ,它是有问题的数组。

现在如果我删除多重处理并只使用: line_result = starmap(process_line, line_args)

数组传入得很好。虽然没有多处理

另外,如果我只传递数组而不是整个对象,那么它也可以工作,但现在由于某种原因只有 2 个进程工作(在 Linux 上,在 Windows 上使用任务管理器只有 1 个工作,而其余的只使用内存而不是 CPU) )。而不是预期的 20,这对于此任务至关重要。

Processes

有什么解决方案吗?请帮忙

Im new to python and having trouble passing in an object in a function.
Basically, I'm trying to read a large file over 1.4B lines.

I am passing in an object that contains information on the file. One of these is a very large array containing the location of the start of each line in the file.

This is a large array and by passing just the object reference I wish to have just one instance of the array which is then shared by multiple processes although I don't know if this is actually happening.

The array when passed into the process_line function is then empty leading to errors. This is the problem.

Here is where the function is being called (see the p.starmap)

with open(file_name, 'r') as f:
    

    line_start = file_inf.start_offset
    # Iterate over all lines and construct arguments for `process_line`
    while line_start < file_inf.file_size:
        line_end = min(file_inf.file_size, line_start + line_size)    #end = minimum of either file size and line start + line size

        # Save `process_line` arguments
        args = [file_name, line_start, line_end, file_inf.line_offset ]                        #arguments for process_line
        line_args.append(args)

        # Move to the next line
        line_start = line_end
        
print(line_args[1])      
with multiprocessing.Pool(cpu_count) as p:  #run the process_line function on each line
# Run lines in parallel
# starmap() is like map() except the the we have multiple arguments in a list so we use starmap  
    line_result = p.starmap(process_line, line_args)    #maps the process_line function to each line

This is the function:

def process_line(file_name,line_start, line_end, file_obj):
line_results = register()
c2 = register()
c1 = register()
with open(file_name, 'r') as f:
    # Moving stream position to `line_start`
    f.seek(file_obj[line_start])
    i = 0
    if line_start == 63400:
        print ("hello")
    # Read and process lines until `line_end`
    
    for line in f:
        line_start += 1
        if line_start > line_end:
            line_results.__append__(c2)
            c2.clear()
            break
        c1 = func(line)
        c2.__add__(c1)   
        i= i+1       
return line_results.countf

where file_obj contains line_offset which is the array in question.

Now If I remove the multiprocessing and just use:
line_result = starmap(process_line, line_args)

the array is passed in just fine. Although without multiprocessing

Also if I pass in just the array instead of the entire object then it also works but now for some reason only 2 processes work (on Linux, on Windows using task manager only 1 works while the rest just use memory but not CPU). Instead of an expected 20 which is critical for this task.

Processes

Is there any solution to this? please help

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文