处理来自 GCS 的文档时，文档 AI 处理文档失败并出现无效参数

发布于 2025-01-14 05:04:16 字数 2526 浏览 6 评论 0原文

我收到与以下内容非常相似的错误，但我不在欧盟：文档 AI：google.api_core.exceptions.InvalidArgument : 400 Request contains an invalid argument

当我使用 raw_document 并处理本地 pdf 文件时，它工作正常。但是，当我在 GCS 位置指定 pdf 文件时，它会失败。

错误消息：

the processor name: projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97
the form process request: name: "projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97"
inline_document {
  uri: "gs://xxxx/temp/test1.pdf"
}

Traceback (most recent call last):
  File "C:\Python39\lib\site-packages\google\api_core\grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "Request contains an invalid argument."
        debug_error_string = "{"created":"@1647296055.582000000","description":"Error received from peer ipv4:142.250.80.74:443","file":"src/core/lib/surface/call.cc","file_line":1070,"grpc_message":"Request contains an invalid argument.","grpc_status":3}"
>

代码：

   client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    print(f'the processor name: {name}')

    # document = {"uri": gcs_path, "mime_type": "application/pdf"}
    document = {"uri": gcs_path}
    inline_document = documentai.Document()
    inline_document.uri = gcs_path
    # inline_document.mime_type = "application/pdf"

    # Configure the process request
    # request = {"name": name, "inline_document": document}
    request = documentai.ProcessRequest(
        inline_document=inline_document,
        name=name
    )    

    print(f'the form process request: {request}')

    result = client.process_document(request=request)

我不认为存储桶上有权限问题，因为相同的设置对于同一存储桶上的文档分类过程工作正常。

原文

I am getting an error very similar to the below, but I am not in EU:
Document AI: google.api_core.exceptions.InvalidArgument: 400 Request contains an invalid argument

When I use the raw_document and process a local pdf file, it works fine. However, when I specify a pdf file on a GCS location, it fails.

Error message:

the processor name: projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97
the form process request: name: "projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97"
inline_document {
  uri: "gs://xxxx/temp/test1.pdf"
}

Traceback (most recent call last):
  File "C:\Python39\lib\site-packages\google\api_core\grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "Request contains an invalid argument."
        debug_error_string = "{"created":"@1647296055.582000000","description":"Error received from peer ipv4:142.250.80.74:443","file":"src/core/lib/surface/call.cc","file_line":1070,"grpc_message":"Request contains an invalid argument.","grpc_status":3}"
>

Code:

   client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    print(f'the processor name: {name}')

    # document = {"uri": gcs_path, "mime_type": "application/pdf"}
    document = {"uri": gcs_path}
    inline_document = documentai.Document()
    inline_document.uri = gcs_path
    # inline_document.mime_type = "application/pdf"

    # Configure the process request
    # request = {"name": name, "inline_document": document}
    request = documentai.ProcessRequest(
        inline_document=inline_document,
        name=name
    )    

    print(f'the form process request: {request}')

    result = client.process_document(request=request)

I do not believe I have permission issues on the bucket since the same set up works fine for a document classification process on the same bucket.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦醒时光 2025-01-21 05:04:16

这是 Document AI 的一个已知问题，已在此问题跟踪器中报告。不幸的是，目前唯一的解决方法是：

下载文件，以字节形式读取文件并使用 process_documents()。有关示例代码，请参阅 Document AI 本地处理。
使用 batch_process_documents() 因为默认情况下是只接受来自 GCS 的文件。如果您不想在下载文件时执行额外的步骤，则可以这样做。

回复收藏 0 原文

半﹌身腐败 2025-01-21 05:04:16

5个月后这仍然是一个问题，接受的答案中没有提到的事情是（我可能是错的，但在我看来）批处理只能将结果输出到GCS，所以你仍然会招致从存储桶下载某些内容的额外步骤（无论是选项 1 下的输入文档还是选项 2 下的结果）。最重要的是，如果您不想要存储桶中的结果，您最终将不得不在存储桶中进行清理，因此在许多情况下，选项 2 除了下载结果之外不会提供太多优势可能会小于输入文件下载。

我在 Python Cloud Function 中使用客户端库，并且受到此问题的影响。我正在实施选项 1，因为它看起来最简单，并且我坚持修复。我还考虑使用 Workflow 客户端库来触发运行 Document AI 进程的工作流，或者调用记录 AI REST API，但这一切都不是最理想的。

回复收藏 0 原文

~没有更多了~