处理来自 GCS 的文档时,文档 AI 处理文档失败并出现无效参数

发布于 2025-01-14 05:04:16 字数 2526 浏览 6 评论 0原文

我收到与以下内容非常相似的错误,但我不在欧盟: 文档 AI:google.api_core.exceptions.InvalidArgument : 400 Request contains an invalid argument

当我使用 raw_document 并处理本地 pdf 文件时,它工作正常。但是,当我在 GCS 位置指定 pdf 文件时,它会失败。

错误消息:

the processor name: projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97
the form process request: name: "projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97"
inline_document {
  uri: "gs://xxxx/temp/test1.pdf"
}

Traceback (most recent call last):
  File "C:\Python39\lib\site-packages\google\api_core\grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "Request contains an invalid argument."
        debug_error_string = "{"created":"@1647296055.582000000","description":"Error received from peer ipv4:142.250.80.74:443","file":"src/core/lib/surface/call.cc","file_line":1070,"grpc_message":"Request contains an invalid argument.","grpc_status":3}"
>

代码:

   client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    print(f'the processor name: {name}')

    # document = {"uri": gcs_path, "mime_type": "application/pdf"}
    document = {"uri": gcs_path}
    inline_document = documentai.Document()
    inline_document.uri = gcs_path
    # inline_document.mime_type = "application/pdf"

    # Configure the process request
    # request = {"name": name, "inline_document": document}
    request = documentai.ProcessRequest(
        inline_document=inline_document,
        name=name
    )    

    print(f'the form process request: {request}')

    result = client.process_document(request=request)

我不认为存储桶上有权限问题,因为相同的设置对于同一存储桶上的文档分类过程工作正常。

I am getting an error very similar to the below, but I am not in EU:
Document AI: google.api_core.exceptions.InvalidArgument: 400 Request contains an invalid argument

When I use the raw_document and process a local pdf file, it works fine. However, when I specify a pdf file on a GCS location, it fails.

Error message:

the processor name: projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97
the form process request: name: "projects/xxxxxxxxx/locations/us/processors/f7502cad4bccdd97"
inline_document {
  uri: "gs://xxxx/temp/test1.pdf"
}

Traceback (most recent call last):
  File "C:\Python39\lib\site-packages\google\api_core\grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "C:\Python39\lib\site-packages\grpc\_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.INVALID_ARGUMENT
        details = "Request contains an invalid argument."
        debug_error_string = "{"created":"@1647296055.582000000","description":"Error received from peer ipv4:142.250.80.74:443","file":"src/core/lib/surface/call.cc","file_line":1070,"grpc_message":"Request contains an invalid argument.","grpc_status":3}"
>

Code:

   client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
    print(f'the processor name: {name}')

    # document = {"uri": gcs_path, "mime_type": "application/pdf"}
    document = {"uri": gcs_path}
    inline_document = documentai.Document()
    inline_document.uri = gcs_path
    # inline_document.mime_type = "application/pdf"

    # Configure the process request
    # request = {"name": name, "inline_document": document}
    request = documentai.ProcessRequest(
        inline_document=inline_document,
        name=name
    )    

    print(f'the form process request: {request}')

    result = client.process_document(request=request)

I do not believe I have permission issues on the bucket since the same set up works fine for a document classification process on the same bucket.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦醒时光 2025-01-21 05:04:16

这是 Document AI 的一个已知问题,已在此问题跟踪器中报告。不幸的是,目前唯一的解决方法是:

  1. 下载文件,以字节形式读取文件并使用 process_documents()。有关示例代码,请参阅 Document AI 本地处理
  2. 使用 batch_process_documents() 因为默认情况下是只接受来自 GCS 的文件。如果您不想在下载文件时执行额外的步骤,则可以这样做。

This is a known issue for Document AI, and is already reported in this issue tracker. Unfortunately the only workaround for now is to either:

  1. Download your file, read the file as bytes and use process_documents(). See Document AI local processing for the sample code.
  2. Use batch_process_documents() since by default is only accepts files from GCS. This is if you don't want to do the extra step on downloading the file.
半﹌身腐败 2025-01-21 05:04:16

5个月后这仍然是一个问题,接受的答案中没有提到的事情是(我可能是错的,但在我看来)批处理只能将结果输出到GCS,所以你仍然会招致从存储桶下载某些内容的额外步骤(无论是选项 1 下的输入文档还是选项 2 下的结果)。最重要的是,如果您不想要存储桶中的结果,您最终将不得不在存储桶中进行清理,因此在许多情况下,选项 2 除了下载结果之外不会提供太多优势可能会小于输入文件下载。

我在 Python Cloud Function 中使用客户端库,并且受到此问题的影响。我正在实施选项 1,因为它看起来最简单,并且我坚持修复。我还考虑使用 Workflow 客户端库来触发运行 Document AI 进程的工作流,或者调用记录 AI REST API,但这一切都不是最理想的。

This is still an issue 5 months later, and something not mentioned in the accepted answer is (and I could be wrong but it seems to me) that batch processes are only able to output their results to GCS, so you'll still incur the extra step of downloading something from a bucket (be it the input document under Option 1 or the result under Option 2). On top of that, you'll end up having to do cleanup in the bucket if you don't want the results there, so under many circumstances, Option 2 won't present much of an advantage other than the fact that the result download will probably be smaller than the input file download.

I'm using the client library in a Python Cloud Function and I'm affected by this issue. I'm implementing Option 1 for the reason that it seems simplest and I'm holding out for the fix. I also considered using the Workflow client library to fire a Workflow that runs a Document AI process, or calling the Document AI REST API, but it's all very suboptimal.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文