使用 Python 后端传输来自 Triton 推理服务器的流式响应

发布于 2025-01-09 10:42:35 字数 110 浏览 8 评论 0原文

我正在使用带有 Python 后端的 Triton 推理服务器，目前我发送 gRPC 请求。有谁知道我们如何使用Python后端进行流式传输（例如模型响应），因为我在文档中没有找到任何与流式传输相关的示例。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里泪两行 2025-01-16 10:42:35

要逐步将响应流式传输到 tritonclient（例如，由 LLM 模型生成），您可以使用 Triton Python 后端的解耦模式：

< code>config.pbtxt 文件：

model_transaction_policy {
  decoupled: True
}

这样后端将能够为单个请求返回多个响应（如果需要），例如使用由生成的文本序列逐步更新客户端（例如聊天机器人应用程序） Triton 提供的 LLM 模型从而提高感知模型的推理速度。

请注意，在 Triton 中，只有 gRPC 端点支持解耦事务策略（标准 HTTP/REST 端点不支持）。

更多信息：文档 | 示例

To stream responses to tritonclient progressively (e.g. as they are generated by a LLM model), you can use the decoupled mode of the Triton Python Backend:

The relevant part of the config.pbtxt file:

model_transaction_policy {
  decoupled: True
}

This way the backend will be able to return multiple responses (if necessary) for a single request, e.g. progressively updating the client (such as a chatbot app) with the text sequence generated by the LLM model served by Triton and thus improving the perceived model inference speed.

Note that in Triton only gRPC endpoints support decoupled transaction policy (standard HTTP/REST endpoints do not).

More info: docs | examples

回复收藏 0 原文

~没有更多了~