Flyte如何为“数据和机器学习”量身定制?

发布于 2025-02-08 06:46:18 字数 546 浏览 1 评论 0 原文

https://flyte.org/ 说这是

,用于规模的复杂,关键任务数据和机器学习过程的工作流程自动化平台

,我经过了很多文档,我看不到为什么它是“数据和机器学习”。在我看来,这是一个工作流程管理器,在容器兰花(这里是kubernetes)之上,工作流管理器的意思是,我可以定义有向的无环形图(DAG),然后将DAG节点部署为容器,而DAG则是DAG是跑步。

当然,这对于“数据和机器学习”很有用,很重要,但是我不妨将其用于任何其他微服务DAG。除了功能/详细信息外,这与 https://spark.apache.org

作为软件成就,我应该记住什么?

https://flyte.org/ says that it is

The Workflow Automation Platform for Complex, Mission-Critical Data and Machine Learning Processes at Scale

I went through quite a bit of documentation and I fail to see why it is "Data and Machine Learning". It seem to me that it is a workflow manager on top of a container orchastration (here Kubernetes), where workflow manager means, that I can define a Directed Acyclic Graphs (DAG) and then the DAG nodes are deployed as containers and the DAG is run.

Of course this is usefull and important for "Data and Machine Learning", but I might as well use it for any other microservice DAG with this. Except for features/details, how is this different than https://airflow.apache.org or other workflow managers (of which there are many). There are even more specialized workflow managers for "Data and Machine Learning", e.g., https://spark.apache.org.

What should I keep in mind as a Software Achitect?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一个人的夜不怕黑 2025-02-15 06:46:18

这是一个很好的问题。一方面,您是正确的,核心是无服务器的工作流编排器(无服务器,因为它确实会带来基础架构来运行代码)。是的,它可以用于多种其他情况。对于其他一些系统,例如微服务编排,它可能不是最好的工具。

但是,真正使它对ML&数据编排是功能的组合

  1. (下面列表)&
  2. 的人们的集成(下面)
  3. 社区
  4. 使用IT路线图

具有

  1. 长期运行的任务:它是针对运行漫长的任务而设计的。即使控制平面下降,可能会运行几天和几周的任务,您也不会失去工作。您可以继续部署而不会影响现有工作。
  2. 版本控制 - 允许多个用户在同一
    工作流程,使用不同的库,模型,输入等
  3. 记忆。让我们以10个步骤的管道示例,您可以记住所有9个步骤,如果第10步失败或可以修改第10步,然后将重复使用前9个。这会导致急剧更快的迭代强大的迭代
  4. 强烈的键入和ML特定类型支持。
    Flyte了解数据框,并能够从Spark.dataframe-> pandas.dataframe-> modin-> Porars等,没有用户必须考虑如何有效地进行操作。还支持张量(正确序列化),numpy阵列等诸如模型以及从过去的执行中检索的模型,因此,实际上,Model Truth Store
  5. 存储了本机内部支持任务检查点。这可以帮助您在节点失败之间恢复模型培训,甚至在跨执行之间恢复。添加了新的支持,用于检查点回调。
  6. Flyte甲板:一种可视化指标,例如ROC曲线等指标,或将数据输入到任务分布的自动可视化。
  7. 可扩展的编程接口,可以协调分布式作业或在本地运行 -
    例如,Spark,MPI,
  8. 用于库隔离
  9. 调度程序的SageMaker参考任务独立于用户代码
  10. 理解GPU等的资源 - 自动在GPU和或SPOT机器上安排。通过智能处理点计算机-N-1自动恢复最后一个,将移动到按需机器上,以更好地确保
  11. 地图任务和动态任务。 (在区域列表上的地图),动态 - >基于输入的
  12. 多个发射计划创建新的静态图。附表2运行的工作流程具有略有不同的超级参数或模型值等的工作流程
  1. 对于真正长期运行的任务,管理员可以部署管理层而无需杀死
  2. 对spot/arm/gpu的任务支持(带有不同的任务版本等)
  3. 每个项目 /域名升级的配额和油门
  4. ,而无需升级用户库

集成

  1. pandas dataframe pandas dataframe天然支持
  2. 火花
  3. MPI工作(帮派计划)
  4. pandera / pandera / pandera /对数据质量
  5. 萨吉马制造商
  6. 轻松部署模型,用于服务
  7. PORARS / MODIN / MODIN / MODIN / SPARK FARGE
  8. DATAISORS /检查点等
    等等以及路线图

社区

中的许多其他人都集中在ML特定功能的

路线图

  1. CD4ML上,其中人为循环和基于外部信号的工作流程。这将允许用户自动化模型的部署或在循环标签中执行人类,
  2. 以支持雷/spark/dask群集在整个任务中
  3. 与Whylogs集成以及其他用于监视
  4. MLFLOW的集成等
  5. 的工具,以使更多的本机flydtedecks呈现更多的本机渲染器,

希望此答案有希望的答案你的问题。另外,请加入Slack社区,并帮助传播此信息。也问更多问题

That is a great question. You are right in one thing, at the core it is a Serverless Workflow Orchestrator (serverless, because it does bring up the infrastructure to run the code). And yes it can be used for multiple other situations. It may not be the best tool for some other systems like Micro-service orchestration.

But, what really makes it good for ML & Data Orchestration is a combination of

  1. Features (list below) &
  2. Integrations (list below)
  3. Community of folks using it
  4. Roadmap

Features

  1. Long running tasks: It is designed for extremely long running tasks. Tasks that can run for days and weeks, even if the control plane goes down, you will not lose the work. You can keep deploying without impacting existing work.
  2. Versioning - allow multiple users to work independently on the same
    workflow, use different libraries, models, inputs etc
  3. Memoization. Lets take an example of a pipeline with 10 steps, you can memoize all 9 steps and if 10th fails or you can modify 10th and then it will reuse results from previous 9. This leads to drastically faster iteration
  4. Strong typing and ML specific type supports
    Flyte understands dataframes and is able to translate dataframes from spark.dataFrame -> pandas.DataFrame -> Modin -> polars etc without the user having to think about how to do it efficiently. Also supports things like tensors (correctly serialized), numpy arrays, etc. Also models can be saved and retrieved from past executions so is infact the model truth store
  5. Native support for Intra task checkpointing. This can help is recovering model training between node failures and across executions even. With new support being added for Checkpointing callbacks.
  6. Flyte decks: A way to visualize metrics like ROC curve, etc or auto visualization of the distribution of data input to a task.
  7. Extendable Programming interface, that can orchestrate distributed jobs or run locally -
    e.g spark, MPI, sagemaker
  8. Reference task for library isolation
  9. Scheduler independent of user code
  10. Understanding of resources like GPU's etc - Automatically schedule on gpus and or spot machines. With Smart handling of spot machines - n-1 retries last one automatically is moved to an on-demand machine to better guarantees
  11. Map tasks and dynamic tasks. (map over a list of regions), dynamic -> create new static graphs based on inputs dyanmically
  12. Multiple launchplans. Schedule 2 runs for a workflow with slightly different hyper parameters or model values etc

For Admins

  1. For really long running tasks, admin can deploy the management layer without killing the tasks
  2. Support for spot/arm/gpu (with different versions etc)
  3. Quotas and throttles for per project/domain
  4. Upgrade infra without upgrading user libraries

Integrations

  1. pandas dataframe native support
  2. Spark
  3. mpi jobs (gang scheduled)
  4. pandera / Great expectations for data quality
  5. Sagemaker
  6. Easy deployment of model for serving
  7. Polars / Modin / Spark dataframe
  8. tensors / checkpointing etc
    etc and many others in the roadmap

Community

Focused on ML specific features

Roadmap

  1. CD4ML, with human in the loop and external signal based workflows. This will allow for users to automate deployment of models or perform human in the loop labeling etc
  2. Support for Ray/Spark/Dask cluster re-use across tasks
  3. Integration with WhyLogs and other tools for monitoring
  4. Integration with MLFlow etc
  5. More native Flytedecks renderers

Hopefully this answers your questions. Also please join the slack community and help spread this information. Also ask more questions

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文