多态性数据转换技术 /数据湖 /大数据

发布于 2025-01-26 18:11:02 字数 410 浏览 5 评论 0原文

背景：我们正在研究一种解决方案，以从各种客户中摄入大量的遥测数据。数据采用XML格式，包含多个具有许多嵌套元素的独立信息组。客户有不同的版本，因此数据在数据湖中摄入了不同但相似的模式。例如，起始阶段可以是字符串或包含日期的对象。）我们的目标是在BI工具中可视化累积信息。

问题：处理多态性数据的最佳实践是什么？

处理和转换所需的数据（简化版本）使用编程语言为Uni-Schema文件，然后在Spark和Databricks中处理并在BI工具中使用。
将数据分解为有意义的组，处理并加入（使用数据关系）与Spark和Databricks一起加入（使用数据关系）。

感谢您的评论，并分享有关该主题的意见和经验，尤其是主题专家和数据工程师。如果您还可以在此特定主题上共享一些有用的资源，那将是很好的。

干杯!

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离鸿 2025-02-02 18:11:02

您为此线程选择的标签之一是指出您想将Databricks用于此转换。 Databricks是我正在使用的工具之一，并且认为可以进行这种数据处理的功能足够强大，有效。由于，我最常使用的数据处理平台是Azure和Cloudera，因此我的答案将依靠Azure堆栈，因为它与数据链助剂集成在一起。这是根据我的经验推荐的。

第一个认为您要做的就是定义数据层并为它们创建一个平台。特别是对于您的情况，它应该具有着陆区，登台，OD和数据仓库层。

着陆区

将用于您的客户摄入多态性数据。这可以仅通过 azure data factory （adf））在客户端与 azure blob存储。我建议，在这一层中，我们不要将任何转换放入ADF管道中，以便我们可以创建一个用于摄入原始文件的常见。如果您有很多可以将数据发送到Azure存储中的客户，那很好。您也可以为他们创建一些专用文件夹。

通常，我创建与客户端类型对齐的文件夹。例如，如果我有3种类型的客户端，Oracle，SQL Server和SAP，则我的Azure存储中的根文件夹将是Oracle，SQL_Server和SAP，然后是服务器/数据库/客户端名称。

此外，如果您要从IoT设备摄入数据，似乎您可能必须设置Azure IoT中心。如果是这种情况，则此 page 会有所帮助。

分期区域

将是模式清理的区域。我将拥有多个ADF管道，可以将多态性数据从着陆区转化，并将其摄入分期区域。您将必须为每个分解的数据集和数据源创建模式（Delta表）。我建议使用 delta lake 管理和检索数据。

您将拥有的转换选项是：

仅使用ADF转换。它将允许您 unnest XML列以及一些数据清理和从着陆区进行纠纷，以便可以将相同的数据集插入同一表中。
对于您的情况，您可能必须为每个数据集创建特定的ADF管道乘以客户端版本。
使用额外的常见ADF管道，该管道在数据集和客户端版本上运行Databricks转换基础。这将允许ADF转换无法实现的更复杂的转换。
对于您的情况，还将有一个针对每个数据集乘以客户端版本的特定数据键。

结果，将从RAW文件中提取一个不同版本的一个特定数据集，以架构清理，并将每个数据源摄入一个表格。跨不同数据源的主数据集将有一些重复的数据。

ODS区域

将成为您数据的所谓单一来源的区域。多个数据源将合并为一个。因此，所有重复的数据都将消除，并且数据集之间的关系得到澄清，从您的问题中导致第二项。如果您只有一个数据源，那么这也将是应用更多数据清理（例如验证和一致性）的领域。结果，一个数据集将存储在一个表中。

我建议使用运行Databricks的ADF，但是在这段时间内，我们可以使用SQL笔记本而不是Python，因为数据已经很好地插入了舞台区域。

此阶段的数据可以由Power BI消费。阅读更多关于与数据映集成的Power BI集成。

此外，如果您仍然需要一个数据仓库或Star模式以进行预先分析，则可以进行进一步的转换（通过AGN ADF和DATABRICKS）并使用 azure Synapse 。

源控制

幸运的是，由于Microsoft收购了GitHub，我上面提到的工具已经与源代码版本控件集成在一起。 Databricks笔记本和ADF管道源代码可以进行版本控制。检查 azure devops 。

One of the tags that you have selected for this thread is pointing out that you would like to use Databricks for this transformation. Databricks is one of the tools that I am using and think is powerful enough and effective to do this kind of data processing. Since, the data processing platforms that I have been using the most are Azure and Cloudera, my answer will rely on Azure stack because it is integrated with Databricks. here is what I would recommend based on my experience.

The first think you have to do is to define data layers and create a platform for them. Particularly, for your case, it should have Landing Zone, Staging, ODS, and Data Warehouse layers.

Landing Zone

Will be used for polymorphic data ingestion from your clients. This can be done by only Azure Data Factory (ADF) connecting between the client and Azure Blob Storage. I recommend ,in this layer, we don't put any transformation into ADF pipeline so that we can create a common one for ingesting raw files. If you have many clients that can send data into Azure Storage, this is fine. You can create some dedicated folders for them as well.

Normally, I create folders aligning with client types. For example, if I have 3 types of clients, Oracle, SQL Server, and SAP, the root folders on my Azure Storage will be oracle, sql_server, and sap followed by server/database/client names.

Additionally, it seems you may have to set up Azure IoT hub if you are going to ingest data from IoT devices. If that is the case, this page would be helpful.

Staging Area

Will be an area for schema cleanup. I will have multiple ADF pipelines that transform polymorphic data from Landing Zone and ingest it into Staging Area. You will have to create schema (delta table) for each of your decomposed datasets and data sources. I recommend utilizing Delta Lake as it will be easy to manage and retrieve data.

The transformation options you will have are:

Use only ADF transformation. It will allow you to unnest some nested XML columns as well as do some data cleansing and wrangling from Landing Zone so that the same dataset can be inserted into the same table.
For your case, you may have to create particular ADF pipelines for each of datasets multiplied by client versions.
Use an additional common ADF pipeline that ran Databricks transformation base on datasets and client versions. This will allow more complex transformations that ADF transformation is not capable of.
For your case, there will also be a particular Databricks notebook for each of datasets multiplied by client versions.

As a result, different versions of one particular dataset will be extracted from raw files, cleaned up in terms of schema, and ingested into one table for each data source. There will be some duplicated data for master datasets across different data sources.

ODS Area

Will be an area for so-called single source of truth of your data. Multiple data sources will be merge into one. Therefore, all duplicated data gets eliminated and relationships between dataset get clarified resulting in the second item per your question. If you have just one data source, this will also be an area for applying more data cleansing, such as, validation and consistency. As a result, one dataset will be stored in one table.

I recommend using ADF running Databricks, but for this time, we can use SQL notebook instead of Python because data is well inserted into the table in Staging area already.

The data at this stage can be consumed by Power BI. Read more about Power BI integration with Databricks.

Furthermore, if you still want a data warehouse or star schema for advance analytics, you can do further transformation (via again ADF and Databricks) and utilize Azure Synapse.

Source Control

Fortunately, the tools that I mentioned above are already integrated with source code version control thanks to acquisition of Github by Microsoft. The Databricks notebook and ADF pipeline source codes can be versioning. Check Azure DevOps.

回复收藏 0 原文

不寐倦长更 2025-02-02 18:11:02

非常感谢您的全面答案！实际上，数据源始终是相同的软件，但是具有各种不同版本，不幸的是，数据属性并不总是在这些版本中保持稳定。在摄入后处理原始数据是否是一种选择）我在原始帖子中添加了一个示例。

Version1:{
    'theProperty': 8
}
Version2:{
    'data':{
              'myProperty': 10
           }
}


Processing =>
Refined version: [{
    'property: 8
},
{
    'property: 10
}]

因此，在数据出现到数据映以进行进一步处理之前，请解决不一致之处。这也可以选择吗？

Many thanks for your comprehensive answer PhuriChal! Indeed the data sources are always the same software, but with various different versions and unfortunately data properties are not always remain steady among those versions. Would it be an option to process the raw data after ingestion in order to unify and resolve unmatched properties using a high level programming language before processing them further in databricks?(We may have many of this processing code to refine the raw data for specific proposes)I have added an example in the original post.

Version1:{
    'theProperty': 8
}
Version2:{
    'data':{
              'myProperty': 10
           }
}


Processing =>
Refined version: [{
    'property: 8
},
{
    'property: 10
}]

So that the inconsistencies are resolved before the data comes to databricks for further processing. Can this also be an option?

回复收藏 0 原文

~没有更多了~