为什么要从Databricks Spark Notebook(Hadoop Fileutils)写入DBFS安装位置的位置慢13倍?

发布于 2025-01-23 05:59:26 字数 1217 浏览 2 评论 0原文

Databricks笔记本电脑需要2个小时才能写入 /dbfs /mnt(blob存储)。 同样的工作需要8分钟才能写入 /dbfs /filestore。

我想了解为什么在两种情况下写作表现都不同。 我还想知道哪个后端存储 /dbfs /filestor用途?

我知道DBFS是可扩展对象存储的顶部的抽象。 在这种情况下,对于/dbfs/mnt/blobstorage和/dbfs/filestore/,应该花费相同的时间。

问题语句:

源文件格式:.TAR.GZ

AVG大小:10 MB

tar.gz文件的数量:1000

个tar.gz文件约20000 CSV文件。

要求: 取消tar.gz文件,然后将CSV文件写入BLOB存储 /中间存储层,以进行进一步处理。

UNTAR并写入安装位置(附加的屏幕快照):

在这里,我使用Hadoop Fileutil库和UNTAR函数将其用于UNTAR,然后将CSV文件写入目标存储(/dbfs/mnt/-blob存储)。 用2个工作节点(每个4个)群集完成工作需要1.50小时。

untar并写信给DBFS root filestore: 在这里,我将Hadoop Fileutil库和UNTAR函数用于UNTAR,并将CSV文件写入目标存储(/dbfs/filestore/) 用2个工作节点(每个4个)群集完成工作仅需8分钟即可。

问题: 为什么写信给DBFS/FILESTORE或DBFS/DATABRICKS/驱动程序的速度比写入DBFS/MNT存储的速度快15倍?

DBF在后端使用了dbfs root( /filestore, /databricks-datasets, /databricks /驱动程序)?每个子文件夹的尺寸限制是什么?

Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage).
Same job is taking 8 minutes to write to /dbfs/FileStore.

I would like to understand why write performance is different in both cases.
I also want to know which backend storage does /dbfs/FileStor uses?

I understand that DBFS is an abstraction on top of scalable object storage.
In this case it should take same amount of time for both /dbfs/mnt/blobstorage and /dbfs/FileStore/.

Problem statement:

Source file format : .tar.gz

Avg size: 10 mb

number of tar.gz files: 1000

Each tar.gz file contails around 20000 csv files.

Requirement :
Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further processing.

unTar and write to mount location (Attached Screenshot):

Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/mnt/ - blob storage).
it takes 1.50 hours to complete the job with 2 worker nodes (4 cores each) cluster.
enter image description here

Untar and write to DBFS Root FileStore:
Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/FileStore/ )
it takes just 8 minutes to complete the job with 2 worker nodes (4 cores each) cluster.
enter image description here

Questions:
Why writing to DBFS/FileStore or DBFS/databricks/driver is 15 times faster that writing to DBFS/mnt storage?

what storage and file system does DBFS root (/FileStore , /databricks-datasets , /databricks/driver ) uses in backend? What is size limit for each sub folder?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

无言温柔 2025-01-30 05:59:26

可能有多种因素影响

  • 一点例如,您的斑点存储空间,如果从其他群集中有很多读取/写入或列表操作 - 这可能会导致Spark任务的重试(如果您有任何错误的任务,请检查Spark UI)。另一方面,/filestore位于专用blob存储中(SO-Called dbfs root ),并未那么加载。

通常,对于DBF,使用Azure Blob存储,而不是ADL。带有层次命名空间的ADL具有额外的操作开销,因为它需要检查权限等。这也可能影响性能。

但是要解决这个问题,最好打开支持票,因为它可能需要后端调查。

PS请注意,DBFS根应仅用于临时数据,因为它仅可从工作区访问,因此您无法与其他工作区或其他消费者共享数据。

There could be multiple factors affecting that, but it requires more information to investigate:

  • your /mnt mount point could point to the blob storage in another region, so you have higher latency
  • you're hitting a throttling for your blob storage, for example, if there are a lot of reads/writes or list operations to it from other clusters - this may lead to retry of the Spark tasks (check Spark UI if you had any tasks with errors). On other side, the /FileStore is located in a dedicated blob storage (so-called DBFS Root) that is not so loaded.

Usually, for DBFS Root the Azure Blob Storage is used, not ADLS. ADLS with hierarchical namespace have an additional overhead for operations because it needs to check permissions, etc. This could also affect performance.

But to solve that problem it's better to open a support ticket as it may require backend investigation.

P.S. Please note that DBFS Root should be used only for temporary data, as it's accessible only from workspace, so you can't share data on it with other workspaces or other consumers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文