为什么要从Databricks Spark Notebook(Hadoop Fileutils)写入DBFS安装位置的位置慢13倍?
Databricks笔记本电脑需要2个小时才能写入 /dbfs /mnt(blob存储)。 同样的工作需要8分钟才能写入 /dbfs /filestore。
我想了解为什么在两种情况下写作表现都不同。 我还想知道哪个后端存储 /dbfs /filestor用途?
我知道DBFS是可扩展对象存储的顶部的抽象。 在这种情况下,对于/dbfs/mnt/blobstorage和/dbfs/filestore/,应该花费相同的时间。
问题语句:
源文件格式:.TAR.GZ
AVG大小:10 MB
tar.gz文件的数量:1000
个tar.gz文件约20000 CSV文件。
要求: 取消tar.gz文件,然后将CSV文件写入BLOB存储 /中间存储层,以进行进一步处理。
UNTAR并写入安装位置(附加的屏幕快照):
在这里,我使用Hadoop Fileutil库和UNTAR函数将其用于UNTAR,然后将CSV文件写入目标存储(/dbfs/mnt/-blob存储)。 用2个工作节点(每个4个)群集完成工作需要1.50小时。
untar并写信给DBFS root filestore: 在这里,我将Hadoop Fileutil库和UNTAR函数用于UNTAR,并将CSV文件写入目标存储(/dbfs/filestore/) 用2个工作节点(每个4个)群集完成工作仅需8分钟即可。
问题: 为什么写信给DBFS/FILESTORE或DBFS/DATABRICKS/驱动程序的速度比写入DBFS/MNT存储的速度快15倍?
DBF在后端使用了dbfs root( /filestore, /databricks-datasets, /databricks /驱动程序)?每个子文件夹的尺寸限制是什么?
Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage).
Same job is taking 8 minutes to write to /dbfs/FileStore.
I would like to understand why write performance is different in both cases.
I also want to know which backend storage does /dbfs/FileStor uses?
I understand that DBFS is an abstraction on top of scalable object storage.
In this case it should take same amount of time for both /dbfs/mnt/blobstorage and /dbfs/FileStore/.
Problem statement:
Source file format : .tar.gz
Avg size: 10 mb
number of tar.gz files: 1000
Each tar.gz file contails around 20000 csv files.
Requirement :
Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further processing.
unTar and write to mount location (Attached Screenshot):
Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/mnt/ - blob storage).
it takes 1.50 hours to complete the job with 2 worker nodes (4 cores each) cluster.
Untar and write to DBFS Root FileStore:
Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/FileStore/ )
it takes just 8 minutes to complete the job with 2 worker nodes (4 cores each) cluster.
Questions:
Why writing to DBFS/FileStore or DBFS/databricks/driver is 15 times faster that writing to DBFS/mnt storage?
what storage and file system does DBFS root (/FileStore , /databricks-datasets , /databricks/driver ) uses in backend? What is size limit for each sub folder?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
可能有多种因素影响
/filestore
位于专用blob存储中(SO-Called dbfs root ),并未那么加载。通常,对于DBF,使用Azure Blob存储,而不是ADL。带有层次命名空间的ADL具有额外的操作开销,因为它需要检查权限等。这也可能影响性能。
但是要解决这个问题,最好打开支持票,因为它可能需要后端调查。
PS请注意,DBFS根应仅用于临时数据,因为它仅可从工作区访问,因此您无法与其他工作区或其他消费者共享数据。
There could be multiple factors affecting that, but it requires more information to investigate:
/mnt
mount point could point to the blob storage in another region, so you have higher latency/FileStore
is located in a dedicated blob storage (so-called DBFS Root) that is not so loaded.Usually, for DBFS Root the Azure Blob Storage is used, not ADLS. ADLS with hierarchical namespace have an additional overhead for operations because it needs to check permissions, etc. This could also affect performance.
But to solve that problem it's better to open a support ticket as it may require backend investigation.
P.S. Please note that DBFS Root should be used only for temporary data, as it's accessible only from workspace, so you can't share data on it with other workspaces or other consumers.