用Scala计数HDFS目录中的文件
在Scala中,我正在尝试计算HDFS目录的文件。 我尝试获取具有val files = fs.listfiles(path,false)
的文件列表,并依靠它或获取大小,但它不适用于files
类型IS 远程标准器[stuentfilestatus]
关于如何处理的任何想法?
感谢您的帮助
In Scala, I am trying to count the files from an Hdfs directory.
I tryed to get a list of the files with val files = fs.listFiles(path, false)
and make a count on it or get it's size, but it doesn't work as files
type is RemoteIterator[LocatedFileStatus]
Any idea on how I should process ?
Thank's for helping
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是之前已经完成的,但人们通常使用fsimage。 (名称节点文件的副本。)
然后他们将其扔进蜂巢表中,然后您可以查询有关HDFS文件系统的信息。
这是 /a>说明如何导出FSIMAGE并将其扔进蜂巢桌。
这是另一个我想我喜欢 :
在表中,您确实可以在Scala/Spark中完成其余的HDFS用户。
This has been done before but generally people use the FSImage. (A copy of the name node file.)
They'll then throw that into a hive table and then you can query it for information about your hdfs file system.
Here's a really good tutorial that explains how to export the fsimage and throw it into a hive table.
Here's another that I think I prefer:
Once it's in a table you really can do the rest in scala/spark.
我最终使用:
作为Scala Begginer,我不知道如何制作
count ++
(答案是count> count+= 1
)。这实际上很安静I end up using:
As a Scala begginer, I didn't know how to make a
count++
(the answear iscount += 1
). This actually works quiet well