直接查询和验证数据链球机或转换为DB以获取更快的查询
我们有10GB CSV文件,其中读取CSV文件并在正常机器中很难进行验证,因此我们决定选择Databricks执行相同的功能。
我的10GB文件数据,每周更改。这意味着我们每周一次上传10GB数据文件,以获取
我们经常获得客户端请求以进行验证的任何更改,并且我们必须在PossBile时立即处理并提供结果。
Option1:
1. Make Databricks cluster always run,
2. For every client request
# Run Job
# get validation output from Databricks itself
Option2:
1.Perform query and upload all data to database.
# run job (upload to DB)
# terminate/stop cluster (since weekly one time excution). Databricks has option on-demand.
2. for every client request directly make query to database and perform validation.
在我的用例方法1或2中,成本和性能有效解决方案是什么?还是两种方法是不良使用其他标准方法来实现同样的方法?
如果需要一些细节,请告诉我,我对大数据和数据链球菌没有任何想法。我也有兴趣了解如何在行业中解决问题。
we have have 10GB csv file, where reading csv file and making validation is quite difficult in normar machines and Hence we have decided to choose databricks to perform the same.
My 10GB file data, changes weekly basis. Meaning we upload every week one time 10GB data file for any changes
We get client request very frequently for validations and we have to process as soon as possbile and provide results.
Option1:
1. Make Databricks cluster always run,
2. For every client request
# Run Job
# get validation output from Databricks itself
Option2:
1.Perform query and upload all data to database.
# run job (upload to DB)
# terminate/stop cluster (since weekly one time excution). Databricks has option on-demand.
2. for every client request directly make query to database and perform validation.
what is cost and performace effective solution in my use case approach 1 or 2? or two approaches are bad use other standard methods to achive the same?
I do not have any idea on big data and databricks, If some details are required please let me know. Also I am interested to learn how actually solves problem in industry.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果数据在特定时间更改,则将验证过程触发为作业(特定时间)。
如果您的数据有望在任何时候到达,并且必须在收到一旦处理,则作业群集需要一直运行。
如果有任何延迟的摆动空间,那么您可以将每个“ N” HR安排工作,您可以节省金钱,并且将在下一个“ N” HR(S)中处理数据。
根据您做出的选择,您可以将Databricks的新副本使用到语句中,仅将读取新文件。
注意:不用上传CSV文件,而是尝试使用它,这样您就可以在存储和处理上节省一些资金,并且资源较少的速度将更快。
成本效益是一个灰色区域,您需要根据用户酶找到最佳位置。 (反复试验)
If the data changes at a specific time then Trigger the validation process as JOB (specific time).
If your data is expected to arrive at any time and it has to be processed as soon as it's received then the JOB cluster needs to be running all the time.
If there is any wiggle room for any latency, then you can schedule the JOB every 'n' hr(s) this way you can save money and the data will be processed in the next 'n' hr(s).
Based on the choice you make, you can use the Databricks new COPY INTO statement, this will be read, only the new files.
Note: instead of uploading a CSV file as-is, try bzip ing it, this way you can save some money on storage and processing will be faster with less resources.
Cost effectiveness is a grey area, you need to find the sweet spot based on your usecases. (Trial and Error)