直接查询和验证数据链球机或转换为DB以获取更快的查询

发布于 2025-02-02 18:16:42 字数 766 浏览 4 评论 0原文

我们有10GB CSV文件，其中读取CSV文件并在正常机器中很难进行验证，因此我们决定选择Databricks执行相同的功能。

我的10GB文件数据，每周更改。这意味着我们每周一次上传10GB数据文件，以获取

我们经常获得客户端请求以进行验证的任何更改，并且我们必须在PossBile时立即处理并提供结果。

Option1：

 1. Make Databricks cluster always run, 
 2. For every client request 
 # Run Job
 # get validation output from Databricks itself

Option2：

 1.Perform query and upload all data to database. 
  # run job (upload to DB)
  # terminate/stop cluster (since weekly one time excution). Databricks has option on-demand. 
 2. for every client request directly make query to database and perform validation.

在我的用例方法1或2中，成本和性能有效解决方案是什么？还是两种方法是不良使用其他标准方法来实现同样的方法？

如果需要一些细节，请告诉我，我对大数据和数据链球菌没有任何想法。我也有兴趣了解如何在行业中解决问题。

原文

we have have 10GB csv file, where reading csv file and making validation is quite difficult in normar machines and Hence we have decided to choose databricks to perform the same.

My 10GB file data, changes weekly basis. Meaning we upload every week one time 10GB data file for any changes

We get client request very frequently for validations and we have to process as soon as possbile and provide results.

Option1:

 1. Make Databricks cluster always run, 
 2. For every client request 
 # Run Job
 # get validation output from Databricks itself

Option2:

 1.Perform query and upload all data to database. 
  # run job (upload to DB)
  # terminate/stop cluster (since weekly one time excution). Databricks has option on-demand. 
 2. for every client request directly make query to database and perform validation.

what is cost and performace effective solution in my use case approach 1 or 2? or two approaches are bad use other standard methods to achive the same?

I do not have any idea on big data and databricks, If some details are required please let me know. Also I am interested to learn how actually solves problem in industry.

分享到QQ

分享到微博