GCP云日志记录成本增加了DataProc IMG版本2.0.39-ubuntu18

发布于 2025-02-03 01:09:08 字数 2880 浏览 4 评论 0原文

我有一个带有图像版本-2.0.39 -ubuntu18的DataProc群集，它似乎正在将所有日志放入云日志记录中，这使我们的成本大大增加了。

这是用于创建群集的命令，我添加了以下内容-Spark.eventlog.dir = gs：// dataproc-spark-logs/joblogs，spark：spark.history.fs.logdirectory = gs：/ /dataproc-spark-logs/joblogs

停止使用云日志记录，但这是行不通的。日志也被重新定向到云记录。

这是用于创建DataProc群集的命令：

REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3

# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
           --enable-component-gateway \
           --bucket $BUCKET \
           --region $REGION \
           --zone $ZONE \
           --no-address --master-machine-type $TYPE \
           --master-boot-disk-size 100 \
           --master-boot-disk-type pd-ssd \
           --num-workers $NUM_WORKER \
           --worker-machine-type $TYPE \
           --worker-boot-disk-type pd-ssd \
           --worker-boot-disk-size 500 \
           --image-version $IMG_VERSION \
           --autoscaling-policy versa-dataproc-autoscaling \
           --scopes 'https://www.googleapis.com/auth/cloud-platform' \
           --project $PROJECT \
           --initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
           --metadata 'gcs-connector-version=2.0.0' \
           --metadata 'bigquery-connector-version=1.2.0' \
           --properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'

我们有另一个数据ploc群集（Image版本1.4.37-ubuntu18，与Image版本2.0-ubuntu18相似的配置），该配置具有相似的配置，但似乎并不使用云记录尽可能多。

附着是两个簇的屏幕截图属性。

我需要更改以确保DataProc作业（PYSPARK）使用云记录？

蒂亚！

[

原文

I've a Dataproc cluster with image version - 2.0.39-ubuntu18, which seems to be putting all logs into Cloud Logging, this is increasing our costs a lot.

Here is the command used to create the cluster, i've added the following - spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs

to stop using the Cloud Logging, however that is not working .. Logs are being re-directed to Cloud Logging as well.

Here is the command used to create the Dataproc cluster :

REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3

# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
           --enable-component-gateway \
           --bucket $BUCKET \
           --region $REGION \
           --zone $ZONE \
           --no-address --master-machine-type $TYPE \
           --master-boot-disk-size 100 \
           --master-boot-disk-type pd-ssd \
           --num-workers $NUM_WORKER \
           --worker-machine-type $TYPE \
           --worker-boot-disk-type pd-ssd \
           --worker-boot-disk-size 500 \
           --image-version $IMG_VERSION \
           --autoscaling-policy versa-dataproc-autoscaling \
           --scopes 'https://www.googleapis.com/auth/cloud-platform' \
           --project $PROJECT \
           --initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
           --metadata 'gcs-connector-version=2.0.0' \
           --metadata 'bigquery-connector-version=1.2.0' \
           --properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'

We have another Dataproc cluster (image version 1.4.37-ubuntu18, similar configuration as the image version 2.0-ubuntu18), which has similar configuration but does not seem to using Cloud Logging as much.

Attached is screenshot properties of both the clusters.

What do i need to change to ensure the Dataproc jobs(pyspark) donot use the Cloud Logging ?

tia!

[ image-v2.x-1][4]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自我难过 2025-02-10 01:09:09

我看到dataproc：dataproc.logging.stackdriver.job.driver.enable设置为true。默认情况下，该值为false，这意味着将将驱动程序日志保存到GCS并流回客户端进行查看，但不会将其保存到Cloud Loggging。您可以尝试禁用它。顺便说一句，当启用它时，作业驱动程序日志将在作业资源下的云记录中可用（而不是集群资源）。

如果要完全禁用云记录到集群，则可以添加dataproc：dataproc.logging.stackdriver.enable.enable.enable = false = false创建群集或用systemctl stop stop stop google编写init Action -fluentd.service。两者都将停止在集群的侧面云记录，但建议使用属性。

请参阅 dataproclocercluster properties 该物业的属性