GCP云日志记录成本增加了DataProc IMG版本2.0.39-ubuntu18
我有一个带有图像版本-2.0.39 -ubuntu18的DataProc群集,它似乎正在将所有日志放入云日志记录中,这使我们的成本大大增加了。
这是用于创建群集的命令,我添加了以下内容-Spark.eventlog.dir = gs:// dataproc-spark-logs/joblogs,spark:spark.history.fs.logdirectory = gs:/ /dataproc-spark-logs/joblogs
停止使用云日志记录,但这是行不通的。日志也被重新定向到云记录。
这是用于创建DataProc群集的命令:
REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3
# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
--enable-component-gateway \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--no-address --master-machine-type $TYPE \
--master-boot-disk-size 100 \
--master-boot-disk-type pd-ssd \
--num-workers $NUM_WORKER \
--worker-machine-type $TYPE \
--worker-boot-disk-type pd-ssd \
--worker-boot-disk-size 500 \
--image-version $IMG_VERSION \
--autoscaling-policy versa-dataproc-autoscaling \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project $PROJECT \
--initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
--metadata 'gcs-connector-version=2.0.0' \
--metadata 'bigquery-connector-version=1.2.0' \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'
我们有另一个数据ploc群集(Image版本1.4.37-ubuntu18,与Image版本2.0-ubuntu18相似的配置),该配置具有相似的配置,但似乎并不使用云记录尽可能多。
附着是两个簇的屏幕截图属性。
我需要更改以确保DataProc作业(PYSPARK)使用云记录?
蒂亚!
I've a Dataproc cluster with image version - 2.0.39-ubuntu18, which seems to be putting all logs into Cloud Logging, this is increasing our costs a lot.
Here is the command used to create the cluster, i've added the following - spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs
to stop using the Cloud Logging, however that is not working .. Logs are being re-directed to Cloud Logging as well.
Here is the command used to create the Dataproc cluster :
REGION=us-east1
ZONE=us-east1-b
IMG_VERSION=2.0-ubuntu18
NUM_WORKER=3
# in versa-sml-googl
gcloud beta dataproc clusters create $CNAME \
--enable-component-gateway \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--no-address --master-machine-type $TYPE \
--master-boot-disk-size 100 \
--master-boot-disk-type pd-ssd \
--num-workers $NUM_WORKER \
--worker-machine-type $TYPE \
--worker-boot-disk-type pd-ssd \
--worker-boot-disk-size 500 \
--image-version $IMG_VERSION \
--autoscaling-policy versa-dataproc-autoscaling \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project $PROJECT \
--initialization-actions 'gs://dataproc-spark-configs/pip_install.sh','gs://dataproc-spark-configs/connectors-feb1.sh' \
--metadata 'gcs-connector-version=2.0.0' \
--metadata 'bigquery-connector-version=1.2.0' \
--properties 'dataproc:dataproc.logging.stackdriver.job.driver.enable=true,dataproc:job.history.to-gcs.enabled=true,spark:spark.dynamicAllocation.enabled=false,spark:spark.executor.instances=6,spark:spark.executor.cores=2,spark:spark.eventLog.dir=gs://dataproc-spark-logs/joblogs,spark:spark.history.fs.logDirectory=gs://dataproc-spark-logs/joblogs,spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2'
We have another Dataproc cluster (image version 1.4.37-ubuntu18, similar configuration as the image version 2.0-ubuntu18), which has similar configuration but does not seem to using Cloud Logging as much.
Attached is screenshot properties of both the clusters.
What do i need to change to ensure the Dataproc jobs(pyspark) donot use the Cloud Logging ?
tia!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我看到
dataproc:dataproc.logging.stackdriver.job.driver.enable
设置为true
。默认情况下,该值为false
,这意味着将将驱动程序日志保存到GCS并流回客户端进行查看,但不会将其保存到Cloud Loggging。您可以尝试禁用它。顺便说一句,当启用它时,作业驱动程序日志将在作业资源下的云记录中可用(而不是集群资源)。如果要完全禁用云记录到集群,则可以添加
dataproc:dataproc.logging.stackdriver.enable.enable.enable = false = false
创建群集或用systemctl stop stop stop google编写init Action -fluentd.service
。两者都将停止在集群的侧面云记录,但建议使用属性。请参阅 dataproclocercluster properties 该物业的属性
I saw
dataproc:dataproc.logging.stackdriver.job.driver.enable
is set totrue
. By default, the value isfalse
, which means driver logs will be saved to GCS and streamed back to the client for viewing, but it won't be saved to Cloud Logging. You can try disabling it. BTW, when it is enabled, the job driver logs will be available in Cloud Logging under the job resource (instead of the cluster resource).If you want to disable Cloud Logging completely for a cluster, you can either add
dataproc:dataproc.logging.stackdriver.enable=false
when creating the cluster or write an init action withsystemctl stop google-fluentd.service
. Both will stop Cloud Logging on the cluster's side, but using property is recommended.See Dataproc cluster properties for the property.
这是此更新(基于与GCP支持的讨论):
在GCP日志记录中,我们需要使用包含过滤器创建一个日志路由接收器 - 这将根据您指定的目标写入BigQuery或Cloud Storage的日志。
此外,需要修改_default接收器以添加排除过滤器,以便将特定的日志重新定向到
附加的GCP日志记录是_DEFAULT LOG SINK的屏幕截图和DataProc的包含式接收器的屏幕截图。
Here is the update on this (based on discussions with GCP Support) :
In the GCP Logging, we need to create a Log Routing sink with inclusion filter - this will write the logs to BigQuery or Cloud Storage depending upon the target you specify.
Additionally, the _Default sink needs to be modified to add exclusion filters so specific logs will NOT be re-directed to GCP Logging
Attached are screenshots of the _Default log sink and the Inclusion sink for Dataproc.