用于用户活动数据流以构建在线ML模型
我正在写一个消费的消费者(用户活动数据(ActivityId,userId,Timestamp,cta,持续时间)
从Google Pub/sub,我想为此创建一个水槽,以便我可以训练我的ML模型在线时尚
。 SQL DB,Bigtable),检索将很容易,但是更新操作将是昂贵的,因为每次我为用户获得活动事件时,我都会附加到价值,在这种情况下我应该考虑哪种类型的水槽?
I am writing a consumer that consumes (user activity data, (activityid, userid, timestamp, cta, duration)
from Google Pub/Sub and I want to create a sink for this such that I can train my ML model in online fashion.
Since this sink is the source from where I will get the user's last x (say 100) activity, to update the ml model, if I can store the data in user-sharded form (in say a no-sql db, bigtable), retrieval will be easy, but the update operation will be costly, as I will append to the value every time I get the activity event for the user, which type of sink should I consider in this situation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用Bigtable Cell_version,并设置了垃圾收集,以便在重新训练 /更新ML模型的同时节省最后100个单元格版本,并在历史单元格版本上迭代。
将更新最终的读 /写吞吐量和潜伏期
Using the bigtable cell_version, and have set garbage collection such that, saving last 100 cell version, while re-training /updating the ML model, iterating over the historical cell versions.
Will update the final read / write throughput and latencies