PySpark 中的系统采样
我对 PySpark 很陌生,我一直在努力寻找我正在寻找的答案。
我有大量家庭样本,我想进行系统抽样。与真正的系统抽样一样,我希望从随机起点开始,然后定期选择一个家庭(例如,每 50 个家庭)。我已经研究过sample()和sampleBy(),但我认为这些并不是我所需要的。任何人都可以就我如何做到这一点提供任何建议吗?非常感谢您的帮助!
I’m quite new to PySpark and I’ve been struggling to find the answer I’m looking for.
I have a large sample of households and I want to conduct systematic sampling. Like true systematic sampling, I would like to begin at a random starting point and then select a household at regular intervals (e.g. every 50th household). I have looked into sample() and sampleBy(), but I don't think these are quite what I need. Can anyone give any advice on how I can do this? Many thanks in advance for your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您只有 1 个分区,则
monotonically_increasing_id
有效,因此,如果您有多个分区,则可以考虑row_number
。检查 https 中的“注释” ://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html
与
行号
,monotonically_increasing_id
works if you have only 1 partition, so if you have more than just 1 partition, you can considerrow_number
.Check "Notes" in https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html
With
row_number
,您可能想要使用
monotonically_increasing_id
然后模以 50 得到你想要的。You might want to use
monotonically_increasing_id
then modulo by 50 to get what you want.