“大”是多少? 数据集?
假设无限存储,其中大小/体积/物理(指标、千兆字节/太字节)仅与元素及其标签的数量无关,统计模式应该已经在 30 个子集中出现,但是您是否同意少于 1000 个子集太少了为了测试,至少 10000 个不同的子集/“元素”、“条目”/实体是“一个大数据集”。 或者更大? 谢谢
Assumed infinite storage where size/volume/physics (metrics,gigabytes/terrabytes) won't matter only the number of elements and their labels, statistically pattern should emerge already at 30 subsets, but can you agree that less than 1000 subsets is too little to test, and at least 10000 distinct subsets / "elements", "entries" / entities is "a large data set". Or larger?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不确定我是否理解你的问题,但听起来你试图询问需要采样多少数据集元素才能确保一定程度的准确性(30 是中心极限定理中的一个神奇数字经常进来玩)。
如果是这种情况,您需要的样本量取决于置信水平和置信区间。 如果您想要 95% 的置信水平和 5% 的置信区间(即您想要 95% 的置信度确保从样本中确定的比例在完整数据集中的比例的 5% 以内),那么您最终需要样本量不超过 385 个元素。 您想要生成的置信水平越高,置信区间越小,您需要的样本量就越大。
这是关于确定样本大小的数学的精彩讨论
如果您只想计算数字,还有一个方便的样本量计算器。
I'm not sure I understand your question, but it sounds like you are attempting to ask about how many elements of data set you need to sample in order to ensure a certain degree of accuracy (30 is a magic number from the Central Limit Theorem that comes in to play frequently).
If that is the case, the sample size you need depends on the confidence level and confidence interval. If you want a 95% confidence level and a 5% confidence interval (i.e. you want to be 95% confident that the proportion you determine from your sample is within 5% of the proportion in the full data set), you end up needing a sample size of no more than 385 elements. The greater the confidence level and the smaller the confidence interval that you want to generate, the larger the sample size you need.
Here is a nice discussion on the mathematics of determining sample size
and a handy sample size calculator if you just want to run the numbers.