我可以用进气目录定义数据过滤器吗?
我想使用进气口不仅链接到已发布的数据集,还可以在目录本身中过滤它们。打开数据后,过滤是在Python中的微不足道,但这意味着要在元数据范围内提供用户代码以提供一些指导。
动机:通常,用户对数据集不像生产者那样熟悉,并且在不添加Python中的一系列不同的过滤步骤的情况下为他们做一些预处理会很高兴。
例如,如果我们已经打开了CSV,则可以使用以下方式过滤: DF [DF ['Rain']> 70] 但是我在read_csv中没有看到任何参数的熊猫或dask。
I would like to use intake to not only link to published datasets, but filter them in the catalog itself. Filtering is trivial to in python once you open the data, but this means providing the user code beyond the metadata in order to give some guidance.
Motivation: often the user is not as familiar with the dataset as the producer, and it would be nice to do some preprocessing for them without adding a series of different filtering steps in python.
eg if we have opened a csv already, we can filter with:
df[df['rain'] > 70]
but I don't see any arguments in read_csv for either pandas or dask to do this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
的确,没有办法将过滤器传递给pandas的'或dask的read_csv函数,因此这是Intake的CSV驱动程序支持的选项。
但是,进气口支持数据集变换: https://intake.readthedocs.io/en//en/最新/变换。将在每个访问中执行转换/计算,过滤的数据集不会存储任何地方(除非您还使用持久功能)。
There is, indeed, no way to pass a filter to pandas' or dask's read_csv functions, and therefore this is nt an option supported by Intake's CSV driver.
However, Intake does support dataset transforms: https://intake.readthedocs.io/en/latest/transforms.html This means, that you can operate on the output of one data source, and assign a new catalogue entry to the result. The transform/computation would be performed on every access, the filtered dataset is not stored anywhere (unless you also use the persist functionality).