Concat排序的DASK数据框

发布于 2025-01-23 20:02:45 字数 1124 浏览 0 评论 0原文

我有ts列（无索引）对N dask数据框架进行排序。我想创建一个dataframe- con缩所有它们，但仍然通过此ts列对其进行排序。

注意：ts可以在数据范围之间重叠。

有人可以推荐有效的实施方法吗？

更新：

dfs = []
for product in PRODUCTS:
   namespace = PRODUCT_NAMESPACE[product]
   message_type = PRODUCT_MESSAGE_TYPE[product]

   num_expected_channels = PRODUCT_EXPECTED_CHANNELS[product]
   for channel in range(num_expected_channels):
       df = storage.load(
           namespace,
           partition_filter=(P.date == '2022-02-01') & (P.channel == str(channel)),
       )

       df = df.assign(
          product=product,
          message_type=message_type
       ).astype(
          dict(
              dtype='category',
              product=pd.api.types.CategoricalDtype(PRODUCTS),
              message_type=pd.api.types.CategoricalDtype(['trade', 'quote']),
          )
       ).drop(columns=['channel', 'feed'])

       df = df.set_index('ts', sorted=True, drop=False).persist()

       dfs.append(df)

df = dd.concat(dfs, interleave_partitions=True)
df = df.map_partitions(lambda pdf: pdf.sort_index())

原文

I have N Dask DataFrame sorted by the ts column(no index). I would like to create one DataFrame - concat all of them, but still have it sorted by this ts column.

Note: ts can overlap between DataFrames.

Can someone recommend efficient way to implement it?

UPDATE:

dfs = []
for product in PRODUCTS:
   namespace = PRODUCT_NAMESPACE[product]
   message_type = PRODUCT_MESSAGE_TYPE[product]

   num_expected_channels = PRODUCT_EXPECTED_CHANNELS[product]
   for channel in range(num_expected_channels):
       df = storage.load(
           namespace,
           partition_filter=(P.date == '2022-02-01') & (P.channel == str(channel)),
       )

       df = df.assign(
          product=product,
          message_type=message_type
       ).astype(
          dict(
              dtype='category',
              product=pd.api.types.CategoricalDtype(PRODUCTS),
              message_type=pd.api.types.CategoricalDtype(['trade', 'quote']),
          )
       ).drop(columns=['channel', 'feed'])

       df = df.set_index('ts', sorted=True, drop=False).persist()

       dfs.append(df)

df = dd.concat(dfs, interleave_partitions=True)
df = df.map_partitions(lambda pdf: pdf.sort_index())

分享到QQ

分享到微博