当前位置：文江博客话题详情

palantir打字稿顶级n在车间中排序

发布于 2025-02-13 20:39:59 字数 617 浏览 8 评论 0原文

我是Palantir中打字稿和研讨会的概念的新手。

1）我希望有可能向顶级组展示具有一定特征的顶部（例如总量）。这些组超过12,000（因此超过了palantir限制）。

我尝试指定：有12000个类别，我将每个类别的总和（对于列数量）！

是否有任何方法可以避免结果近似和按数量显示前100个组（DESC订单）？我希望我可以通过 Pivot表和直方图来做到这一点。另外，我希望能够显示正确的总和组的总和（或者更好的是所选的直方图条）。

2）此外。在直方图上，可以同时选择几个条（不是1个，而不是全部）？

3）我是否可以过滤谁的小于某些聚集后值？（不是从最初的数据集，而是来自聚合）我可以将枢轴输出保存在对象集中吗？

我猜这两个问题都可以通过一个函数解决，您能输入打字稿代码以获取我想要的东西吗？

谢谢你！！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毁梦 2025-02-20 20:39:59

您可以在转换中执行1和3，而不是直接在对象顶部的Quiver或研讨会中进行这种聚合。这将产生一个预处理的数据集，然后您可以从中创建一个对象。然后，您将能够在需要聚合的车间中使用它。

这可能最适合您的数据量表，因为您提到的是12000桶。

如果要使用截止值，则转换可能看起来像：

from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output


@transform_df(
    Output("/path/to/route_aggregation"),
    source_df=Input("/path/to/flights"),
)
def compute(source_df):
    return (
        source_df
        .groupBy(F.col("route_id"))
        .agg(F.sum("distance").alias("total_distance_travelled"))
        .filter(F.col("total_distance_travelled") > 500000)
    )

或者只需要排名前100个：

from pyspark.sql import functions as F, Window as W
from transforms.api import transform_df, Input, Output


@transform_df(
    Output("/path/to/route_aggregation"),
    source_df=Input("/path/to/flights"),
)
def compute(source_df):
    return (
        source_df
        .groupBy(F.col("route_id"))
        .agg(F.sum("distance").alias("total_distance_travelled"))
        .withColumn("row_number", F.row_number().over(W.orderBy(F.col("total_distance_travelled").desc())))
        .filter(F.col("row_number") <= 100)
    )

如果要在对象上写回写，则应使用对象的写入数据集来确保聚合保持在同步中通过您通过操作进行的任何编辑。

这种方法的缺点是您必须维护另一种对象类型，并且它比其他对象类型不那么灵活（要进行更改，您必须更改转换逻辑并可能更新您的本体论定义。这也意味着可能有时间您的本体杂乱无章，并且不是真正的“对象”。

集合和单个对象值不同步，因为这些数据集需要更新的时间，这些聚合对象可以使缺点是，如果您可以将此汇总附加到现有的对象，如果您有代表类别的对象，则可以这样做。除了添加一个新的聚合对象，您可以将属性添加到路由的“路线上旅行的总距离”中。

You can do 1 and 3 in a transform instead of trying to do this aggregation in Quiver or Workshop directly on top of the objects. This will produce a pre-aggregated dataset that you can then create an object from. You'll then be able to use it in Workshop where you need the aggregation.

This is likely to work best for your data scale, as you mention you have 12000 buckets.

If you want to use a cut-off, the transform might look like:

from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output


@transform_df(
    Output("/path/to/route_aggregation"),
    source_df=Input("/path/to/flights"),
)
def compute(source_df):
    return (
        source_df
        .groupBy(F.col("route_id"))
        .agg(F.sum("distance").alias("total_distance_travelled"))
        .filter(F.col("total_distance_travelled") > 500000)
    )

or if you only want the top 100:

from pyspark.sql import functions as F, Window as W
from transforms.api import transform_df, Input, Output


@transform_df(
    Output("/path/to/route_aggregation"),
    source_df=Input("/path/to/flights"),
)
def compute(source_df):
    return (
        source_df
        .groupBy(F.col("route_id"))
        .agg(F.sum("distance").alias("total_distance_travelled"))
        .withColumn("row_number", F.row_number().over(W.orderBy(F.col("total_distance_travelled").desc())))
        .filter(F.col("row_number") <= 100)
    )

If you are using write back on the objects, you should use the object's write back dataset to ensure the aggregations stay in sync with any edits you make via actions.

The downside of this approach is that you have to maintain another object type, and it's less flexible than the others (as to make a change you have to change the transform logic and potentially update your ontology definition. It also means there may be times the aggregations and individual object values are out of sync, as it takes time for the dataset to be updated. Additionally, these aggregation objects can clutter your ontology and aren't really 'objects' in the true sense.

One way you can avoid these downsides is if you can attach this aggregation to an existing object where it makes sense to do so. This works if you have ontology objects to represent the categories. For example, if you were grouping flights by route and then summing the distance travelled, rather than adding a new aggregation object you could just add a property to the route e.g. 'total distance travelled on route'.

回复收藏 0 原文

坏尐絯℡ 2025-02-20 20:39:59

如果您想将很多类别的数据汇总到用于过滤的直方图中，则可能需要考虑使用轮廓而不是研讨会。内置的“直方图”板似乎是理想的选择，例如：

这将允许您选择多个组，然后将这些过滤器应用于直方图板下方的所有板。

您可以使用轮廓来创建仪表板，该仪表板可能具有与某些不需要操作的研讨会应用程序相似的功能。您还可以使用轮廓来构建一个预处理的数据集（使用“转换为枢纽的数据”按钮，然后在底部使用“另存为数据集”），该数据可以备份用于研讨会中的对象。

回复收藏 0 原文

苏璃陌 2025-02-20 20:39:59

如果您试图将超过1000个存储桶汇总到1000多个存储桶中，这可能会有问题。如果您的对象集大于此（或期望在不久的将来），请使用其他方法。

您可以在打字稿中创建铸造函数来计算聚合，然后使用此功能填充工作坊表。例如，查找沿它们最大距离的路线（即属性的组为RouteID，总属性为decort> decort> decort）：

import { Function, TwoDimensionalAggregation, BucketKey, BucketValue } from "@foundry/functions-api";
import { Objects, ExampleDataFlight } from "@foundry/ontology-api";

export class MyFunctions {
    @Function()
    public async aircraftAggregationExample(): Promise<TwoDimensionalAggregation<string>> {
        const aggregation = await Objects.search().exampleDataFlight()
                 .filter(o => o.distance.range().gt(0))
                 .groupBy(o => o.routeId.topValues())
                 .sum(o => o.distance);

        return sortBucketsByValue2D(aggregation, 'desc');
    }
}

/**
 * Sort buckets of a 2D aggregation by their value in the specified order
 * 
 * Example input 1:
 * { buckets: [
 *   { key: { min: "2022-01-01", max: "2022-12-31" }, value: 456 },
 *   { key: { min: "2021-01-01", max: "2021-12-31" }, value: 123 },
 *   { key: { min: "2023-01-01", max: "2023-12-31" }, value: 789 },
 * ]}
 * 
 * Example output 1:
 * { buckets: [
 *   { key: { min: "2021-01-01", max: "2021-12-31" }, value: 123 },
 *   { key: { min: "2022-01-01", max: "2022-12-31" }, value: 456 },
 *   { key: { min: "2023-01-01", max: "2023-12-31" }, value: 789 },
 * ]}
 * 
 * Example input 2:
 * { buckets: [
 *   { key: 17, value: 456 },
 *   { key: 21, value: 123 },
 *   { key: 23, value: 789 },
 * ]}
 * 
 * Example output 2:
 * { buckets: [
 *   { key: 21, value: 123 },
 *   { key: 17, value: 456 },
 *   { key: 23, value: 789 },
 * ]}
 */
function sortBucketsByValue2D<K extends BucketKey, V extends BucketValue>(
    buckets: TwoDimensionalAggregation<K, V>,
    order: 'asc' | 'desc' = 'asc'
): TwoDimensionalAggregation<K, V> {
    return {
        // See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort
        buckets: buckets.buckets.sort(({ value: v1 }, { value: v2 }) => {
            // These are be either numbers, timestamps or localdates which can be compared like this
            return (order === 'desc' ? -1 : 1) * (v1.valueOf() - v2.valueOf());
        }),
    };
}

This may have issues if you are trying to aggregate into more than 1000 buckets. If your object set is larger than this (or expect it to be in the near future), use a different method.

You could create a Foundry Function in TypeScript to compute the aggregations, and then use this function to populate a Workshop table. For example to find the routes with the greatest total distance travelled along them (i.e. the group by property here is routeId and the summed property is distance):

import { Function, TwoDimensionalAggregation, BucketKey, BucketValue } from "@foundry/functions-api";
import { Objects, ExampleDataFlight } from "@foundry/ontology-api";

export class MyFunctions {
    @Function()
    public async aircraftAggregationExample(): Promise<TwoDimensionalAggregation<string>> {
        const aggregation = await Objects.search().exampleDataFlight()
                 .filter(o => o.distance.range().gt(0))
                 .groupBy(o => o.routeId.topValues())
                 .sum(o => o.distance);

        return sortBucketsByValue2D(aggregation, 'desc');
    }
}

/**
 * Sort buckets of a 2D aggregation by their value in the specified order
 * 
 * Example input 1:
 * { buckets: [
 *   { key: { min: "2022-01-01", max: "2022-12-31" }, value: 456 },
 *   { key: { min: "2021-01-01", max: "2021-12-31" }, value: 123 },
 *   { key: { min: "2023-01-01", max: "2023-12-31" }, value: 789 },
 * ]}
 * 
 * Example output 1:
 * { buckets: [
 *   { key: { min: "2021-01-01", max: "2021-12-31" }, value: 123 },
 *   { key: { min: "2022-01-01", max: "2022-12-31" }, value: 456 },
 *   { key: { min: "2023-01-01", max: "2023-12-31" }, value: 789 },
 * ]}
 * 
 * Example input 2:
 * { buckets: [
 *   { key: 17, value: 456 },
 *   { key: 21, value: 123 },
 *   { key: 23, value: 789 },
 * ]}
 * 
 * Example output 2:
 * { buckets: [
 *   { key: 21, value: 123 },
 *   { key: 17, value: 456 },
 *   { key: 23, value: 789 },
 * ]}
 */
function sortBucketsByValue2D<K extends BucketKey, V extends BucketValue>(
    buckets: TwoDimensionalAggregation<K, V>,
    order: 'asc' | 'desc' = 'asc'
): TwoDimensionalAggregation<K, V> {
    return {
        // See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort
        buckets: buckets.buckets.sort(({ value: v1 }, { value: v2 }) => {
            // These are be either numbers, timestamps or localdates which can be compared like this
            return (order === 'desc' ? -1 : 1) * (v1.valueOf() - v2.valueOf());
        }),
    };
}

回复收藏 0 原文