palantir打字稿顶级n在车间中排序

发布于 2025-02-13 20:39:59 字数 617 浏览 8 评论 0原文

我是Palantir中打字稿和研讨会的概念的新手。

1)我希望有可能向顶级组展示具有一定特征的顶部(例如总量)。这些组超过12,000(因此超过了palantir限制)。

我尝试指定:有12000个类别,我将每个类别的总和(对于列数量)!

是否有任何方法可以避免结果近似和按数量显示前100个组(DESC订单)?我希望我可以通过 Pivot表直方图来做到这一点。另外,我希望能够显示正确的总和组的总和(或者更好的是所选的直方图条)。

2)此外。在直方图上,可以同时选择几个条(不是1个,而不是全部)?

3)是否可以过滤谁的小于 某些聚集后值? (不是从最初的数据集,而是来自聚合) 我可以将枢轴输出保存在对象集中吗?

我猜这两个问题都可以通过一个函数解决,您能输入打字稿代码以获取我想要的东西吗?

谢谢你!!

I am new to the concepts of typescript and Workshop in Palantir.

1) I would like to have the possibility to show the top GROUPs with a certain characteristic (for example the total AMOUNT). The GROUPs are more than 12,000 (thus exceeding the Palantir limit).

I try to specify: there are more than 12000 categories and i would the sum (for the column AMOUNT) of each categories!

Is there any way to avoid approximation of results and show the top 100 GROUPs (desc order) by amount? I wish I could do this both through the Pivot Table and through a Histogram. Also I would like to be able to show the correct sum of the amounts of the top 100 GROUPs (or, even better, the selected histogram bars).

2) Moreover .. on the histogram it is possible to select several bars at the same time (not 1 and not all)?!

3) Could I filter out who has less than a certain post aggregation value? (Not from the initial dataset but from the aggregation)
Could I save a pivot output in an Object Set?!

I guess both questions are solvable through a function, could you kindly enter the TypeScript code to get what I would like?

Thank you!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

毁梦 2025-02-20 20:39:59

您可以在转换中执行1和3,而不是直接在对象顶部的Quiver或研讨会中进行这种聚合。这将产生一个预处理的数据集,然后您可以从中创建一个对象。然后,您将能够在需要聚合的车间中使用它。

这可能最适合您的数据量表,因为您提到的是12000桶。

如果要使用截止值,则转换可能看起来像:

from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output


@transform_df(
    Output("/path/to/route_aggregation"),
    source_df=Input("/path/to/flights"),
)
def compute(source_df):
    return (
        source_df
        .groupBy(F.col("route_id"))
        .agg(F.sum("distance").alias("total_distance_travelled"))
        .filter(F.col("total_distance_travelled") > 500000)
    )

或者只需要排名前100个:

from pyspark.sql import functions as F, Window as W
from transforms.api import transform_df, Input, Output


@transform_df(
    Output("/path/to/route_aggregation"),
    source_df=Input("/path/to/flights"),
)
def compute(source_df):
    return (
        source_df
        .groupBy(F.col("route_id"))
        .agg(F.sum("distance").alias("total_distance_travelled"))
        .withColumn("row_number", F.row_number().over(W.orderBy(F.col("total_distance_travelled").desc())))
        .filter(F.col("row_number") <= 100)
    )

如果要在对象上写回写,则应使用对象的写入数据集来确保聚合保持在同步中通过您通过操作进行的任何编辑。

这种方法的缺点是您必须维护另一种对象类型,并且它比其他对象类型不那么灵活(要进行更改,您必须更改转换逻辑并可能更新您的本体论定义。这也意味着可能有时间 您的本体杂乱无章,并且不是真正的“对象”。

集合和单个对象值不同步,因为这些数据集需要更新的时间,这些聚合对象可以使 缺点是,如果您可以将此汇总附加到现有的对象,如果您有代表类别的对象,则可以这样做。除了添加一个新的聚合对象,您可以将属性添加到路由的“路线上旅行的总距离”中。

You can do 1 and 3 in a transform instead of trying to do this aggregation in Quiver or Workshop directly on top of the objects. This will produce a pre-aggregated dataset that you can then create an object from. You'll then be able to use it in Workshop where you need the aggregation.

This is likely to work best for your data scale, as you mention you have 12000 buckets.

If you want to use a cut-off, the transform might look like:

from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output


@transform_df(
    Output("/path/to/route_aggregation"),
    source_df=Input("/path/to/flights"),
)
def compute(source_df):
    return (
        source_df
        .groupBy(F.col("route_id"))
        .agg(F.sum("distance").alias("total_distance_travelled"))
        .filter(F.col("total_distance_travelled") > 500000)
    )

or if you only want the top 100:

from pyspark.sql import functions as F, Window as W
from transforms.api import transform_df, Input, Output


@transform_df(
    Output("/path/to/route_aggregation"),
    source_df=Input("/path/to/flights"),
)
def compute(source_df):
    return (
        source_df
        .groupBy(F.col("route_id"))
        .agg(F.sum("distance").alias("total_distance_travelled"))
        .withColumn("row_number", F.row_number().over(W.orderBy(F.col("total_distance_travelled").desc())))
        .filter(F.col("row_number") <= 100)
    )

If you are using write back on the objects, you should use the object's write back dataset to ensure the aggregations stay in sync with any edits you make via actions.

The downside of this approach is that you have to maintain another object type, and it's less flexible than the others (as to make a change you have to change the transform logic and potentially update your ontology definition. It also means there may be times the aggregations and individual object values are out of sync, as it takes time for the dataset to be updated. Additionally, these aggregation objects can clutter your ontology and aren't really 'objects' in the true sense.

One way you can avoid these downsides is if you can attach this aggregation to an existing object where it makes sense to do so. This works if you have ontology objects to represent the categories. For example, if you were grouping flights by route and then summing the distance travelled, rather than adding a new aggregation object you could just add a property to the route e.g. 'total distance travelled on route'.

坏尐絯℡ 2025-02-20 20:39:59

如果您想将很多类别的数据汇总到用于过滤的直方图中,则可能需要考虑使用轮廓而不是研讨会。内置的“直方图”板似乎是理想的选择,例如:

”轮廓直方图板的配置,显示Y轴列设置为'route_id'和X轴设置以显示“距离”的“总和”

这将允许您选择多个组,然后将这些过滤器应用于直方图板下方的所有板。

您可以使用轮廓来创建仪表板,该仪表板可能具有与某些不需要操作的研讨会应用程序相似的功能。您还可以使用轮廓来构建一个预处理的数据集(使用“转换为枢纽的数据”按钮,然后在底部使用“另存为数据集”),该数据可以备份用于研讨会中的对象。

If you want to aggregate lots of data with many categories into a histogram for filtering, you might want to consider using Contour instead of Workshop. The built-in 'Histogram' board seems ideal for this, for example:

Configuration of the Contour Histogram board, showing the y-axis column set to 'route_id' and the x-axis set to display the 'Sum' of 'distance'

This will allow you to select multiple groups, and these filters are then applied to all boards beneath the Histogram board.

You can use Contour to create a dashboard, which can have similar functionality to some Workshop applications that don't require actions. You can also use Contour to build a pre-aggregated dataset (use the 'Switch to pivoted data' button and then 'Save as dataset' at the bottom) which can back an object to be used in Workshop.

苏璃陌 2025-02-20 20:39:59

如果您试图将超过1000个存储桶汇总到1000多个存储桶中,这可能会有问题。如果您的对象集大于此(或期望在不久的将来),请使用其他方法。

您可以在打字稿中创建铸造函数来计算聚合,然后使用此功能填充工作坊表。例如,查找沿它们最大距离的路线(即属性的组为RouteID,总属性为decort> decort> decort):

import { Function, TwoDimensionalAggregation, BucketKey, BucketValue } from "@foundry/functions-api";
import { Objects, ExampleDataFlight } from "@foundry/ontology-api";

export class MyFunctions {
    @Function()
    public async aircraftAggregationExample(): Promise<TwoDimensionalAggregation<string>> {
        const aggregation = await Objects.search().exampleDataFlight()
                 .filter(o => o.distance.range().gt(0))
                 .groupBy(o => o.routeId.topValues())
                 .sum(o => o.distance);

        return sortBucketsByValue2D(aggregation, 'desc');
    }
}

/**
 * Sort buckets of a 2D aggregation by their value in the specified order
 * 
 * Example input 1:
 * { buckets: [
 *   { key: { min: "2022-01-01", max: "2022-12-31" }, value: 456 },
 *   { key: { min: "2021-01-01", max: "2021-12-31" }, value: 123 },
 *   { key: { min: "2023-01-01", max: "2023-12-31" }, value: 789 },
 * ]}
 * 
 * Example output 1:
 * { buckets: [
 *   { key: { min: "2021-01-01", max: "2021-12-31" }, value: 123 },
 *   { key: { min: "2022-01-01", max: "2022-12-31" }, value: 456 },
 *   { key: { min: "2023-01-01", max: "2023-12-31" }, value: 789 },
 * ]}
 * 
 * Example input 2:
 * { buckets: [
 *   { key: 17, value: 456 },
 *   { key: 21, value: 123 },
 *   { key: 23, value: 789 },
 * ]}
 * 
 * Example output 2:
 * { buckets: [
 *   { key: 21, value: 123 },
 *   { key: 17, value: 456 },
 *   { key: 23, value: 789 },
 * ]}
 */
function sortBucketsByValue2D<K extends BucketKey, V extends BucketValue>(
    buckets: TwoDimensionalAggregation<K, V>,
    order: 'asc' | 'desc' = 'asc'
): TwoDimensionalAggregation<K, V> {
    return {
        // See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort
        buckets: buckets.buckets.sort(({ value: v1 }, { value: v2 }) => {
            // These are be either numbers, timestamps or localdates which can be compared like this
            return (order === 'desc' ? -1 : 1) * (v1.valueOf() - v2.valueOf());
        }),
    };
}

This may have issues if you are trying to aggregate into more than 1000 buckets. If your object set is larger than this (or expect it to be in the near future), use a different method.

You could create a Foundry Function in TypeScript to compute the aggregations, and then use this function to populate a Workshop table. For example to find the routes with the greatest total distance travelled along them (i.e. the group by property here is routeId and the summed property is distance):

import { Function, TwoDimensionalAggregation, BucketKey, BucketValue } from "@foundry/functions-api";
import { Objects, ExampleDataFlight } from "@foundry/ontology-api";

export class MyFunctions {
    @Function()
    public async aircraftAggregationExample(): Promise<TwoDimensionalAggregation<string>> {
        const aggregation = await Objects.search().exampleDataFlight()
                 .filter(o => o.distance.range().gt(0))
                 .groupBy(o => o.routeId.topValues())
                 .sum(o => o.distance);

        return sortBucketsByValue2D(aggregation, 'desc');
    }
}

/**
 * Sort buckets of a 2D aggregation by their value in the specified order
 * 
 * Example input 1:
 * { buckets: [
 *   { key: { min: "2022-01-01", max: "2022-12-31" }, value: 456 },
 *   { key: { min: "2021-01-01", max: "2021-12-31" }, value: 123 },
 *   { key: { min: "2023-01-01", max: "2023-12-31" }, value: 789 },
 * ]}
 * 
 * Example output 1:
 * { buckets: [
 *   { key: { min: "2021-01-01", max: "2021-12-31" }, value: 123 },
 *   { key: { min: "2022-01-01", max: "2022-12-31" }, value: 456 },
 *   { key: { min: "2023-01-01", max: "2023-12-31" }, value: 789 },
 * ]}
 * 
 * Example input 2:
 * { buckets: [
 *   { key: 17, value: 456 },
 *   { key: 21, value: 123 },
 *   { key: 23, value: 789 },
 * ]}
 * 
 * Example output 2:
 * { buckets: [
 *   { key: 21, value: 123 },
 *   { key: 17, value: 456 },
 *   { key: 23, value: 789 },
 * ]}
 */
function sortBucketsByValue2D<K extends BucketKey, V extends BucketValue>(
    buckets: TwoDimensionalAggregation<K, V>,
    order: 'asc' | 'desc' = 'asc'
): TwoDimensionalAggregation<K, V> {
    return {
        // See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort
        buckets: buckets.buckets.sort(({ value: v1 }, { value: v2 }) => {
            // These are be either numbers, timestamps or localdates which can be compared like this
            return (order === 'desc' ? -1 : 1) * (v1.valueOf() - v2.valueOf());
        }),
    };
}
恏ㄋ傷疤忘ㄋ疼 2025-02-20 20:39:59

如果您试图将超过1000个存储桶汇总到1000多个存储桶中,这可能会有问题。如果您的对象集大于此(或期望在不久的将来),请使用其他方法。

最简单的分组方法,然后在研讨会中排序的方法是添加一个枢轴表。可以将其配置如下:

  • 行分组(S):按
  • 汇总分组的类别:sum(量)

,然后按预览中的聚合列上的下拉列表,然后按“排序下降”。

如果您有很多数据,则可能会收到错误:“太多值,并非全部显示。过滤您的数据以获得更准确的结果。”在这种情况下,如果可能的话,您应先对数据进行过滤(例如,删除无关或零值),或使用其他方法之一。

This may have issues if you are trying to aggregate into more than 1000 buckets. If your object set is larger than this (or expect it to be in the near future), use a different method.

The easiest way to group and then sort on an aggregate in Workshop is to add a pivot table. It can be configured as follows:

  • row grouping(s): category to group by
  • aggregation(s): sum(AMOUNT)

and then pressing the dropdown on the aggregation column in the preview and pressing 'Sort descending'.

If you have a lot of data, you might get the error: "Too many values for , not all are displayed. Filter your data for more accurate results." In this case, you should either pre-filter your data if possible (e.g. remove irrelevant or zero values) or use one of the other approaches.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文