在 DFSORT 中一起排序和拆分？

发布于 2024-11-29 03:59:52 字数 1051 浏览 2 评论 0原文

输入文件布局： 01 至 10 - 10 位数字帐户# 53 至 01 - 值为“Y”或“N”的指示器 71 到 10 - 时间戳（其余字段对于这种排序来说无关紧要）

通过分割和消除重复两种方式对输入文件进行排序会导致不同的结果。我想知道为什么？

Casei：在同一步骤中拆分和消除重复项。

SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE
OUTFIL FILES=01,                                             
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),                            
OUTFIL FILES=02,                                             
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),

Caseii：在两个不同的步骤中拆分和消除重复项：

STEP:01
SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE

STEP:02
SORT FIELDS=COPY
OUTFIL FILES=01,                                             
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),                            
OUTFIL FILES=02,                                             
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),

这两个步骤会产生不同的输出。您看到这两种情况有什么区别吗？请澄清。

原文

Input file Layout:
01 to 10 - 10 Digit Acct#
53 to 01 - An indicator with values 'Y' or 'N'
71 to 10 - Time stamp
(Rest of the fields are insignificant for this sort)

While sorting the input file by splitting and eliminating duplicates in two ways result in different results. I wanna know why?

Casei: Splitting and Eliminating duplicates in the same step.

SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE
OUTFIL FILES=01,                                             
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),                            
OUTFIL FILES=02,                                             
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),

Caseii: Splitting and eliminating duplicates in two different steps:

STEP:01
SORT FIELDS=(01,10,CH,A,53,01,CH,A)
SUM FIELDS=NONE

STEP:02
SORT FIELDS=COPY
OUTFIL FILES=01,                                             
INCLUDE=(53,01,CH,C'Y',AND,71,10,CH,GT,&DATE2(-)),                            
OUTFIL FILES=02,                                             
INCLUDE=(53,01,CH,C'N',AND,71,10,CH,GT,&DATE2(-)),

These two steps are resulting different output. Do u see any difference between both cases? Please clarify.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

把时间冻结 2024-12-06 03:59:52

您要求按帐号（10 个字符升序）排序，然后按指标（1 个字符升序）排序。
这两个字段单独确定记录的键 - 时间戳不是排序键的一部分。因此如果有
是具有相同键的两个或多个记录，它们可以通过排序以任何（随机）顺序放置。没有告诉
时间戳值将按什么顺序出现。

牢记上述内容，考虑当您有两个具有相同键但不同的记录时会发生什么
时间戳值。其中一个时间戳值满足给定的INCLUDE条件，而另一个则不满足。
SUM FIELDS=NONE 参数要求根据键删除重复项。它通过分组来做到这一点
将具有相同键的所有记录放在一起，然后选择组中的最后一个。由于关键
不包括时间戳，所选记录本质上是随机事件。因此它是不可预测的
是否获取满足后续INCLUDE条件的记录。

有几种方法可以解决此问题：

将时间戳添加到排序键。这可能不起作用，因为它可能会为同一个帐号 / Inidcator留下多个记录，也就是说，它可能会破坏您的重复删除要求
请求稳定排序。

稳定排序会导致具有相同排序键的记录在排序后保持相同的相对位置。
在给定相同键的情况下，这将保留文件中时间戳值的原始顺序。当删除重复项时，DFSORT 将从重复项集中选择最后一条记录。这将为您正在寻找的重复消除过程带来可预测性。指定
通过在SORT卡之前添加OPTIONS EQUALS控制卡来实现稳定排序。

编辑评论：...选择非常第一条记录

我最初的答案所依据的书清楚地说明了一组记录中的最后记录与相同的
当指定 SUM=NONE 时，将选择键。然而，它总是
最好查阅供应商自己的手册。 IBM 的 DFSORT 应用程序编程指南仅指出
将选择每个键的一条记录。然而，
它还具有以下注释：

ICETOOL 的 SELECT 运算符的第一个操作数可用于执行相同的操作
函数为 SUM FIELDS=NONE 且 OPTION EQUALS。另外，SELECT 的 FIRSTDUP，
ALLDUPS、NODUPS、HIGHER(x)、LOWER(y)、EQUAL(v)、LASTDUP 和 LAST 操作数可以是
用于根据与重复和非重复相关的其他标准选择记录
键。 SELECT 的 DISCARD(savedd) 操作数可用于保存 SELECT 丢弃的记录
FIRST、FIRSTDUP、ALLDUPS、NODUPS、HIGHER(x)、LOWER(y)、EQUAL(v)、LASTDUP 或
最后的。有关 SELECT 运算符的完整详细信息，请参阅 SELECT 运算符。

根据这些信息，我建议使用 ICETOOL 的 SELECT 运算符来选择正确的记录。

抱歉提供错误信息。

You are asking to sort on an Account Number (10 characters ascending) then on an Indicator (1 character ascending).
These two fields alone determine the key of the record - Timestamp is not part of the sort key. Consequently if there
are two or more records with the same key they could be placed in any (random) order by the sort. No telling
what order the Timestamp values will appear.

Keeping the above in mind, consider what happens when you have two records with the same key but different
Timestamp values. One of these Timestamp values meets the given INCLUDE criteria and the other one doesn't.
The SUM FIELDS=NONE parameter is asking to remove duplicates based on the key. It does this by grouping
all of the records with the same key together and then selecting the last one in the group. Since key
does not include the Timestamp the choosen record is essentially a random event. Consequently it is unpredictable
as to whether you get the record that meets the subsequent INCLUDE condition.

There are a couple of ways to fix this:

Add Timestamp to the sort key. This might not work because it may leave multiple records for the same Account Number / Inidcator, that is it may break your duplicate removal requirement
Request a stable sort.

A stable sort causes records having the same sort key to maintain their same relative positions after the sort.
This will preserve the original order of the Timestamp values in your file given the same key. When the removal of duplicates occurs DFSORT will choose the last record from the set of duplicates. This should bring the predicability to the duplicate elimination process you are looking for. Specify
a stable sort by adding an OPTIONS EQUALS control card before the SORT card.

EDIT Comment: ...picks the VERY FIRST record

The book I based my original answer on clearly stated the last record in a group of records with the same
key would be selected when SUM=NONE is specified. However, it is always
best to consult the vendors own manuals. IBM's DFSORT Application Programming Guide only states
that one record with each key will be selected. However,
it also has the following note:

The FIRST operand of ICETOOL's SELECT operator can be used to perform the same
function as SUM FIELDS=NONE with OPTION EQUALS. Additionally, SELECT's FIRSTDUP,
ALLDUPS, NODUPS, HIGHER(x), LOWER(y), EQUAL(v), LASTDUP, and LAST operands can be
used to select records based on other criteria related to duplicate and non-duplicate
keys. SELECT's DISCARD(savedd) operand can be used to save the records discarded by
FIRST, FIRSTDUP, ALLDUPS, NODUPS, HIGHER(x), LOWER(y), EQUAL(v), LASTDUP, or
LAST. See SELECT Operator for complete details on the SELECT operator.

Based on this information I would suggest using ICETOOL's SELECT operator to select the correct record.

Sorry for the misinformation.

回复收藏 0 原文

兮子 2024-12-06 03:59:52

NealB 发现了这个问题。

最简单的方法是在排序之前日期“删除”您不需要的记录。 SORT 将花费更少的时间。这假设不需要 SORTOUT。如果是，则必须将 INCLUDE= 保留在 OUTFIL 上。

选择是一个不错的选择。 SELECT 默认使用 OPTION EQUALS。以下控制卡可以包含在 xxxxCNTL 数据集中，以及来自使用 USING(xxxx) 的 SELECT 的操作。 SELECT 为您提供比 SUM 更大的灵活性（除其他外，您可以获得最后一个）。

整个任务听起来有缺陷。如果每个帐户都有不同日期的记录，我希望需要第一个日期或最后一个日期，或者其他特定的日期，而不仅仅是 SUM 末尾恰好存在的任何记录。

 OPTION EQUALS

 INCLUDE COND=(71,10,CH,GT,&DATE2(-))

 SORT FIELDS=(01,10,CH,A,53,01,CH,A)

 SUM FIELDS=NONE

 OUTFIL FILES=01,                                             
      INCLUDE=(53,01,CH,EQ,C'Y')

 OUTFIL FILES=02,                                             
      INCLUDE=(53,01,CH,EQ,C'N')

或者，如果是/否涵盖所有记录：

 OUTFIL FILES=02,SAVE

The problem is as NealB identified.

The easiest thing to do is to "get rid of" the records you don't want by date before the SORT. The SORT will take less time. This assumes that SORTOUT is not required. If it is, you have to keep your INCLUDE= on the OUTFILs.

SELECT is a good option. SELECT uses OPTION EQUALS by default. The below Control Cards can be included in an xxxxCNTL dataset, and action from the SELECT with USING(xxxx). SELECT gives you greater flexibility than SUM (you can get the last, amongst other things).

The whole task sounds flawed. If there are records per account with different dates, I'd expect either the first date or the last date, or something else specific, to be required, not just whatever record happens to be hanging around at the end of the SUM.

 OPTION EQUALS

 INCLUDE COND=(71,10,CH,GT,&DATE2(-))

 SORT FIELDS=(01,10,CH,A,53,01,CH,A)

 SUM FIELDS=NONE

 OUTFIL FILES=01,                                             
      INCLUDE=(53,01,CH,EQ,C'Y')

 OUTFIL FILES=02,                                             
      INCLUDE=(53,01,CH,EQ,C'N')

Or, if the Y/N cover all records: