按最近日期和聚类(组)相似标题排序
寻找 LINQ 需要对日期字段进行排序,但也有类似的标题进行分组和排序。考虑类似以下所需的顺序:
Title Date
"Some Title 1/3" 2009/1/3 "note1: even this is old title 3/3 causes this group to be 1st"
"Some Title 2/3" 2011/1/31 "note2: dates may not be in sequence with titles"
"Some Title 3/3" 2011/1/1 "note3: this date is most recent between "groups" of titles
"Title XYZ 1of2" 2010/2/1
"Title XYz 2of2" 2010/2/21
我已经显示了因某些后缀而异的标题。如果海报使用类似以下内容的标题会怎样?
"1 LINQ Tutorial"
"2 LINQ Tutorial"
"3 LINQ Tutorial"
查询如何识别这些相似的标题? 您不必解决所有问题,非常感谢第一个示例的解决方案。
谢谢。
附录 #1 20110605 @svick 还有,当标题作者的编号方案超过 9 时,通常不会考虑使用 2 位数字。例如 01,02...10,11 等。
我见过的典型模式往往是前缀或后缀或甚至埋藏在诸如
1/10 1-10 ...
(1/10) (2/10) ...
1 of 10 2 of 10
Part 1 Part 2 ...
您也指出了一个有效模式:
xxxx Tutorial : first session, xxxx Tutorial : second session, ....
如果我有一个 Levenshtein 函数 StringDistance( s1, s2 ) 我将如何适应 LINQ 查询:)
Looking for LINQ needed to sort on a date field but also have similar titles grouped and sorted. Consider something like the following desired ordering:
Title Date
"Some Title 1/3" 2009/1/3 "note1: even this is old title 3/3 causes this group to be 1st"
"Some Title 2/3" 2011/1/31 "note2: dates may not be in sequence with titles"
"Some Title 3/3" 2011/1/1 "note3: this date is most recent between "groups" of titles
"Title XYZ 1of2" 2010/2/1
"Title XYz 2of2" 2010/2/21
I've shown titles varying by some suffix. What if a poster used something like the following for titles?
"1 LINQ Tutorial"
"2 LINQ Tutorial"
"3 LINQ Tutorial"
How would the query recognize these are similar titles?
You don't have to solve everything, a solution for the 1st example is much appreciated.
Thank you.
Addendum #1 20110605
@svick also Title authors typically are not thoughtful to use say 2 digits when their numbering scheme goes beyond 9. for example 01,02...10,11 etc..
Typical patterns I've seen tend to be either prefix or suffix or even buried in such as
1/10 1-10 ...
(1/10) (2/10) ...
1 of 10 2 of 10
Part 1 Part 2 ...
You pointed out a valid pattern as well:
xxxx Tutorial : first session, xxxx Tutorial : second session, ....
If I have a Levenshtein function StringDistance( s1, s2 ) how would I fit into the LINQ query :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
LINQ(以及 SQL,但这与此处无关)中的正常分组是通过为集合中的每个元素选择某个键来工作的。你没有这样的键,所以我不会使用 LINQ,而是使用两个嵌套的
foreach
es:这会逐渐创建一个组列表。每本书都与每组中的第一本书进行比较。如果匹配,则将其添加到组中。如果没有匹配的组,本书将创建一个新组。最后,我们使用带有点表示法的 LINQ 对结果进行排序。
如果将书籍与一组中的每本书进行比较,而不仅仅是第一本书,那就更正确了。但无论如何你可能不会得到完全正确的结果,所以我认为这种优化是值得的。
这具有时间复杂度
O(N²)
,因此如果您有数百万本书,它可能不是最好的解决方案。编辑:要对组进行排序,请使用类似
Normal grouping in LINQ (and in SQL, but that's not relevant here) works by selecting some key for every element in the collection. You don't have such key, so I wouldn't use LINQ, but two nested
foreach
es:This gradually creates a list of groups. Each book is compared with the first one in each group. If it matches, it is added to the group. If no group matched, the book creates a new group. In the end, we sort the results using LINQ with dot notation.
It would be more correct if books were compared with each book in a group, not just the first. But you're may not get completely correct results anyway, so I think this optimization is worth it.
This has time complexity
O(N²)
, so it's probably not the best solution if you had millions of books.EDIT: To sort the groups, use something like
要按日期订购,您应该使用 OrderBy 运算符。
示例:
要根据相似性对字符串进行分组,您应该考虑诸如 汉明距离 或 Metaphone 算法。 (虽然我不知道.Net 中这些的任何直接实现)。
编辑:正如 svick 的评论中所建议的,Levenstein 距离可能也可以考虑作为汉明距离的更好替代方案。
For ordering by date you should use the OrderBy operator.
Example:
For grouping strings after similarity you should consider something like the Hamming distance or the Metaphone algorithm. (Although I do not know any direct implementations of these in .Net).
EDIT: As suggested in the comment by svick, the Levenstein distance may also be considered, as a better alternative to the Hamming distance.
假设您的标题和日期字段包含在名为 model 的类中,请考虑以下类定义
除了日期和标题属性之外,公共类模型
我还创建了一个没有设置器的前缀属性,它使用子字符串返回我们的公共前缀。您可以在此属性的 getter 中使用您选择的任何方法。剩下的工作就很简单了。考虑这个 Linqpad 程序
编辑 >>>
如果我们将前缀放在一边,查询本身不会返回我想要的结果,即:1)按最近的日期对组进行排序2)按簇内的标题排序。尝试以下操作
Assuming that your Title and Date fields are contained in class called model consider the following class definition
public class Model
Alongside Date and Title properties i have created a prefix property with no setter and it is returning us the common prefix using substring. you can use any method of your choice in getter of this property. Rest of job is simple. Consider this Linqpad program
Edits >>>
If we put the prefix aside the query itself is not returning what I was after which is: 1) Sort the groups by their most recent date 2) sort by title within clusters. Try the following