LINQ to DataSet 中的 GroupBy
背景:
我正在将数据从 MySQL 数据库导入到 SQL-Server 数据库(用于报告,随后用于 SSAS-Cube)。我想同时对数据进行标准化。我想将重复的 Ticket_IDs
与其他有用信息一起分组到表 Contact
中的一条记录,并将原始数据保留在子表 ContactDetail
中(使用联系人的外键)。因此,Contact
中的每条记录都有一个唯一的 Ticket_ID
。
我决定使用强类型数据集进行导入。现在我想知道检测我是否已添加 Ticket_ID 的最佳方法是什么。我可以在每个循环中检查它(~100000 条记录),但我假设有更好/更快的方法。
简化的示例数据:
Ticket_ID ID fiContact
89442226 1 1
89442226 2 1
89442226 3 1
89442261 4 2
89442261 5 2
89442354 6 3
89442359 7 4
89442359 8 4
89442367 9 5
89442504 10 6
这应该是 Contact
-table
Ticket_ID idContact
89442226 1
89442261 2
89442354 3
89442359 4
89442367 5
89442504 6
问题:
是否可以使用 LINQ/LINQ-to-DataSet 按 Ticket_ID 进行分组并获取 ContactDetailRows 列表每个联系人行?我知道有一个 GroupBy-扩展,但我不确定如何使用以及它是否满足我的需要(保留 ContactDetail-Rows,就像以 Ticket_ID 作为键和 List(of EmailRow)
作为字典 价值)。
这就是我所拥有的(简化的):
For Each srcEmail In src.email 'i want to group src.email by Ticket_ID'
'so far i check for existence in every loop'
Dim res = From c In dest.Contact
Where c.Ticket_ID = srcEmail.ticket_id
If Not res.Any Then
'create new Contact
Dim newContact = Me.dest.Contact.NewContactRow
newContact.Ticket_ID = srcEmail.ticket_id
' ..... '
dest.Contact.AddContactRow(newContact)
End If
'TODO: create ContactDetail row and add it to the DataTable '
Next
src
: typed DataSet(MySQL)src.email
: typed DataTable =>进入ContactDetail
dest
:类型化 DataSet(SQL-Server)dest.Contact
类型化 DataTabledest.ContactDetail
类型化 DataTable fk toContact
我更喜欢 VB.NET,因为我对 LINQ 还不太熟悉,而且 C# 中的语法有很大不同。
编辑:
感谢@Magnus,我按照以下方式进行:
Dim emailsPerTicketID = src.email.ToLookup(Function(email) email.ticket_id)
For Each ticket In emailsPerTicketID
'create new Contact
Dim newContact = Me.dest.Contact.NewContactRow
newContact.Ticket_ID = ticket.Key
newContact.CreatedAt = ticket.First().modified_time
' ...... '
dest.Contact.AddContactRow(newContact)
'TODO: add now all EmailRows'
For Each emailRow In ticket
Dim newContactDetail = dest.ContactDetail.NewContactDetailRow
newContactDetail.ContactRow = newContact
newContactDetail.Interaction = emailRow.interaction
' .... '
dest.ContactDetail.AddContactDetailRow(newContactDetail)
Next
Next
我将看看这是否比使用 HashSet 检测联系人是否已创建的迭代方法更快。
Background:
I'm importing data from a MySQL database into a SQL-Server database(for reports and later a SSAS-Cube). I want to normalize the data at the same time. I want to group repeating Ticket_IDs
to one record in a table Contact
with other useful informations and leave the rawdata in the sub-table ContactDetail
(with foreignkey to Contact). Hence every record in Contact
has a unique Ticket_ID
.
I've decided to use strong typed datasets for the import. Now i'm wondering what's the best way to detect if i've already added a Ticket_ID. I could check for it in every loop(~100000 records) but i'm assuming that there is a better/faster way.
Simplified sample-data:
Ticket_ID ID fiContact
89442226 1 1
89442226 2 1
89442226 3 1
89442261 4 2
89442261 5 2
89442354 6 3
89442359 7 4
89442359 8 4
89442367 9 5
89442504 10 6
This should be the Contact
-table
Ticket_ID idContact
89442226 1
89442261 2
89442354 3
89442359 4
89442367 5
89442504 6
Question:
Is it possible with LINQ/LINQ-to-DataSet to group by Ticket_ID and get a list of ContactDetailRows for every ContactRow? I know there is a GroupBy-Extension, but i'm unsure how to use and if it does what i need(keep the ContactDetail-Rows, f.e. like a dicitonary with Ticket_ID as key and a List(of EmailRow)
as value).
This is what i have(simplified):
For Each srcEmail In src.email 'i want to group src.email by Ticket_ID'
'so far i check for existence in every loop'
Dim res = From c In dest.Contact
Where c.Ticket_ID = srcEmail.ticket_id
If Not res.Any Then
'create new Contact
Dim newContact = Me.dest.Contact.NewContactRow
newContact.Ticket_ID = srcEmail.ticket_id
' ..... '
dest.Contact.AddContactRow(newContact)
End If
'TODO: create ContactDetail row and add it to the DataTable '
Next
src
: typed DataSet(MySQL)src.email
: typed DataTable => intoContactDetail
dest
: typed DataSet(SQL-Server)dest.Contact
typed DataTabledest.ContactDetail
typed DataTable with fk toContact
I would prefer VB.NET because i'm yet not as familiar with LINQ and the Syntax is quite different in C#.
Edit:
Thanks to @Magnus i've get it going in the following way:
Dim emailsPerTicketID = src.email.ToLookup(Function(email) email.ticket_id)
For Each ticket In emailsPerTicketID
'create new Contact
Dim newContact = Me.dest.Contact.NewContactRow
newContact.Ticket_ID = ticket.Key
newContact.CreatedAt = ticket.First().modified_time
' ...... '
dest.Contact.AddContactRow(newContact)
'TODO: add now all EmailRows'
For Each emailRow In ticket
Dim newContactDetail = dest.ContactDetail.NewContactDetailRow
newContactDetail.ContactRow = newContact
newContactDetail.Interaction = emailRow.interaction
' .... '
dest.ContactDetail.AddContactDetailRow(newContactDetail)
Next
Next
I will have a look if this is faster than the iterating approach with a HashSet to detect if the contact was already created.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为使用 Lookup (类似于字典,但用 key/Collection 代替)对你来说是一个很好的解决方案。像这样:
如果您需要任何帮助将任何语法翻译为 VB,请评论我。
I think using Lookup(like a dictionary but with key/Collection instead) would be a good solution for you. something like this:
If you need any help translating any of the syntax to VB comment me.
我的 VB 已经生锈了,但这里有一个尝试:
或者,如果您想每次通过循环检查它,您可以使用 HashSet,每次只需将票证 ID 添加到哈希集中,然后通过
Contains
方法检查其是否存在。这会比你正在做的更快,但我怀疑 LINQ 分组会比 HashSet 更快。My VB is rusty, but here's a shot at it:
Alternatively, if you want to check it every time through the loop, you could use a HashSet, just adding the ticket ID to the hashset each time through and then checking for its presence via the
Contains
method. That would be faster than what you're doing, but I suspect the LINQ grouping will be faster than the HashSet.