如何处理集合中具有多个元素的集合列表
当我抓取每个网站上的所有电子邮件并尝试输出它时,我可以获得给定的数据帧,它是每个网站的多个元素集的列表:
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': [{'[email protected]', '[email protected]'}, set(),{'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}]},
{'main_url': 'http://kirsebaergaarden.com', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}]},
{'main_url': 'http://koglernes.dk', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'},set(), set(), {'[email protected]'}, {'[email protected]'}]},
{'main_url': 'http://kongehojensbornehave.dk', 'emails': [set()]}
])
但是,我想处理数据框看起来像以下内容:
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]','[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']},
{'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]']},
{'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
{'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
])
如何实现?
我已经尝试过以下代码,但它只能仅在给定网站的电子邮件列表中没有元素时运行错误时返回第一个集合的第一个元素:
URL_WITH_EMAILS_DF['emails'] = [', '.join(x.pop()) if not None else "" for x in URL_WITH_EMAILS_DF['emails'].values]
PS:根据第一个数据帧,我需要插入一组多封电子邮件,因为单个网站可能有多个网页,并且我不想从每个网页中获取重复的电子邮件。
When I scrape websites for all the emails on each website and try to output it, I can get a given data frame which is a list of sets of multiple elements for each website:
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': [{'[email protected]', '[email protected]'}, set(),{'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}]},
{'main_url': 'http://kirsebaergaarden.com', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}]},
{'main_url': 'http://koglernes.dk', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'},set(), set(), {'[email protected]'}, {'[email protected]'}]},
{'main_url': 'http://kongehojensbornehave.dk', 'emails': [set()]}
])
However, I want to process the data frame to look like the following:
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]','[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']},
{'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]']},
{'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
{'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
])
How can it be achieved?
I have tried the following code but it only manage to return first element of first set only while running to error when there is no element in the email list for a given website:
URL_WITH_EMAILS_DF['emails'] = [', '.join(x.pop()) if not None else "" for x in URL_WITH_EMAILS_DF['emails'].values]
PS: As per first dataframe, I needed to get a set of multiple emails to be inserted because there can be multiple webpage for a single website and I do not want to take duplicate email from each web page.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
chain.from_iterable
可以解决这个问题。chain.from_iterable
can solve this problem.