在其中获得带有特定文字的电子邮件

发布于 2025-02-11 06:42:03 字数 1931 浏览 0 评论 0原文

我正在创建一个脚本,该脚本列出了一个网站的所有业务, 它需要刮擦(名称,地址,网站,电子邮件,电话号码)。 而且我必须一部分,我有点可以刮擦电子邮件,但是我有小问题,我不能只告诉我的脚本拿走所有脚本,它们是特定的,需要包含[Biuro或sekretariat或网站www的一部分。 (namepart).com]和我有点不知道该怎么做。 这是我的代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, Spider


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']
    def parse(self, response):
        websites = response.css('div#root')[0]
        PAGETEST = response.xpath('//a[contains(@class,"rp-173nt6g")]/../following-sibling::li').css('a::attr(href)').get()
        for website in websites.css('li.rp-np9kb1'):
            page = website.css('a::attr(href)').get()
            address = website.css('address.rp-o9b83y::text').get()
            name = website.css('h2.rp-69f2r4::text').get()
            params = {
            'address' : address,
            'name' : name,
            'href' : page,
            }
            url  = response.urljoin(page)
            
            yield Request(url=url, cb_kwargs={'params': params}, callback=self.parseMain)
            
        yield Request(url=response.urljoin(PAGETEST), callback=self.parse)

    def parseMain(self, response, params=None):
        # print(response.url)
        website = response.css('div.rp-l0pkv6 a::attr(href)').get()
        params['website'] = website
        urlem = response.urljoin(website)
        yield Request(url=urlem, cb_kwargs={'params': params}, callback=self.parseEmail)
        
    
    
    
    
    def parseEmail(self,response, params=None):
        email = response.xpath('//a[contains(@href, "@")]/@href').get()
        params['email'] = email        
        yield params
if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()
    
    
    
    
    

感谢提前的帮助!

i'm creating a script that lists all bussiness from one website,
it need's to scrape (Name,address,website,email,telephone number).
And i got to part that i kinda can scrape email, but i have small problem, i can't just tell my script to take all of them, they are specyifc and need to contain[Biuro or Sekretariat or name part of website www.(namePart).com] and i kinda don't know how to do it.
Here is my code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, Spider


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']
    def parse(self, response):
        websites = response.css('div#root')[0]
        PAGETEST = response.xpath('//a[contains(@class,"rp-173nt6g")]/../following-sibling::li').css('a::attr(href)').get()
        for website in websites.css('li.rp-np9kb1'):
            page = website.css('a::attr(href)').get()
            address = website.css('address.rp-o9b83y::text').get()
            name = website.css('h2.rp-69f2r4::text').get()
            params = {
            'address' : address,
            'name' : name,
            'href' : page,
            }
            url  = response.urljoin(page)
            
            yield Request(url=url, cb_kwargs={'params': params}, callback=self.parseMain)
            
        yield Request(url=response.urljoin(PAGETEST), callback=self.parse)

    def parseMain(self, response, params=None):
        # print(response.url)
        website = response.css('div.rp-l0pkv6 a::attr(href)').get()
        params['website'] = website
        urlem = response.urljoin(website)
        yield Request(url=urlem, cb_kwargs={'params': params}, callback=self.parseEmail)
        
    
    
    
    
    def parseEmail(self,response, params=None):
        email = response.xpath('//a[contains(@href, "@")]/@href').get()
        params['email'] = email        
        yield params
if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()
    
    
    
    
    

Thanks for help in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

孤独陪着我 2025-02-18 06:42:03

在您的parseemail方法中,提取电子邮件地址后,只需像使用任何字符串一样检查提取的字符串。

例如

from urllib.parse import urlsplit

def parseEmail(self,response, params=None):
    email = response.xpath('//a[contains(@href, "@")]/@href').get()
    netloc = urlsplit(response.url).netloc
    if 'Biuro' in email or 'Sekretariat' in email:
        params['email'] = email
    elif any([(i in email) for i in netloc.split('.')[:-1] if i != 'www']):
        params['email'] = email
    yield params

In your parseEmail method, after extracting the email address, just check the extracted string like you would with any string.

For Example

from urllib.parse import urlsplit

def parseEmail(self,response, params=None):
    email = response.xpath('//a[contains(@href, "@")]/@href').get()
    netloc = urlsplit(response.url).netloc
    if 'Biuro' in email or 'Sekretariat' in email:
        params['email'] = email
    elif any([(i in email) for i in netloc.split('.')[:-1] if i != 'www']):
        params['email'] = email
    yield params
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文