如何通过卷心python从多个URL中将数据收集到单个项目中

发布于 2025-01-30 23:02:29 字数 4342 浏览 0 评论 0 原文

在更简单的术语中,我想从回调函数中获取返回值,直到耗尽的for for for the the the Trafe单项。

我想做的是
我正在创建新链接,该链接表示 https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-gkw6i7t6/ 例如

  1. https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-ahly-es-setif-gkw6i7t6/?r=2#ah; P>

基本上,我正在使用一个for循环,并使用dict创建一个新链接并通过回调函数产生请求。

class CountryLinksSpider(scrapy.Spider):
    name = 'country_links'
    allowed_domains = ['oddsportal.com']
    start_urls = ['https://www.oddsportal.com/soccer/africa/caf-champions-league/es-setif-al-ahly-AsWAHRrD/']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.create_all_tabs_links_from_url)

    def create_all_tabs_links_from_url(self, response):
        current_url = response.request.url
        _other_useful_scrape_data_dict = OrderedDict(
            [('time', '19:00'), ('day', '14'), ('month', 'May'), ('year', '22'), ('Country', 'Africa'),
             ('League', 'CAF Champions'), ('Home', 'ES Setif'), ('Away', 'Al Ahly'), ('FT1', '2'), ('FT2', '2'),
             ('FT', 'FT'), ('1H H', '1'), ('1H A', '1'), ('1HHA', 'D'), ('2H H', '1'), ('2H A', 1), ('2HHA', 'D')])

        with requests.Session() as s:
            s.headers = {
                "accept": "*/*",
                "accept-encoding": "gzip, deflate, br",
                "accept-language": "en-US,en;q=0.9,pl;q=0.8",
                "referer": 'https://www.oddsportal.com',
                "user-agent": fake_useragent.UserAgent().random
            }
            r = s.get(current_url)
            version_id = re.search(r'"versionId":(\d+)', r.text).group(1)
            sport_id = re.search(r'"sportId":(\d+)', r.text).group(1)
            xeid = re.search(r'"id":"(.*?)"', r.text).group(1)

            xhash = urllib.parse.unquote(re.search(r'"xhash":"(.*?)"', r.text).group(1))

        unix = int(round(time.time() * 1000))

        tabs_dict = {'#ah;2': ['5-2', 'AH full time', ['1', '2']], '#ah;3': ['5-3', 'AH 1st half', ['1', '2']],
                     '#ah;4': ['5-4', 'AH 2nd half', ['1', '2']], '#dnb;2': ['6-2', 'DNB full_time', ['1', '2']]}
        all_tabs_data = OrderedDict()
        all_tabs_data = all_tabs_data | _other_useful_scrape_data_dict

        for key, value in tabs_dict.items():
            api_url = f'https://fb.oddsportal.com/feed/match/{version_id}-{sport_id}-{xeid}-{value[0]}-{xhash}.dat?_={unix}'

            # goto each main tabs and get data from it and yield here
            single_tab_scrape_data = yield scrapy.http.Request(api_url,
                                                        callback=self.scrape_single_tab)
        # following i want to do (collect all the data from all tabs into single item)
        # all_tabs_data = all_tabs_data | single_tab_scrape_data # (as a dict)

    # yield all_tabs_data  # yield single dict with scrape data from all the tabs

    def scrape_single_tab(self, response):
        # sample scraped data from the response
        scrape_dict = OrderedDict([('AH full time -0.25 closing 2', 1.59), ('AH full time -0.25 closing 1', 2.3),
                                   ('AH full time -0.25 opening 2', 1.69), ('AH full time -0.25 opening 1', 2.12),
                                   ('AH full time -0.50 opening 1', ''), ('AH full time -0.50 opening 2', '')])

        yield scrape_dict

我尝试过的东西, 首先,我尝试简单地从Scrape_Match_Data Fuction返回Scrape项目。但是我找不到一种从收益请求中获取回调函数的返回值的方法。

我尝试使用以下库, 来自inline_requests导入inline_requests 从twisted.internet.defer导入inlinecallbacks,

但我无法使其起作用。我觉得必须有更简单的方法将来自不同链接的刮擦项目附加到一个项目中并产生。

请帮助我解决这个问题。

In simpler term, I would like to grab the return value from the callback function until the for loop exhausted and the yield single item after that.

What I am trying to do is following,
I am creating new links, which represent click on tabs on https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/
such as

  1. https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#ah;2

  2. https://www.oddsportal.com/soccer/africa/caf-champions-league/al-ahly-es-setif-GKW6I7T6/?r=2#over-under;2
    and so on.
    But they are basically data for the same match, so I am trying to collect bet info into one single time.

Basically, I am using a for loop with dict to create a new link and yielding the request with callback function.

class CountryLinksSpider(scrapy.Spider):
    name = 'country_links'
    allowed_domains = ['oddsportal.com']
    start_urls = ['https://www.oddsportal.com/soccer/africa/caf-champions-league/es-setif-al-ahly-AsWAHRrD/']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.create_all_tabs_links_from_url)

    def create_all_tabs_links_from_url(self, response):
        current_url = response.request.url
        _other_useful_scrape_data_dict = OrderedDict(
            [('time', '19:00'), ('day', '14'), ('month', 'May'), ('year', '22'), ('Country', 'Africa'),
             ('League', 'CAF Champions'), ('Home', 'ES Setif'), ('Away', 'Al Ahly'), ('FT1', '2'), ('FT2', '2'),
             ('FT', 'FT'), ('1H H', '1'), ('1H A', '1'), ('1HHA', 'D'), ('2H H', '1'), ('2H A', 1), ('2HHA', 'D')])

        with requests.Session() as s:
            s.headers = {
                "accept": "*/*",
                "accept-encoding": "gzip, deflate, br",
                "accept-language": "en-US,en;q=0.9,pl;q=0.8",
                "referer": 'https://www.oddsportal.com',
                "user-agent": fake_useragent.UserAgent().random
            }
            r = s.get(current_url)
            version_id = re.search(r'"versionId":(\d+)', r.text).group(1)
            sport_id = re.search(r'"sportId":(\d+)', r.text).group(1)
            xeid = re.search(r'"id":"(.*?)"', r.text).group(1)

            xhash = urllib.parse.unquote(re.search(r'"xhash":"(.*?)"', r.text).group(1))

        unix = int(round(time.time() * 1000))

        tabs_dict = {'#ah;2': ['5-2', 'AH full time', ['1', '2']], '#ah;3': ['5-3', 'AH 1st half', ['1', '2']],
                     '#ah;4': ['5-4', 'AH 2nd half', ['1', '2']], '#dnb;2': ['6-2', 'DNB full_time', ['1', '2']]}
        all_tabs_data = OrderedDict()
        all_tabs_data = all_tabs_data | _other_useful_scrape_data_dict

        for key, value in tabs_dict.items():
            api_url = f'https://fb.oddsportal.com/feed/match/{version_id}-{sport_id}-{xeid}-{value[0]}-{xhash}.dat?_={unix}'

            # goto each main tabs and get data from it and yield here
            single_tab_scrape_data = yield scrapy.http.Request(api_url,
                                                        callback=self.scrape_single_tab)
        # following i want to do (collect all the data from all tabs into single item)
        # all_tabs_data = all_tabs_data | single_tab_scrape_data # (as a dict)

    # yield all_tabs_data  # yield single dict with scrape data from all the tabs

    def scrape_single_tab(self, response):
        # sample scraped data from the response
        scrape_dict = OrderedDict([('AH full time -0.25 closing 2', 1.59), ('AH full time -0.25 closing 1', 2.3),
                                   ('AH full time -0.25 opening 2', 1.69), ('AH full time -0.25 opening 1', 2.12),
                                   ('AH full time -0.50 opening 1', ''), ('AH full time -0.50 opening 2', '')])

        yield scrape_dict

What i have tried,
first i tried simple returning the scrape item from the scrape_match_data fuction. but i could not find a way to grab the return value of callback function from the yield request.

I have tried using following libraries,
from inline_requests import inline_requests
from twisted.internet.defer import inlineCallbacks

but i can not make it work. i feel like there must simpler way to append scraped item from different links into one item and yield it.

Please help me to solve this issue.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

东京女 2025-02-06 23:02:29

从技术上讲,我们有两种方法可以在我们使用的回调功能之间传输数据,以从多个请求中构造项目:

1。请求meta字典:

def parse(self, response):
    ...
    yield Request(
        url,
        callback=self.parse_details,
        meta = {'scraped_item_data': data})

def parse_details(self, response):
    scraped_data = response.meta.get('scraped_item_data') # <- not present in Your code
    ...

您可能错过了调用 reverse.meta.get('_ scrape_dict')访问从上一个回调函数中删除的数据

2。 cb_kwargs 可用于零工版本1.7和较新的访问:

def parse(self, response):
    ...
    yield Request(
        url,
        callback=self.parse_details,
        cb_kwargs={'scraped_item_data': data})

def parse_details(self, response, scraped_item_data):  # <- already accessible data from previous request
    ...

3.求解项目。来自多个类型的响应。
实施它的最简单方法是将数据分配给类变量。
代码看起来像这样:

def parse(self, response):
    self.tabs_data = []
    ...
    self.tabs_number = len(tabs) #  or len(list(tabs)) # <number of tabs
    for tab in tabs:
        yield Request(...

def parse_details(self, response)
    scraped_tab_data = ...
    self.tabs_data.append(scraped_tab_data)
    if len(self.tabs_data) == self.tabs_number: # when data from all tabs collected
        # compose one big item
        ...
        yield item

Technically in scrapy we have 2 approaches to transfer data between callback functions we are using to construct items from multiple requests:

1. Request meta dictionary:

def parse(self, response):
    ...
    yield Request(
        url,
        callback=self.parse_details,
        meta = {'scraped_item_data': data})

def parse_details(self, response):
    scraped_data = response.meta.get('scraped_item_data') # <- not present in Your code
    ...

probably You missed to call response.meta.get('_scrape_dict') to access data scraped from previous callback function

2. cb_kwargs accessible for scrapy version 1.7 and newer:

def parse(self, response):
    ...
    yield Request(
        url,
        callback=self.parse_details,
        cb_kwargs={'scraped_item_data': data})

def parse_details(self, response, scraped_item_data):  # <- already accessible data from previous request
    ...

3.Single Item from multiple.. responses with the same type.
The easiest way to implement it is to assign data to class variable.
The code will look like this:

def parse(self, response):
    self.tabs_data = []
    ...
    self.tabs_number = len(tabs) #  or len(list(tabs)) # <number of tabs
    for tab in tabs:
        yield Request(...

def parse_details(self, response)
    scraped_tab_data = ...
    self.tabs_data.append(scraped_tab_data)
    if len(self.tabs_data) == self.tabs_number: # when data from all tabs collected
        # compose one big item
        ...
        yield item

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文