如何以Pythonic的方式保持标题,API参数&请求[Web Scraper的对象]分开但易于访问和可变的对象?

发布于 2025-02-01 03:37:22 字数 4644 浏览 4 评论 0原文

我想编写一个具有3种不同属性“组”(或数据)的刮刀,这些属性(或数据)可能会单独保存。

我希望使用数据类别并针对Pythonic实践,但是由于以后更详细说明的原因,数据级不适合。

3组[或“接口”]如下:

#1:HTTP Header Fields

  • 具有默认值,但是需要在对象实例化(#)的对象实例化时 (#) 3)请求类对象
  • 在#3内部使用请求方法

#2:api参数 for url查询请求中

  • 使用默认值时,请求类对象的行为就像是dict。 /strong>在实例化之后,
  • 理想情况下使用#3内部使用请求方法

#3:响应对象(数据)在请求从API服务器返回用户后。

  • 稍后,我将实现该对象的方法以具有输出格式,例如csvjsonsql dbs3等等。这至少是第四个接口。

我一直在尝试完成的任务,

我想要一个界面,用户可以在其中使用他们需要的API参数实例化类,例如player,并且还可以更新HTTP Header(根据需要)。

这是我当前的代码(图片表格):

http headerapi params都可以轻松存储为Python dict (或JSON)。我在下面加入了它们。

=> 问题是,我如何使它们在实例化(创建)的请求对象(类)中可变,并且能够在实例化(创建)之后进行更新?

  • 通过dataclasses继承?我试图将这些字典放在数据级别中,但它们不喜欢它们链接,因为它是尝试从Dataclass模块中浏览default_factory field 的default_factory。有可能,但是使用数据级避免所有额外的语法。使用dataclasses也使其成为myDataClass .__ dict __pythonclass .__ dict __拥有更多的东西。 =>因此,使用常规的python类或dict ...

  • :似乎有两个选项可以使HTTP标头的可变性在创建时。 1)继承,但这将HTTP标头属性的水与API参数混淆。 2)coption,将属性字段设置为httpclassheader并进行一些工作以能够转换回dict以在request> request_data()方法。

  • 将dicts放入播放器(请求类)不允许通过不错的关键字接口进行突变性(或者我不知道如何实现它)。

这是我的文本形式的我的代码:

class Players:

    __endpoint__ = "CommonallPlayers"

    def __init__(self, IsOnlyCurrentSeason=0, LeagueID="00", Season="2021-22", header= HTTPHeader) -> None:
        # these first 3 attributes constitute the (#2) API Params
        self.IsOnlyCurrentSeason = IsOnlyCurrentSeason
        self.LeagueID = LeagueID
        self.Season = Season
        
        self.header = HTTPHeader # (1) inherit as a Class or Dict?



    def encode_api_params(self):
        return self.__dict__ # if only 3 attributes, this works, but not if I add more attributes HTTP or self.request_data

    def get_http_header(self):
        # ideally can return the http_header as a dict
        pass

    # ideally this is NOT instantiated (as doesn't have data, shouldn't be accessible to user until AFTER request)
    def request_data(self):
        url_api = f"{BASE_URL}/{self.__endpoint__}"
        return requests.get(url_api, 
                            params=self.encode_api_params(), 
                            headers=self.get_http_header())

# works, has current defaults (current season)
c = Players()

# a common use case, using a different Season than the default (current season)
c = Players(Season="1999-00")

# A possible needed change, with 2 possible desired interface
c = Players(Season="1999-00", header={"Referer": "https://www.another-website.com/"})
c = Players(Season="1999-00").header(Referer="https://www.another-website.com/")

# Final outputs
c.request_data().to_csv("downloads/my_data.csv")
c.request_data().to_sql("table-name")


这是http headerapi paramsrequest> request opect object如下(运行这些一起返回一些数据):

HTTP_HEADER = {
    "Accept": "application/json, text/plain, */*",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "Host": "stats.nba.com",
    "Origin": "https://www.nba.com",
    "Referer": "https://www.nba.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-site",
    "Sec-GPC": "1",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
    "x-nba-stats-origin": "stats",
    "x-nba-stats-token": "true",
}

params = {'IsOnlyCurrentSeason': 0, 'LeagueID': '00', 'Season': '2021-22'}


r = requests.get("https://stats.nba.com/stats/commonallplayers", # base url
                 params=params, # expects (#2) params, the api parameters to be a dict
                 headers=headers) # expects (#1) headers to be a dict

r.json()

I would like to write a scraper that has 3 different "groups" of attributes (or data) that would [and likely should] be kept separately.

I was hoping to use DataClasses and aim at Pythonic practices, but DataClasses don't feel appropriate for reasons stated in more detail later.

The 3 groups [or "interfaces"] are as follows:

#1: HTTP Header fields

  • has defaults, but needs to be mutable at/after object instantiation of the (#3) request class object
  • ideally acts like a dict when using a request method inside #3 request object

#2: API parameters for the URL query request

  • has defaults, but also needs to be mutable at/after instantiation
  • ideally acts like a dict when using a request method inside #3 request object

#3:Response Object (the data) after the request is returned to the user from the API server.

  • I would later implement methods for the object to have output formats such as CSV, JSON, SQL DB, S3, etc. That would be [at least] a 4th interface.

The Task I've been trying to Accomplish

I want an interface where a user can instantiate a class, e.g. Player with the API params they need and are also update HTTP header (as needed).

Here's my current code (pic form):
Player Class Request Object

The HTTP Header and the API Params are both easily stored as Python dicts (or JSON). I have included them below.

=> The question is how do I make them mutable in the Request object (Class) at instantiation (creation) and able to be updated after instantiation (creation)?

  • Inheritance via DataClassses? I have tried to put these dictionaries in DataClasses but they don't like them link as it's a hack to try to get around the default_factory using field from the dataclass module. It's possible, but defeats using Dataclasses to avoid all the extra syntax. Using Dataclasses also makes it so the MyDataClass.__dict__ has way more stuff to it than PythonClass.__dict__.
    => Thus use a regular Python Class or Dict...

  • Using a Regular Python Class: There seems to be two options to allow mutability of the HTTP Header at creation. 1) Inheritance, but that muddies the waters of the attributes of the HTTP Header with the API Params. 2) Composition, setting an attribute field to the HTTPClassHeader and doing some work to be able to convert back to a dict to use in the request_data() method.

  • Putting the Dicts into the Players (Request Class) doesn't allow mutability via a nice keyword interface (or I'm not aware how to implement it).

Here's my code in text form:

class Players:

    __endpoint__ = "CommonallPlayers"

    def __init__(self, IsOnlyCurrentSeason=0, LeagueID="00", Season="2021-22", header= HTTPHeader) -> None:
        # these first 3 attributes constitute the (#2) API Params
        self.IsOnlyCurrentSeason = IsOnlyCurrentSeason
        self.LeagueID = LeagueID
        self.Season = Season
        
        self.header = HTTPHeader # (1) inherit as a Class or Dict?



    def encode_api_params(self):
        return self.__dict__ # if only 3 attributes, this works, but not if I add more attributes HTTP or self.request_data

    def get_http_header(self):
        # ideally can return the http_header as a dict
        pass

    # ideally this is NOT instantiated (as doesn't have data, shouldn't be accessible to user until AFTER request)
    def request_data(self):
        url_api = f"{BASE_URL}/{self.__endpoint__}"
        return requests.get(url_api, 
                            params=self.encode_api_params(), 
                            headers=self.get_http_header())

# works, has current defaults (current season)
c = Players()

# a common use case, using a different Season than the default (current season)
c = Players(Season="1999-00")

# A possible needed change, with 2 possible desired interface
c = Players(Season="1999-00", header={"Referer": "https://www.another-website.com/"})
c = Players(Season="1999-00").header(Referer="https://www.another-website.com/")

# Final outputs
c.request_data().to_csv("downloads/my_data.csv")
c.request_data().to_sql("table-name")


Here's the HTTP HEADER, the API Params, and Request Object in the simplest form are as follows (running these together would return some data):

HTTP_HEADER = {
    "Accept": "application/json, text/plain, */*",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "Host": "stats.nba.com",
    "Origin": "https://www.nba.com",
    "Referer": "https://www.nba.com/",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-site",
    "Sec-GPC": "1",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
    "x-nba-stats-origin": "stats",
    "x-nba-stats-token": "true",
}

params = {'IsOnlyCurrentSeason': 0, 'LeagueID': '00', 'Season': '2021-22'}


r = requests.get("https://stats.nba.com/stats/commonallplayers", # base url
                 params=params, # expects (#2) params, the api parameters to be a dict
                 headers=headers) # expects (#1) headers to be a dict

r.json()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文