有没有一种方法可以自定义废纸jsonlines出口商不包含null/默认值?

发布于 2025-02-10 05:39:11 字数 1042 浏览 2 评论 0原文

我正在使用与Pydantic的废料建造一些网络刮刀。我们当前正在使用JSONLINES项目出口商将数据输出到文件中。这是刮板创建的JSON线的示例。

{
  "timestamp": null, 
  "deposit_date": "2022-01-14", 
  "secondary_date": null, 
  "termination_date": "2024-01-12", 
  "tax_structure": "UNKNOWN", 
  "initial_pop": "10.00", 
  "initial_liq": null, 
  "term": "Y02", 
  "narrative_objective": "The trust seeks to provide ....", 
  "narrative_inv_strategy": "",
  "narrative_selection": "", 
  "narrative_risks": ""
}

当刮板在页面上找不到字段/值时,用空或空字符串标记的字段是从模型提供的默认值。问题是这些默认值覆盖了其他来源的值输入(例如,手动输入)。我希望输出不包括这些空字段,以便以后可以手动填充它们。

所需的输出:

{ 
  "deposit_date": "2022-01-14", 
  "termination_date": "2024-01-12", 
  "tax_structure": "UNKNOWN", 
  "initial_pop": "10.00", 
  "term": "Y02", 
  "narrative_objective": "The trust seeks to provide ....", 
}

一种可能的解决方案是将模型更改为仅包括被刮擦的字段。我想避免这样做。我正在为4种不同的网站构建类似的刮板,并希望避免制作4个以上的不同型号。即使是同一站点上的页面也没有/没有不同的字段,具体取决于产品。我想实现的解决方案是自定义进料出口商以不包括这些“空”字段,以便以后可以手动填充它们。我已经阅读了饲料出口上的Scrapy文档,但想了解有关如何解决此问题的更多详细信息。

任何帮助将不胜感激。

I'm building some web scrapers using Scrapy with Pydantic. We are currently using the JSONlines item exporter to output the data into a file. Here is an example of a JSON line created by the scraper.

{
  "timestamp": null, 
  "deposit_date": "2022-01-14", 
  "secondary_date": null, 
  "termination_date": "2024-01-12", 
  "tax_structure": "UNKNOWN", 
  "initial_pop": "10.00", 
  "initial_liq": null, 
  "term": "Y02", 
  "narrative_objective": "The trust seeks to provide ....", 
  "narrative_inv_strategy": "",
  "narrative_selection": "", 
  "narrative_risks": ""
}

The fields marked with null or an empty string are default values provided from the model when the scraper doesn't find the field/value on the page. The issue is that these default values override values input from other sources (manual input, for instance). I would like the output to not include these empty fields so they can be manually populated later.

Desired output:

{ 
  "deposit_date": "2022-01-14", 
  "termination_date": "2024-01-12", 
  "tax_structure": "UNKNOWN", 
  "initial_pop": "10.00", 
  "term": "Y02", 
  "narrative_objective": "The trust seeks to provide ....", 
}

One possible solution is to change the model to only include fields that are scraped. I'd like to avoid doing this. I am building similar scrapers for 4 different sites and would like to avoid having to make 4+ different models. Even pages on the same site have/don't have different fields depending on the product. The solution I'd like to implement is to customize the feed exporter to not include these "empty" fields so they can be manually populated later. I've read through Scrapy's docs on Feed exports but would like a bit more detail on how to go about this.

Any help would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

久光 2025-02-17 05:39:11

另一个可能的解决方案是使用项目管道。在pipelines.py文件中的废纸项目中,您可以过滤出任何具有不正确值的键和/或完全丢弃项目的键,以防其没有适当的字段。

pipelines.py,

from scrapy.exceptions import DropItem

class SpidersPipeline:

    def process_item(self, item, spider):
        new_item = {k:v for k,v in item.items() if v not in [None, ""]}
        if len(new_item) == 0:
            raise DropItem
        return new_item

然后在您的settings.py文件不按评论item_pipelines字典中。

Another possible solution would be to use item pipelines. In your scrapy project in the pipelines.py file you could filter out any keys that have improper values and/or drop an item altogether, in case it has no proper fields.

pipelines.py

from scrapy.exceptions import DropItem

class SpidersPipeline:

    def process_item(self, item, spider):
        new_item = {k:v for k,v in item.items() if v not in [None, ""]}
        if len(new_item) == 0:
            raise DropItem
        return new_item

Then in your settings.py file uncomment the item_pipelines dictionary.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文