Spacy Model EN_CORE_WEB_LG的问题:如何防止每次运行代码下载软件包

发布于 2025-02-01 19:42:24 字数 3064 浏览 4 评论 0原文

我正在使用Spacy及其模型en_core_web_lg在Python中执行摘要。代码运行完美,根本没有错误。除此之外,我正在尝试找到一种方法来确保EN_CORE_WEB_LG如果已经有了它,则不会在环境中下载。我已经搜索了很多东西,以找到一个完美的解决方案,我将在下面列出,但没有一个人对我想实现的目标感到震惊。 此代码将被包装,并将由多个人使用,我想确保如果它们每次运行代码,则EN_CORE_WEB_LG如果已经存在,则不会下载。以下是我的代码和我尝试过的解决方案的摘录:

#Importing necessary Libraries
from heapq import nlargest
from string import punctuation
import nltk
import spacy
from spacy.cli.download import download
from spacy.lang.en.stop_words import STOP_WORDS

nltk.download('punkt')
download(model="en_core_web_lg")
nlp_g = spacy.load('en_core_web_lg') #downloads everytime the code is run even if the model is present in the environment

def spacy_summarize(text):
    """
    Returns the summary for an input string text

            Parameters:
                :param text: Input String
                :type text: str

            Returns:
                :return: The summary for the input text
                :rtype: String

    """
    nlp = nlp_g
    doc= nlp(text)
    word_frequencies={}
    for word in doc:
        if word.text.lower() not in [list(STOP_WORDS), punctuation]:
            if word.text not in word_frequencies:
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1
    max_frequency=max(word_frequencies.values())
    for word in word_frequencies:
        word_frequencies.copy()[word]=word_frequencies[word]/max_frequency
    sentence_tokens= [sent for sent in doc.sents]
    sentence_scores = {}
    spacy_frequencies(word_frequencies, sentence_tokens, sentence_scores)
    select_length=max(1,int(len(sentence_tokens)*0.05))
    summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
    final_summary=[word.text for word in summary]
    summary=''.join(final_summary)
    return summary

def spacy_frequencies(word_frequencies, sentence_tokens, sentence_scores):
    """
    Child function for spacy function for calculating sentence scores
            Parameters:
                    :param: word frequeny, sentence token and score which
                        is provided through the parent function

    """
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies:
                if sent not in sentence_scores:
                    sentence_scores[sent]=word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent]+=word_frequencies[word.text.lower()]



尝试的事情:

import sys
import subprocess
import pkg_resources

required = {'en_core_web_lg'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
  python = sys.executable
  subprocess.check_call([python, '-m', 'spacy', 'download', *missing], stdout=subprocess.DEVNULL)

try:
  nlp_lg = spacy.load("en_core_web_lg")
except ModuleNotFoundError:
  download(model="en_core_web_lg")
  nlp_lg = spacy.load("en_core_web_lg")


两种解决方案都没有给出令人满意的结果,并且再次下载了包裹,如果有人可以帮助我,我将不胜感激吗? 太感谢了!

I am using spacy and its model en_core_web_lg, to perform summarisation in python. The code is running perfectly and there is no error at all. Except that, I am trying to find a way of making sure that the en_core_web_lg doesn't keep downloading in an environment if it already has it. I have googled a lot to find a perfect solution for this which I will list below but none has gelled with what I am trying to achieve.
This code will be packaged and will be used by multiple people and I want to make sure that if they run the code everytime, the en_core_web_lg doesn't download if it already exists. Below is the spacy excerpt of my code and the solutions I tried:

#Importing necessary Libraries
from heapq import nlargest
from string import punctuation
import nltk
import spacy
from spacy.cli.download import download
from spacy.lang.en.stop_words import STOP_WORDS

nltk.download('punkt')
download(model="en_core_web_lg")
nlp_g = spacy.load('en_core_web_lg') #downloads everytime the code is run even if the model is present in the environment

def spacy_summarize(text):
    """
    Returns the summary for an input string text

            Parameters:
                :param text: Input String
                :type text: str

            Returns:
                :return: The summary for the input text
                :rtype: String

    """
    nlp = nlp_g
    doc= nlp(text)
    word_frequencies={}
    for word in doc:
        if word.text.lower() not in [list(STOP_WORDS), punctuation]:
            if word.text not in word_frequencies:
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1
    max_frequency=max(word_frequencies.values())
    for word in word_frequencies:
        word_frequencies.copy()[word]=word_frequencies[word]/max_frequency
    sentence_tokens= [sent for sent in doc.sents]
    sentence_scores = {}
    spacy_frequencies(word_frequencies, sentence_tokens, sentence_scores)
    select_length=max(1,int(len(sentence_tokens)*0.05))
    summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
    final_summary=[word.text for word in summary]
    summary=''.join(final_summary)
    return summary

def spacy_frequencies(word_frequencies, sentence_tokens, sentence_scores):
    """
    Child function for spacy function for calculating sentence scores
            Parameters:
                    :param: word frequeny, sentence token and score which
                        is provided through the parent function

    """
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies:
                if sent not in sentence_scores:
                    sentence_scores[sent]=word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent]+=word_frequencies[word.text.lower()]



Things Tried:

import sys
import subprocess
import pkg_resources

required = {'en_core_web_lg'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
  python = sys.executable
  subprocess.check_call([python, '-m', 'spacy', 'download', *missing], stdout=subprocess.DEVNULL)

try:
  nlp_lg = spacy.load("en_core_web_lg")
except ModuleNotFoundError:
  download(model="en_core_web_lg")
  nlp_lg = spacy.load("en_core_web_lg")


Both solutions didn't give a satisfactory result and the package was downloaded again and I would appreciate if someone could help me with this?
Thank you so much!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

暖风昔人 2025-02-08 19:42:24

Spacy根本不会自动下载模型,因此这必须是您的代码的错误,该错误检查是否已经安装了模型。

查看此代码:

try:
  nlp_lg = spacy.load("en_core_web_lg")
except ModuleNotFoundError:
  download(model="en_core_web_lg")
  nlp_lg = spacy.load("en_core_web_lg")

问题是,如果未安装模型,则是oserror,而不是modulenotefounderror。首先,您需要解决这个问题。

这种方法似乎应该可以工作,除了在安装它们的同一过程中加载模型不能可靠地工作 - 在Python运行时未更新已安装的软件包的列表。因此,即使解决了上述问题,它也可能无法按预期工作。

我建议要么建议:

  1. 将模型下载到已知目录,在此处提取并从路径上加载它,而不仅仅是模型名称
  2. 检查pip list的输出以查看是否安装了模型,如果没有安装

spaCy doesn't automatically download models at all, so this must be a bug with your code that checks if the model is already installed.

Looking at this code:

try:
  nlp_lg = spacy.load("en_core_web_lg")
except ModuleNotFoundError:
  download(model="en_core_web_lg")
  nlp_lg = spacy.load("en_core_web_lg")

The issue is that if the model is not installed this is an OSError, not a ModuleNoteFoundError. First you need to fix that.

This approach seems like it should work, except loading models in the same process you installed them in doesn't work very reliably - the list of installed packages is not updated while Python is running. So even after fixing the above issue, it may not work as intended.

I would recommend either:

  1. Download the model to a known directory, extract it there, and load it from a path instead of just the model name
  2. Check the output of pip list to see if the model is installed, and install it if not
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文