Spacy Model EN_CORE_WEB_LG的问题：如何防止每次运行代码下载软件包

发布于 2025-02-01 19:42:24 字数 3064 浏览 4 评论 0原文

我正在使用Spacy及其模型en_core_web_lg在Python中执行摘要。代码运行完美，根本没有错误。除此之外，我正在尝试找到一种方法来确保EN_CORE_WEB_LG如果已经有了它，则不会在环境中下载。我已经搜索了很多东西，以找到一个完美的解决方案，我将在下面列出，但没有一个人对我想实现的目标感到震惊。此代码将被包装，并将由多个人使用，我想确保如果它们每次运行代码，则EN_CORE_WEB_LG如果已经存在，则不会下载。以下是我的代码和我尝试过的解决方案的摘录：

#Importing necessary Libraries
from heapq import nlargest
from string import punctuation
import nltk
import spacy
from spacy.cli.download import download
from spacy.lang.en.stop_words import STOP_WORDS

nltk.download('punkt')
download(model="en_core_web_lg")
nlp_g = spacy.load('en_core_web_lg') #downloads everytime the code is run even if the model is present in the environment

def spacy_summarize(text):
    """
    Returns the summary for an input string text

            Parameters:
                :param text: Input String
                :type text: str

            Returns:
                :return: The summary for the input text
                :rtype: String

    """
    nlp = nlp_g
    doc= nlp(text)
    word_frequencies={}
    for word in doc:
        if word.text.lower() not in [list(STOP_WORDS), punctuation]:
            if word.text not in word_frequencies:
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1
    max_frequency=max(word_frequencies.values())
    for word in word_frequencies:
        word_frequencies.copy()[word]=word_frequencies[word]/max_frequency
    sentence_tokens= [sent for sent in doc.sents]
    sentence_scores = {}
    spacy_frequencies(word_frequencies, sentence_tokens, sentence_scores)
    select_length=max(1,int(len(sentence_tokens)*0.05))
    summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
    final_summary=[word.text for word in summary]
    summary=''.join(final_summary)
    return summary

def spacy_frequencies(word_frequencies, sentence_tokens, sentence_scores):
    """
    Child function for spacy function for calculating sentence scores
            Parameters:
                    :param: word frequeny, sentence token and score which
                        is provided through the parent function

    """
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies:
                if sent not in sentence_scores:
                    sentence_scores[sent]=word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent]+=word_frequencies[word.text.lower()]

尝试的事情：

import sys
import subprocess
import pkg_resources

required = {'en_core_web_lg'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
  python = sys.executable
  subprocess.check_call([python, '-m', 'spacy', 'download', *missing], stdout=subprocess.DEVNULL)

try:
  nlp_lg = spacy.load("en_core_web_lg")
except ModuleNotFoundError:
  download(model="en_core_web_lg")
  nlp_lg = spacy.load("en_core_web_lg")

两种解决方案都没有给出令人满意的结果，并且再次下载了包裹，如果有人可以帮助我，我将不胜感激吗？太感谢了！

原文

I am using spacy and its model en_core_web_lg, to perform summarisation in python. The code is running perfectly and there is no error at all. Except that, I am trying to find a way of making sure that the en_core_web_lg doesn't keep downloading in an environment if it already has it. I have googled a lot to find a perfect solution for this which I will list below but none has gelled with what I am trying to achieve.
This code will be packaged and will be used by multiple people and I want to make sure that if they run the code everytime, the en_core_web_lg doesn't download if it already exists. Below is the spacy excerpt of my code and the solutions I tried:

#Importing necessary Libraries
from heapq import nlargest
from string import punctuation
import nltk
import spacy
from spacy.cli.download import download
from spacy.lang.en.stop_words import STOP_WORDS

nltk.download('punkt')
download(model="en_core_web_lg")
nlp_g = spacy.load('en_core_web_lg') #downloads everytime the code is run even if the model is present in the environment

def spacy_summarize(text):
    """
    Returns the summary for an input string text

            Parameters:
                :param text: Input String
                :type text: str

            Returns:
                :return: The summary for the input text
                :rtype: String

    """
    nlp = nlp_g
    doc= nlp(text)
    word_frequencies={}
    for word in doc:
        if word.text.lower() not in [list(STOP_WORDS), punctuation]:
            if word.text not in word_frequencies:
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1
    max_frequency=max(word_frequencies.values())
    for word in word_frequencies:
        word_frequencies.copy()[word]=word_frequencies[word]/max_frequency
    sentence_tokens= [sent for sent in doc.sents]
    sentence_scores = {}
    spacy_frequencies(word_frequencies, sentence_tokens, sentence_scores)
    select_length=max(1,int(len(sentence_tokens)*0.05))
    summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
    final_summary=[word.text for word in summary]
    summary=''.join(final_summary)
    return summary

def spacy_frequencies(word_frequencies, sentence_tokens, sentence_scores):
    """
    Child function for spacy function for calculating sentence scores
            Parameters:
                    :param: word frequeny, sentence token and score which
                        is provided through the parent function

    """
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies:
                if sent not in sentence_scores:
                    sentence_scores[sent]=word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent]+=word_frequencies[word.text.lower()]

Things Tried:

import sys
import subprocess
import pkg_resources

required = {'en_core_web_lg'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
  python = sys.executable
  subprocess.check_call([python, '-m', 'spacy', 'download', *missing], stdout=subprocess.DEVNULL)

try:
  nlp_lg = spacy.load("en_core_web_lg")
except ModuleNotFoundError:
  download(model="en_core_web_lg")
  nlp_lg = spacy.load("en_core_web_lg")

Both solutions didn't give a satisfactory result and the package was downloaded again and I would appreciate if someone could help me with this?
Thank you so much!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暖风昔人 2025-02-08 19:42:24

Spacy根本不会自动下载模型，因此这必须是您的代码的错误，该错误检查是否已经安装了模型。

查看此代码：

try:
  nlp_lg = spacy.load("en_core_web_lg")
except ModuleNotFoundError:
  download(model="en_core_web_lg")
  nlp_lg = spacy.load("en_core_web_lg")

问题是，如果未安装模型，则是oserror，而不是modulenotefounderror。首先，您需要解决这个问题。

这种方法似乎应该可以工作，除了在安装它们的同一过程中加载模型不能可靠地工作 - 在Python运行时未更新已安装的软件包的列表。因此，即使解决了上述问题，它也可能无法按预期工作。

我建议要么建议：

将模型下载到已知目录，在此处提取并从路径上加载它，而不仅仅是模型名称
检查pip list的输出以查看是否安装了模型，如果没有安装

spaCy doesn't automatically download models at all, so this must be a bug with your code that checks if the model is already installed.

Looking at this code:

try:
  nlp_lg = spacy.load("en_core_web_lg")
except ModuleNotFoundError:
  download(model="en_core_web_lg")
  nlp_lg = spacy.load("en_core_web_lg")

The issue is that if the model is not installed this is an OSError, not a ModuleNoteFoundError. First you need to fix that.

This approach seems like it should work, except loading models in the same process you installed them in doesn't work very reliably - the list of installed packages is not updated while Python is running. So even after fixing the above issue, it may not work as intended.

I would recommend either:

Download the model to a known directory, extract it there, and load it from a path instead of just the model name
Check the output of pip list to see if the model is installed, and install it if not

回复收藏 0 原文

~没有更多了~