以编程方式安装 NLTK 语料库/模型,即无需 GUI 下载器?

发布于 11-04 15:17 字数 151 浏览 7 评论 0原文

我的项目使用NLTK。如何列出项目的语料库和项目?型号要求,以便它们可以自动安装?我不想点击 nltk.download() GUI,一一安装软件包。

另外,有什么方法可以冻结相同的需求列表(例如pip freeze)?

My project uses the NLTK. How can I list the project's corpus & model requirements so they can be automatically installed? I don't want to click through the nltk.download() GUI, installing packages one by one.

Also, any way to freeze that same list of requirements (like pip freeze)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

羁绊已千年2024-11-11 15:17:28

安装所有 NLTK 语料库和models:

python -m nltk.downloader all

则可以使用:

sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

或者,在 Linux 上,如果您只想列出最流行的语料库和模型, 将 all 替换为 popular。模型。


您还可以浏览语料库和通过命令行模型:

mlee@server:/scratch/jjylee/tests$ sudo python -m nltk.downloader
[sudo] password for jjylee:
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.............. Chat-80 Data Files
  [ ] city_database....... City Database
  [ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
  [ ] comparative_sentences Comparative Sentence Dataset
  [ ] comtrans............ ComTrans Corpus Sample
  [ ] conll2000........... CONLL 2000 Chunking Corpus
  [ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
  [ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
                           and Basque Subset)
  [ ] crubadan............ Crubadan Corpus
  [ ] dependency_treebank. Dependency Parsed Treebank
  [ ] europarl_raw........ Sample European Parliament Proceedings Parallel
                           Corpus
  [ ] floresta............ Portuguese Treebank
  [ ] framenet_v15........ FrameNet 1.5
Hit Enter to continue: 
  [ ] framenet_v17........ FrameNet 1.7
  [ ] gazetteers.......... Gazeteer Lists
  [ ] genesis............. Genesis Corpus
  [ ] gutenberg........... Project Gutenberg Selections
  [ ] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
  [ ] ieer................ NIST IE-ER DATA SAMPLE
  [ ] inaugural........... C-Span Inaugural Address Corpus
  [ ] indian.............. Indian Language POS-Tagged Corpus
  [ ] jeita............... JEITA Public Morphologically Tagged Corpus (in
                           ChaSen format)
  [ ] kimmo............... PC-KIMMO Data Files
  [ ] knbc................ KNB Corpus (Annotated blog corpus)
  [ ] large_grammars...... Large context-free and feature-based grammars
                           for parser comparison
  [ ] lin_thesaurus....... Lin's Dependency Thesaurus
  [ ] mac_morpho.......... MAC-MORPHO: Brazilian Portuguese news text with
                           part-of-speech tags
  [ ] machado............. Machado de Assis -- Obra Completa
  [ ] masc_tagged......... MASC Tagged Corpus
  [ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
  [ ] moses_sample........ Moses Sample Models
Hit Enter to continue: x


Download which package (l=list; x=cancel)?
  Identifier> conll2002
    Downloading package conll2002 to
        /afs/mit.edu/u/m/mlee/nltk_data...
      Unzipping corpora/conll2002.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader>

To install all NLTK corpora & models:

python -m nltk.downloader all

Alternatively, on Linux, you can use:

sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

Replace all by popular if you just want to list the most popular corpora & models.


You may also browse the corpora & models through the command line:

mlee@server:/scratch/jjylee/tests$ sudo python -m nltk.downloader
[sudo] password for jjylee:
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.............. Chat-80 Data Files
  [ ] city_database....... City Database
  [ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
  [ ] comparative_sentences Comparative Sentence Dataset
  [ ] comtrans............ ComTrans Corpus Sample
  [ ] conll2000........... CONLL 2000 Chunking Corpus
  [ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
  [ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
                           and Basque Subset)
  [ ] crubadan............ Crubadan Corpus
  [ ] dependency_treebank. Dependency Parsed Treebank
  [ ] europarl_raw........ Sample European Parliament Proceedings Parallel
                           Corpus
  [ ] floresta............ Portuguese Treebank
  [ ] framenet_v15........ FrameNet 1.5
Hit Enter to continue: 
  [ ] framenet_v17........ FrameNet 1.7
  [ ] gazetteers.......... Gazeteer Lists
  [ ] genesis............. Genesis Corpus
  [ ] gutenberg........... Project Gutenberg Selections
  [ ] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
  [ ] ieer................ NIST IE-ER DATA SAMPLE
  [ ] inaugural........... C-Span Inaugural Address Corpus
  [ ] indian.............. Indian Language POS-Tagged Corpus
  [ ] jeita............... JEITA Public Morphologically Tagged Corpus (in
                           ChaSen format)
  [ ] kimmo............... PC-KIMMO Data Files
  [ ] knbc................ KNB Corpus (Annotated blog corpus)
  [ ] large_grammars...... Large context-free and feature-based grammars
                           for parser comparison
  [ ] lin_thesaurus....... Lin's Dependency Thesaurus
  [ ] mac_morpho.......... MAC-MORPHO: Brazilian Portuguese news text with
                           part-of-speech tags
  [ ] machado............. Machado de Assis -- Obra Completa
  [ ] masc_tagged......... MASC Tagged Corpus
  [ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
  [ ] moses_sample........ Moses Sample Models
Hit Enter to continue: x


Download which package (l=list; x=cancel)?
  Identifier> conll2002
    Downloading package conll2002 to
        /afs/mit.edu/u/m/mlee/nltk_data...
      Unzipping corpora/conll2002.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader>
信仰2024-11-11 15:17:28

除了已经提到的命令行选项之外,您还可以通过向 download() 函数添加参数,以编程方式在 Python 脚本中安装 NLTK 数据。

请参阅 help(nltk.download) 文本,具体如下:

可以通过调用``download()``来下载单个包
具有单个参数的函数,给出包标识符
需要下载的包:

    >>>>>下载('treebank') # doctest: +SKIP
    [nltk_data] 正在下载包“treebank”...
    [nltk_data] 解压 corpora/treebank.zip。

我可以确认这适用于一次下载一个包,或者在传递 listtuple 时。

>>> import nltk
>>> nltk.download('wordnet')
[nltk_data] Downloading package 'wordnet' to
[nltk_data]     C:\Users\_my-username_\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.
True

您也可以尝试下载已下载的软件包,不会出现任何问题:

>>> nltk.download('wordnet')
[nltk_data] Downloading package 'wordnet' to
[nltk_data]     C:\Users\_my-username_\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
True

此外,该函数似乎返回一个布尔值,您可以使用该值来查看下载是否成功:

>>> nltk.download('not-a-real-name')
[nltk_data] Error loading not-a-real-name: Package 'not-a-real-name'
[nltk_data]     not found in index
False

In addition to the command line option already mentioned, you can programmatically install NLTK data in your Python script by adding an argument to the download() function.

See the help(nltk.download) text, specifically:

Individual packages can be downloaded by calling the ``download()``
function with a single argument, giving the package identifier for the
package that should be downloaded:

    >>> download('treebank') # doctest: +SKIP
    [nltk_data] Downloading package 'treebank'...
    [nltk_data]   Unzipping corpora/treebank.zip.

I can confirm that this works for downloading one package at a time, or when passed a list or tuple.

>>> import nltk
>>> nltk.download('wordnet')
[nltk_data] Downloading package 'wordnet' to
[nltk_data]     C:\Users\_my-username_\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.
True

You may also try to download an already downloaded package without problems:

>>> nltk.download('wordnet')
[nltk_data] Downloading package 'wordnet' to
[nltk_data]     C:\Users\_my-username_\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
True

Also, it appears the function returns a boolean value that you can use to see whether or not the download succeeded:

>>> nltk.download('not-a-real-name')
[nltk_data] Error loading not-a-real-name: Package 'not-a-real-name'
[nltk_data]     not found in index
False
黯淡〆2024-11-11 15:17:28

我已成功使用以下代码在自定义目录中安装语料库和模型:

import nltk
nltk.download(info_or_id="popular", download_dir="/path/to/dir")
nltk.data.path.append("/path/to/dir")

这将在 /path/to/dir 中安装“all”语料库/模型,并让 NLTK 知道在哪里寻找它 (data.path.append)。

您无法“冻结”需求文件中的数据,但除了检查文件是否已存在的代码之外,您还可以将此代码添加到 __init__ 中。

I've managed to install the corpora and models inside a custom directory using the following code:

import nltk
nltk.download(info_or_id="popular", download_dir="/path/to/dir")
nltk.data.path.append("/path/to/dir")

this will install "all" corpora/models inside /path/to/dir, and will let know NLTK where to look for it (data.path.append).

You can't «freeze» the data in a requirements file, but you could add this code to your __init__ besides come code to check if the files are already there.

千纸鹤带着心事2024-11-11 15:17:27

NLTK 站点在此页面底部列出了用于下载软件包和集合的命令行界面:

http://www.nltk.org/data nltk.org/data

命令行的使用情况因您使用的 Python 版本而异,但在我的 Python2.6 安装中,我注意到我缺少“spanish_grammar”模型,这工作得很好:

python -m nltk.downloader spanish_grammars

您提到列出了项目的语料库和模型要求,虽然我不确定是否有办法自动做到这一点,但我想我至少会分享这一点。

The NLTK site does list a command line interface for downloading packages and collections at the bottom of this page :

http://www.nltk.org/data

The command line usage varies by which version of Python you are using, but on my Python2.6 install I noticed I was missing the 'spanish_grammar' model and this worked fine:

python -m nltk.downloader spanish_grammars

You mention listing the project's corpus and model requirements and while I'm not sure of a way to automagically do that, I figured I would at least share this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文