在AWS胶水中使用NLTK

发布于 2025-01-30 14:57:22 字数 537 浏览 1 评论 0原文

我正在努力使脚本有效，并想知道其他人是否成功地做到了这一点。我正在使用胶水执行火花脚本，并试图使用NLTK模块分析一些文本。我已经能够通过将其上传到S3并引用该位置的其他Python模块配置来导入NLTK模块。但是，我正在使用Word_tokenize方法，该方法需要在NLTK_DATA目录中下载Punkt库。

我已经遵循了这一点（使用boto3 从S3下载文件夹）将punkt文件复制到胶水中的TMP目录。但是，如果我在交互式胶水会话中查看TMP文件夹，我看不到文件。当我运行word_tokenize方法时，我会发现一个错误，说该软件包在默认位置无法找到（ /usr /nltk_data的变化）。

我将将所需的文件移至S3中的NLTK软件包中，并尝试重写NLTK Tokenizer直接加载文件而不是NLTK_DATA位置。但是，如果有人能够使此功能工作似乎相当普遍，则首先想在这里检查此处。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

送舟行 2025-02-06 14:57:22

我对NLTK的经验有限，但是我认为nltk.download（）将使Punkt位于正确的位置。

import nltk

print('nltk.__version__', nltk.__version__)

nltk.download('punkt')

from nltk import word_tokenize

print(word_tokenize('Glue is good, but it has some rough edges'))

从日志

nltk.__version__ 3.6.3
[nltk_data] Downloading package punkt to /home/spark/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
['Glue', 'is', 'good', ',', 'but', 'it', 'has', 'some', 'rough', 'edges']

I have limited experience with NLTK, but I think the nltk.download() will put punkt in the right spot.

import nltk

print('nltk.__version__', nltk.__version__)

nltk.download('punkt')

from nltk import word_tokenize

print(word_tokenize('Glue is good, but it has some rough edges'))

From the logs

nltk.__version__ 3.6.3
[nltk_data] Downloading package punkt to /home/spark/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
['Glue', 'is', 'good', ',', 'but', 'it', 'has', 'some', 'rough', 'edges']

回复收藏 0 原文

葵雨 2025-02-06 14:57:22

我想在这里跟进，以防其他人遇到这些问题并且找不到工作解决方案。

独自离开这个项目一段时间后，我终于回来了，并能够得到一个工作解决方案。最初，我将我的TMP位置添加到NLTK_DATA路径并在此下载所需的软件包。但是，这不起作用。

nltk.data.path.append("/tmp/nltk_data")
nltk.download("punkt", download_dir="/tmp/nltk_data")
nltk.download("averaged_perceptron_tagger", download_dir="/tmp/nltk_data")

最终，我相信问题是，我需要的punkt的文件在工人节点上不可用。使用AddFile方法，我最终能够使用NLTK数据。

sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')

我下一个问题是我试图从.withcolmn（）方法调用UDF函数以获取每行的名词。这里的问题是，使用colummn要求传递一列，但NLTK只能与字符串值一起使用。

不起作用：

df2 = df.select(['col1','col2','col3']).filter(df['col2'].isin(date_list)).withColumn('col4', find_nouns(col('col1'))

为了让NLTK上班，我在完整的数据帧中传递并在每一行上循环。使用Collect获取行的文本值，然后构建一个新的DataFrame，并使用所有原始列以及新的NLTK列返回该框架。对我来说，这似乎令人难以置信的低效，但是没有它，我无法获得工作解决方案。

df2 = find_nouns(df)

def find_nouns(df):
    data = []
    schema = StructType([...])
    is_noun = lambda pos: pos[:2] == 'NN'
    for i in range(df.count()):
        row = df.collect()[i]
        tokenized = nltk.word_tokenize(row[0])
        data.append((row[0], row[1], row[2], [word for (word, pos) inn nltk.pos_tag(tokenized) if is_noun(pos)]))
    df2 = spark.createDataFrame(data=data, schema=schema)
    return df2

我敢肯定，那里有一个更好的解决方案，但我希望这可以帮助某人将项目带入初始工作解决方案。

I wanted to follow up here in case anyone else encounters these issues and can't find a working solution.

After leaving this project alone for a while I finally came back and was able to get a working solution. Initially I was adding my tmp location to the nltk_data path and downloading the required packages there. However, this wasnt working.

nltk.data.path.append("/tmp/nltk_data")
nltk.download("punkt", download_dir="/tmp/nltk_data")
nltk.download("averaged_perceptron_tagger", download_dir="/tmp/nltk_data")

Ultimately, I believe the issue was that the file I needed from punkt was not available on the worker nodes. Using the addFile method I was finally able to use nltk data.

sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')

The next issue I had was that I was trying to call a UDF function from a .withColmn() method to get the nouns for each row. The issue here is that withColummn requires that a column be passed but nltk will only work with string values.

Not working:

df2 = df.select(['col1','col2','col3']).filter(df['col2'].isin(date_list)).withColumn('col4', find_nouns(col('col1'))

In order to get nltk to work I passed in my full dataframe and looped over every row. Using collect to get the text value of the row then building a new dataframe and returning that with all the original columns plus the new nltk column. To me this seems incredible inefficient but I wasn't able to get a working solution without it.

df2 = find_nouns(df)

def find_nouns(df):
    data = []
    schema = StructType([...])
    is_noun = lambda pos: pos[:2] == 'NN'
    for i in range(df.count()):
        row = df.collect()[i]
        tokenized = nltk.word_tokenize(row[0])
        data.append((row[0], row[1], row[2], [word for (word, pos) inn nltk.pos_tag(tokenized) if is_noun(pos)]))
    df2 = spark.createDataFrame(data=data, schema=schema)
    return df2

I'm sure there's a better solution out there, but I hope this can help someone get their project to an initial working solution.

回复收藏 0 原文

~没有更多了~