在AWS胶水中使用NLTK

发布于 2025-01-30 14:57:22 字数 537 浏览 1 评论 0原文

我正在努力使脚本有效,并想知道其他人是否成功地做到了这一点。 我正在使用胶水执行火花脚本,并试图使用NLTK模块分析一些文本。我已经能够通过将其上传到S3并引用该位置的其他Python模块配置来导入NLTK模块。但是,我正在使用Word_tokenize方法,该方法需要在NLTK_DATA目录中下载Punkt库。

我已经遵循了这一点(使用boto3 从S3下载文件夹)将punkt文件复制到胶水中的TMP目录。但是,如果我在交互式胶水会话中查看TMP文件夹,我看不到文件。当我运行word_tokenize方法时,我会发现一个错误,说该软件包在默认位置无法找到( /usr /nltk_data的变化)。

我将将所需的文件移至S3中的NLTK软件包中,并尝试重写NLTK Tokenizer直接加载文件而不是NLTK_DATA位置。但是,如果有人能够使此功能工作似乎相当普遍,则首先想在这里检查此处。

I'm struggling to get a script working and wondering if anyone else has successfully done this.
I'm using Glue to execute a spark script and am trying to use the NLTK module to analyze some text. I've been able to import the NLTK module by uploading it to s3 and referencing that location for the Glue additional python module config. However, I'm using the word_tokenize method which requires the punkt library to be downloaded in the nltk_data directory.

I've followed this (Download a folder from S3 using Boto3) to copy the punkt files to the tmp directory in Glue. However, if I look into the tmp folder in an interactive glue session I don't see the files. When I run the word_tokenize method I get an error saying that the package cant be found in the default locations (variations of /usr/nltk_data).

I'm going to move the required files into the nltk package in s3 and try to try to re-write the nltk tokenizer to load the files directly instead of the nltk_data location. But wanted to check here first if anyone was able to get this working as this seems fairly common.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

送舟行 2025-02-06 14:57:22

我对NLTK的经验有限,但是我认为nltk.download()将使Punkt位于正确的位置。

import nltk

print('nltk.__version__', nltk.__version__)

nltk.download('punkt')

from nltk import word_tokenize

print(word_tokenize('Glue is good, but it has some rough edges'))

从日志

nltk.__version__ 3.6.3
[nltk_data] Downloading package punkt to /home/spark/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
['Glue', 'is', 'good', ',', 'but', 'it', 'has', 'some', 'rough', 'edges']

I have limited experience with NLTK, but I think the nltk.download() will put punkt in the right spot.

import nltk

print('nltk.__version__', nltk.__version__)

nltk.download('punkt')

from nltk import word_tokenize

print(word_tokenize('Glue is good, but it has some rough edges'))

From the logs

nltk.__version__ 3.6.3
[nltk_data] Downloading package punkt to /home/spark/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
['Glue', 'is', 'good', ',', 'but', 'it', 'has', 'some', 'rough', 'edges']
葵雨 2025-02-06 14:57:22

我想在这里跟进,以防其他人遇到这些问题并且找不到工作解决方案。

独自离开这个项目一段时间后,我终于回来了,并能够得到一个工作解决方案。最初,我将我的TMP位置添加到NLTK_DATA路径并在此下载所需的软件包。但是,这不起作用。

nltk.data.path.append("/tmp/nltk_data")
nltk.download("punkt", download_dir="/tmp/nltk_data")
nltk.download("averaged_perceptron_tagger", download_dir="/tmp/nltk_data")

最终,我相信问题是,我需要的punkt的文件在工人节点上不可用。使用AddFile方法,我最终能够使用NLTK数据。

sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')

我下一个问题是我试图从.withcolmn()方法调用UDF函数以获取每行的名词。这里的问题是,使用colummn要求传递一列,但NLTK只能与字符串值一起使用。

不起作用:

df2 = df.select(['col1','col2','col3']).filter(df['col2'].isin(date_list)).withColumn('col4', find_nouns(col('col1'))

为了让NLTK上班,我在完整的数据帧中传递并在每一行上循环。使用Collect获取行的文本值,然后构建一个新的DataFrame,并使用所有原始列以及新的NLTK列返回该框架。对我来说,这似乎令人难以置信的低效,但是没有它,我无法获得工作解决方案。

df2 = find_nouns(df)

def find_nouns(df):
    data = []
    schema = StructType([...])
    is_noun = lambda pos: pos[:2] == 'NN'
    for i in range(df.count()):
        row = df.collect()[i]
        tokenized = nltk.word_tokenize(row[0])
        data.append((row[0], row[1], row[2], [word for (word, pos) inn nltk.pos_tag(tokenized) if is_noun(pos)]))
    df2 = spark.createDataFrame(data=data, schema=schema)
    return df2

我敢肯定,那里有一个更好的解决方案,但我希望这可以帮助某人将项目带入初始工作解决方案。

I wanted to follow up here in case anyone else encounters these issues and can't find a working solution.

After leaving this project alone for a while I finally came back and was able to get a working solution. Initially I was adding my tmp location to the nltk_data path and downloading the required packages there. However, this wasnt working.

nltk.data.path.append("/tmp/nltk_data")
nltk.download("punkt", download_dir="/tmp/nltk_data")
nltk.download("averaged_perceptron_tagger", download_dir="/tmp/nltk_data")

Ultimately, I believe the issue was that the file I needed from punkt was not available on the worker nodes. Using the addFile method I was finally able to use nltk data.

sc.addFile('/tmp/nltk_data/tokenizers/punkt/PY3/english.pickle')

The next issue I had was that I was trying to call a UDF function from a .withColmn() method to get the nouns for each row. The issue here is that withColummn requires that a column be passed but nltk will only work with string values.

Not working:

df2 = df.select(['col1','col2','col3']).filter(df['col2'].isin(date_list)).withColumn('col4', find_nouns(col('col1'))

In order to get nltk to work I passed in my full dataframe and looped over every row. Using collect to get the text value of the row then building a new dataframe and returning that with all the original columns plus the new nltk column. To me this seems incredible inefficient but I wasn't able to get a working solution without it.

df2 = find_nouns(df)

def find_nouns(df):
    data = []
    schema = StructType([...])
    is_noun = lambda pos: pos[:2] == 'NN'
    for i in range(df.count()):
        row = df.collect()[i]
        tokenized = nltk.word_tokenize(row[0])
        data.append((row[0], row[1], row[2], [word for (word, pos) inn nltk.pos_tag(tokenized) if is_noun(pos)]))
    df2 = spark.createDataFrame(data=data, schema=schema)
    return df2

I'm sure there's a better solution out there, but I hope this can help someone get their project to an initial working solution.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文