如何使用 shell 工具从 HTML 文件中提取项目 ID？

发布于 2024-11-28 01:13:31 字数 432 浏览 1 评论 0原文

我有一个带有目录名称的文本文件（每行一个），我需要打开并循环浏览该列表。一次获取一个名称，下载相应的 HTML 页面并提取页面上的“item_id”。

项目 ID 在 HTML 中显示如下：?item_id=55963573">。

这是我到目前为止所得到的。

#!/bin/sh

for productID in (catIDs.txt) #I know this part is not correct
do
    wget -q -U Mozilla "http://www.example.com/$productID/" -O - \
     | tr '"' '\n' | grep "^item_id" | cut -d ' ' -f 4 >> itemIDs.txt
    sleep 15
done

原文

I have a text file with catalog names (one per line) and I need to open and cycle through that list. Taking one name at a time, downloading the corresponding HTML page and extracting the "item_id" that is on the page.

The item ID is displayed like this in the HTML: ?item_id=55963573">.

This is what I have so far below.

#!/bin/sh

for productID in (catIDs.txt) #I know this part is not correct
do
    wget -q -U Mozilla "http://www.example.com/$productID/" -O - \
     | tr '"' '\n' | grep "^item_id" | cut -d ' ' -f 4 >> itemIDs.txt
    sleep 15
done

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无需解释 2024-12-05 01:13:31

这应该有效：

#!/bin/sh

while read productID; do
    wget -q -U Mozilla "http://www.example.com/$productID/" -O - |
    sed -n -r 's/.*\?item_id=([0-9]+)"\>.*/\1/p'
done <catIDs.txt >itemIDs.txt

This should work:

#!/bin/sh

while read productID; do
    wget -q -U Mozilla "http://www.example.com/$productID/" -O - |
    sed -n -r 's/.*\?item_id=([0-9]+)"\>.*/\1/p'
done <catIDs.txt >itemIDs.txt

回复收藏 0 原文

帅冕 2024-12-05 01:13:31

如果文件很小，请使用：

for productID in `cat catIDs.txt`

If the file is small, use:

for productID in `cat catIDs.txt`

回复收藏 0 原文

鹿港小镇 2024-12-05 01:13:31

cat catIDs.txt | while read productID;
do
  wget -q -U Mozilla "http://www.domain.com/$productID/" -O - \
  | tr '"' '\n' | grep "^item_id" | cut -d ' ' -f 4 >> itemIDs.txt
  sleep 15
done

或者

while read productID;
do
  wget -q -U Mozilla "http://www.domain.com/$productID/" -O - \
  | tr '"' '\n' | grep "^item_id" | cut -d ' ' -f 4 >> itemIDs.txt
  sleep 15
done < catIDs.txt

cat catIDs.txt | while read productID;
do
  wget -q -U Mozilla "http://www.domain.com/$productID/" -O - \
  | tr '"' '\n' | grep "^item_id" | cut -d ' ' -f 4 >> itemIDs.txt
  sleep 15
done

while read productID;
do
  wget -q -U Mozilla "http://www.domain.com/$productID/" -O - \
  | tr '"' '\n' | grep "^item_id" | cut -d ' ' -f 4 >> itemIDs.txt
  sleep 15
done < catIDs.txt

回复收藏 0 原文

~没有更多了~