字符串切片的不可编写脚本的 Int 错误

发布于 2024-10-27 06:02:19 字数 2712 浏览 2 评论 0原文

我正在编写一个网络爬虫，并且有一个表格，其中充满了我想要下载、保存并稍后分析的 .pdf 文件的链接。我用的是漂亮的汤，我让汤找到了所有的链接。它们通常是漂亮的汤标签对象，但我把它们变成了字符串。该字符串实际上是一堆垃圾，链接文本埋在中间。我想删掉那些垃圾，只留下链接。然后我会把它们变成一个列表，稍后让 python 下载它们。（我的计划是让 python 保留 pdf 链接名称的列表，以跟踪下载的内容，然后它可以根据这些链接名称或其一部分来命名文件）。

但是 .pdf 文件的名称长度可变，例如：

I_am_the_first_file.pdf
And_I_am_the_seond_file.pdf

并且由于它们存在于表中，因此它们有一堆垃圾文本：

a href = ://blah/blah/blah/I_am_the_first_file.pdf [加上其他意外进入我的字符串的注释内容]
a href = ://blah/blah/blah/And_I_am_the_seond_file.pdf[加上意外进入我的字符串的其他注释内容]

所以我想剪切（“切片”）字符串的前部和最后部分，只留下字符串指向我的网址（所以接下来是我的程序所需的输出）：

://blah/blah/blah/I_am_the_first_file.pdf
://blah/blah/blah/And_I_am_the_seond_file.pdf

不过，正如您所看到的，第二个文件的字符串中的字符比第一个文件多。所以我不能做：

string[9:40]

或其他任何事情，因为这适用于第一个文件，但不适用于第二个文件。

所以我试图为字符串切片的末尾找到一个变量，如下所示：

string[9:x]

其中 x 是以 '.pdf' 结尾的字符串中的位置（我的想法是使用 string.index(' .pdf') 函数来做到这一点，

但是 t3h 失败了，因为我尝试使用变量来执行此操作时遇到错误，

("TypeError: 'int' object is unsubscriptable")

除了弄乱字符串之外，可能有一个简单的答案和更好的方法来做到这一点，但你们很聪明。比我，我想你会这是我到目前为止的完整

代码：

import urllib, urllib2

from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("mywebsite.com")

soup = BeautifulSoup(page)

table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.

for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them

   pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)

   if 'pdf' in pdf_link_string:
#some links in the table are .html and I don't want those, I just want the pdfs.

      end_of_link = pdf_link_string.index('.pdf')
#I want to know where the .pdf file extension ends because that's the end of the link, so I'll slice backward from there

      just_the_link = end_of_link[9:end_of_link]
#here, the first 9 characters are junk "a href = yadda yadda yadda".  So I'm setting a variable that starts just after that junk and goes to the .pdf (I realize that I will actualy have to do .pdf + 3 or something to actually get to the end of string, but this makes it easier for now).

      print just_the_link
#I debug by print statement because I'm an amatuer

行（倒数第二行）如下： just_the_link = end_of_link[9:end_of_link]

返回错误（TypeError: 'int' object is unsubscriptable）

此外，“:”应该是超文本传输协议冒号，但是它不会让我发布 b/c 新手不能发布超过 2 个链接，所以我把它们删除了。

原文

I'm writing a webscraper and I have a table full of links to .pdf files that I want to download, save, and later analyze. I was using beautiful soup and I had soup find all the links. They are normally beautiful soup tag objects, but I've turned them into strings. The string is actually a bunch of junk with the link text buried in the middle. I want to cut out that junk and just leave the link. Then I will turn these into a list and have python download them later. (My plan is for python to keep a list of the pdf link names to keep track of what it's downloaded and then it can name the files according to those link names or a portion thereof).

But the .pdfs come in variable name-lengths, e.g.:

I_am_the_first_file.pdf
And_I_am_the_seond_file.pdf

and as they exist in the table, they have a bunch of junk text:

a href = ://blah/blah/blah/I_am_the_first_file.pdf[plus other annotation stuff that gets into my string accidentally]
a href = ://blah/blah/blah/And_I_am_the_seond_file.pdf[plus other annotation stuff that gets into my string accidentally]

So I want to cut ("slice") the front part and the last part off of the string and just leave the string that points to my url (so what follows is the desired output for my program):

://blah/blah/blah/I_am_the_first_file.pdf
://blah/blah/blah/And_I_am_the_seond_file.pdf

As you can see, though, the second file has more characters in the string than the first. So I can't do:

string[9:40]

or whatever because that would work for the first file but not for the second.

So i'm trying to come up with a variable for the end of the string slice, like so:

string[9:x]

wherein x is the location in the string that ends in '.pdf' (and my thought was to use the string.index('.pdf') function to do this.

But is t3h fail because I get an error trying to use a variable to do this

("TypeError: 'int' object is unsubscriptable")

Probably there's an easy answer and a better way to do this other than messing with strings, but you guys are way smartert than me and I figured you'd know straight off.

Here's my full code so far:

import urllib, urllib2

from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("mywebsite.com")

soup = BeautifulSoup(page)

table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.

for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them

   pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)

   if 'pdf' in pdf_link_string:
#some links in the table are .html and I don't want those, I just want the pdfs.

      end_of_link = pdf_link_string.index('.pdf')
#I want to know where the .pdf file extension ends because that's the end of the link, so I'll slice backward from there

      just_the_link = end_of_link[9:end_of_link]
#here, the first 9 characters are junk "a href = yadda yadda yadda".  So I'm setting a variable that starts just after that junk and goes to the .pdf (I realize that I will actualy have to do .pdf + 3 or something to actually get to the end of string, but this makes it easier for now).

      print just_the_link
#I debug by print statement because I'm an amatuer

the line (Second from the bottom) that reads:
just_the_link = end_of_link[9:end_of_link]

returns an error (TypeError: 'int' object is unsubscriptable)

also, the ":" should be hyper text transfer protocol colon, but it won't let me post that b/c newbs can't post more than 2 links so I took them out.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

初懵 2024-11-03 06:02:19

just_the_link = end_of_link[9:end_of_link]

这是你的问题，就像错误消息所说的那样。 end_of_link 是一个整数 - pdf_link_string 中“.pdf”的索引，您在前一行中计算出该索引。所以您自然可以'不要把它切片。您想要对 pdf_link_string 进行切片。

just_the_link = end_of_link[9:end_of_link]

This is your problem, just like the error message says. end_of_link is an integer -- the index of ".pdf" in pdf_link_string, which you calculated in the preceding line. So naturally you can't slice it. You want to slice pdf_link_string.

回复收藏 0 原文

无人问我粥可暖 2024-11-03 06:02:19

听起来像是正则表达式的工作：

import urllib, urllib2, re

from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("mywebsite.com")

soup = BeautifulSoup(page)

table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.

for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them

   pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)


   if 'pdf' in pdf_link_string:
      pdfURLPattern = re.compile("""://(\w+/)+\S+.pdf""")
      pdfURLMatch = pdfURLPattern.search(line)

#If there is no match than search() returns None, otherwise the whole group (group(0)) returns the URL of interest.
      if pdfURLMatch:
         print pdfURLMatch.group(0)

Sounds like a job for regular expressions:

import urllib, urllib2, re

from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("mywebsite.com")

soup = BeautifulSoup(page)

table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.

for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them

   pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)


   if 'pdf' in pdf_link_string:
      pdfURLPattern = re.compile("""://(\w+/)+\S+.pdf""")
      pdfURLMatch = pdfURLPattern.search(line)

#If there is no match than search() returns None, otherwise the whole group (group(0)) returns the URL of interest.
      if pdfURLMatch:
         print pdfURLMatch.group(0)

回复收藏 0 原文

~没有更多了~