用go-colly解析HTML并返回一个空切片

发布于 2025-01-25 18:19:46 字数 800 浏览 3 评论 0 原文

我正在用Colly框架来解析网站,并且正在发生一些错误。我有一个非常基本的函数 getWeeks()抓住并返回某些东西,但我得到了一个空的切片。

func getWeeks(c *colly.Collector) []string {
    var wks []string
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()  // a string Week 1, Week 2 etc 
        wks = append(wks, weekName)  // weekName has actual value is not empty
        // If `wks` printed here it shows correctly how the slice gets populated on each iteration
    })
    return wks  // returns []
}

func main() {
    c := colly.NewCollector(
    )

    w := getWeeks(c)
    fmt.Println(w)  // []

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)")
    })

    c.Visit("target url")

}

I'm parsing a web site with the colly framework and something wrong is happening. I have a very basic function getweeks() to grab and return something, yet I'm getting an empty slice instead.

func getWeeks(c *colly.Collector) []string {
    var wks []string
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()  // a string Week 1, Week 2 etc 
        wks = append(wks, weekName)  // weekName has actual value is not empty
        // If `wks` printed here it shows correctly how the slice gets populated on each iteration
    })
    return wks  // returns []
}

func main() {
    c := colly.NewCollector(
    )

    w := getWeeks(c)
    fmt.Println(w)  // []

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)")
    })

    c.Visit("target url")

}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

南城追梦 2025-02-01 18:19:46

tl; dr :在 onhtml 回调的内部更新了slice标头,但是您在 main 中打印的值是旧的slice标头。您应该使用*[] String 而使用。


首先,您传递给 c.Onhtml 的回调仅在您调用 c.visit 之后才能运行代码> getweeks 在任何情况下都会显示一个空切片。

但是,即使在 c.visit 之后,它也将是一个空的切片,为什么?

GO中的切片被用作数据结构 - 称为slice标头(更多信息: 1 < /a>, 2 )。

当您分配 getWeeks 的返回值时,您实际上是在复制slice标题,包括其字段 data len cap cap 。您可以在动词(使用一些其他结构而不是go-colly来使示例自独立):

func getWeeks(c *Foo) []string {
    var wks []string
    c.OnHTML("div.ltbluediv", func(text string) {
        weekName := text
        wks = append(wks, weekName)
    })
    fmt.Printf("%p\n", &wks)
    return wks
}

func main() {
    c := &Foo{}

    w := getWeeks(c)

    c.Visit("target url")
    fmt.Printf("%p\n", &w)

}

打印两个不同的内存地址:

0xc0000ac030
0xc0000ac018

现在,如果您继续在堆栈上四处钓鱼,围绕切片和 append 行为,您可能会发现,如果切片具有足够的容量( 1 2 3 )备用阵列未重新分配。

但是,即使您确实以足够的容量来初始化 wks 来确保备份阵列相同, w 的值仍然是原始Slice标题的副本,因此强>具有0长度。这在

in getWeeks reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:1, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:2, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:3, Cap:3}
[]
in main reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}

中通过重新定义( playground ):

c.Visit("target url")
w = w[0:3]
fmt.Println(w) // [foo bar baz]

这意味着您需要事先事先知道合理的容量不会导致重新分配,最终长度可以重新分配。

取而代之的是,将指针返回切片:

func getWeeks(c *colly.Collector) *[]string {
    wks := &[]string{}
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()
        *wks = append(*wks, weekName) 
    })
    return wks
}

或将指针传递到 getweeks

func getWeeks(c *colly.Collector, wks *[]string) {
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()
        *wks = append(*wks, weekName)
    })
}

固定游乐场: https://go.dev/play/p/yhq8yynkfsv

tl;dr: The slice header is updated inside OnHTML callback, but the value you print in main is the old slice header. You should work with *[]string instead.


First of all, the callback you pass to c.OnHTML will actually run only after you call c.Visit, so printing w right after getWeeks, would show an empty slice in any case.

However it would be an empty slice even by printing it after c.Visit, why?

A slice in Go is implemented as a data structure — called slice header (more info: 1, 2).

When you assign the return value of getWeeks, you're essentially copying the slice header, including its fields Data, Len and Cap. You can see it in this playground by printing the address of the slices with %p verb (using some other struct instead of go-colly to make the example self-contained):

func getWeeks(c *Foo) []string {
    var wks []string
    c.OnHTML("div.ltbluediv", func(text string) {
        weekName := text
        wks = append(wks, weekName)
    })
    fmt.Printf("%p\n", &wks)
    return wks
}

func main() {
    c := &Foo{}

    w := getWeeks(c)

    c.Visit("target url")
    fmt.Printf("%p\n", &w)

}

Prints two different memory addresses:

0xc0000ac030
0xc0000ac018

Now if you keep fishing around on Stack Overflow about slice and append behavior, you may find out that if the slice has sufficient capacity (1, 2, 3) the backing array is not reallocated.

However even if you do make sure the backing array is the same by initializing wks with sufficient capacity, the value of w is still a copy of the original slice header, therefore with 0 length. This is demonstrated in this playground, which prints:

in getWeeks reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:1, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:2, Cap:3}
in callback reflect.SliceHeader{Data:0xc0000121b0, Len:3, Cap:3}
[]
in main reflect.SliceHeader{Data:0xc0000121b0, Len:0, Cap:3}

You could adjust the length of w by reslicing it (playground):

c.Visit("target url")
w = w[0:3]
fmt.Println(w) // [foo bar baz]

But this means that you need to know beforehand a reasonable capacity that doesn't cause reallocation, and the final length to reslice to.

Instead, return a pointer to a slice:

func getWeeks(c *colly.Collector) *[]string {
    wks := &[]string{}
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()
        *wks = append(*wks, weekName) 
    })
    return wks
}

Or pass a pointer into getWeeks:

func getWeeks(c *colly.Collector, wks *[]string) {
    c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
        weekName := div.DOM.Find("span").Text()
        *wks = append(*wks, weekName)
    })
}

Fixed playground: https://go.dev/play/p/yhq8YYnkFsv

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文