用go-colly解析HTML并返回一个空切片
我正在用Colly框架来解析网站,并且正在发生一些错误。我有一个非常基本的函数 getWeeks()
抓住并返回某些东西,但我得到了一个空的切片。
func getWeeks(c *colly.Collector) []string {
var wks []string
c.OnHTML("div.ltbluediv", func(div *colly.HTMLElement) {
weekName := div.DOM.Find("span").Text() // a string Week 1, Week 2 etc
wks = append(wks, weekName) // weekName has actual value is not empty
// If `wks` printed here it shows correctly how the slice gets populated on each iteration
})
return wks // returns []
}
func main() {
c := colly.NewCollector(
)
w := getWeeks(c)
fmt.Println(w) // []
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)")
})
c.Visit("target url")
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
tl; dr :在
onhtml
回调的内部更新了slice标头,但是您在main
中打印的值是旧的slice标头。您应该使用*[] String
而使用。首先,您传递给
c.Onhtml
的回调仅在您调用c.visit
之后才能运行代码> getweeks 在任何情况下都会显示一个空切片。但是,即使在
c.visit
之后,它也将是一个空的切片,为什么?GO中的切片被用作数据结构 - 称为slice标头(更多信息: 1 < /a>, 2 )。
当您分配
getWeeks
的返回值时,您实际上是在复制slice标题,包括其字段data
,len
和cap cap
。您可以在动词(使用一些其他结构而不是go-colly来使示例自独立):打印两个不同的内存地址:
现在,如果您继续在堆栈上四处钓鱼,围绕切片和
append 行为,您可能会发现,如果切片具有足够的容量( 1 , 2 , 3 )备用阵列未重新分配。
但是,即使您确实以足够的容量来初始化
wks
来确保备份阵列相同,w
的值仍然是原始Slice标题的副本,因此强>具有0长度。这在中通过重新定义( playground ):
这意味着您需要事先事先知道合理的容量不会导致重新分配,最终长度可以重新分配。
取而代之的是,将指针返回切片:
或将指针传递到
getweeks
:固定游乐场: https://go.dev/play/p/yhq8yynkfsv
tl;dr: The slice header is updated inside
OnHTML
callback, but the value you print inmain
is the old slice header. You should work with*[]string
instead.First of all, the callback you pass to
c.OnHTML
will actually run only after you callc.Visit
, so printingw
right aftergetWeeks
, would show an empty slice in any case.However it would be an empty slice even by printing it after
c.Visit
, why?A slice in Go is implemented as a data structure — called slice header (more info: 1, 2).
When you assign the return value of
getWeeks
, you're essentially copying the slice header, including its fieldsData
,Len
andCap
. You can see it in this playground by printing the address of the slices with%p
verb (using some other struct instead of go-colly to make the example self-contained):Prints two different memory addresses:
Now if you keep fishing around on Stack Overflow about slice and
append
behavior, you may find out that if the slice has sufficient capacity (1, 2, 3) the backing array is not reallocated.However even if you do make sure the backing array is the same by initializing
wks
with sufficient capacity, the value ofw
is still a copy of the original slice header, therefore with 0 length. This is demonstrated in this playground, which prints:You could adjust the length of
w
by reslicing it (playground):But this means that you need to know beforehand a reasonable capacity that doesn't cause reallocation, and the final length to reslice to.
Instead, return a pointer to a slice:
Or pass a pointer into
getWeeks
:Fixed playground: https://go.dev/play/p/yhq8YYnkFsv