读取文件大文件很慢,请帮忙
这段代码需要大约 30 分钟并且 CPU 使用率很高,问题是什么?
Do
strLine = objReader.ReadLine()
If strLine Is Nothing Then
Exit Do
End If
'check valid proxy
m = Regex.Match(strLine.Trim, strProxyParttern)
strMatch = m.Value.Trim
If String.IsNullOrEmpty(strMatch) = True OrElse _
strMatch.Contains("..") = True Then
Continue Do
End If
' create proxy
With tmpProxy
.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
.Status = "new"
End With
' check
If lstProxys.Contains(tmpProxy) = True Then
Continue Do
End If
lstProxys.Add(tmpProxy)
Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
Exit Sub
End If
比较、读取文件或正则表达式导致速度缓慢?
编辑
像这样分析代码
Dim myTimer As New System.Diagnostics.Stopwatch()
Dim t1 As Integer = 0
Dim t2 As Integer = 0
Dim t3 As Integer = 0
'read the file line by line, collecting valid proxy
Do
'Read a line fromn the file
myTimer.Reset()
myTimer.Start()
strLine = objReader.ReadLine()
If strLine Is Nothing Then
Exit Do
End If
myTimer.Stop()
t1 = myTimer.Elapsed.Milliseconds
'check valid proxy
myTimer.Reset()
myTimer.Start()
m = Regex.Match(strLine.Trim, strProxyParttern)
strMatch = m.Value.Trim
If String.IsNullOrEmpty(strMatch) = True OrElse _
strMatch.Contains("..") = True Then
Continue Do
End If
myTimer.Stop()
t2 = myTimer.Elapsed.Milliseconds
' create proxy
myTimer.Reset()
myTimer.Start()
tmpProxy.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
tmpProxy.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
tmpProxy.Status = "new"
' check
If lstProxys.Contains(tmpProxy) = True Then
Continue Do
End If
lstProxys.Add(tmpProxy)
myTimer.Stop()
t2 = myTimer.Elapsed.Milliseconds
Debug.Print(String.Format("Read={0}, Match={1}, Add={2}", t1, t2, t3))
Loop Until strLine Is Nothing
给出了这些结果,
Read=0, Match=0, Add=1
Read=0, Match=0, Add=1
Read=0, Match=0, Add=2
...
Read=0, Match=0, Add=9
Read=0, Match=0, Add=9
Read=0, Match=0, Add=10
...
...
Read=0, Match=0, Add=39
Read=0, Match=0, Add=39
Read=0, Match=0, Add=40
etc
看起来代码没问题,除了添加到列表之外
this code takes about 30 mins and high cpu usage, what is the problem
Do
strLine = objReader.ReadLine()
If strLine Is Nothing Then
Exit Do
End If
'check valid proxy
m = Regex.Match(strLine.Trim, strProxyParttern)
strMatch = m.Value.Trim
If String.IsNullOrEmpty(strMatch) = True OrElse _
strMatch.Contains("..") = True Then
Continue Do
End If
' create proxy
With tmpProxy
.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
.Status = "new"
End With
' check
If lstProxys.Contains(tmpProxy) = True Then
Continue Do
End If
lstProxys.Add(tmpProxy)
Debug.Print(lstProxys.Count.ToString)
Loop Until strLine Is Nothing
If lstProxys.Count < 1 Then
Exit Sub
End If
is the slowness from the comparism or from reading the file or from the regex?
EDIT
profiling the code like this
Dim myTimer As New System.Diagnostics.Stopwatch()
Dim t1 As Integer = 0
Dim t2 As Integer = 0
Dim t3 As Integer = 0
'read the file line by line, collecting valid proxy
Do
'Read a line fromn the file
myTimer.Reset()
myTimer.Start()
strLine = objReader.ReadLine()
If strLine Is Nothing Then
Exit Do
End If
myTimer.Stop()
t1 = myTimer.Elapsed.Milliseconds
'check valid proxy
myTimer.Reset()
myTimer.Start()
m = Regex.Match(strLine.Trim, strProxyParttern)
strMatch = m.Value.Trim
If String.IsNullOrEmpty(strMatch) = True OrElse _
strMatch.Contains("..") = True Then
Continue Do
End If
myTimer.Stop()
t2 = myTimer.Elapsed.Milliseconds
' create proxy
myTimer.Reset()
myTimer.Start()
tmpProxy.IP = strMatch.Substring(0, strMatch.IndexOf(":"))
tmpProxy.Port = CInt(strMatch.Substring(strMatch.IndexOf(":") + 1))
tmpProxy.Status = "new"
' check
If lstProxys.Contains(tmpProxy) = True Then
Continue Do
End If
lstProxys.Add(tmpProxy)
myTimer.Stop()
t2 = myTimer.Elapsed.Milliseconds
Debug.Print(String.Format("Read={0}, Match={1}, Add={2}", t1, t2, t3))
Loop Until strLine Is Nothing
gave these results
Read=0, Match=0, Add=1
Read=0, Match=0, Add=1
Read=0, Match=0, Add=2
...
Read=0, Match=0, Add=9
Read=0, Match=0, Add=9
Read=0, Match=0, Add=10
...
...
Read=0, Match=0, Add=39
Read=0, Match=0, Add=39
Read=0, Match=0, Add=40
etc
looks like the code is ok right, except for the add to the list
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
速度问题是因为您使用的是列表(结构)。 List.Contains 方法是线性搜索(它会遍历列表中的每个项目以查看是否匹配),因此添加到列表中的唯一项目越多,所需的时间就越长。
由于您正在处理大量项目,因此请将 lstProxys 更改为 HashSet(Of T)。您应该会看到性能的巨大提升。您需要做的就是更改 lstProxys 的定义:
The speed issue is because you are using a List(Of Structure). The List.Contains method is a linear search (it goes through each item of the list to see if it matches) so it takes increasingly longer the more unique items you add to the list.
Because you're dealing with a large number of items, change lstProxys into a HashSet(Of T). You should see a huge performance boost. All you should need to do is change the definition of lstProxys:
磁盘 I/O 通常是此类问题的限制因素。根据磁盘速度,您预计吞吐量约为每秒 5-20 兆字节。
如果正则表达式包含导致大量回溯的表达式,那么它们可能会很慢,因此这是一种可能性,但与磁盘 I/O 相比,它应该是相当糟糕的。
由于代理列表中永远不会有多个项目,因此该比较不会成为问题。您没有创建任何新的代理对象,而是重复使用相同的代理对象,这意味着您更改了已放入列表中的对象的属性。当您将对象与其自身进行比较时,列表将始终包含第一次迭代后的对象,并且永远不会第二次添加。
当您为代理类的属性赋值时,代理类会执行任何操作吗?如果它执行诸如创建连接之类的操作,则可能会花费很长时间。
The disk I/O is usually the limiting factor for something like this. Depending on the disk speed you could expect a throughput of about 5-20 megabyte per second.
Regular expressions can be slow if they contain expressions that cause a lot of backtracking, so that is a possibility, but it should be pretty bad to be noticable compared to the disk I/O.
As there will never be more than one item in the proxy list, that comparion can't be the problem. You are not creating any new proxy object, but reusing the same, which means that you change the property of the object that you have already put in the list. As you are comparing the object with itself, the list will always contain the object after the first iteration, and will never be added a second time.
Does the proxy class do anything when you assign values to its properties? If it does something like creating a connection, that might be what's taking so long.
我们可以进行有根据的猜测,但为什么不直接测量呢?
例如,在发布模式下且不附加调试器的情况下分别运行以下三个测试,看看需要多长时间
We could take educated guesses but why not measure it instead.
For example run the following three tests separately under release mode and without the debugger attached and see how long it takes