在 .NET 和 C# 中从网站提取数据时出现问题
我编写了一个网页抓取程序来转到页面列表并将所有 html 写入文件。问题是,当我拉出一段文本时,某些字符被写成“�”。如何将这些字符提取到我的文本文件中?这是我的代码:
string baseUri = String.Format("http://www.rogersmushrooms.com/gallery/loadimage.asp?did={0}&blockName={1}", id.ToString(), name.Trim());
// our third request is for the actual webpage after the login.
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(baseUri);
request.Method = "GET";
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)";
//get the response object, so that we may get the session cookie.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
// and read the response
string page = reader.ReadToEnd();
StreamWriter SW;
string filename = string.Format("{0}.txt", id.ToString());
SW = File.AppendText("C:\\Share\\" + filename);
SW.Write(page);
reader.Close();
response.Close();
I have written a web scraping program to go to a list of pages and write all the html to a file. The problem is that when I pull a block of text some of the characters get written as '�'. How do I pull those characters into my text file? Here is my code:
string baseUri = String.Format("http://www.rogersmushrooms.com/gallery/loadimage.asp?did={0}&blockName={1}", id.ToString(), name.Trim());
// our third request is for the actual webpage after the login.
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(baseUri);
request.Method = "GET";
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)";
//get the response object, so that we may get the session cookie.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());
// and read the response
string page = reader.ReadToEnd();
StreamWriter SW;
string filename = string.Format("{0}.txt", id.ToString());
SW = File.AppendText("C:\\Share\\" + filename);
SW.Write(page);
reader.Close();
response.Close();
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您正在将名为
loadimage
的页面保存到文本文件中。你确定这真的是所有文字吗?无论哪种方式,您都可以使用
System.Net.WebClient.DownloadFile()
节省大量代码。You're saving a page named
loadimage
to a text file. Are you sure that's really all text?Either way, you can save yourself a lot of code by using
System.Net.WebClient.DownloadFile()
.您需要在此行中指定编码:
并且
File.AppendText("C:\\Share\\" + filename);
使用 UTF-8You need to specify your encoding in this line:
and
File.AppendText("C:\\Share\\" + filename);
uses UTF-8指定 Unicode 编码,如下所示:
..same for the StreamWriter
Specify Unicode encoding, like so:
..same for the StreamWriter