c# parse google trend csv没有明显的分隔符

发布于 2024-10-17 10:43:23 字数 643 浏览 1 评论 0原文

我正在尝试从谷歌趋势解析 csv 文件,但列之间似乎没有任何分隔符?有什么办法可以让这个工作正常进行,这样我就可以在解析后将数据分成几列,或者我能做的最好的事情就是将每一行放在一列中。

我尝试过很多 csv 阅读器: http://www.codeproject.com/KB/database/CsvReader.aspx http://www.stellman-greene.com/CSVReader/

我可以尝试对每行中的数据进行子串,但这似乎是一个非常糟糕的解决方案。

来自谷歌趋势的 csv 文件示例: http ://www.google.com/trends/viz?q=stackoverflow&date=all&geo=all&graph=all_csv&sort=0&sa=N

有人有任何想法吗?

I'm trying to parse csv files from google trends, but there doesn't appear to be any delimiter between columns? Is there any way to go about getting this working so I can get data separated into columns after parsing, or is the best that I can do to just have each row in one column.

I've tried numerous csv readers:
http://www.codeproject.com/KB/database/CsvReader.aspx
http://www.stellman-greene.com/CSVReader/

I could try to substring out the data in each row, but that seems like a very poor solution.

Example csv file from google trends:
http://www.google.com/trends/viz?q=stackoverflow&date=all&geo=all&graph=all_csv&sort=0&sa=N

Anyone got any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

你爱我像她 2024-10-24 10:43:23

在我看来,这些列是用制表符(U+0009)分隔的,不是吗?只要做

using (var reader = new StreamReader(@"trends.csv", true))
{
    string line;
    while ((line = reader.ReadLine()) != null)
    {
        var items = line.Split('\t');
        if (items.Length == 3) // recognizing the header etc. left as an exercise for the reader
        {
            Console.WriteLine("Date: {0}, value = {1}, error = {2}", items[0], items[1], items[2]);
        }
    }
}

It seems to me the columns are delimited with tabs (U+0009), aren’t they? Just do

using (var reader = new StreamReader(@"trends.csv", true))
{
    string line;
    while ((line = reader.ReadLine()) != null)
    {
        var items = line.Split('\t');
        if (items.Length == 3) // recognizing the header etc. left as an exercise for the reader
        {
            Console.WriteLine("Date: {0}, value = {1}, error = {2}", items[0], items[1], items[2]);
        }
    }
}
煮酒 2024-10-24 10:43:23

在我看来,它是用 UTF-16 编码的,带有制表符 (U+0009) 分隔符。

Looks to me like it's encoded in UTF-16 with a delimiter of tab (U+0009).

软的没边 2024-10-24 10:43:23

有 2 个可能的问题导致这些库无法很好地解析它:

  1. 前 4 行可能
    “欺骗”那些解析器相信
    只有 2 列

  2. 这并不是真正的 CSV (逗号分隔值) 文件,使用制表符代替逗号


在此处输入图像描述


编写自己的解析器非常简单明了对于这种特殊情况(值中没有转义制表符):

  1. 打开文件

  2. 跳过前 5 行

  3. 对于您阅读的每一行,将其按 \t 分割并获取列值

There are 2 possible issues why it does not get parsed well by those libraries:

  1. The first 4 lines could possibly
    "trick" those parsers into believing
    there are only 2 columns

  2. This is not really a CSV (Comma-Separated Values) file, tabs are used instead of commas


enter image description here


It's easy and straightforward to write your own parser for this particular case (there are no escaped tabs in values):

  1. Open the file

  2. Skip the first 5 lines

  3. For each line you read, split it by \t and get column values

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文