解析非常繁忙的空格分隔文件

发布于 2024-12-23 02:10:42 字数 293 浏览 1 评论 0原文

我正在努力帮助我的父亲——他给了我一份他工作中的日程安排应用程序的导出信息。我们正在尝试是否可以将其导入 mysql 数据库,以便他/同事可以与其在线协作。

我尝试了许多不同的方法,但似乎没有一个能正常工作——而且这不是我的专业领域。

导出可以在此处查看:http://roikingon.com/export.txt

有关如何导出的任何帮助/建议去解析这个将不胜感激!

谢谢 !!

I'm trying to help my dad out -- he gave me an export from a scheduling application at his work. We are trying to see if we can import it into a mysql database so he/co-workers can collaborate online with it.

I've tried a number of different methods but none seem to work right -- and this is not my area of specialties.

Export can be seen here: http://roikingon.com/export.txt

Any help / advice on how to go about parsing this would be greatly appreciated!

Thanks !!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

迷迭香的记忆 2024-12-30 02:10:42

我尝试编写一个(有点动态的)固定列解析器。看一下:http://codepad.org/oAiKD0e7(对于SO来说太长了,但大部分只是“数据”)。

我注意到

  • 文本数据与右侧的填充左对齐,例如 "hello___" (_ = space)
  • 数字数据与左侧的填充右对齐 < code>"___42"

如果你想使用我的代码,还有一些事情要做:

  • 记录类型 12.x 具有可变列数(在一些静态列之后),你必须实现另一个“处理程序”它
  • 是我的一些宽度很可能是错误的。我认为有一个系统(比如数字是 4 个字符长,文本是 8 个字符长,特殊情况有一些变化)。具有领域知识和多个示例文件的人可以找出这些列。
  • 获取原始数据只是第一步,您必须将原始数据映射到一些有用的模型并将该模型写入数据库。

I've made an attempt to write a (somewhat dynamic) fixed-with-column parser. Take a look: http://codepad.org/oAiKD0e7 (it's too long for SO, but it's mostly just "data").

What I've noticed

  • Text-Data is left aligned with padding on the right like "hello___" (_ = space)
  • Numerical data is right aligned with padding on the left "___42"

If you want to use my code there's yet stuff to do:

  • The record types 12.x have variable column count (after some static columns), you'd have to implement another "handler" for it
  • Some of my width's are most probably wrong. I think there is a system (like numbers are 4 characters long and text 8 characters long, with some variations for special cases). Someone with domain knowledge and more than one sample file could figure out the columns.
  • Getting the raw-data out is only the first step, you have to map the raw-data to some useful model and write that model to the database.
私藏温柔 2024-12-30 02:10:42

使用该文件结构,您基本上需要对专有格式进行逆向工程。是的,它是空格分隔的,但格式不遵循任何类型的标准,如 CSV、YAML 等。它完全是专有的,似乎是一个标头和带有自己标头的单独部分。

我认为你最好的选择是尝试看看是否有其他类型的导出可以完成,例如 Excel 或 XML,并从那里开始工作。如果没有,则查看是否有某种可以在屏幕上抓取的 html 输出,然后粘贴到 Excel 中,看看会得到什么。

由于我上面提到的一切,将当前形式的文件调整为可以明智地导入数据库的内容将非常困难。 (请注意,从文件结构来看,将需要许多表。)

With that file structure you're basically in need of reverse engineering a proprietary format. Yes, it is space delimited but the format does not follow any kind of standard like CSV, YAML etc. It is completely proprietary with what seems to be a header and separate section with headers of their own.

I think your best bet is to try and see if there's some other type of export that can be done such as Excel or XML and working from there. If there isn't then see if there's an html output of some kind that can be screen scraped, and pasted into Excel and seeing what you get.

Due to everything I mentioned above it will be VERY difficult to massage the file in its current form into something that can be sensibly imported into a database. (Note that from the file structure a number of tables would be needed.)

迷迭香的记忆 2024-12-30 02:10:42

您可以将 split 与正则表达式一起使用(零个或多个空格)。

我会尽力让你知道。

您的数据似乎没有结构。

$data = "12.1  0    1144713      751  17  Y   8  517  526  537  542  550  556  561  567                                     17 ";

$arr = preg_split("/ +/", $data);
print_r($arr);

Array
(
    [0] => 12.1
    [1] => 0
    [2] => 1144713
    [3] => 751
    [4] => 17
    [5] => Y
    [6] => 8
    [7] => 517
    [8] => 526
    [9] => 537
    [10] => 542
    [11] => 550
    [12] => 556
    [13] => 561
    [14] => 567
    [15] => 17
    [16] =>
)

试试这个 preg_split("/ +/", $data); ,它将行分割成零个或多个空格,然后你将得到一个可以处理的漂亮数组。但是查看您的数据,没有结构,因此您必须知道哪个数组元素对应于什么数据。

祝你好运。

you can use split with a regular expression (zero or more spaces).

I will try and let you know.

There doesnt seem to be a structure with you data.

$data = "12.1  0    1144713      751  17  Y   8  517  526  537  542  550  556  561  567                                     17 ";

$arr = preg_split("/ +/", $data);
print_r($arr);

Array
(
    [0] => 12.1
    [1] => 0
    [2] => 1144713
    [3] => 751
    [4] => 17
    [5] => Y
    [6] => 8
    [7] => 517
    [8] => 526
    [9] => 537
    [10] => 542
    [11] => 550
    [12] => 556
    [13] => 561
    [14] => 567
    [15] => 17
    [16] =>
)

Try this preg_split("/ +/", $data); which splits the line by zero or more spaces, then you will have a nice array, that you can process. But looking at your data, there is no structure, so you will have to know which array element corresponds to what data.

Good luck.

爱人如己 2024-12-30 02:10:42

用excel打开它并保存为逗号分隔。将连续分隔符视为一个或不视为一个。然后用excel重新保存为csv,这样会以逗号分隔,更容易导入到mysql中。

编辑:
那个说在“[ +]”上使用 preg_split 的人给你的答案基本上与我上面所做的相同。

问题是之后该怎么办。

您确定有多少种“行类型”吗?一旦确定了这一点并定义了它们的特征,编写一些代码来完成它就会容易得多。

如果保存为csv,则可以使用PHP fgetcsv函数及相关函数。对于每一行,您将检查其类型并根据类型执行操作。

我注意到您的数据行可能会根据第一列的数据是否包含“。”进行划分。这是一个如何循环遍历文件的示例。

while($row = fgetcsv($file_handle)) {
if(strpos($row[0],'.') === false) {
// 做某事
} 别的 {
// 做点别的事
}

做某事”类似于“CREATE TABLE table_$row[0]”或“INSERT INTO table”等。

好的,这里有更多观察结果

:文件实际上就像多个文件粘在一起。它包含多种格式。请注意,接下来以“4”开头的所有行都有一个 4 个字母的公司缩写,后跟完整的公司名称。其中之一是“可可”。如果您搜索“caco”,您会在文件内的多个“表”中找到它。

我还注意到周围散布着“smuwtfa”(一周中的几天)。

使用这样的线索来确定如何处理每一行的逻辑。

Open it with excel and save it as comma-delimited. Treat consecutive delimiters as one, or not. Then resave it with excel as a csv, which will be comma-separated and easier to import to mysql.

EDIT:
The guy who says to use preg_split on "[ +]" is giving you essentially the same answer as I just did above.

The question is what to do after that, then.

Have you determined yet how many "row types" there are? Once you've determined that and defined their characteristics it will be a lot easier to write some code to go through it.

If you save it in csv, you can use the PHP fgetcsv function and related functions. For each row, you would check it's type and perform operations depending on the type.

I noticed that your data rows could possibly be divided on whether or not the first column's data contains a "." so here's an example of how you might loop through the file.

while($row = fgetcsv($file_handle)) {
if(strpos($row[0],'.') === false) {
// do something
} else {
// do something else
}
}

"do something" would be something like "CREATE TABLE table_$row[0]" or "INSERT INTO table" etc.

Ok, and here's some more observation:

Your file is really like multiple files glued together. It contains multiple formats. Notice all the rows starting with "4" next have a 4-letter company abbreviation followed by full company name. One of them is "caco". If you search for "caco", you find it in multiple "tables" within the file.

I also notice "smuwtfa" (days of the week) sprinkled around.

Use clues like that to determine the logic of how to treat each row.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文