简化平台迁移网站——现有软件还是解析脚本?
我正在将网站从一个平台迁移到另一个平台。部分要求是维护可能被添加书签的 URL,我将通过重写规则来做到这一点。
因为旧系统有点混乱,所以我需要特别小心并确保所有链接都有效。因为有很多页面,所以手动完成此操作是不现实的 - 我需要自动化该过程。顶部有一个主菜单,其正下方有一个子菜单,还有一个侧面菜单——但这对于任何随机页面来说可能都是不正确的。
第一步,我想做的是进行某种解析并生成网站的简化版本。在这个简化版本中,我只担心链接。
所以我想做的是:解析页面并扔掉大部分 html,除了任何链接(内部或外部)。如果一组链接都位于特定的 html 标记中(例如,充当菜单的
或充当菜单的
内容区域,我想保留 html 标签的嵌套,基本上我想要的结果是:
index.html
<html>
<body>
<tag>
<a href='page2.html'>Menu Item 1</a>
<a href='page3.html'>Menu Item 2</a>
<a href='page4.html'>Menu Item 3</a>
</tag>
<tag>
<a href='page5.html'>SubMenu Item 4</a>
<a href='page6.html'>SubMenu Item 5</a>
<a href='page7.html'>SubMenu Item 6</a>
</tag>
<tag>
<a href='page8.html'>Side Menu Item 1</a>
<a href='page9.html'>Side Menu Item 2</a>
<a href='page10.html'>Side Menu Item 3</a>
</tag>
<tag>
<a href='site.com'>Content External link</a>
<a href='about_us.html'>Content Internal Link</a>
</tag>
</body>
</html>
其中
可以是任何块样式的 html 标签 - 不是。实际上必须是第一个标签链接共享。
脚本/程序不必足够聪明来知道“这是一个菜单”或“这是一个导航面板”,只要它可以将链接分组到它们都存在的第一个 html 标签中即可。是否
已经有一个脚本或软件可以做到这一点?或者我自己编写
,我应该如何进行 html 解析?没有答案,因为他们无法跟踪状态从而无法了解嵌套标签结构。
I am migrating a website from one platform to another. Part of the requirements are to maintain URLs that may be bookmarked, which I will do with re-write rules.
Because the old system is kind of a mess, I need to take special care and make sure that all the links work. Because there are many pages, its unrealistic to do this by hand -- I'll need to automate the process. There is a main menu at the top, a submenu right below it, and a side menu -- but this could be untrue for any random page.
For the first step, what I'm looking to do is do some kind of parsing and generate a simplified version of the site. In this simplified version, I'm only worried about links.
So what I would like to do is: Parse the page and throw out most of the html, except for any link (internal or external). If a set of links all live within a particular html tag ( say, a <ul>
that acts as a menu, or a <div>
that acts as a content area, I would like to preserve that nesting of html tags.
Basically what I want to end up with is this:
index.html
<html>
<body>
<tag>
<a href='page2.html'>Menu Item 1</a>
<a href='page3.html'>Menu Item 2</a>
<a href='page4.html'>Menu Item 3</a>
</tag>
<tag>
<a href='page5.html'>SubMenu Item 4</a>
<a href='page6.html'>SubMenu Item 5</a>
<a href='page7.html'>SubMenu Item 6</a>
</tag>
<tag>
<a href='page8.html'>Side Menu Item 1</a>
<a href='page9.html'>Side Menu Item 2</a>
<a href='page10.html'>Side Menu Item 3</a>
</tag>
<tag>
<a href='site.com'>Content External link</a>
<a href='about_us.html'>Content Internal Link</a>
</tag>
</body>
</html>
Where <tag>
could be any block-style html tag -- doesn't have to actually be the first tag that the links share.
The script/program doesn't have to be smart enough to know "This is a menu" or "This is a navigation panel"; so long as it can group links in the first html tag they all exist in, that will be good enough.
Is there a script or piece of software out there that already does this? Or will I write my own?
If I write my own, how should I do the html parsing? I've heard that regexes aren't the answer, since they can't track state and thereby cannot know about nested tag structure.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论