使用字符串操作来解决目录分隔符的疯狂问题?
我正在努力转换一个网站。它涉及标准化图像和媒体文件的目录结构。我正在解析来自各种标签的路径信息,对它们进行标准化,检查媒体是否存在于新的标准化位置,如果不存在则将其放在那里。我正在使用字符串操作来执行此操作。
这有点开放式,但是有没有一个类、工具或概念可以让我避免一些头痛?例如,我遇到的问题是,su 目录中的页面 (website.com/subdir/dir/page.php
) 具有相对图像路径 (../images /image.png
),或其他类似的东西。这并不是说存在一个总体问题,而是有很多小问题加起来。
当我认为我的脚本涵盖了大多数情况时,我会收到类似 Could not find file at export/standardized_folder/proper_image_folderimage.png
的错误,它应该是 export/standardized_folder/proper_image_folder/图片.png
。进行字符串解析和检查以确保目录分隔符位于正确的位置,这有点让我发疯。
我觉得我投入了太多的精力来使一次性导入脚本变得非常健壮。也许有人已经以一种可重复使用的方式解决了这个混乱,我可以利用这种方式?
帖子脚本:这是一个更深入的独家新闻。我编写的脚本解析一种“类型”的页面并从同类页面中提取内容。然后我转动脚本来解析另一种类型的页面,获取所有错误,并了解到我所有关于如何引用路径的假设都必须被抛弃。清洗、冲洗,重复。
因此,我正在考虑对我的脚本进行一些主要的重构,抛弃所有假设,并检查、重新检查和双重检查路径信息。由于我真的在尝试构建一个强大的路径构建脚本,因此希望我可以避免重新发明轮子。外面有轮子吗?
I'm working on converting a website. It involved standardizing the directory structure of images and media files. I'm parsing path information from various tags, standardizing them, checking to see if the media exists in the new standardized location, and putting it there if it doesn't. I'm using string manipulation to do so.
This is a little open-ended, but is there a class, tool, or concept out there I can use to save myself some headaches? For instance, I'm running into problems where, say, a page in a sudirectory (website.com/subdir/dir/page.php
) has relative image paths (../images/image.png
), or other kinds of things like this. It's not like there's one overarching problem, but just a lot of little things that add up.
When I think I've got my script covering most cases, then I get errors like Could not find file at export/standardized_folder/proper_image_folderimage.png
where it should be export/standardized_folder/proper_image_folder/image.png
. It's kind of driving me mad, doing string parsing and checks to make sure that directory separators are in the proper places.
I feel like I'm putting too much work into making a one-off import script very robust. Perhaps someone's already untangled this mess in a re-useable way, one which I can take advantage of?
Post Script: So here's a more in-depth scoop. I write my script that parses one "type" of page and pulls content from the same of its kind. Then I turn my script to parse another type of page, get all knids of errors, and learn that all my assumptions about how paths are referenced must be thrown out the window. Wash, rinse, repeat.
So I'm looking at doing some major re-factoring of my script, throwing out all assumptions, and checking, re-checking, and double-checking path information. Since I'm really trying to build a robust path building script, hopefully I can avoid re-inventing the wheel. Is there a wheel out there?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您的问题根源在于解析文档中的相对链接并解析为绝对链接(这应该是将链接图像路径映射到文件系统的工作的一半),我通常使用
Net_URL2
来自 pear。这是一个简单的类,可以完成工作。要安装,只需以root身份调用
即使它是测试版软件包,它也非常稳定。
举个小例子,假设有一个包含所有相关图像 src 的数组,并且该文档有一个基本 URL:
这将根据您的基本 URL 将任何相对链接转换为绝对链接。基本 URL 首先是文档地址。该文档可以通过使用
base
元素指定另一个来覆盖它文档。因此,您可以使用已经使用的 HTML 解析器(以及src
和href
值)进行查找。Net_URL2
反映当前的 RFC 3986 进行 URL 解析。另一个对 URL 处理可能很方便的东西是 getNormalizedURL 函数。它确实消除了一些潜在的错误情况,例如不必要的点段等。如果您需要将一个 URL 与另一个 URL 进行比较,并且自然地将 URL 映射到路径,那么这很有用:
因此,您可以将所有 URL 解析为绝对 URL,并且您可以让它们标准化,您可以决定它们是否对您的网站有问题,只要 url 仍然是
Net_URL2
实例,您就可以使用众多函数之一来执行此操作:Left is文件的具体路径URL:
考虑到您正在与 UNIX 文件系统进行比较,该路径应该很容易以具体的基本目录作为前缀:
如果您在将基本路径与图像路径组合时遇到问题,则图像路径将始终具有开头有一个斜杠。
希望这有帮助。
If your problems have their root in resolving the relative links from a document and resolve to an absolute one (which should be half the job to map the linked images paths onto the file-system), I normally use
Net_URL2
from pear. It's a simple class that just does the job.To install, as root just call
Even if it's a beta package, it's really stable.
A little example, let's say there is an array with all the images srcs in question and there is a base-URL for the document:
This will convert any relative links into absolute ones based on your base URL. The base URL is first of all the documents address. The document can override it by specifying another one with the
base
elementDocs. So you could look that up with the HTML parser you're already using (as well as thesrc
andhref
values).Net_URL2
reflects the current RFC 3986 to do the URL resolving.Another thing that might be handy for your URL handling is the
getNormalizedURL
function. It does remove some potential error-cases like needless dot segments etc. which is useful if you need to compare one URL with another one and naturally for mapping the URL to a path then:So as you can resolve all URLs to absolute ones and you get them normalized, you can decide whether or not they are in question for your site, as long as the url is still a
Net_URL2
instance, you can use one of the many functions to do that:Left is the concrete path to the file in the URL:
That path, considering you're comparing against a UNIX file-system, should be easy to prefix with a concrete base directory:
If you've got problems to combine the base path with the image path, the image path will always have a slash at the beginning.
Hope this helps.
Truepath()
来救援!不,您不应该使用
realpath()
(了解原因)。Truepath()
to the rescue!No, you shouldn't use
realpath()
(see why).