duanjiani6826 2011-09-16 15:05
浏览 85
已采纳

使用字符串操作解开目录分隔符疯狂?

I'm working on converting a website. It involved standardizing the directory structure of images and media files. I'm parsing path information from various tags, standardizing them, checking to see if the media exists in the new standardized location, and putting it there if it doesn't. I'm using string manipulation to do so.

This is a little open-ended, but is there a class, tool, or concept out there I can use to save myself some headaches? For instance, I'm running into problems where, say, a page in a sudirectory (website.com/subdir/dir/page.php) has relative image paths (../images/image.png), or other kinds of things like this. It's not like there's one overarching problem, but just a lot of little things that add up.

When I think I've got my script covering most cases, then I get errors like Could not find file at export/standardized_folder/proper_image_folderimage.png where it should be export/standardized_folder/proper_image_folder/image.png. It's kind of driving me mad, doing string parsing and checks to make sure that directory separators are in the proper places.

I feel like I'm putting too much work into making a one-off import script very robust. Perhaps someone's already untangled this mess in a re-useable way, one which I can take advantage of?

Post Script: So here's a more in-depth scoop. I write my script that parses one "type" of page and pulls content from the same of its kind. Then I turn my script to parse another type of page, get all knids of errors, and learn that all my assumptions about how paths are referenced must be thrown out the window. Wash, rinse, repeat.

So I'm looking at doing some major re-factoring of my script, throwing out all assumptions, and checking, re-checking, and double-checking path information. Since I'm really trying to build a robust path building script, hopefully I can avoid re-inventing the wheel. Is there a wheel out there?

  • 写回答

2条回答 默认 最新

  • dth42345 2011-09-16 16:35
    关注

    If your problems have their root in resolving the relative links from a document and resolve to an absolute one (which should be half the job to map the linked images paths onto the file-system), I normally use Net_URL2 from pear. It's a simple class that just does the job.

    To install, as root just call

    # pear install channel://pear.php.net/Net_URL2-0.3.1
    

    Even if it's a beta package, it's really stable.

    A little example, let's say there is an array with all the images srcs in question and there is a base-URL for the document:

    require_once('Net/URL2.php');
    
    $baseUrl = 'http://www.example.com/test/images.html';
    
    $docSrcs = array(...);
    
    $baseUrl = new Net_URL2($baseUrl);
    
    foreach($docSrcs as $href)
    {
        $url = $baseUrl->resolve($href);
        echo ' * ', $href, ' -> ', $url->getURL(), "
    ";
        // or
        echo " $href -> $url
    "; # Net_URL2 supports string context
    }
    

    This will convert any relative links into absolute ones based on your base URL. The base URL is first of all the documents address. The document can override it by specifying another one with the base elementDocs. So you could look that up with the HTML parser you're already using (as well as the src and href values).

    Net_URL2 reflects the current RFC 3986 to do the URL resolving.

    Another thing that might be handy for your URL handling is the getNormalizedURL function. It does remove some potential error-cases like needless dot segments etc. which is useful if you need to compare one URL with another one and naturally for mapping the URL to a path then:

    foreach($docSrcs as $href)
    {
        $url = $baseUrl->resolve($href);
        $url = $url->getNormalizedURL();
        echo " $href -> $url
    ";
    }
    

    So as you can resolve all URLs to absolute ones and you get them normalized, you can decide whether or not they are in question for your site, as long as the url is still a Net_URL2 instance, you can use one of the many functions to do that:

    $host = strtolower($url->getHost());
    if (in_array($host, array('example.com', 'www.example.com'))
    {
        # URL is on my server, process it further
    }
    

    Left is the concrete path to the file in the URL:

    $path = $url->getPath();
    

    That path, considering you're comparing against a UNIX file-system, should be easy to prefix with a concrete base directory:

    $filesystemImagePath = '/var/www/site-new/images';
    $newPath = $filesystemImagePath . $path;
    if (is_file($newPath))
    {
        # new image already exists.
    }
    

    If you've got problems to combine the base path with the image path, the image path will always have a slash at the beginning.

    Hope this helps.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 关于#vue.js#的问题:word excel和ppt预览问题语言-javascript)
  • ¥15 Apache显示系统错误3该如何解决?
  • ¥30 uniapp小程序苹果手机加载gif图片不显示动效?
  • ¥20 js怎么实现跨域问题
  • ¥15 C++dll二次开发,C#调用
  • ¥15 请教,如何使用C#加载本地摄像头进行逐帧推流
  • ¥15 Python easyocr无法顺利执行,如何解决?
  • ¥15 为什么会突然npm err!啊
  • ¥15 java服务连接es读取列表数据,服务连接本地es获取数据时的速度很快,但是换成远端的es就会非常慢,这是为什么呢
  • ¥15 vxworks交叉编译gcc报错error: missing binary operator before token "("