dounuo1881 2015-03-14 12:58
浏览 86

PHP脚本,用于检查目录中所有文件的SHA1或MD5哈希,以防止从XML文件中删除的校验和; 递归,循环

I've done a bulk download from archive.org using wget which was set to spit out a list of all files per IDENTIFIER into their respective folders.

wget -r -H -nc -np -nH -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'

Which results in folders organised thus from a root, for example:

./IDENTIFIER1/file.blah
./IDENTIFIER1/something.la
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb001.gif
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb002.gif
./IDENTIFIER1/IDENTIFIER_files.xml

./IDENTIFIER2/etc.etc
./IDENTIFIER2/blah.blah
./IDENTIFIER2/thumbnails/IDENTIFIER_thumb001.gif

 etc

The IDENTIFIER is the name of a collection of files that comes from archive.org, hence, in each folder, there is also the file called IDENTIFIER_files.xml which contains checksums for all the files in that folder, wrapped in the various xml tags.

Since this is a bulk download and there's hundreds of files, the idea is to write some sort of script (preferably bash? Edit: Maybe PHP?) that can select each .xml file and scrape it for the hashes to test them against the files to reveal any corrupted, failed or modified downloads.

For example:

From archive.org/details/NuclearExplosion, XML is:

https://archive.org/download/NuclearExplosion/NuclearExplosion_files.xml

If you check that link you can see there's both the option for MD5 or SHA1 hashes in the XML, as well as the relative file paths in the file tag (which will be the same as locally).

So. How do we:

  1. For each folder of IDENTIFIER, select and scrape the XML for each filename and the checksum of choice;

  2. Actually test the checksum for each file;

  3. Log outputs of failed checksums to a file that lists only the failed IDENTIFIER (say a file called ./RetryIDs.txt for example), so a download reattempt can be tried using that list...

    wget -r -H -nc -np -nH -e robots=off -l1 -i ./RetryIDs.txt -B 'http://archive.org/download/'
    

Any leads on how to piece this together would be extremely helpful.

And another added incentive---probably a good idea too, if there is a solution, if we let archive.org know so they can put it on their blog. I'm sure I'm not the only one that will find this very useful!

Thanks all in advance.


Edit: Okay, so a bash script looks tricky. Could it be done with PHP?

  • 写回答

1条回答 默认 最新

  • dsvbtgo639708 2015-03-14 13:52
    关注

    If you really want to go the bash route, here's something to you started. You can use the xml2 suite of tools to convert XML into something more amendable to traditional shell scripting, and then do something like this:

    #!/bin/sh
    
    xml2 < $1 | awk -F= '
        $1 == "/files/file/@name" {name=$2}
        $1 == "/files/file/sha1" {
            sha1=$2
            print name, sha1
        }
    '
    

    This will produce on standard output a list of filenames and their corresponding SHA1 checksum. That should get you substantially closer to a solution.

    Actually using that output to validate the files is left as an exercise to the reader.

    评论

报告相同问题?

悬赏问题

  • ¥15 Vue3 大型图片数据拖动排序
  • ¥15 划分vlan后不通了
  • ¥15 GDI处理通道视频时总是带有白色锯齿
  • ¥20 用雷电模拟器安装百达屋apk一直闪退
  • ¥15 算能科技20240506咨询(拒绝大模型回答)
  • ¥15 自适应 AR 模型 参数估计Matlab程序
  • ¥100 角动量包络面如何用MATLAB绘制
  • ¥15 merge函数占用内存过大
  • ¥15 使用EMD去噪处理RML2016数据集时候的原理
  • ¥15 神经网络预测均方误差很小 但是图像上看着差别太大