PHP脚本，用于检查目录中所有文件的SHA1或MD5哈希，以防止从XML文件中删除的校验和; 递归，循环

I've done a bulk download from archive.org using wget which was set to spit out a list of all files per IDENTIFIER into their respective folders.

wget -r -H -nc -np -nH -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'

Which results in folders organised thus from a root, for example:

./IDENTIFIER1/file.blah
./IDENTIFIER1/something.la
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb001.gif
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb002.gif
./IDENTIFIER1/IDENTIFIER_files.xml

./IDENTIFIER2/etc.etc
./IDENTIFIER2/blah.blah
./IDENTIFIER2/thumbnails/IDENTIFIER_thumb001.gif

 etc

The IDENTIFIER is the name of a collection of files that comes from archive.org, hence, in each folder, there is also the file called IDENTIFIER_files.xml which contains checksums for all the files in that folder, wrapped in the various xml tags.

Since this is a bulk download and there's hundreds of files, the idea is to write some sort of script (preferably bash? Edit: Maybe PHP?) that can select each .xml file and scrape it for the hashes to test them against the files to reveal any corrupted, failed or modified downloads.

For example:

From archive.org/details/NuclearExplosion, XML is:

https://archive.org/download/NuclearExplosion/NuclearExplosion_files.xml

If you check that link you can see there's both the option for MD5 or SHA1 hashes in the XML, as well as the relative file paths in the file tag (which will be the same as locally).

So. How do we:

For each folder of IDENTIFIER, select and scrape the XML for each filename and the checksum of choice;
Actually test the checksum for each file;
Log outputs of failed checksums to a file that lists only the failed IDENTIFIER (say a file called ./RetryIDs.txt for example), so a download reattempt can be tried using that list...
```
wget -r -H -nc -np -nH -e robots=off -l1 -i ./RetryIDs.txt -B 'http://archive.org/download/'
```

Any leads on how to piece this together would be extremely helpful.

And another added incentive---probably a good idea too, if there is a solution, if we let archive.org know so they can put it on their blog. I'm sure I'm not the only one that will find this very useful!

Thanks all in advance.

Edit: Okay, so a bash script looks tricky. Could it be done with PHP?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dsvbtgo639708 2015-03-14 13:52
关注
If you really want to go the bash route, here's something to you started. You can use the xml2 suite of tools to convert XML into something more amendable to traditional shell scripting, and then do something like this:

#!/bin/sh xml2 < $1 | awk -F= ' $1 == "/files/file/@name" {name=$2} $1 == "/files/file/sha1" { sha1=$2 print name, sha1 } '

This will produce on standard output a list of filenames and their corresponding SHA1 checksum. That should get you substantially closer to a solution.

Actually using that output to validate the files is left as an exercise to the reader.
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

PHP和C＃中的不同MD5哈希值 c# php
2018-05-29 10:50

回答 1 已采纳 The hash value ee1ae334e4bdeceb54caab15555f2f40 is the MD5 hash over test-server::7250406f7c435245
如何在Golang中计算嵌套/迭代的MD5哈希？
2019-01-10 05:05

回答 1 已采纳 Maybe I'm misunderstanding your question, but is this what you're looking for? package main impo
java和php的hmac_sha1结果不同，求高手帮忙 java php
2015-04-12 12:13

回答 1 已采纳自己已经搞定，解决了哇！
文件校验和完整性验证程序 (FCIV) 工具
2019-08-24 10:36

allway2的博客 FCIV 可以计算 MD5 或 sha-1 加密哈希值。可以在屏幕上显示或在以后使用和验证 XML 文件数据库中保存这些值。文件校验和完整性验证程序简介功能安装使用情况数据库存储格式验证简介本文讨论了...
在Java中使用盐与PHP完全相同的哈希？（SHA-256） java php
2014-07-14 08:14

回答 2 已采纳 Apart from the order being swapped, it looks like in PHP you're treating the salt value as a liter
相同的字符串导致不同的MD5哈希
2018-04-27 19:13

回答 1 已采纳 It appears as though you are writing over and over to the same (package level variable) hasher, so
Java和PHP中的相同哈希算法给出了不同的结果 java php
2018-02-13 14:29

回答 1 已采纳 It's because the SHA-256 functions do not use the same format for the return value. The hash funct
【PHP】phpini配置文件中文翻译
2019-11-14 11:27

JackMa_的博客 [PHP] ;;;;;;;;;;...配置文件 ;... PHP 的初始化文件, 必须命名为 php.ini. ... PHP 会尝试通过一些地址来寻找和加载配置. ... 比如有以下几种方式依次查找该文件: ... 1. SAPI . ; 2. 环境变量 PHPRC . (A...
所有md5哈希都是一样的吗？ bash php
2012-01-25 01:31

回答 3 已采纳 Yes, MD5 checksums are platform agnostic and will produce the same value every time on the same fi
如何在PHP中过滤和识别哈希字符串？ php
2013-05-07 10:48

回答 4 已采纳 To test whether a string is a x length hash value or not: $x = 40 $string = "inputstring"; $boolR
Go Hmac SHA1生成的哈希与Java中的Hmac SHA1不同 java
2015-05-15 06:37

回答 1 已采纳 The Go code you provided gives exactly the same output as the Java code. Try it on the Go Playgro
可用性和文件校验和完整性验证程序实用程序FCIV说明
2020-05-16 17:15

allway2的博客可用性和文件校验和完整性验证程序实用程序的说明重要提示：本文章是 Microsoft 软件自动翻译的结果，而非专业译者翻译的结果。 Microsoft 提供专业人员翻译的文章、由自动翻译生成的文章以及来自 Microsoft 社区...
如何通过PKCS5_PBKDF2_HMAC和hash_hmac在C和php中获得相同的输出 php
2015-04-20 18:40

回答 1 已采纳 By trying various combinations I found the proper way of updating the hash. I didn't really search
[转帖]可用性和文件校验和完整性验证程序实用程序的说明
2019-04-18 20:14

weixin_30698297的博客可用性和文件校验和完整性验证程序实用程序的说明 https://support.microsoft.com/zh-cn/help/841290/availability-and-description-of-the-file-checksum-integrity-verifier-u File Checksum Integrity ...
PHP配置文件
2015-11-12 16:50

Jerry_Dui的博客从开始使用PHP到现在已经快2年多了，但是从来没有仔细看过php.ini。利用周末这两天断断续续将里面的注释翻译成了中文，感觉看到了很多以前从来没有关注过的配置，但是却曾经使用过与这些配置相关的功能或扩展模块，...
没有解决我的问题, 去提问

悬赏问题

¥15 Vue3 大型图片数据拖动排序
¥15 划分vlan后不通了
¥15 GDI处理通道视频时总是带有白色锯齿
¥20 用雷电模拟器安装百达屋apk一直闪退
¥15 算能科技20240506咨询（拒绝大模型回答）
¥15 自适应 AR 模型参数估计Matlab程序
¥100 角动量包络面如何用MATLAB绘制
¥15 merge函数占用内存过大
¥15 使用EMD去噪处理RML2016数据集时候的原理
¥15 神经网络预测均方误差很小但是图像上看着差别太大

PHP脚本，用于检查目录中所有文件的SHA1或MD5哈希，以防止从XML文件中删除的校验和; 递归，循环

1条回答 默认 最新

悬赏问题

1条回答默认最新