drsc10888 2012-04-18 17:44
浏览 74
已采纳

在后台从外部源下载大型XML文件,如果不完整,则可以恢复下载

Some background information

The files I would like to download is kept at the external server for a week, and a new XML file(10-50mb large) is created there every hour with a different name. I would like the large file to be downloaded to my server chunk by chunk in the background each time my website is loaded, perhaps 0.5mb each time, and then resume the download the next time someone else loads the website. This would require my site to have atleast 100 pageloads each hour to stay updated, so perhaps abit more of the file each time if possible. I have researched simpleXML, XMLreader, SAX parsing, but whatever I do, it seems it takes too long to parse the file directly, therefore I would like a different approach, namely downloading it like described above.

If I download a 30mb large XML file, I can parse it locally with XMLreader in 3 seconds(250k iterations) only, but when I try to do the same from the external server limiting it to 50k iterations, it uses 15secs to read that small part, so it would not be possible to parse it directly from that server it seems.

Possible solutions

I think it's best to use cURL. But then again, perhaps fopen(), fsockopen(), copy() or file_get_contents() are the way to go. I'm looking for advice on what functions to use to make this happen, or different solutions on how I can parse a 50mb external XML file into a mySQL database.

I suspect a Cron job every hour would be the best solution, but I am not sure how well that would be supported by webhosting companies, and I have no clue how to do something like that. But if thats the best solution, and the majority thinks so, I will have to do my research in that area too.

If a java applet/javascript running in the background would be a better solution, please point me in the right direction when it comes to functions/methods/libraries there aswell.

Summary

  • What's the best solution to downloading parts of a file in the background, and resume the download each time my website is loaded until its completed?
  • If the above solution would be moronic to even try, what language/software would you use to achieve the same thing(download a large file every hour)?

Thanks in advance for all answers, and sorry for the long story/question.

Edit: I ended up using this solution to get the files with cron job scheduling a php script. It checks my folder for what files I already have, generates a list of the possible downloads for the last four days, then downloads the next XMLfile in line.

<?php
$date = new DateTime();
$current_time = $date->getTimestamp();
$four_days_ago = $current_time-345600;

echo 'Downloading: '."
";
for ($i=$four_days_ago; $i<=$current_time; ) {
    $date->setTimestamp($i);

    if($date->format('H') !== '00') {
        $temp_filename = $date->format('Y_m_d_H') ."_full.xml";
        if(!glob($temp_filename)) {
            $temp_url = 'http://www.external-site-example.com/'.$date->format('Y/m/d/H') .".xml";
            echo $temp_filename.' --- '.$temp_url.'<br>'."
";
            break; // with a break here, this loop will only return the next file you should download
        }
    }
    $i += 3600;
}

set_time_limit(300);
$Start = getTime(); 

$objInputStream = fopen($temp_url, "rb");
$objTempStream = fopen($temp_filename, "w+b");

stream_copy_to_stream($objInputStream, $objTempStream, (1024*200000));

$End = getTime();
echo '<br>It took '.number_format(($End - $Start),2).' secs to download "'.$temp_filename.'".';

function getTime() {
    $a = explode (' ',microtime());
    return(double) $a[0] + $a[1];
}
?>

edit2: I just wanted to inform you that there is a way to do what I asked, only it would'nt work in my case. With the amount of data I need the website would have to have 400+ visitors an hour for it to work properly. But with smaller amounts of data there are some options; http://www.google.no/search?q=poormanscron

  • 写回答

2条回答 默认 最新

  • dongmi1941 2012-04-18 20:52
    关注

    You need to have a scheduled, offline task (e.g., cronjob). The solution you are pursuing is just plain wrong.

    The simplest thing that could possibly work is a php script you run every hour (scheduled via cron, most likely) that downloads the file and processes it.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥30 关于<main>标签页面跳转的问题
  • ¥80 部署运行web自动化项目
  • ¥15 腾讯云如何建立同一个项目中物模型之间的联系
  • ¥30 VMware 云桌面水印如何添加
  • ¥15 用ns3仿真出5G核心网网元
  • ¥15 matlab答疑 关于海上风电的爬坡事件检测
  • ¥88 python部署量化回测异常问题
  • ¥30 酬劳2w元求合作写文章
  • ¥15 在现有系统基础上增加功能
  • ¥15 远程桌面文档内容复制粘贴,格式会变化