在后台从外部源下载大型XML文件，如果不完整，则可以恢复下载

Some background information

The files I would like to download is kept at the external server for a week, and a new XML file(10-50mb large) is created there every hour with a different name. I would like the large file to be downloaded to my server chunk by chunk in the background each time my website is loaded, perhaps 0.5mb each time, and then resume the download the next time someone else loads the website. This would require my site to have atleast 100 pageloads each hour to stay updated, so perhaps abit more of the file each time if possible. I have researched simpleXML, XMLreader, SAX parsing, but whatever I do, it seems it takes too long to parse the file directly, therefore I would like a different approach, namely downloading it like described above.

If I download a 30mb large XML file, I can parse it locally with XMLreader in 3 seconds(250k iterations) only, but when I try to do the same from the external server limiting it to 50k iterations, it uses 15secs to read that small part, so it would not be possible to parse it directly from that server it seems.

Possible solutions

I think it's best to use cURL. But then again, perhaps fopen(), fsockopen(), copy() or file_get_contents() are the way to go. I'm looking for advice on what functions to use to make this happen, or different solutions on how I can parse a 50mb external XML file into a mySQL database.

I suspect a Cron job every hour would be the best solution, but I am not sure how well that would be supported by webhosting companies, and I have no clue how to do something like that. But if thats the best solution, and the majority thinks so, I will have to do my research in that area too.

If a java applet/javascript running in the background would be a better solution, please point me in the right direction when it comes to functions/methods/libraries there aswell.

Summary

What's the best solution to downloading parts of a file in the background, and resume the download each time my website is loaded until its completed?
If the above solution would be moronic to even try, what language/software would you use to achieve the same thing(download a large file every hour)?

Thanks in advance for all answers, and sorry for the long story/question.

Edit: I ended up using this solution to get the files with cron job scheduling a php script. It checks my folder for what files I already have, generates a list of the possible downloads for the last four days, then downloads the next XMLfile in line.

<?php
$date = new DateTime();
$current_time = $date->getTimestamp();
$four_days_ago = $current_time-345600;

echo 'Downloading: '."
";
for ($i=$four_days_ago; $i<=$current_time; ) {
    $date->setTimestamp($i);

    if($date->format('H') !== '00') {
        $temp_filename = $date->format('Y_m_d_H') ."_full.xml";
        if(!glob($temp_filename)) {
            $temp_url = 'http://www.external-site-example.com/'.$date->format('Y/m/d/H') .".xml";
            echo $temp_filename.' --- '.$temp_url.'<br>'."
";
            break; // with a break here, this loop will only return the next file you should download
        }
    }
    $i += 3600;
}

set_time_limit(300);
$Start = getTime(); 

$objInputStream = fopen($temp_url, "rb");
$objTempStream = fopen($temp_filename, "w+b");

stream_copy_to_stream($objInputStream, $objTempStream, (1024*200000));

$End = getTime();
echo '<br>It took '.number_format(($End - $Start),2).' secs to download "'.$temp_filename.'".';

function getTime() {
    $a = explode (' ',microtime());
    return(double) $a[0] + $a[1];
}
?>

edit2: I just wanted to inform you that there is a way to do what I asked, only it would'nt work in my case. With the amount of data I need the website would have to have 400+ visitors an hour for it to work properly. But with smaller amounts of data there are some options; http://www.google.no/search?q=poormanscron

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dongmi1941 2012-04-18 20:52
关注
You need to have a scheduled, offline task (e.g., cronjob). The solution you are pursuing is just plain wrong.

The simplest thing that could possibly work is a php script you run every hour (scheduled via cron, most likely) that downloads the file and processes it.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

百度编辑器ueditor全网搜不到解决方案的问题，ueditor配置文件中php/controller.php一个接口文件，请求没有返回，竟然下载了。 javascript php 后端
2022-04-28 17:29

回答 2 已采纳已经解决，由于是nginx操作系统，存在拦截器读取.php文件失败。想到另一个办法，在ueditor配置文件的php/controller.php，移到正常项目的路径，本来这就是个返回接口的文件，我是
vscode下载完extension for java后pom.xml文件报错显示缺少文件 java 有问必答
2021-09-23 14:36

回答 5 已采纳你装的插件extension pack for java没有问题，感觉是pom.xml有错误，把pom.xml贴上来。
如何从xml文件中提取特定字段？ xml
2019-08-02 14:46

回答 1 已采纳 package main import ( "encoding/xml" "fmt" ) const data = `<?xml version="1.0" encodi
PHP168整站|PHP168整站 v4.0下载_源码下载
2011-08-18 18:48

A、文章系统 1.支持无限级分类. ...3.完全纯绿色整合论坛，不修改论坛的文件，也不修改论坛的数据库，所以你可以大大的放心。除以上功能外,还有完善的留言本功能与友情链接功能,等等,大家慢慢体会吧!
如何在mybatis的xml文件中，include另一个xml文件中的sql java 有问必答
2021-05-10 15:49

回答 6 已采纳必须要在 java 中定义一下，并且加上 @Mapper 注解才行。要不然 mybatis 是不会解析这个 .xml 文件的。
ideaWeb项目打包后的target文件夹中因为多级目录折叠导致mapper.xml文件不在正确的dao文件夹 idea java spring boot 有问必答
2021-12-06 15:54

回答 2 已采纳很少见你这样写的，一般需要给映射文件配置路径，如果不配置映射文件的路径，默认就是在当前文件夹下找对应的xml文件，所以说你没必要在pom.xml配置映射了，通常情况下dao只放接口文件不放xml文件，
读取和解析大型XML文件的性能问题 xml
2018-12-31 10:21

回答 2 已采纳 You can do something even better: You can tokenize your xml files. Say you have an xml like this
xml注入漏洞
2022-10-28 14:31

浪久1的博客 xml注入
Postman发送xml报文后台接受到乱码数据 postman xml
2021-10-28 17:23

回答 2 已采纳加header声明contentType，charset=utf-8
点击一个链接直接下载一个XML文件 xml
2018-03-14 08:46

回答 10 已采纳下载按钮： ``` 下载 ``` 下载接口： ``` @RequestMapping("/downloadZipQrcode.save") public
IDEA 中 xml 文件的标签不能自动对齐 intellij-idea
2020-07-31 11:39

回答 2 已采纳 https://blog.csdn.net/qq_43193942/article/details/104711107
1-Web安全——XXE-XML外部实体注入
2021-03-20 23:04

song->_->的博客 XML是一种可扩展标记语言（EXtensible Markup Language），主要作用是用于程序之间进行数据通信，存储数据，充当配置文件等。 XML 文档必须包含根元素。该元素是所有其他元素的父元素，<users>就是一个根元素...
C#读取XML文件数据显示在datagridview控件中 c# xml
2018-07-12 06:18

回答 3 已采纳 ``` using System; using System.Collections.Generic; using System.ComponentModel; using Syst
xxe-xml外部实体注入
2021-07-27 03:04

nigo134的博客 XXE(XML External Entity Injection) 全称为 XML 外部实体注入，从名字就能看出来，这是一个注入漏洞，注入的是什么？XML外部实体。(看到这里肯定有人要说：你这不是在废话)，固然，其实我这里废话只是想强调我们的...
pikachu靶场&RCE&文件包含&上传下载&越权(四)
2021-08-05 09:48

进击的网安攻城狮的博客文章目录RCE概述RCE-PINGRCE-EVELFile Inclusion(文件包含漏洞)概述文件下载漏洞概念文件上传漏洞概述前端页面检查（client check）MIME TYPE漏洞getimagesizeover permission 概述水平越权垂直越权 RCE 概述 RCE...
没有解决我的问题, 去提问

悬赏问题

¥30 关于<main>标签页面跳转的问题
¥80 部署运行web自动化项目
¥15 腾讯云如何建立同一个项目中物模型之间的联系
¥30 VMware 云桌面水印如何添加
¥15 用ns3仿真出5G核心网网元
¥15 matlab答疑关于海上风电的爬坡事件检测
¥88 python部署量化回测异常问题
¥30 酬劳2w元求合作写文章
¥15 在现有系统基础上增加功能
¥15 远程桌面文档内容复制粘贴，格式会变化