qizi456258 2012-07-02 14:23
浏览 379
已采纳

java网页抓取 提取网页部分信息

http://www.fedex.com/Tracking?clienttype=dotcomreg&ascend_header=1&cntry_code=cn&language=sim&mi=n&tracknumbers=874589732820

在上面网址中最后874589732820为每次抓取要替换的参数

package ups.test;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.URL;

public class Test {

public String getPageContent(String strUrl, String strPostRequest,int maxLength) {


        // 读取结果网页
        StringBuffer buffer = new StringBuffer();
        System.setProperty("sun.net.client.defaultConnectTimeout", "5000");
        System.setProperty("sun.net.client.defaultReadTimeout", "5000");
        try {
        URL newUrl = new URL(strUrl);
        HttpURLConnection hConnect = (HttpURLConnection) newUrl.openConnection();
        // POST方式的额外数据
        if (strPostRequest.length() > 0) {
        hConnect.setDoOutput(true);
        OutputStreamWriter out = new OutputStreamWriter(hConnect.getOutputStream());
        out.write(strPostRequest);
        out.flush();
        out.close();
        }
        // 读取内容

        BufferedReader rd = new BufferedReader(new InputStreamReader(hConnect.getInputStream(),"utf-8"));
        int ch;
        for (int length = 0; (ch = rd.read()) > -1 && (maxLength <= 0 || length < maxLength); length++)
        buffer.append((char) ch);
        String s = buffer.toString();
        s.replaceAll("//&[a-zA-Z]{1,10};", "").replaceAll("<[^>]*>", "");
        System.out.println(s);

        rd.close();
        hConnect.disconnect();
        return buffer.toString().trim();
        } catch (Exception e) {
         return "错误:读取网页失败!";
        //


        }
}

}
package ups.test;

public class Test1 {

public static void main(String[] args) {

    String url = "http://www.fedex.com/Tracking?clienttype=dotcomreg&ascend_header=1&cntry_code=cn&language=sim&mi=n&tracknumbers=874589732820";


    Test p = new Test();
    p.getPageContent(url, "post", 100500);

    System.out.print("已经执行!");
    }

}
现在能够抓取到网页的所有代码,但需要货件托运历史中里面的信息:日期/时间 活动 地点 详细信息 不知道怎么提取 求指导

  • 写回答

3条回答 默认 最新

  • wayne_ren 2012-07-03 09:06
    关注

    仔细查看该页面的HTML代码,你就会发现 货件托运历史 记录被保存在一个叫 detailInfoObject 的JavaScript对象中,所以不需要特殊的HTML解析器,使用正则即可截取到这个对象,然后用jackson这样的JSON解析器即可反解析到Bean。

    解析HTML代码一定要观察代码的构成,才能有效的解析出你要的信息。

    [code="js"]var detailInfoObject = {"shipDate":"Jun 16, 2012","emailResults":false,"scans":[{"scanStatus":"已送达","scanLocation":"CANADA, ON","scanTime":"11:36 AM","GMTOffset":"-04:00","showReturnToShipper":false,"scanDate":"Jun 19, 2012"},{"scanStatus":"货件已装车,派送途中","scanLocation":"MISSISSAUGA, ON","scanTime":"9:30 AM","GMTOffset":"-04:00","showReturnToShipper":false,"scanDate":"Jun 19, 2012"},{"scanStatus":"位于当地的FedEx工作地点","scanLocation":"MISSISSAUGA, ON","scanTime":"8:44 AM","GMTOffset":"-04:00","showReturnToShipper":false,"scanDate":"Jun 19, 2012"},{"scanStatus":"国际货物放行 - 进口","scanLocation":"MISSISSAUGA, ON","scanTime":"6:35 AM","GMTOffset":"-04:00","showReturnToShipper":false,"scanDate":"Jun 19, 2012"},{"scanStatus":"位于目的地分拣中心","scanLocation":"MISSISSAUGA, ON","scanTime":"5:27 AM","GMTOffset":"-04:00","showReturnToShipper":false,"scanDate":"Jun 19, 2012"},{"scanStatus":"正在运输","scanLocation":"INDIANAPOLIS, IN","scanTime":"3:55 AM","GMTOffset":"-04:00","showReturnToShipper":false,"scanDate":"Jun 19, 2012"},{"scanStatus":"离开联邦快递工作地点","scanLocation":"INDIANAPOLIS, IN","scanTime":"3:32 AM","GMTOffset":"-04:00","showReturnToShipper":false,"scanDate":"Jun 19, 2012"},{"scanStatus":"到达联邦快递工作地点","scanLocation":"INDIANAPOLIS, IN","scanTime":"2:00 AM","GMTOffset":"-04:00","showReturnToShipper":false,"scanDate":"Jun 19, 2012"},{"scanStatus":"离开联邦快递工作地点","scanLocation":"ANCHORAGE, AK","scanTime":"4:15 PM","GMTOffset":"-08:00","showReturnToShipper":false,"scanDate":"Jun 18, 2012"},{"scanStatus":"到达联邦快递工作地点","scanLocation":"ANCHORAGE, AK","scanTime":"12:06 PM","GMTOffset":"-08:00","showReturnToShipper":false,"scanDate":"Jun 18, 2012"},{"scanStatus":"正在运输","scanLocation":"NARITA-SHI JP","scanTime":"10:38 PM","GMTOffset":"+09:00","showReturnToShipper":false,"scanDate":"Jun 18, 2012"},{"scanStatus":"清关延误 - 进口","scanLocation":"MISSISSAUGA, ON","scanTime":"4:16 AM","GMTOffset":"-04:00","showReturnToShipper":false,"scanDate":"Jun 18, 2012"},{"scanStatus":"正在运输","scanLocation":"SHANGHAI CN","scanTime":"4:58 AM","GMTOffset":"+08:00","showReturnToShipper":false,"scanDate":"Jun 18, 2012"},{"scanStatus":"正在运输","scanLocation":"SHANGHAI CN","scanTime":"12:06 AM","GMTOffset":"+08:00","showReturnToShipper":false,"scanDate":"Jun 18, 2012"},{"scanStatus":"国际货物放行 - 出口","scanLocation":"SHANGHAI CN","scanTime":"11:40 PM","GMTOffset":"+08:00","showReturnToShipper":false,"scanDate":"Jun 17, 2012"},{"scanStatus":"已离开发件地FedEx站点","scanLocation":"SHANGHAI CN","scanTime":"4:40 PM","GMTOffset":"+08:00","showReturnToShipper":false,"scanDate":"Jun 16, 2012"},{"scanStatus":"已取件","scanComments":"在FedEx截件时间之后才收到包裹","scanLocation":"SHANGHAI CN","scanTime":"3:34 PM","GMTOffset":"+08:00","showReturnToShipper":false,"scanDate":"Jun 16, 2012"},{"scanStatus":"托运资讯发送给FedEx ","scanTime":"12:40 AM","GMTOffset":"-05:00","showReturnToShipper":false,"scanDate":"Jun 16, 2012"}],......[/code]

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
  • ¥15 如何在scanpy上做差异基因和通路富集?
  • ¥20 关于#硬件工程#的问题,请各位专家解答!
  • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
  • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
  • ¥30 截图中的mathematics程序转换成matlab
  • ¥15 动力学代码报错,维度不匹配
  • ¥15 Power query添加列问题
  • ¥50 Kubernetes&Fission&Eleasticsearch
  • ¥15 報錯:Person is not mapped,如何解決?