dqm4977 2010-04-15 13:49
浏览 137
已采纳

如何抓取网站上的动态内容并保存?

For example I need to grab from http://gmail.com/ the number of free storage:

Over <span id=quota>2757.272164</span> megabytes (and counting) of free storage.

And then store those numbers in a MySql database. The number, as you can see, is dynamically changing.

Is there a way i can setup a server side script that will be grabbing that number, every time it changes, and saving it to database?

Thanks.

  • 写回答

4条回答 默认 最新

  • duandanai6470 2010-04-15 13:57
    关注

    Since Gmail doesn't provide any API to get this information, it sounds like you want to do some web scraping.

    Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites

    There are numerous ways of doing this, as mentioned in the wikipedia article linked before:

    Human copy-and-paste: Sometimes even the best Web-scraping technology can not replace human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly setup barriers to prevent machine automation.

    Text grepping and regular expression matching: A simple yet powerful approach to extract information from Web pages can be based on the UNIX grep command or regular expression matching facilities of programming languages (for instance Perl or Python).

    HTTP programming: Static and dynamic Web pages can be retrieved by posting HTTP requests to the remote Web server using socket programming.

    DOM parsing: By embedding a full-fledged Web browser, such as the Internet Explorer or the Mozilla Web browser control, programs can retrieve the dynamic contents generated by client side scripts. These Web browser controls also parse Web pages into a DOM tree, based on which programs can retrieve parts of the Web pages.

    HTML parsers: Some semi-structured data query languages, such as the XML query language (XQL) and the hyper-text query language (HTQL), can be used to parse HTML pages and to retrieve and transform Web content.

    Web-scraping software: There are many Web-scraping software available that can be used to customize Web-scraping solutions. These software may provide a Web recording interface that removes the necessity to manually write Web-scraping codes, or some scripting functions that can be used to extract and transform Web content, and database interfaces that can store the scraped data in local databases.

    Semantic annotation recognizing: The Web pages may embrace metadata or semantic markups/annotations which can be made use of to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer2, are stored and managed separated to the Web pages, so the Web scrapers can retrieve data schema and instructions from this layer before scraping the pages.

    And before I continue, please keep in mind the legal implications of all this. I don't know if it's compliant with gmail's terms and I would recommend checking them before moving forward. You might also end up being blacklisted or encounter other issues like this.

    All that being said, I'd say that in your case you need some kind of spider and DOM parser to log into gmail and find the data you want. The choice of this tool will depend on your technology stack.

    As a ruby dev, I like using Mechanize and nokogiri. Using PHP you could take a look at solutions like Sphider.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)

报告相同问题?

悬赏问题

  • ¥15 求差集那个函数有问题,有无佬可以解决
  • ¥15 【提问】基于Invest的水源涵养
  • ¥20 微信网友居然可以通过vx号找到我绑的手机号
  • ¥15 寻一个支付宝扫码远程授权登录的软件助手app
  • ¥15 解riccati方程组
  • ¥15 display:none;样式在嵌套结构中的已设置了display样式的元素上不起作用?
  • ¥15 使用rabbitMQ 消息队列作为url源进行多线程爬取时,总有几个url没有处理的问题。
  • ¥15 Ubuntu在安装序列比对软件STAR时出现报错如何解决
  • ¥50 树莓派安卓APK系统签名
  • ¥65 汇编语言除法溢出问题