dsw8292301 2015-01-21 02:31
浏览 84

使用简单的html dom获取网站的语言

I am building a search engine and webcrawler using PHP, and i would like to detect the language of a website, how would i detect the language of a page by:

  1. Checking the URL for https://twitter.com/?lang=jap
    if that is not set then i would like to:
  2. Check the URL https://www.google.co.jp/

if i still can't find anything then i would to set default to English

the code i have so far for scraping pages is:

function crawl($url){
            $html = file_get_html($url);
            if($html && is_object($html) && isset($html->nodes)){
                $weblinks[]=$url;
                foreach($html->find('a') as $element) {
                    global $weblinks;
                    $link = $element->href;
                    $base_url = parse_url($url, PHP_URL_HOST);
                    if(substr($link,0,7)=="http://"){
                        $link = $link;
                    }else if(substr($link,0,8)=="https://"){
                        $link = $link;
                    }else if(substr($link,0,2)=="//"){
                        $link = substr($link, 2);
                    }else if(substr($link,0,1)=="#"){
                        $link = $html;
                    }else if(substr($link,0,7)=="mailto:"){
                        $link = "";
                    }else if(substr($link,0,11)=="javascript:"){
                        $link = "";
                    }else{
                        if(substr($link, 0, 1) != "/"){
                            $link = $base_url."/".$link;
                        }else{
                            $link = $base_url . $link;
                        }
                    }
                    if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && $link != ""){
                        if(substr($url, 0, 8) == "https://"){
                            $link = "https://".$link;
                        }else{
                            $link = "http://".$link;
                        }
                    }
                    if(!in_array($link, $weblinks)){
                        $weblinks[]=$link;
                    }
                }
                $html->clear();
            }else{

            }
        }
        function info($weblinks){
            foreach($weblinks as $link) {
                $linkhtml = file_get_html("$link");
                if($linkhtml && is_object($linkhtml) && isset($linkhtml->nodes)){

                    $titleraw = $linkhtml->find('title',0);
                    $title = $titleraw->innertext;
                    $des = $linkhtml->find("meta[name='description']",0)->content;


//detect language here

                    echo "<tr><td>".$title."</td><td>".$link."</td><td>".$des."</td></tr>";
                    $sql = mysql_query("INSERT into web once");
                    $title = "";
                    $des = "";
                    $linkhtml->clear();
                }
            }

        } 
  • 写回答

1条回答 默认 最新

  • dosf40815 2015-01-21 03:28
    关注

    To get the language from ?lang=:

    $url = 'www.domain.org?lang=IT';
    $url_parts = parse_url($url);
    $lang = parse_str($url_parts['lang']);
    

    You should then validate this with a switch/case statement and a list of languages that you support, like this:

    switch ($lang) {
    case 'EN':
    //language is English
    break;
    case 'IT':
    //language is Italian
    break;
    case 'FR':
    //language is French
    break;
    default:
    //?lang query was empty, or contained an unsupported language
    $lang = FALSE;
    } //end switch
    

    After that, you can use this logic to determine whether you need to check the URL for the language:

    if ($lang == FALSE) {
    //code to determine language from TLD
    }
    

    Hopefully this will help get you started, although this is a big can of worms you're opening up. There are other things you need to check in order to be certain of the language of a website in addition to what you've mentioned. One of them is the language meta tag, which is like this: <meta name="language" content="english"> and goes in the head of the webpage, though not all websites use it.

    Some multilingual websites, like mine, use a subdomain like http://it.website.com or http://fr.website.com

    Others use query strings that are different from ?lang=. So you'll need to do a significant amount of research to cover all your bases.

    评论

报告相同问题?

悬赏问题

  • ¥20 易康econgnition精度验证
  • ¥15 msix packaging tool打包问题
  • ¥28 微信小程序开发页面布局没问题,真机调试的时候页面布局就乱了
  • ¥15 python的qt5界面
  • ¥15 无线电能传输系统MATLAB仿真问题
  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接
  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败