dongmozhui3805 2013-12-02 23:13
浏览 21
已采纳

too long

I recently finished building a website and while trying to get the site indexed by Google I seem to be getting some weird happenings and was hoping someone could shed some light on this as my Google-fu has revealed nothing.

The server stack I'm running is made up of:

Debian 7 / Apache 2.2.22 / MySQL 5.5.31 / PHP 5.4.4-14

The problem I'm having is Google seems to want to index some odd URLs and is currently ranking them higher than actual legitimate pages. I will list the odd ones here:

www.mydomain.com/srv/www/mydomain?srv/www/mydomain
www.mydomain.com/srv/www?srv/www
www.mydomain.com/srv/www?srv/www/index‎

Webmaster tools now tell me 'this is an important page blocked by robots.txt' because as soon as I found the issue, I put some 301 redirects into the htaccess file to send these requests to the homepage and blocked the addresses in the robots file.

Also, I have submitted an XML sitemap with all the correct URLs to webmaster tools.

All the website files are stored in:

/srv/www/mydomain/public_html/

Now, I think this has something to do with the way I've set up my .htaccess mod-rewrite rules, but I can't seem to get my head around what is doing it. It could also be my Apache vhosts configuration. I will include both below:

.htaccess mod-rewrite rules:

<IfModule mod_rewrite.c>
    RewriteEngine on

# Redirect requests for all non-canonical domains
# to same page in www.mydomain.com
    RewriteCond %{HTTP_HOST} .
    RewriteCond %{HTTP_HOST} !^www\.mydomain\.com$
    RewriteRule (.*) http://www.mydomain.com/$1 [R=301,L]


# Remove .php file extension
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteCond %{REQUEST_FILENAME}\.php -f
    RewriteRule ^(.*)$ $1.php

# redirect all traffic to index
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule ^ index [L]

# Remove 'index' from URL
    RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\s(.*)/index [NC]
    RewriteRule ^ / [R=301,L]

</IfModule>

Apache Vhost:

<VirtualHost *:80>
    ServerAdmin webmaster@mydomain.com
    ServerName mydomain.com
    ServerAlias www.mydomain.com
    DocumentRoot /srv/www/mydomain/public_html/
    ErrorLog /srv/www/mydomain/logs/error.log
    CustomLog /srv/www/mydomain/logs/access.log combined
</VirtualHost>

Also, if it might be relevant, my PHP page handling is:

# Declare the Page array
$Page = array();

# Get the requested path and trim leading slashes
$Page['Path'] = ltrim($_SERVER['REQUEST_URI'], '/');

# Check for query string
if (strpos($Page['Path'], '?') !== false) {

    # Seperate path and query string
    $Page['Query']  = explode('?', $Page['Path'])['1'];
    $Page['Path']   = explode('?', $Page['Path'])['0'];
}

# Check a path was supplied
if ($Page['Path'] != '') {

    # Select page data from the directory
    $Page['Data'] = SelectData('Directory', 'Path', '=', $Page['Path']);

    # Check a page was returned
    if ($Page['Data'] != null) {

        # switch through allowed page types
        switch ($Page['Data']['Type']) {

            # There are a bunch of switch cases here that
            # Determine what page to serve based on the
            # page type stored in the directory

        }

    # When no page is returned
    } else {

        # 404
        $Page = Build404ErrorPage($Page);
    }

# When no path supplied
} else {

    # Build the Home page
    $Page = BuildHomePage($Page);
}

Can anyone see anything here that would be causing this?

  • 写回答

1条回答 默认 最新

  • dqpfl2508589 2014-01-24 02:49
    关注

    After much research I have concluded that my problems came about due to a combination of Google attempting to index the website before it was completed and some incomplete page handling scripts. My mistake was not blocking all robots while in development.

    The solution to the problem was this:

    1. Submit an xml sitemap to google webmaster tools with all the valid URL's

    2. 301 Redirect all odd URL's to the correct homepage

    3. Request removal of incorrect URL's using google webmaster tools

    4. Block googlebot's access to the incorrect URL's using a robots.txt file

    5. Wait for Google to re-crawl the site and correctly index it.

    Waiting for googlebot to correct the issues was the hardest part.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 winform的chart曲线生成时有凸起
  • ¥15 msix packaging tool打包问题
  • ¥15 finalshell节点的搭建代码和那个端口代码教程
  • ¥15 用hfss做微带贴片阵列天线的时候分析设置有问题
  • ¥15 Centos / PETSc / PETGEM
  • ¥15 centos7.9 IPv6端口telnet和端口监控问题
  • ¥120 计算机网络的新校区组网设计
  • ¥20 完全没有学习过GAN,看了CSDN的一篇文章,里面有代码但是完全不知道如何操作
  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 海浪数据 南海地区海况数据,波浪数据