douan2478 2012-11-11 04:07
浏览 58

关键词相关性PHP MySQL搜索引擎

I don't know why I can't find this anywhere. I would think this would be pretty common request. I am writing a search engine in PHP to search a MySQL database of For Sale listings for keywords inputted by the user.

There are several columns in the table but only 2 that will need to be searched. They are named file_Title & file_Desc. Think of it like a classified ad. An item title and a description.

So for example a user would search for 'John Deere Lawn Tractor'. What I would like to happen is classifieds that have all 4 of those words show up at the top of the list. Then results that only have 3 an so on.

I've read a very good webpage at http://www.roscripts.com/PHP_search_engine-119.html

From that authors example I have the following code below:

<?php
    $search = 'John Deere Lawn Tractors';
    $keywords = split(' ', $search);

    $sql = "SELECT DISTINCT COUNT(*) As relevance, id, file_Title, file_Desc FROM Listings WHERE (";

    foreach ($keywords as $keyword) {
        echo 'Keyword is ' . $keyword . '<br />';
        $sql .= "(file_Title LIKE '%$keyword%' OR file_Desc LIKE '%$keyword%') OR ";
    }
    $sql=substr($sql,0,(strLen($sql)-3));//this will eat the last OR

    $sql .= ") GROUP BY id ORDER BY relevance DESC";
    echo 'SQL is ' . $sql;  

    $query = mysql_query($sql) or die(mysql_error());
    $Count = mysql_num_rows($query);
    if($Count != 0) {
                echo '<br />' . $Count . ' RESULTS FOUND';
        while ($row_sql = mysql_fetch_assoc($query)) {//echo out the results
            echo '<h3>'.$row_sql['file_Title'].'</h3><br /><p>'.$row_sql['file_Desc'].'</p>';
        }
    } else  {
        echo "No results to display";
    }

?>

The SQL String outputted is this:

 SELECT DISTINCT COUNT(*) As relevance, id, file_Title, file_Desc FROM Listings 
  WHERE ((file_Title LIKE '%John%'
    OR file_Desc LIKE '%John%')
    OR (file_Title LIKE '%Deere%' 
    OR file_Desc LIKE '%Deere%') 
    OR (file_Title LIKE '%Lawn%' 
    OR file_Desc LIKE '%Lawn%') 
    OR (file_Title LIKE '%Tractors%' 
    OR file_Desc LIKE '%Tractors%') ) 
 GROUP BY id 
 ORDER BY relevance DESC

With this code I get 275 results from my DB. My problem is it really doesn't order by the number of keywords found in the row. It seems to order the results by id instead. If I remove 'GROUP BY id' then it only returns 1 result instead of all of them, which is really messing with me!

I've also tried shifting to FULLTEXT in the db but can't seem to get that going either so I'd prefer to stick with LIKE %Keyword% syntax.

Any help is appreciated! Thanks!

  • 写回答

2条回答 默认 最新

  • dongqie2355 2012-11-11 04:24
    关注

    I would suggest a totally different approach. Your approach is cumbersome, inefficient, heavy on the DB and will likely be very slow with more and more records added to your database.

    What I would suggest is the following:

    1. Create a separate table for keywords.
    2. Create a list of non keywords you don't want to index (like the common English prepositions etc.) so that they are not included. You can probably find a list of them online, readily available.
    3. When a new entry is added, you split the string into separate keywords, omitting the ones in step 2., and inserting them in the table created in step 3 (if not already in it).
    4. In a separate table, with a foreign key pointing to the keywords table, associate the classifed_ad to the keyword.

    Steps 3 and 4 must happen again if your classified_ad is edited (i.e. any keywords inserted in step 4 deleted from the association table and the keywords analysed again and reassociated with the classified ad).

    Once you have this structure, all you have to do is search the association table and order by the number of matched keywords. You can even add an extra column to it and put the number of occurrences of that keyword in the article, so that you order by that too.

    That will be much faster.

    I had used a script once called Sphider which does something similar. Not sure if it is still maintained, but it works in a very similar way on web pages it parses.

    评论

报告相同问题?

悬赏问题

  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 关于大棚监测的pcb板设计
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器
  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题
  • ¥20 win11修改中文用户名路径
  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入