Tuesday, 8 March 2016

How to create your own search engine with PHP and MySQL

As we have mentioned that MySQL is one of the prerequisite in our approach, our first step would be setup the MySQL database up and running. Connect to MySQL, we can any use any of the UI based free tools e.g. Squirrel, HeidiSQL or DBVisualiser or the MySQL admin console. Once connected, let run the following SQL which will create a table called SEARCH_ENGINE.
Listing 1: An SQL statement which will create a table –
CREATE TABLE SEARCH_ENGINE (
       `id` INT(11) NOT NULL AUTO_INCREMENT,
       `pageurl` VARCHAR(255) NOT NULL,
       `pagecontent` TEXT NOT NULL,
       PRIMARY KEY (`id`))
The above query will create a table in the database which will be used to store the details or information to be stored in the database.
Creating the Form:
Now, once the database is ready, let us make the form which will be used by the visitors or the end users to perform their search. Let us call this file - 'index.php' which is a simple search forms having a button. Here we will use GET instead of POST. Thus the information is made quite visible in the address bar.
Listing 2: Our index.php file –
 
<html>
       <head>
             <title> My search engine </title>
       </head>
       <body>
             < form action = 'search.php' method = 'GET' >
                    < center >
                           <h1 > My Search Engine </h1 >
                           < input type = 'text' size='90' name = 'search' >
                           </ br >
                           </ br >
                           < input type = 'submit' name = 'submit' value = 'Search source code' >
                           < option > 10 </ option >
                           < option > 20 </ option >
                           < option > 50 </ option >
                    </ center >
             </ form >
       </ body >
</ html > 
Our form is now completed and ready to be used. This form will be used by the end users to enter in a query and at the same time will enable the users to restrict the count of results which needs to be shown on the form.
Processing the Query:
Let us create a new file 'search.php' which is the page where the results from the search will be listed or shown. This file is divided into following sections -
· Let us connect to the database first:

Listing 3: DB connection
 
       mysql_connect ( "localhost", "USER_NAME", "PASSWORD" ) ; 
       mysql_select_db ( "DB_NAME" );
· Form the query - Once we are connected to the DB, we then form the query using the tokens that the end users have entered. This is shown below -

Listing 4: Construct the query along with the tokens users have entered –
 
       $search_exploded = explode ( " ", $search );
       $x = 0; 
       foreach( $search_exploded as $search_each ) {
             $x++;
             $construct = " ";
             if( $x == 1 )
                    $construct .= "keywords LIKE '%$search_each%' ";
             else
                    $construct .= "AND keywords LIKE '%$search_each%' ";
       }
       $construct = " SELECT * FROM SEARCH_ENGINE WHERE $construct ";
       $run = mysql_query( $construct ); 
· Our next job is to fetch the results from the database and present it to the user. If the search doesn't yield any result, we should show an appropriate message to the user as shown below -

Listing 4: Fetch the result and present it to the user –
 
       if ($foundnum == 0)
             echo "Sorry, there are no matching result for <b> $search </b>.
             </ br >
             </ br > 1. Try more general words. for example: If you want to search 'how to create a website' then use general keyword like 'create' 'website'
             </ br > 2. Try different words with similar  meaning
             </ br > 3. Please check your spelling"; 
                    else {
                           echo "$foundnum results found !<p>";
                           while ( $runrows = mysql_fetch_assoc($run) ) {
                                  $title = $runrows ['title'];
                                  $desc = $runrows ['description'];
                                  $url = $runrows ['url'];
                                  echo "<a href='$url'> <b> $title </b> </a> <br> $desc <br> <a href='$url'> $url </a> <p>";
                    }
             } 
Now our Search engine is ready to be used. The code explained above in parts is listed under -
Listing 5: The Complete Search.PHP file –
<?php
       $button = $_GET [ 'submit' ];
       $search = $_GET [ 'search' ]; 
 
       if( !$button )
             echo "you didn't submit a keyword";
       else {
             if( strlen( $search ) <= 1 )
                    echo "Search term too short";
             else {
                    echo "You searched for <b> $search </b> <hr size='1' > </ br > ";
                    mysql_connect( "localhost","USERNAME","PASSWORD") ; 
                    mysql_select_db("DBNAME");
 
                    $search_exploded = explode ( " ", $search );
                    $x = 0; 
                    foreach( $search_exploded as $search_each ) {
                           $x++;
                           $construct = "";
                           if( $x == 1 )
                                  $construct .="keywords LIKE '%$search_each%'";
                           else
                                  $construct .="AND keywords LIKE '%$search_each%'";
                    }
 
                    $construct = " SELECT * FROM SEARCH_ENGINE WHERE $construct ";
                    $run = mysql_query( $construct );
 
                    $foundnum = mysql_num_rows($run);
 
                    if ($foundnum == 0)
                           echo "Sorry, there are no matching result for <b> $search </b>. </br> </br> 1. Try more general words. for example: If you want to search 'how to create a website' then use general keyword like 'create' 'website' </br> 2. Try different words with similar  meaning </br> 3. Please check your spelling"; 
                    else {
                           echo "$foundnum results found !<p>";
 
                           while( $runrows = mysql_fetch_assoc( $run ) ) {
                                  $title = $runrows ['title'];
                                  $desc = $runrows ['description'];
                                  $url = $runrows ['url'];
 
                                  echo "<a href='$url'> <b> $title </b> </a> <br> $desc <br> <a href='$url'> $url </a> <p>";
 
                           }
                    }
 
             }
       }
 ?>

Search Engine architecture

Before going into further details, let us talk about what should be our goals while developing a search engine. Listed below is a brief set of goals which we should be focused on -
  • WebCrawler, indexer and document storage which should be capable of handling a large volume of documents may be 1 million or even more. .
  • We should follow the test driven development which would help to enforce good design and modular code.
  • We should have the ability to support various strategies for things like the index, document store, search etc.
A typical search engine consists of few parts -
  • A crawler which is used to pull external documents.
  • An index which is the place where the documents are stored in an inverted tree and
  • A document store to keep the documents.
THE CRAWLER
In order to crawl, we should come up with a list of URL’s. There are a few generic ways to do this as listed under -
  • The most common is to feed the crawler with a list of links which contain lots of links as listed. Our next job is to crawl them and harvest as we go down the list
  • Another approach is to download a list of URL’s and then use that list.
Since our aim is to get the actual website only, let us write a simple parser to extract the appropriate data out. It is quite straight forward as shown below -
Listing 6: The parser –
                $file_handle = fopen( " Quantcast-Top-Million.txt ", "r" );
 
       while ( !feof ( $file_handle ) ) {
             $line = fgets( $file_handle );
             if( preg_match( '/^\d+/',$line ) ) { # if it starts with some amount of digits
                    $tmp = explode( "\t",$line );
                    $rank = trim( $tmp[0] );
                    $url = trim( $tmp[1] );
                    if( $url != 'Hidden profile' ) { # Hidden profile appears sometimes just ignore then
                           echo $ 
   }
  }
 }
 fclose( $file_handle );
DOWNLOADING
Downloading the data is going to take some time hence we should be prepared for a longer wait. We can write a very basic crawler in PHP simply by using a file_get_contents and sticking in a url. Let us have a look into the following code -

Listing 7: The crawler –
        $file_handle = fopen("urllist.txt", "r");
         while (!feof($file_handle)) {
                 $url = trim(fgets($file_handle));
                 $content = file_get_contents($url);
                 $document = array($url,$content);
                 $serialized = serialize($document);
                 $fp = fopen('./documents/'.md5($url), 'w');
                 fwrite($fp, $serialized);
                 fclose($fp);
         }
         fclose($file_handle);
The above code is essentially a single threaded crawler. It simply loops over every url in the file, extracts down the content and then saves the content to the disk. The only thing we should note here is that it stores the url and the content in a document since we might need to to use the URL for ranking purpose and also it is helpful to keep a track where the document came from. We should keep in mind that we may run out of file system storage limits while trying to store lots of documents in one folder.
THE INDEX
The reason I initially talked about the test driven development mechanism, is that I prefer the bottom up approach. The index, which we are going to create, should have a few very simple responsibilities as listed under -
  • It needs to store its contents to disk and retrieve them.
  • It needs to be able to clear itself when we decide to regenerate things.
  • It should validate documents that its storing.
Having these tasks defined Let us have the following interface in place -

Listing 8: The interface –
        interface iindex {
                 public function storeDocuments($name,array $documents);
                 public function getDocuments($name);
                 public function clearIndex();
                 public function validateDocument(array $document);
         }
THE DOCUMENT STORE
The document store is a somewhat odd if we are going to index things that we probably already have what we wanted to be stored somewhere else. The most obvious thing in this case is that the documents are already in some database.
THE INDEXER
The next step in our approach to build our search is to create the indexer. An indexer takes a document, breaks it apart and feeds it into the index, and also possibly to the document store depending upon our implementation.
INDEXING
Now that we have the ability to store and index some documents. Let us go through the steps we need here to have the indexing in place -
  • The first thing we are supposed to do here is to set the time limit to unlimited since the indexing process might take a longer time than expected.
  • Our next step is to define the position of the index and the documents that are going to stay in order to avoid the errors.
SEARCHING
Searching requires a relatively simple approach. In fact we only require a single method as shown below -

Listing 9: The search interface –
                interface isearch {
                       public function dosearch($searchterms);
       }
Of course, the actual implementation is not that easy. It is rather more complex than it appears.

1 comment:

felisha green said...

Do you need to increase your credit score?
Do you intend to upgrade your school grade?
Do you want to hack your cheating spouse Email, whatsapp, Facebook, instagram or any social network?
Do you need any information concerning any database.
Do you need to retrieve deleted files?
Do you need to clear your criminal records or DMV?
Do you want to remove any site or link from any blog?
you should contact this hacker, he is reliable and good at the hack jobs..
contact : cybergoldenhacker at gmail dot com