Friday, March 16, 2012

A Simple Search Engine

I have built a simple search engine as a follow up to my previous posts. This search engine, returns the list of url's that contain a specific keyword in the pages the engine had crawled till then. The functions of the last post were reused except that the Crawl_web function is modified. Along with it, three more new functions are added. The four of them are as shown below. So, in the main function, we first let the function crawl, giving a specific url as seed, forming an index. Then we go for looking the word "how" in the index (We can search for any keyword, just for demonstration I am using how). The Look_up function helps us in our later task returning a list of urls that contain our keyword.

def Look_up(index,keyword):#This function is for given an index, it finds the keyword in the index and returns the list of links
 f=[]
 for i in index:
  if i[0]==keyword:
   for j in i[1]:
    f.append(j)
 return f
#The format of element in the index is <keyword>,[<List of urls that contain the keyword>]
def add_to_index(index,url,keyword):
 for i in index:
  if keyword==i[0]:
   i[1].append(url)
   return
 index.append([keyword,[url]])
def add_page_to_index(index,url,content):#Adding the content of the webpage to the index
 for i in content.split():
  add_to_index(index,url,i)

def Crawl_web(seed):#The website to act as seed page is given as input
 tocrawl=[seed]
 crawled=[]
 index=[]
 while tocrawl:
  p=tocrawl.pop()
  if p not in crawled:#To remove the looping, if a page is already crawled and it is backlinked again by someother link we are crawling, we need not crawl it again
   c=get_page(p)
   add_page_to_index(index,p,c)
   union(tocrawl,get_all_links(c))
   crawled.append(p)#As soon as a link is crawled it is appended to crawled. In the end when all the links are over, we will return the crawled since it contains all the links we have so far
 return crawled,index #Returns the list of links
crawled,index=Crawl_web('http://xkcd.com/353')#printing all the links
#print index 
print Look_up(index,"how")#We are looking for the keyword "how" in the index

The output of the above code is as shown below. Please note that the above code is not complete, but had to be merged with the code previous post to be a complete one.


['http://xkcd.com/353']

No comments:

Post a Comment