Udacity Web Crawler in Python 3? -



Udacity Web Crawler in Python 3? -

am new world of programming, (apologies in advance mutual newbie mistakes)and completed udacity's first-class intro computer science course, building web crawler python. however, course of study taught programme in python 2, i'd own version running in python 3. can't figure out how get_page() function right, despite much searching around forums , discussions on there.

the program:

urllib.request import urlopen

def compute_ranks(graph): d = 0.8 #damping factor numloops = 10

ranks = {} npages = len(graph) page in graph: ranks[page] = 1.0 / npages in range(0, numloops): newranks = {} page in graph: newrank = (1 - d) / npages node in graph: if page in graph[node]: newrank = newrank + d *(ranks[node] / len(graph[node])) newranks[page] = newrank ranks = newranks homecoming ranks

def crawl_web(seed): #returns index, graph of outlinks tocrawl = [seed] crawled = [] graph = {} # :[list of pages links to] index = {} while tocrawl: page = tocrawl.pop() if page not in crawled: content = get_page(page) add_page_to_index(index, page, content) outlinks = get_all_links(content) graph[page] = outlinks union(tocrawl, outlinks) crawled.append(page) homecoming index, graph

def get_page(page): html = urlopen(page) page = html.read() str(page) homecoming page

def get_next_target(page): start_link = page.find("

def get_all_links(page): links = [] while true: url,endpos = get_next_target(page) if url: links.append(url) page = page[endpos:] else: break homecoming links

def union(p,q): e in q: if e not in p: p.append(e)

def add_page_to_index(index, url, content): words = content.split() word in words: add_to_index(index, word, url)

def add_to_index(index, keyword, url): entry in index: if entry[0] == keyword: entry[1].append(url) homecoming index.append([keyword,[url]])

def lookup(index, keyword): entry in index: if entry[0] == keyword: homecoming entry[1] return[0]

index, graph = crawl_web("http://cs101.udacity.com/urank/index.html") ranks = compute_ranks(graph) print(ranks)

python python-3.x web-crawler

Comments

Popular posts from this blog

java - How to set log4j.defaultInitOverride property to false in jboss server 6 -

c - GStreamer 1.0 1.4.5 RTSP Example Server sends 503 Service unavailable -

Using ajax with sonata admin list view pagination -