Udacity Web Crawler in Python 3? -
Udacity Web Crawler in Python 3? -
am new world of programming, (apologies in advance mutual newbie mistakes)and completed udacity's first-class intro computer science course, building web crawler python. however, course of study taught programme in python 2, i'd own version running in python 3. can't figure out how get_page() function right, despite much searching around forums , discussions on there.
the program:
urllib.request import urlopen
def compute_ranks(graph): d = 0.8 #damping factor numloops = 10
ranks = {} npages = len(graph) page in graph: ranks[page] = 1.0 / npages in range(0, numloops): newranks = {} page in graph: newrank = (1 - d) / npages node in graph: if page in graph[node]: newrank = newrank + d *(ranks[node] / len(graph[node])) newranks[page] = newrank ranks = newranks homecoming ranks
def crawl_web(seed): #returns index, graph of outlinks tocrawl = [seed] crawled = [] graph = {} # :[list of pages links to] index = {} while tocrawl: page = tocrawl.pop() if page not in crawled: content = get_page(page) add_page_to_index(index, page, content) outlinks = get_all_links(content) graph[page] = outlinks union(tocrawl, outlinks) crawled.append(page) homecoming index, graph
def get_page(page): html = urlopen(page) page = html.read() str(page) homecoming page
def get_next_target(page): start_link = page.find("
def get_all_links(page): links = [] while true: url,endpos = get_next_target(page) if url: links.append(url) page = page[endpos:] else: break homecoming links
def union(p,q): e in q: if e not in p: p.append(e)
def add_page_to_index(index, url, content): words = content.split() word in words: add_to_index(index, word, url)
def add_to_index(index, keyword, url): entry in index: if entry[0] == keyword: entry[1].append(url) homecoming index.append([keyword,[url]])
def lookup(index, keyword): entry in index: if entry[0] == keyword: homecoming entry[1] return[0]
index, graph = crawl_web("http://cs101.udacity.com/urank/index.html") ranks = compute_ranks(graph) print(ranks)
python python-3.x web-crawler
Comments
Post a Comment