Crawler For LeetCode [ xyma ]

This a very simple tutorial for python crawler written with Requests and BeautifulSoup, the crawler is used to grab the leetcode subjects and your ac solution codes, better reading for fresh hands.

Motivation

For I have done almost 200 leetcode problems before, and the blog is just online, So it is a really tough job to transport all the code to the Blog, So i wanted to write some code such as crawler to do this job for me. Finally I choose python, Requests and BeautifulSoup to complete this code.

Thank for the help of Syaning

Preparation Works

Firstly, you should make sure that you have python.

the Requests is written by python and based on urllib but more convenient. Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Install by pip

1 2	pip install requests pip install beautifulsoup4

or manually by

1
2
3

$ git clone git://github.com/kennethreitz/requests.git
$ cd requests
$ python setup.py install

or download the Beautiful Soup 4 source tarball and install it with setup.py.

1	$ python setup.py install

Get the Problems List

To get all the problems list, the page site is https://leetcode.com/problemset/algorithms/, by chrome developer console, you can see the follwoing picture:

So we can by the following code to get the json data:

#get all the algorithm list
r = requests.get(url='https://leetcode.com/api/problems/algorithms/')
#get the json data
data_json = json.loads(r.text)
#get the algorithm list
alg_list = data_json['stat_status_pairs']

After get the problems list, we can using the for loop to get every problems' name, and then to get all the problems' websites.


for alg_json in alg_list:
	#get the meta info of algorithm
	alg_stat = alg_json['stat']
	alg_difficulty = alg_json['difficulty']['level']
	alg_paid = alg_json['paid_only']

	alg_name = alg_stat['question__title']
	alg_name_slug = alg_stat['question__title_slug'] 
	alg_id = alg_stat['question_id']
	alg_acs = alg_stat['total_acs']
	alg_submitted = alg_stat['total_submitted']

	content = get_alg_content(alg_name_slug)
	print('writing '+str(alg_id)+' '+alg_name+'...')
	alg_dict = {'id': alg_id, 'title':alg_name,'difficulty':alg_difficulty,'content':content}

	save_alg(alg_dict)
	print('Done!')

we using the following code to get each problem's content:

def get_alg_content(name):
	url = 'https://leetcode.com/problems/' + name
	try:
		 page = requests.get(url)
	except (requests.exceptions.ReadTimeout,requests.exceptions.ConnectTimeout):
        	print('time out')
        	return 'time out'
	alg_page = BeautifulSoup(page.text, "html.parser")
	alg_contents = alg_page.select('.question-content')

	alg_text = ''
	#for some special case(such as subscribe needed), 
	#the alg_contents' length will be 0
	if len(alg_contents) > 0:
		contents = alg_contents[0].find_all(['p','pre'])

		for ctt in contents:
			alg_text += ctt.get_text()

	page.close
	return alg_text

After all the work done, we can get the final folder contained with all the problems.

Get the full code ont my Github

Next Job

After getting all the problems content, the next job is to grab all my ac solution, it is a little hard work, i will post on the next blog.