Thursday, September 20, 2012

Download all Links from a Webpage with Python

Man, I love Python and BeautifulSoup!

I wanted to download a bunch of files from a webpage.  Fortunately we have Python.  Watch how ridiculously easy this is:

C:\>python
Python 2.6 (r26:66721, Oct  2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import *

>>> import urllib
>>> import urllib2
>>> baseurl = "http://whatever.com/files/here/"
>>> soup = BeautifulSoup(urllib2.urlopen(baseurl))
>>> links=soup.findAll("a")
>>> for link in links[5:]:
...    print link.text
...    urllib.urlretrieve(baseurl+link.text, link.text)



--- Now watch the fun, your downloads have begun. ---

So now a little explanation.  BeautifulSoup is an HTML parser, and a damn good one at that.  It can handle really badly formed HTML with grace, and makes it really easy to do screen-scraping.  Really cool stuff.  

Basically my variable 'soup' will hold the entire contents of the webpage.  Now that object has a lot of capabilities, you will want to check out the BeautifulSoup docs to learn all of what it can do.  How about this:

soup.findAll("a")  #Boom.  This will return a Python list of all "a" tags


Now all I do is loop through them all.  I skip the first few because after inspection, these weren't files and I don't care about them.  Now I just call the urllib.urlretrieve(url, filename).  link.text is the actual text of the link.


In retrospect I probably should've done (urlretrieve(baseurl+link.href, link.text)), but you can figure that out for yourself.  This is meant to be inspiration and to apply this you might have to make some changes to these nine lines.  

Nine lines!!




1 comment: