Lots of websites contain information that can be of great value to you as a journalist. But this information might change every day, and therefore it is impossible to keep track of it by hand. That’s why we let the computer do our work for us. This is called webscraping.
At NRC Handelsblad, I have written a script that scrapes a website where real estate brokers publish information of houses that are for sale. I scape this site every week, and so I get a unique insight in the development of this market in turmoil. I put all the information in a database that now contains information about almost 250.000 houses.
And last year, during the take-over of ABN Amro bank by Fortis, we wanted to make a map of all the branches of these two banks in The Netherlands. Because both companies didn’t want to give us all the addresses, I used a script to scrape the Dutch Yellow Pages and I ended up with a long list that I could put on a map easily.
In this course, you will learn to do this yourself. You will learn to program your own webscraping machine. A computer program? This probably isn’t something you have done ever before, but it isn’t very difficult. Perhaps only slightly difficult.
There are several different computer languages you can use for this job, but in this course we use Perl, a language that some people call the duct tape of the internet. In Perl, you are going to write a script that can find a web address, fill out a search form and save the results you are looking for on your hard disk.
At the end of the session, you might have written your first ever computer program. But there, the story doesn’t end. I never said programming wasn’t difficult at all. In the future for every different task you’ll come up with, you will have to write a new program. But after this course, you won’t be afraid to do so. And after writing several scripts, you’ll notice that it takes you only about an hour to write a new one.
During this hands-on session, we use the Firefox 3 webbrowser with the add-on ‘Links and forms’. The script is written in Perl, using the ActiveState Komodo debugger. Besides Perl, we also need the package WWW::Mechanize.
Trainer: Arlen Poort
When? Friday November 21st, 3.45 PM
Where? Erasmushogeschool, Room 3.01