Scraping for houses
Having moved back to Romania, I decided I would need a place to live in, ideally to buy. So we started looking online for various places, we went to see a lot of them. Lots of work, especially footwork. But, being the data nerd that I am, I wanted to get smart about it and analyze the market.
For that, I needed data. For data, I turned to scraping. For scraping, I turned to Scrapy. While I did write a scraper 5 years ago, I didn't want to reinvent the wheel yet again, so I turned to Scrapy because it's a well-known, much used scraping framework in Python. And I was super impressed with it. I even started scraping things more often, just because it's so easy to do in Scrapy :D
In this post I am going to show you how to use it to scrape the olx website for housing posts in a given city, in 30 lines of Python. Later, we are going to analyze the data too.
First, you have to generate a new Scrapy project and a Scrapy spider. Run the following commands in your preferred Python environment (I currently prefer pipenv).
pip install scrapy
scrapy startproject olx_houses
scrapy genspider olx olx.ro
This will generate a file for you inside olx_houses/spiders
, with some boilerplate already written, and you just have to extend it a bit.
import scrapy
import datetime
today = datetime.date.today().strftime('%Y-%m-%d')
These are just imports and I am precomputing today's date, because I want each entry to contain when it was scraped.
class OlxHousesSpider(scrapy.Spider):
name = 'olx_houses'
allowed_domains = ['olx.ro']
start_urls = ['https://www.olx.ro/imobiliare/case-de-vanzare/oradea/',
'https://www.olx.ro/imobiliare/apartamente-garsoniere-de-inchiriat/oradea/']
Then we define our class, with the allowed domains. If we encounter a link that is not from these domains, it is not followed. We are interested only in olx stuff, so we allow only that. The start URLs are the inital pages, from where we should start the scraping. In our case, these are the listing pages for house and flats.
def parse(self, response):
for href in response.css('a.detailsLink::attr(href)'):
yield response.follow(href, self.parse_details)
for href in response.css('a.pageNextPrev::attr(href)')[-1:]:
yield response.follow(href, self.parse)
parse
is a special function, which is called by default for every URL. So the start URLs will be parsed using this. It is called with a response object, containing the HTML received from the website. This response object contains both all the HTML text, but it also has a DOM parse and it allows direct querying with CSS and XPath selectors. If you return or yield a Request object from this function, Scrapy will add it to the queue of pages to be visited. A convenience method for doing this is to use the follow
method on the response object. You pass it the URL to visit and what callback method to use for parsing (by default it's the parse
method).
We are looking for two things on this page:
1) For anchor links that have a detailsLink
CSS class. These we want to parse with the parse_details
method. 2) Anchor links that have a pageNextPrev
CSS class. We look only at the last one of these links (that's what the [-1:] indexing does), because that one always points forward. We could look at all links and it wouldn't cause duplicate requests, because Scrapy is keeping track of what links it already visited and it doesn't visit them again. These links we will parse with the default method.
And now comes the fun part, getting the actual data.
def parse_details(self, response):
attrs = {
'url': response.url,
'text': response.css('#textContent>p::text').extract_first().strip(),
'title': response.css('h1::text').extract_first().strip(),
'price': response.css('.price-label > strong::text').extract_first().replace(" ", ""),
'date': today,
'nr_anunt': response.css('.offer-titlebox em small::text').re('\d+'),
'adaugat_la': response.css('.offer-titlebox em::text').re('Adaugat (de pe telefon) +La (.*),')
}
We extract various attributes from the listing pages. Some things are straightforward, like the URL, or the text and title, which are obtained by taking the text of some elements chosen with CSS selectors. For price the selector is a bit more complicated and we have to prepare the text a bit (by removing spaces). For the ID of the listing and the date added field, we have to apply some regular expressions to obtain only the data that we want, without anything else.
for tr in response.css('.details').xpath('tr/td//tr'):
title = tr.css('th::text').extract_first()
value = " ".join(x.strip() for x in tr.xpath('td/strong//text()').extract() if x.strip()!="")
attrs[title]=value
yield attrs
There is one last thing: some crucial information is displayed in a "structured" way, but it's marked up in a completely unstructured way. Things like the size of the house or the age. These values are in a table, with rows containing a table header? cell with the name of the attribute, followed by table data cells containg values. We take all the values, join them with a space, and put them in the dictionary we used above, with the key being the value we got from the table header cell. We do this for all the rows in the table.
And that's it. Easy peasy. Now all we have to do is run the scraper with the following command:
scrapy run olx -o houses.csv
We wait a little bit and then in that file we have all the listings. And if we repeat this process (almost) daily for several months, we can get trends and see how long are houses on the market on average. But that's a topic for another post.