Website crawling with node.js
##Introduction
When you need to quickly scrape some data from a website, here’s a simple solution using:
I’ll ignore error checking for simplicity. Don’t ever do that - especially with node.js - it’ll kick your ass. As always - everything using CoffeeScript.
##Crawling a site
We need some initialization:
Initialize cookie jar in case we’re handling cookie-aware site using HTTP redirects:
Now start actual crawling:
We use cheerio to parse request body and find anchor text of each link on site. Results:
CNN Shop -> http://www.turnerstoreonline.com/
Site map -> /sitemap/
CNN Partner Hotels -> http://partners.cnn.com/
CNN en ESPAÑOL -> /espanol/
CNN Chile -> http://www.cnnchile.com
CNN México -> http://www.cnnmexico.com
العربية -> http://arabic.cnn.com/
日本語 -> http://www.cnn.co.jp/
Türkçe -> http://www.cnnturk.com/
Turner Broadcasting System, Inc. -> http://www.turner.com/
Terms of service -> /interactive_legal.html
Privacy guidelines -> /privacy.html
Ad choices -> /services/ad.choices/
Advertise with us -> http://www.cnnmediainfo.com/
License our content -> /intlsyndication/
About us -> /about/
Contact us -> /feedback/
Work for us -> http://www.turner.com/careers/
Help -> /help/
CNN TV -> /CNNI/
HLN -> /HLN/
Transcripts -> http://transcripts.cnn.com/TRANSCRIPTS/