Scraping Craigslist From 2006-2023

by stink - on Mar 3, 2023 1:11pm to craigslist scraping comments(2)
I've been scraping Craigslist for freelance programming jobs since 2006. This post details the various setups over the years, and how I arrived at being able to scrape their 2022 AJAX pages.

So, initially, believe it or not, I was using the CL RSS feeds in Google Reader. And it was glorious!

Then Google Reader closed and I used RSS Owl and another feed reader for the next ten years.

Then, I believe what happened was Craigslist blocked all the popular RSS readers. So, I think the first scraper I wrote simply scraped the RSS feeds.

Around 2020 Craigslist removed all the RSS feeds. This is when I wrote a scraper to scrape the actual webpages.

And that brings us to now, in late 2022 Craigslist slowly rolled out a new AJAX page structure that broke my scraper.

Now, most developers will probably laugh at me, but I started to dissect the front-end code in order to scrape the data from the AJAX calls. I found the AJAX URL and it had most of the data I was after, but some necessary data was missing.

I stopped working on it for a number of days, and while I was doing something else I happened to come across Selenium. I looked into Selenium a couple years back but it wasn't really in my toolbox. So, when I bumped into it here, it immediately gave me the idea to load the page, wait for the AJAX to finish, and then basically scrape static HTML. And it worked like a charm! I didn't have to learn their AJAX!

I just wonder if the new CL AJAX pages are mainly to prevent scraping, or did they do it for a different reason?
Some comments may be hidden. Show all.
  • by reol - on Mar 13, 2023 1:09am
    Glad to hear that it worked well for you! I had a side project too that was in a similar vein, but also needing visuals. I somehow got into the weeds of getting Selenium and Tesseract to screenshot and OCR the pages I needed and everything became such a hassle...