Scraping Craigslist From 2006-2023

by stink on Mar 3, 2023 to before comments(2)

I've been scraping Craigslist for freelance programming jobs since 2006. This post details the various setups over the years, and how I arrived at being able to scrape their 2022 AJAX pages.

So, initially, believe it or not, I was using the CL RSS feeds in Google Reader. And it was glorious!

Then Google Reader closed and I used RSS Owl and another feed reader for the next ten years.

Then, I believe what happened was Craigslist blocked all the popular RSS readers. So, I think the first scraper I wrote simply scraped the RSS feeds.

Around 2020 Craigslist removed all the RSS feeds. This is when I wrote a scraper to scrape the actual webpages.

And that brings us to now, in late 2022 Craigslist slowly rolled out a new AJAX page structure that broke my scraper.

Now, most developers will probably laugh at me, but I started to dissect the front-end code in order to scrape the data from the AJAX calls. I found the AJAX URL and it had most of the data I was after, but some necessary data was missing.

I stopped working on it for a number of days, and while I was doing something else I happened to come across Selenium. I looked into Selenium a couple years back but it wasn't really in my toolbox. So, when I bumped into it here, it immediately gave me the idea to load the page, wait for the AJAX to finish, and then basically scrape static HTML. And it worked like a charm! I didn't have to learn their AJAX!

I just wonder if the new CL AJAX pages are mainly to prevent scraping, or did they do it for a different reason?

by reol on Mar 13, 2023
Glad to hear that it worked well for you! I had a side project too that was in a similar vein, but also needing visuals. I somehow got into the weeds of getting Selenium and Tesseract to screenshot and OCR the pages I needed and everything became such a hassle...
link link#
by mozillaj on Sep 21, 2023
Have you taken a look at the CL mobile app? I do web scraping for work and most of the time web apps are a lot less secure when it comes to exposing APIs or raw data.
link link#