What We Learned Building a Browser-Based Scraper

When we started building Signalstac, the obvious question was: why not use APIs? Every major platform has one. Reddit has a well-documented API. Hacker News has a public firebase endpoint. GitHub has GraphQL. Why run a browser when you can just fetch JSON?

The API gap

APIs give you data. They do not give you context. A Reddit API response tells you the title, score, and comment text of a thread. It does not tell you the tone, the community dynamics, or whether the conversation has moved to a different subthread. More importantly, APIs change, get rate-limited, and sometimes disappear — Reddit's API pricing changes in 2023 were a wake-up call for everyone who relied on it.

A browser-based approach is more work upfront, but it decouples us from API availability. If a platform changes its frontend, we update the adapter. If it changes its API, we do not care because we are not using it.

The browser tax

Running a headless browser is significantly more expensive than making an API call. Each page load consumes memory, CPU, and time. A full browser instance can take hundreds of megabytes of RAM. Scaling this to browse dozens of communities every 30 minutes is not trivial.

We use a pool of browser instances managed through a queue. Each adapter requests a browsing slot, loads the page, extracts what it needs, and releases the slot. The pool is sized to the number of communities we need to visit within the 30-minute cycle. We use Deno for the scraper workers, which gives us native SOCKS5 proxy support and low overhead per instance.

Proxies and anti-bot measures

Some platforms are more welcoming to automated browsers than others. We maintain a pool of residential proxies and route requests through them. Each adapter has a configurable request profile — how fast to scroll, how long to wait between pages, what headers to send — that mimics human behaviour closely enough to avoid triggering rate limits.

The ratcheting of anti-bot measures has been the most surprising part of this project. Every platform is in an arms race against scrapers, and that race affects us even though we are not doing anything malicious. We have had to implement session management, cookie rotation, and browser fingerprint normalisation just to stay reliable.

The thing that surprised us most

The biggest unexpected insight was not technical. It was that browsing through a real browser gives us access to the presentation of content — how threads are laid out, what is visually prominent, what gets collapsed behind "load more" buttons. That layout information turns out to be useful for scoring. A thread that a platform itself highlights (stickied, pinned, featured) is often more relevant than one buried on page three. We factor this into our relevance scoring, and it is something an API would never expose.

What we would do differently

If we were starting today, we would invest more in browser state management earlier. Early on, we treated each browsing session as stateless — open a page, read it, close. We later realised that maintaining warm browser sessions with pre-loaded cookies, cached resources, and already- resolved DNS drastically improved speed and reduced the chance of being flagged.

We would also have built the adapter abstraction earlier. Our first prototype had Reddit and HN hardcoded. Extracting a common interface — load page, extract threads, extract comments — made adding new platforms an afternoon project instead of a week-long effort.

The scraper is not the product. The product is what you do with the threads — scoring, drafting, routing. But the scraper has to work, and it has to work reliably, or nothing else matters.

Get posts like this in your inbox

Notes on building Signalstac, developer marketing, and community engagement — sent roughly every two weeks.

Subscribe →