What I shipped in 5 days over Christmas

In the 5 days between Christmas and New Year, I wanted to tackle a challenge: what could I actually ship in exactly one week?

I ended up building a tool that lets AI models run for 2-10 hours researching a single topic. Not the usual 30-second ChatGPT response. Actual deep research that produces 10,000+ word reports with citations.

The question I couldn't shake

What happens if you let AI actually have time to think?

Not the 30-second ChatGPT answer. Not the quick summary. What if you gave a model hours instead of seconds? Thousands of tokens instead of hundreds? I wanted to open the tap and see what came out.

The build log

Day 1 — Setup and building the core APIs

This isn't my first AI project, so the scaffolding went fast. Cursor running Sonnet 4 for the routine stuff, Opus 4.5 when I needed to think through architecture. Environment, auth, database — about an hour.

Integrated Claude and GPT. Built the core user flows.

Then I hit the first surprise: the UI for generating and editing a report's table of contents took longer than both API integrations combined. The human part — letting users shape what they actually want researched — that's where the time went.

By midnight: a working prototype that could take a question and produce an outline of a report.

Day 2 — Reality check

Woke up feeling good. Read through some of the outputs properly.

Hallucinations everywhere. Fabricated citations. Sources that didn't exist. The model would write with complete confidence about papers it had invented. This wasn't surprising — it's a known problem — but seeing how rampant it was hit differently.

Spent the day reading about multi-agent verification architectures. The idea: what if a second model acted as an adversarial reviewer? One AI writes, another AI challenges. Disagreements get flagged.

Seemed promising. Hadn't built it yet.

Day 3 — The conversation that wouldn't behave

This was the hardest day.

Getting two AI models to challenge each other productively is harder than it sounds. My first attempts were disasters. Sometimes the reviewer would rubber-stamp everything — "Looks great!" — which defeated the purpose. Other times they'd spiral into endless disagreement, nitpicking each other into oblivion.

The models needed to be skeptical but not obstructionist. Critical but constructive. Finding that balance took all day.

Around 4pm I considered scrapping the whole approach.

Didn't.

By evening I had something that worked. Two models, genuinely checking each other, catching hallucinations I would have missed.

Day 4 — Polish and the 2am revelation

Morning was satisfying. Built the output layer. PDF export. Google Docs integration. Citation formatting.

Added a feature I'm quietly proud of: the report shows you where the verification model still disagrees with the primary model. You can see the contested claims. The reader gets to decide.

Then, around midnight, everything broke.

The longer reports were failing. Silently. I'd get truncated outputs, missing sections, incoherent conclusions. Took me an hour to realise: context windows. These models can't hold 20,000+ words in memory. I was hitting the ceiling.

First fix: brought in Google Gemini for its larger context window. Helped, but not enough.

2am. Still failing on the biggest reports.

The solution: a hierarchical processing pipeline. Decompose the report into sections. Process each section independently with full context. Then run a consolidation pass — stitching sections together, resolving cross-references, smoothing transitions. Recursive refinement until the output is coherent.

Then another problem: what happens when Claude rate-limits you at 2am mid-report? Built a failover system that automatically switches to GPT-4 or Gemini if the primary provider fails. No lost work.

I went to bed at 3am.

Day 5 — Ship it

The home stretch.

Stripe integration for pay-per-use billing. Connected the research layer to Semantic Scholar, OpenAlex, CORE, and arXiv — over 200 million academic papers now searchable. Server setup. Hosting. A few hours of testing.

Launched that evening. Shared it with a few friends.

Then people started using it. Then more people. Then people I didn't know.

What's under the hood

AI Models

Anthropic Claude
OpenAI GPT-4
Google Gemini

Research & Search

200M+ papers

Semantic Scholar
OpenAlex
CORE
arXiv
Exa

Infrastructure

Next.js
PostgreSQL
Redis + BullMQ
Stripe
Magic links + Google OAuth

It's still got bugs. I'm not fixing any of them.

But if you're curious what happens when you give AI time to actually think — give it a try. Pick a question you've been sitting on. Something that deserves more than a quick answer. See what comes back.