Skip to main content
Data Collection Automation

What if AI ran the crawler end to end โ€” from design to incident response?

Reselling alert platform

When a site changes and collection stops, AI finds the cause and fixes it; a person only approves the change. AI takes over the work people used to do. The story of rebuilding a system that watched dozens of shopping sites 24/7 so it could run on its own.

Concept art of a self-healing crawler: an emerald AI-agent core re-stitching a broken data link and producing a fix PR

The crawler ran fine โ€” as long as a person was attached to it

The next wall of a custom crawler: every site change still needed a developer

This customer already ran a custom system watching dozens of Korean shopping sites (Nike, Musinsa, Adidas, and more) in real time. Alerting members about limited-edition drops and restocks faster than competitors, it was the customer's core system. The problem was operations. Shopping sites change their pages constantly, and each time collection quietly comes back empty. Only after someone noticed 'hey, no alerts are coming' would a staffer open the site, find what changed, fix it, check it, and bring it back up โ€” again and again, every time a site changed.

At first, every new site meant copying a program and bolting it on. Past 30 sites and three servers, the upkeep grew. Each server ran dozens โ€” nearly 70 programs in total โ€” and when one stopped, someone had to see the alert and restart it by hand. When a site changed and collection broke, only the person who built that site could fix it, and that one person's time held everything back. A structure with less human dependence was needed.

The specific difficulties

  • โ€ขWhen a shopping site changes its page, collection quietly stops โ€” noticed only much later
  • โ€ขEvery break needs the responsible developer to dig into the site and fix it by hand
  • โ€ขAbout 70 programs with no auto-recovery โ€” a person had to restart each one
  • โ€ขOnly the person who knew a site could fix it, so work piled on one person
  • โ€ขEven adding one new site meant copying, editing, and registering by hand
  • โ€ขAn old structure limited how much each server could handle

โ€œWe had built the crawler itself well. But every time a site changed, a person still had to step in. Someone had to notice alerts weren't coming, a developer had to fix the code โ€” and by then we were already late. We asked whether this could run without people.โ€

โ€” Customer operations team

A crawler that detects its own failures and opens a PR with the fix

Replacing the human 'analyze โ†’ fix โ†’ test โ†’ deploy' loop with an AI Agent

The idea was simple: move the failure response people used to do into the system itself. When collection comes back empty several times in a row, the system declares a failure on its own, AI re-examines the site to find a fix and drafts the change, then asks a person only to approve. To get there, we first set the boundary AI could safely touch and pulled the per-site bits out of the program into separate 'settings.'

The first thing we changed was the only-a-human-can-fix-it structure. Each site used to have its collection method buried inside the program, so any change meant a developer opening the code. We moved that out into 'settings' outside the code. Now what to collect, and how, is decided in settings, not in the program โ€” a safe surface for AI to touch, set up first.

This system doesn't run on one machine but across three servers. The strongest hub server carries 28 collectors plus alerts, self-recovery, and monitoring, while the other two run 22 and 20 collectors and send their results to the hub. Responsibilities are split so a problem on one server doesn't stop the whole thing.

โ† Scroll horizontally to view the chart

Three servers split the work โ€” collection results and recovery signals converge on the hub

Crawler monitoring dashboard: per-domain zero-result, heal triggers, crawl errors, and Discord 429 rate

Dashboard โ€” per-site collection status and auto-recovery events at a glance

On top of that we added a 'fixes itself' flow. When a site comes back empty several times in a row (five by default), the system treats it as a failure and first alerts the ops channel: 'collection on site X is broken.' It usually takes just a few minutes โ€” the system raises its hand before a person even notices.

Automated healer warning sent to the ops Discord when a selector fails

An automatic warning to the ops channel the moment collection breaks โ€” the system reports before anyone notices

It doesn't stop at alerting. AI doesn't fix by guessing. It goes into the site itself to see how the page changed and whether there's a faster, more reliable way to get the data, then re-tunes the collection settings the best way it can. It does what a human would during an incident โ€” except it doesn't wait for a person, starting at midnight or on a weekend all the same.

What we cared about most was keeping AI's fixes from reaching production carelessly. A fix AI makes must first pass an automatic check. On pass, it only files a review request saying 'here's how I fixed it' โ€” it never goes straight to production. We built both a fully automatic path that runs from detection to review request with no human touch, and a semi-automatic path an operator starts with one button. Either way, a human makes the final call. If fixes keep failing, it stops at a set limit and calls a person.

โ† Scroll horizontally to view the chart

A fix becomes a review request only after passing the automatic check โ€” a human decides production

GitHub pull request opened automatically by the crawler-healer bot, switching the strategy from html to api after a selector break

A review request AI filed automatically โ€” a person reviews just this and decides whether to apply it (never straight to production)

One more thing. Giving AI access doesn't mean letting it do anything. The more autonomously AI acts, the more you have to fence off 'what it can touch, and how far' first โ€” the currently recommended approach, which we followed. The only thing AI can change is a site's collection settings; the system's core and sensitive information are locked away entirely. Content scraped from a site is treated as something to read, never a command, so even if a page hides 'ignore your previous instructions,' AI isn't swayed. And who changed what, when, and who approved it is all kept on record, so it can be retraced later.

We also reworked the invisible plumbing underneath. We swapped the old, slow internals for modern ones so the same servers handle more sites, faster. Programs used to need a person to restart them one by one; now the system revives them the moment they stop. Adding a server or more sites just takes a settings change. And whatever happened in between is all there on the dashboard, site by site.

Server resource dashboard: CPU, memory, disk, and load average across main/sub/sub2

Server dashboard โ€” the state of all three servers in real time

If everything so far was about making it run, the next part was seeing โ€” at a glance โ€” whether it runs well. We gather the activity records pouring out of dozens of programs into one place and pull out just the signals worth worrying about. When something looks off, we drill straight down to which site had what problem, right down to the raw record.

Log monitoring dashboard: log lines per minute, errors/warnings, 76 active containers, per-server log volume

Activity monitoring โ€” every program's activity on one screen

Log detail view: structured error/warning logs and the raw full log stream

Detailed records โ€” trace a failing request down to its time and count, from the raw record

Finally, we wrote out the rules for how AI should handle this system in advance. We noted each site's quirks and past changes, and bundled the frequent jobs (analyze a site, fix it, add a new site, run a full check) so each can be called with a single command. Adding a new site takes one line โ€” 'add X' โ€” and AI finds the site, confirms it, lays out the settings, checks, docs, and review request, then waits, switched off, for a human's final approval.

The heart of the system we built

A system that detects its own failures

If collection comes back empty several times in a row, it treats it as a failure and starts recovery on its own โ€” usually alerting the ops channel within minutes.

AI does the repair

AI examines the live site, finds a better way to collect, fixes the settings, and gets it through the automatic check.

A review request only after the check

Even an AI-made fix must pass an automatic check to become a review request. A person decides whether to apply it.

Fully automatic and semi-automatic

We built both a path that reaches the review request with no human touch, and one an operator starts with a single button.

Three servers that don't die

When any of ~70 programs stops, the system โ€” not a person โ€” revives it first. Adding a server or more sites just takes a settings change.

One click for a new site

'Add X' is all it takes โ€” AI prepares the settings, checks, docs, and review request, then waits for approval.

Human involvement shifted from 'editing code' to 'approving'

The owner of failure response moved from developers to an AI Agent

The real win here is less about 'how much faster' and more about 'who does the work.' Before, when a site changed, a developer carried it from start to finish. Now the system reports its own failure, AI examines and fixes it and clears the check, then shows a person 'here's how I fixed it.' The work that used to pile on one person is gone, and people only decide whether to apply it.

Human โ†’ AI
Site-change response
Examining, fixing, and checking โ€” all done automatically by AI
2โ€“3 min
Failure detection
Self-detects repeated collection failures and starts recovery
~70 programs
Concurrent ops
Split across three servers; auto-recovers when one stops
One click
New-site onboarding
Drop a URL and the settings, checks, and review request are all prepared for you

What actually changed

The system knows about failures before people do

Before, response started only after someone noticed 'no alerts.' Now the system catches the collection failure itself, alerts the ops channel within minutes, and begins recovery.

Selector fixes are no longer tied to one developer

The bottleneck where only the person who knew a site could fix it is gone. AI examines and fixes; the operator just reviews the request and decides.

Even a fix doesn't reach production carelessly

Even an AI-made fix must pass an automatic check to become a review request. Production only changes by a human's approval, and if failures pile up it stops and calls a person. Automated, but the safety switch stays in human hands.

Nobody runs to a dead process anymore

Programs used to need a person to restart them; now the system revives them first. Whatever happened is all there on the dashboard, site by site.

A system that only runs when a person is watching it โ€” can AI guard it instead?

Is it really a given that 'a developer must step in every time a site changes'? This customer thought so too. A structure that detects and fixes its own failures โ€” let's talk, and you'll see how.