Crawling at scale

Seed a queue, drain it with bounded workers, and store every record in a local DuckDB database.

The single-shot commands are enough for most work. When you want to collect a lot, amz has a queue and an optional local store so a crawl survives restarts and never loses what it already fetched.

The store

The store is a DuckDB database amz drives by shelling out to the duckdb binary, never through cgo. It is optional: install duckdb and the db and crawl commands light up; leave it out and every fetch command still works.

amz db path                # where the database lives
amz db stats               # row counts per table
amz db query "select asin, data->>'price' price from products order by price desc limit 10"
amz db vacuum              # compact
amz db reset               # delete the file

Each surface has its own table (products, reviews, qa, offers, bestsellers, categories, brands, sellers, authors) plus the queue. Every table keeps a few key columns typed for fast filtering and the full record in a data JSON column, so any field is reachable with DuckDB's JSON arrow: data->>'brand', data->>'rating', and so on.

Seeding the queue

amz seed pushes work onto the queue. Give it ASINs and URLs as arguments or a file:

amz seed B084DWG2VQ B07XJ8C8F5
amz seed --file asins.txt              # one ASIN/URL per line
cat asins.txt | amz seed --file -      # from stdin

Pick what to fetch for each seed with --entity, and order the queue with --priority:

amz seed --file asins.txt --entity reviews --priority 10

search --enqueue is the other way in: it seeds the queue with every result of a search.

amz search "mechanical keyboard" --enqueue -n 200

Draining the queue

amz crawl pulls items off the queue and writes the resulting records into the store, with bounded concurrency from the global --workers:

amz crawl                  # drain everything
amz crawl --kinds product,reviews   # only these entity kinds
amz crawl -j 4             # four workers

A crawl is polite by construction: it shares the rate limiter and retry/backoff with every other command. When a page hits the bot wall, that item goes back to the queue with a short backoff instead of failing the run, so the crawl rides out a temporary block and keeps its place.

A full pipeline

Collect a category's bestsellers, fetch every product, and read the result back with SQL:

amz bestsellers electronics -n 100 -o url \
  | sed 's#.*/dp/##; s#/.*##' \
  | amz seed --file -
amz crawl
amz db query "select data->>'brand' brand, count(*) n,
                     avg((data->>'price')::double) p
              from products group by brand order by n desc limit 20"