Multi-Chain Block Indexer

Full-stack blockchain data ingestion pipeline with real-time analytics.

The Multi-Chain Block Indexer connects to Ethereum, Polygon, and Arbitrum via JSON-RPC, fetches blocks and transactions concurrently, persists per-block analytics to PostgreSQL, writes raw data to Apache Parquet, and streams live updates to a React dashboard over WebSocket. It handles a similar class of problem as the ingestion layer at companies like Dune Analytics: getting blockchain data off-chain, into queryable storage, reliably and at scale.

Check out the latest code here -> https://github.com/peterstringer/blockchain-indexer

Each chain sustains approximately 10 blocks per second, governed by a per-chain RPC rate limiter that prevents the indexer from hitting free-tier limits from the providers. Demo mode, which generates synthetic data without any RPC calls, bypasses rate limiting entirely and runs significantly faster (approximately 600–740 blocks/s per chain).

The indexer throughput was tested locally with a Macbook Pro with an Apple M4 Pro processor.

How it works

Each chain runs an independent indexing loop in its own thread. Blocks are fetched in batches of 50 with 10 concurrent worker threads, but a semaphore-based rate limiter serialises the actual RPC calls, so a batch takes ~5 seconds regardless of concurrency. Two separate thread pools prevent deadlocks between chain orchestration and block fetching.

Two indexing modes are supported. Backfill processes historical block ranges with an ETA display. Incremental live-tails the chain head, polling every 5 seconds and comparing parentHash values to detect reorganisations. Both can run simultaneously on the same chain.

Each chain has two RPC providers (Alchemy + Infura) managed by a circuit breaker. After 5 consecutive failures a provider is marked OPEN; after 30 seconds it transitions to HALF_OPEN for a probe request. When all providers are down, the least-recently-failed is promoted first. Checkpoints are updated atomically after each batch for crash-safe resume.

The dashboard has four tabs: live indexing status with per-chain throughput, lag, and RPC health; historical analytics with five interactive charts (gas market trends, block space demand, transaction type evolution, failure analysis, transaction density heatmap); a data export wizard supporting CSV and Parquet with 19 selectable columns; and a settings page for chain configuration.

What I found interesting

The rate limiter is the throughput ceiling, not the batch size. With 10 threads and a batch size of 50, you might expect 50 blocks/second. The per-chain rate limiter grants 10 permits/second, so the batch size controls checkpoint granularity, not speed.

Measuring throughput required three iterations. A lifetime average was dragged down by startup. A 10-second sliding window oscillated between 5.0 and 10.0 because 50-block batches arrive every ~5 seconds, so the window sometimes caught one batch, sometimes two. An EWMA with alpha=0.3 smoothing resolved it, holding steady at 9.6–10.2. The dashboard passes this value straight through with no additional transformation.

The two-pool executor is not optional. A single shared pool caused thread starvation: chain loop tasks consumed all threads waiting on block-fetch futures queued behind them. Splitting into a cached pool for orchestration and a fixed pool for fetching resolved it.

Parquet writing on JDK 25 has a compatibility issue. Subject.getSubject() was removed in JDK 23 and Hadoop still calls it during reads. Writing works; reading requires DuckDB or pandas. Not a problem in practice since all analytical queries go through PostgreSQL.

Synthetic data needs realistic structure. The demo mode generates sinusoidal gas price cycles, weekend reduction, congestion spikes, and Pareto-distributed transaction values. Without these patterns the analytics charts produce flat lines that tell you nothing about whether the aggregation logic is correct.

Background

This project demonstrates the ingestion, transformation, and storage pipeline that sits between raw chain data and queryable analytical tables. The engineering priorities mirror production concerns: fault tolerance via circuit breakers and crash-safe checkpoints, concurrent processing via the two-pool executor model, and storage efficiency via Parquet columnar compression with chain/date partitioning.

Tech Stack

Java 25, Spring Boot 4.0.2, Web3j, Apache Parquet (SNAPPY), PostgreSQL, Flyway, React 19, TypeScript, Vite 7, Tailwind CSS v4, Recharts, TanStack Query, STOMP/SockJS. Docker multi-stage builds, GitHub Actions CI/CD, Testcontainers, Prometheus/Micrometer.

Page updated

Google Sites

Report abuse