Staff Engineer Deep Dive: Building a Modern eCommerce Payments Platform (Java + Spring Boot + Kafka)

Part 1 — Gateway architecture, PCI-DSS, tokenization, idempotent checkout, and SAGA-based refunds.

By Alera Infotech Engineering | October 2025

Payments systems live at the intersection of throughput, irreversibility, and regulation. This guide walks Staff-level engineers through building a fault-tolerant, PCI-aware eCommerce payments platform using Java + Spring Boot + Kafka + PostgreSQL (AWS). We focus on determinism (exactly-once outcomes), auditability, and evolution (how the system scales and changes safely).


1) Responsibilities at Staff Level (Fintech/eCommerce)

  • Correctness over cleverness: payment state machines with provable invariants.
  • Reliability: API p99 < 300 ms, monthly availability ≥ 99.95%, RPO ≤ 1 s, RTO ≤ 3 min.
  • PCI-DSS boundary design: keep PAN (card number) out of your core systems using tokenization.
  • Observability: distributed tracing, ledger/audit correlation, error budget discipline.
  • Evolution: migrations, feature flags, backward/forward compatibility, multi-gateway routing.

2) System Overview (High-Level Flow)

Checkout Service → Payments API → Orchestrator → Gateway Adapters (Stripe/Adyen/Razorpay/etc.) → Risk/Fraud → Ledger → Webhooks & Notifications → Reconciliation. Kafka connects services; Outbox pattern guarantees exactly-once event publication.

Core Services

  • Payments API (Spring MVC/WebFlux): REST entry-point; validates, authorizes, idempotency.
  • Orchestrator: state machine for authorize/capture/void/refund; invokes Gateway Adapters.
  • Gateway Adapters: one per PSP; maps unified API to vendor-specific schemas.
  • Risk/Fraud: velocity, device, geolocation; fast rules + async ML (optional Part 2).
  • Ledger: append-only accounting; transactions & entries with double-entry invariants.
  • Reconciliation: match gateway settlement files to ledger; raise diffs.

3) PCI-DSS & Tokenization Boundary

Goal: keep Primary Account Number (PAN) out of your environment. Let the Payment Service Provider (PSP) or a PCI vault handle card data. Your systems operate on tokens.

  • Hosted fields/checkout: card data posted directly to PSP → returns a card_token.
  • Your DB: store only card_token, last4, brand, expiry, BIN metadata (non-sensitive).
  • Encryption: HSM/KMS for any secrets; rotate ≤ 90 days.
  • Vaulting: for network tokens (Visa/MC) and card-on-file recurring payments.

Design the PCI boundary so that a breach of your app does not disclose PAN. This is the single highest-leverage architectural decision in eCommerce.

4) Idempotent Checkout API (Spring Boot)

Idempotency ensures retries (client/network) don’t double-charge. We rely on a client-supplied Idempotency-Key spanning the entire authorization or capture request. Store request/response hash keyed by the idempotency key.

Controller + Idempotency Filter (Java)

// build.gradle (snippets)
// implementation 'org.springframework.boot:spring-boot-starter-web'
// implementation 'org.springframework.boot:spring-boot-starter-validation'
// implementation 'org.springframework.kafka:spring-kafka'
// implementation 'org.springframework.boot:spring-boot-starter-data-jdbc'
// implementation 'org.postgresql:postgresql'

// IdempotencyFilter.java
@Component
@Order(1)
public class IdempotencyFilter implements Filter {
  @Autowired IdempotencyStore store; // e.g., Redis or Postgres

  @Override
  public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
      throws IOException, ServletException {
    HttpServletRequest http = (HttpServletRequest) req;
    HttpServletResponse httpRes = (HttpServletResponse) res;
    String key = http.getHeader("Idempotency-Key");
    if (key == null || key.isBlank()) { chain.doFilter(req, res); return; }

    Optional<String> prior = store.getResponse(key);
    if (prior.isPresent()) {
      httpRes.setHeader("Idempotency-Replayed", "true");
      httpRes.getWriter().write(prior.get());
      return;
    }

    ContentCachingResponseWrapper wrapper = new ContentCachingResponseWrapper(httpRes);
    chain.doFilter(req, wrapper);
    String body = new String(wrapper.getContentAsByteArray(), StandardCharsets.UTF_8);
    store.save(key, body);
    wrapper.copyBodyToResponse();
  }
}

// PaymentsController.java
@RestController
@RequestMapping("/payments")
public class PaymentsController {

  @Autowired PaymentService service;

  @PostMapping("/authorize")
  public ResponseEntity<PaymentAuthResponse> authorize(@Valid @RequestBody PaymentAuthRequest req,
                                       @RequestHeader(value="Idempotency-Key", required=false) String idemKey) {
    return ResponseEntity.ok(service.authorize(req, idemKey));
  }
}

Notes: Persist the response JSON (plus request hash) under Idempotency-Key. For safety, scope keys to merchantId + customerId. Expire keys (e.g., 24–48h) to bound storage.

5) Orchestration & Gateway Adapters

The Orchestrator implements the payment state machine and calls Gateway Adapters. Adapters map your internal DTOs to each PSP’s API, handle errors, and normalize responses (auth code, AVS/CVV results, 3DS status).

State Machine (Authorize → Capture → Settle)

  • NewAuthorized (hold funds)
  • AuthorizedCaptured (charge)
  • CapturedSettled (PSP/bank clears)
  • Error paths: Void (release hold), Refund (partial/full), Chargeback

Outbox Event After DB Commit

@Transactional
public PaymentAuthResponse authorize(PaymentAuthRequest r, String idemKey) {
  Payment p = repo.createAuthorized(r); // authorizations table
  // call adapter
  GatewayAuthResult g = gatewayAdapter.authorize(r);
  repo.updateWithGateway(p.getId(), g);
  outbox.storeEvent("payment.authorized", p.getId(), g); // same TX commit
  return PaymentAuthResponse.from(p, g);
}

The Outbox write occurs in the same DB transaction as the authorization record. A background publisher emits the event to Kafka after commit, ensuring exactly-once publication.

6) Refunds & Reversals with SAGA

Refunds span multiple systems (your DB, PSP, notifications). Implement a SAGA with compensations: if the PSP call succeeds but notification fails, you must still record the refund and schedule retries — do not roll back the bank.

Refund Saga Steps

  • Create refund_requested row → emit refund.requested.
  • Call PSP refund() (idempotent with Idempotency-Key).
  • On success: mark refund_completed + ledger entries (negative amount).
  • Emit refund.completed → send email/notification.
  • Compensations: if PSP declines after ledger write (rare if coded right), create ledger reversal and alert ops.
public void refund(String paymentId, Money amount) {
  Refund r = repo.createRefund(paymentId, amount); // state = REQUESTED
  try {
    GatewayRefundResult gr = adapter.refund(paymentId, amount, r.getId()); // pass refund id as idem
    repo.markRefundCompleted(r.getId(), gr);
    ledger.postRefund(paymentId, amount, r.getId()); // append-only, negative entry
    outbox.storeEvent("refund.completed", r.getId(), gr);
  } catch (GatewayDeclined e) {
    repo.markRefundFailed(r.getId(), e.getCode());
    outbox.storeEvent("refund.failed", r.getId(), Map.of("reason", e.getCode()));
  }
}

SAGA rule: never attempt to “undo” a bank transaction by deleting rows. Use compensating entries in the ledger. Money moves are append-only.


Part 2 — Ledger & Reconciliation, Fraud, Observability, SLOs, Scale & DR

7) Ledger: Double-Entry Accounting & Invariants

The ledger is the source of truth for money. Use double-entry (debits/credits) with append-only entries. Never update historical rows; post adjustments. Keep transaction and entry tables separate.

Schema Sketch (PostgreSQL)

-- transactions: logical money movements (auth, capture, refund)
CREATE TABLE transactions (
  id UUID PRIMARY KEY,
  merchant_id UUID NOT NULL,
  payment_id UUID NOT NULL,
  type TEXT CHECK (type IN ('AUTH','CAPTURE','REFUND','FEE','CHARGEBACK')),
  amount_cents BIGINT NOT NULL,
  currency CHAR(3) NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  metadata JSONB
);

-- entries: double-entry postings that must net to zero per transaction
CREATE TABLE entries (
  id UUID PRIMARY KEY,
  transaction_id UUID REFERENCES transactions(id),
  account TEXT NOT NULL,         -- e.g., "merchant_receivable", "psp_payable", "fees_income"
  direction TEXT CHECK (direction IN ('DEBIT','CREDIT')),
  amount_cents BIGINT NOT NULL,
  currency CHAR(3) NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- invariant: sum(entries.amount with sign) = 0 per transaction
-- enforce via trigger or periodic assertion + reconciliation job

Invariants: (1) Entries for a transaction must balance to zero; (2) No delete/update of entries; (3) Currency must match across postings; (4) Referential integrity to payments/refunds.

Posting Example (Capture)

  • Debit psp_receivable, Credit merchant_payable (net of fees)
  • Debit merchant_payable, Credit cash on settlement day
  • Fee: Debit merchant_payable, Credit fee_income

The ledger protects you when gateways misbehave. If settlement files disagree, you can prove exactly what you believe about the money and why.

8) Reconciliation: Matching PSP Files to Your Ledger

Each PSP provides daily settlement/reconciliation files (CSV/JSON/SFTP). A reconciliation job ingests the file, normalizes rows, and matches them to your transactions and entries. Differences generate a variance report for ops, plus recon.adjustment entries if needed.

Workflow

  • Ingest file → Staging table
  • Join on payment_id / PSP reference / amount / currency
  • Mark matched, missing_in_psp, or missing_in_ledger
  • Create adjustments as transactions with entries (append-only)
  • Emit Kafka events → dashboards & alerts

9) Fraud Signals & Observability (Brief)

Real-time rules: velocity per card/email/IP, BIN risk, country mismatch; Async ML: feature store + gRPC model. Log reason codes for every decision to support disputes.

RED Metrics & Tracing

  • Rate: TPS (transactions/second) per endpoint and per gateway
  • Errors: 5xx rate, decline codes, idempotency replays
  • Duration: p50/p90/p99 latency for auth/capture/refund
  • Tracing: OpenTelemetry: propagate trace_id across API → Kafka → workers → ledger

10) Reliability Targets & Error Budgets

  • SLOs: Auth p99 ≤ 300 ms, Capture p99 ≤ 350 ms, API availability ≥ 99.95 % (≈ 22 min downtime/month).
  • Error Budget: percentage of allowed failures (exhaustion → freeze risky changes, prioritize reliability).
  • RPO/RTO: Recovery Point Objective ≤ 1 s (max data loss), Recovery Time Objective ≤ 3 min (time to restore).

Outbox, DLQ, and Replay

Outbox: events written in-transaction; DLQ: permanent processing errors; Replay: reconcile by event idempotency keys. Expose DLQ age & depth alarms.

11) Throughput & Cost Modeling

Assume average request ≈ 2 KB JSON, processing ≈ 300 ms. A single WebFlux instance on a c7g.2xlarge can handle ~300–500 req/s depending on GC & I/O. For 10k RPS sustained: ~25–35 instances (+ 20 % headroom). Batch workers scale horizontally via Kafka consumer groups.

  • Caching: Authorization lookups in Redis (TTL 60 s) → substantial p99 reduction.
  • Async I/O: Prefer WebFlux + Reactor for gateway calls to use fewer threads.
  • Kafka: Partitions per topic sized to target consumer parallelism (start 3–5× worker count).
  • DB: Partition transactions by month, index by (merchant_id, created_at), use logical replication for analytics.

12) Multi-Region & Disaster Recovery

  • Active-Passive: Primary region + warm standby; Aurora Global Database or logical streaming; Route 53 failover; RPO ≤ 1 s, RTO ≤ 3 min.
  • Active-Active: Dual writers require conflict-free design (e.g., ledger with single-writer per money-bucket or XRIDs with deterministic merge). Use only if 99.99 % SLO or global low latency demands.
  • Kafka: MirrorMaker 2; checkpoint offsets per region; test consumer failover.
  • Chaos drills: quarterly region failover practice; verify DNS cutover and ledger consistency by checksums.

Design DR so that the same idempotency key produces the same outcome regardless of region. Determinism beats heroics.

13) Putting It All Together (What to Say in a Staff Interview)

  • Boundary: “We place the PCI boundary at the PSP; we store only tokens and metadata.”
  • Determinism: “Idempotency spans request–response; outbox events are in-tx; SAGA handles refunds with compensating ledger entries.”
  • Reliability: “SLO 99.95 %, p99 auth ≤ 300 ms, RPO ≤ 1 s, RTO ≤ 3 min; DLQ age alarm at 15 min.”
  • Evolution: “Adapters hide PSP differences; feature flags + contract tests ensure zero-downtime migrations.”
  • Traceability: “Every payment has a trace_id linking API, Kafka, ledger, and reconciliation.”

© 2025 Alera Infotech Engineering | Tags: payments, java, spring boot, kafka, pci dss, idempotency, ledger, reconciliation, reliability

Leave a Reply

Your email address will not be published. Required fields are marked *

Further reading