Site Reliability Engineer (DevOps + Full Stack)
Donely
Full-time
Colombo, Sri LankaUSD 500 - 1,700/monthDonely is a managed platform for deploying and operating AI agents for our customers. Our infrastructure spans cloud compute, edge services, multi-tenant containers, and a number of third-party integrations (messaging, billing, LLM providers).
We're hiring you to own reliability. As we scale, reliability becomes a first-class discipline rather than something the product team handles in between features. You become the person whose job is "Donely stays up, and when something does go wrong, we know first and fix it fast."
This is not a pure SRE role. We're a small team, and you'll write product code too. But your primary mandate is the health of the system - uptime, observability, incident response, recovery, and the engineering work needed to make those things real.
What You'll Own:
- Reliability & operations (primary):
- Define, measure, and defend SLOs across our services.
- Build the monitoring, dashboards, and alerting that surface issues before customers feel them.
- On-call: be the first responder. Build the runbooks. Cut the MTTR. Lead incident response and post-mortems.
- Drive root-cause fixes back into the codebase. Own backups, disaster recovery, and restore drills.
- Capacity planning and scaling for our compute fleet.
- Security hygiene: IAM, secrets rotation, dependency CVEs, and access reviews.
- Platform & infrastructure:
- Cloud infrastructure on AWS (compute, networking, IAM, container registry, identity).
- Edge platform services (workers, edge databases, object storage, DNS, access).
- Container images and host-level setup. CI/CD: deploy pipelines, preview environments, and rollback safety.
- Full-stack product work (secondary, ~30%):
- Ship features and fixes across the stack in TypeScript, Next.js/React on the frontend and edge-runtime services on the backend.
- Write database migrations carefully - our schema is shared across multiple services.
- Improve the provisioning pipeline that turns a signup into a running customer instance.
Must have:
- 2+ years in production engineering, with meaningful time spent on reliability, SRE, or DevOps - not as a side duty.
- You've been on-call for a real production system and made it less painful over time.
- Strong AWS chops: compute, IAM, networking, container workflows. Comfort operating Docker in production - debugging containers, host issues, and image pipelines.
- Solid TypeScript/Node and working knowledge of React.
- You can ship a feature end-to-end when needed.
- You debug from first principles. You read logs, trace requests, reproduce locally, and find the root cause.
- You write runbooks that other people can actually follow at 3am.
Nice to have:
- Experience with edge/serverless platforms (Cloudflare Workers, Vercel, Deno Deploy).
- Observability stacks (Grafana, Prometheus, OpenTelemetry, Sentry, Datadog, Honeycomb).
- Multi-tenant SaaS infrastructure experience.
- Billing, messaging, or LLM provider integrations.
- Linux sysadmin instincts - systemd, nginx, reverse proxies, networking.
What "Done" Looks Like in 6 Months:
- Published SLOs and dashboards for every service.
- Alerting catches issues before customers do.
- MTTR on common incidents is measured in minutes.
- Customer provisioning and recovery run end-to-end without manual intervention.
- Backups and DR are tested on a regular cadence.
- The product team can focus on shipping, knowing reliability is owned.
How We Work:
- Small team, high trust, low process.
- You ship. Most work happens in PRs against dev; CI deploys automatically.
- We prefer fixing root causes over papering over symptoms.
- We use AI coding tools heavily - comfort working alongside them is a must.