Site Reliability Engineer (DevOps + Full Stack)

Full-time

Colombo, Sri LankaUSD 500 - 1,700/month

Donely is a managed platform for deploying and operating AI agents for our customers. Our infrastructure spans cloud compute, edge services, multi-tenant containers, and a number of third-party integrations (messaging, billing, LLM providers).

We're hiring you to own reliability. As we scale, reliability becomes a first-class discipline rather than something the product team handles in between features. You become the person whose job is "Donely stays up, and when something does go wrong, we know first and fix it fast."

This is not a pure SRE role. We're a small team, and you'll write product code too. But your primary mandate is the health of the system - uptime, observability, incident response, recovery, and the engineering work needed to make those things real.

What You'll Own:

Reliability & operations (primary):
- Define, measure, and defend SLOs across our services.
- Build the monitoring, dashboards, and alerting that surface issues before customers feel them.
- On-call: be the first responder. Build the runbooks. Cut the MTTR. Lead incident response and post-mortems.
- Drive root-cause fixes back into the codebase. Own backups, disaster recovery, and restore drills.
- Capacity planning and scaling for our compute fleet.
- Security hygiene: IAM, secrets rotation, dependency CVEs, and access reviews.
Platform & infrastructure:
- Cloud infrastructure on AWS (compute, networking, IAM, container registry, identity).
- Edge platform services (workers, edge databases, object storage, DNS, access).
- Container images and host-level setup. CI/CD: deploy pipelines, preview environments, and rollback safety.
Full-stack product work (secondary, ~30%):
- Ship features and fixes across the stack in TypeScript, Next.js/React on the frontend and edge-runtime services on the backend.
- Write database migrations carefully - our schema is shared across multiple services.
- Improve the provisioning pipeline that turns a signup into a running customer instance.

Must have:

2+ years in production engineering, with meaningful time spent on reliability, SRE, or DevOps - not as a side duty.
You've been on-call for a real production system and made it less painful over time.
Strong AWS chops: compute, IAM, networking, container workflows. Comfort operating Docker in production - debugging containers, host issues, and image pipelines.
Solid TypeScript/Node and working knowledge of React.
You can ship a feature end-to-end when needed.
You debug from first principles. You read logs, trace requests, reproduce locally, and find the root cause.
You write runbooks that other people can actually follow at 3am.

Nice to have:

Experience with edge/serverless platforms (Cloudflare Workers, Vercel, Deno Deploy).
Observability stacks (Grafana, Prometheus, OpenTelemetry, Sentry, Datadog, Honeycomb).
Multi-tenant SaaS infrastructure experience.
Billing, messaging, or LLM provider integrations.
Linux sysadmin instincts - systemd, nginx, reverse proxies, networking.

What "Done" Looks Like in 6 Months:

Published SLOs and dashboards for every service.
Alerting catches issues before customers do.
MTTR on common incidents is measured in minutes.
Customer provisioning and recovery run end-to-end without manual intervention.
Backups and DR are tested on a regular cadence.
The product team can focus on shipping, knowing reliability is owned.

How We Work:

Small team, high trust, low process.
You ship. Most work happens in PRs against dev; CI deploys automatically.
We prefer fixing root causes over papering over symptoms.
We use AI coding tools heavily - comfort working alongside them is a must.

Apply for this job

Resume/CV*

Click or drag file to this area to upload your Resume

Please make sure to upload a PDF

First Name*

Last Name*

Email*

Phone Number*

The hiring team may use this number to contact you about this job.

How many years of production engineering experience do you have? (number)*

Are you currently based in Colombo, Sri Lanka or willing to relocate? This is a physical (in-office) role.*

Have you been the on-call first responder for a real production system, with a track record of making it less painful over time?*

Link to your GitHub or portfolio (URL).*

Describe one production incident you owned end-to-end: what broke, how you found it, how you fixed it, and what you changed so it could not happen again.*

Do you have hands-on AWS production experience (EC2, IAM, networking, container workflows)?*

Are you comfortable using AI coding tools (Claude Code, Cursor, Copilot, Codex) as part of your daily workflow?*

By clicking 'Submit Application', you agree to receive job application updates from Donely via text and/or WhatsApp. Message frequency may vary. Reply STOP to unsubscribe at any time. Message & data rates may apply.