Key Responsibilities
- Design, implement, and maintain monitoring, logging, and alerting solutions across production and non-production environments.
- Lead incident response & post-mortem analysis, ensuring best practices for problem resolution.
- Build and test disaster recovery strategies.
- Collaborate with teams to define and implement SLAs for critical services.
- Optimize cloud infrastructure for performance, reliability, and cost.
- Develop automation for deployment, scaling, and recovery procedures.
- Manage infrastructure with Terraform, GitLab CI/CD, and Kubernetes.
- Participate in on-call rotations for incident response.
Must-Have Skills
- 4+ years in SRE, DevOps, or similar roles.
- Strong coding/scripting in Python, Bash, Shell.
- Hands-on with Chef, Ansible (recipes, cookbooks, playbooks).
- Deep expertise in AWS services (EC2, EKS, RDS, Cognito, CloudWatch).
- Solid knowledge of Kubernetes administration in production environments.
- Experience with IaC tools (Terraform / CloudFormation).
- Strong understanding of observability tools: Prometheus, Grafana, ELK, distributed tracing.
- Skilled in PostgreSQL or similar DBs (replication strategies).
- Familiar with network protocols, load balancing, security best practices.
- Exposure to Splunk, Datadog, Dynatrace.
- Strong knowledge of CI/CD pipelines and GitOps workflows.
- Ability to handle multiple incidents under pressure.
Job Type: Full-time
Pay: Up to ₹1,200,000.00 per year
Experience:
- DevOps: 5 years (Preferred)
- docker, kubernetes: 5 years (Preferred)
Work Location: In person