Key Responsibilities

Design, implement, and maintain monitoring, logging, and alerting solutions across production and non-production environments.
Lead incident response & post-mortem analysis, ensuring best practices for problem resolution.
Build and test disaster recovery strategies.
Collaborate with teams to define and implement SLAs for critical services.
Optimize cloud infrastructure for performance, reliability, and cost.
Develop automation for deployment, scaling, and recovery procedures.
Manage infrastructure with Terraform, GitLab CI/CD, and Kubernetes.
Participate in on-call rotations for incident response.

Must-Have Skills

4+ years in SRE, DevOps, or similar roles.
Strong coding/scripting in Python, Bash, Shell.
Hands-on with Chef, Ansible (recipes, cookbooks, playbooks).
Deep expertise in AWS services (EC2, EKS, RDS, Cognito, CloudWatch).
Solid knowledge of Kubernetes administration in production environments.
Experience with IaC tools (Terraform / CloudFormation).
Strong understanding of observability tools: Prometheus, Grafana, ELK, distributed tracing.
Skilled in PostgreSQL or similar DBs (replication strategies).
Familiar with network protocols, load balancing, security best practices.
Exposure to Splunk, Datadog, Dynatrace.
Strong knowledge of CI/CD pipelines and GitOps workflows.
Ability to handle multiple incidents under pressure.

Job Type: Full-time

Pay: Up to ₹1,200,000.00 per year

Experience:

Work Location: In person

DevOps

devops / Could Computing freelance trainer