Our US based client is looking for a Senior Cloud Ops Engineer / Site Reliability Engineer who will be a key player in ensuring the performance, reliability, and scalability of their cloud-based platform. Backed by leading global investors like SoftBank, our client develops cutting-edge technology solutions that are shaping the future of their industry. You will oversee AWS infrastructure operations, manage Terraform deployments, and develop scripts to enhance system troubleshooting and performance.
Proactive monitoring, alerting, and collaboration with engineering teams will be essential as you support our clients mission to deliver industry leading positioning technology. Operating in a dynamic, growth-focused environment, you will also play a critical role in customer issue resolution and system optimization, ensuring our platform meets the needs of large-scale, 24x7 deployments for safety and efficiency in construction and beyond.
Job Responsibilities
Ensure the smooth operation and reliability of our clients AWS infrastructure, supporting 24x7 cloud platform operations. Monitor system performance and proactively identify, resolve, and document system issues. Manage Terraform deployments for integration and production environments, ensuring efficient Infrastructure as Code practices. Develop and maintain scripts for system troubleshooting and performance enhancement. Act as an outstanding troubleshooter, leveraging tools like CloudWatch logs, performance metrics, and customer issue reports to identify root causes. Set up and maintain alerting mechanisms to detect anomalies and performance issues. Document and communicate system issues effectively to the engineering team. Be available on-call to support customer issues when needed. Take ownership of tasks from concept through deployment, ensuring accountability and high quality execution. Work in an environment that supports personal and professional growth.
Must-Have Attributes/Skills
Advanced expertise in AWS Cloud, including services such as Cognito, Lambda, DynamoDB, API Gateway, IAM, and CloudWatch. Strong knowledge of cloud infrastructure deployment using Infrastructure as Code tools like Terraform.Experience with CI/CD tools such as Jenkins or GitHub Actions.Proficiency in debugging complex software applications, with skills in performance tuning and profiling.Coding proficiency, preferably in Python, but other automation-capable languages are acceptable.Experience with monitoring tools like CloudWatch, New Relic, Prometheus, or Grafana.Ability to set up robust alerting mechanisms to monitor system health and detect anomalies.Development experience with Linux/Unix platforms.Strong communication skills in English, both written and spoken, with the ability to escalate critical issues effectively. Flexibility to collaborate with a US-based software development team in the Pacific Time Zone.
Should-Have Attributes/Skills
Familiarity with additional AWS services such as Managed Flink, SES, SNS, IoT Core, and Timestream.Strong understanding of microservice architectures, containerization (e.g., Docker), and container orchestration systems.Skills to ensure systems are operationally ready and to reduce mean time to recovery (MTTR).Ability to interface directly with business management and customers to address technical concerns.
Nice-to-Have Attributes/Skills
Bachelors degree in Computer Science, Engineering, or a related field.AWS Certified Solutions Architect (Associate or Professional).AWS Certified DevOps Engineer (Professional).Experience with web design, development, or software application development (web ormobile).Knowledge of HTML5, CSS, JavaScript, web services (REST), XML, or OpenAPI.
Read more...