Site Reliability Engineering

Location: Houston, TX

Primary Responsibilities:

•           Troubleshoots incidents, conducts blameless post-mortems and ensures permanent closure of incidents.

•           Engages with development team throughout the life cycle to help develop software for reliability.

•           Applies analytics on historic data, such as incidents and usage patterns, to predict issues and take proactive action.

•           Drives adoption of self-healing and resiliency patterns such as circuit breaker, bulkhead etc.

•           Designs and conducts performance tests, identifies bottlenecks and opportunities for optimization.

•           Defines and drives adoption of best-in-class monitoring frameworks to accomplish end to end flow monitoring and noiseless alerting.

•           Designs, develops, tests and delivers software to automate manual operational work

•           Deploys software and product upgrades.

•           Adds value to team delivery and works with team to complete tasks to high quality and actively learns new skills.

•           Facilitates maximum speed of delivery by objectively binding to error budgets of the service.

•           Manages the effort split between manual operational work and engineering work.

•           Coaches other team members and manages teams as needed.

Required Skills:

•           Excellent debugging and troubleshooting skills.

•           Expert in performance monitoring and capacity management of large systems using various tools.

•           Expert in at least one technology stack (Java/J2EE/Python) with designing, coding, testing, and delivering software.

•           Expert in at least one of the relational databases (SQL Server, Oracle, DB2 etc.).

•           Hands-on experience with cloud technologies (Cloud Foundry, Kubernetes, AWS).

•           Hands-on experience with big data services (Hadoop, HDFS, Hive, Yarn, HBase, Kafka, Zookeeper).

•           Working knowledge of Groovy, batch scripting, PowerShell or shell scripting.

•           Experience developing, deploying and debugging distributed systems in a Linux, Hadoop environment.

•           Experience with monitoring tools such as AppD, Splunk, ELK, Geneos.

•           Analysis of SLI metrics and performance data. Interpreting and correlating it to SLOs and SLAs.

•           Experience with deployment automation, CI/CD, DevOps, Jenkins, GIT, BitBucket.

•           Experience with cloud/container environments, big data, analytical tools (Tableau, Alteryx).

•           Expert practitioner in one or more technology domains, may be a cross-domain expert able to solve complex and mission-critical problems within a business or across the firm.

•           Working knowledge of infrastructure components like routers, load balancers and networks.

•           Comfortable working in Agile mode and proficient in continuous integration and continuous delivery.

•           Solid understanding of micro-service design methodologies.

•           Solid analytical and problem-solving skills.

•           A proven team lead with excellent communications skills.

•           Attention to detail and time-management skills.

•           Is endlessly curious about applications and application stability.

Job Type: Full-time
Job Location: Houston

Apply for this position

Allowed Type(s): .pdf, .doc, .docx