Logo
TechDigital Group

Site Reliability engineering (SRE)

TechDigital Group, San Leandro

Save Job

Need SRE candidate with good Java Dev background interested in this role with strong hands-on experience in building dashboards and setting up alerts using Splunk, Grafana and GCL.


Required Qualifications:

  • 10+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 10+ years of experience in Production support/Site Reliability Engineering teams with continued focus on improving Platform health
  • Familiar with Agile or other rapid application development practices
  • Hands-on expertise with Automated testing, Process Automation & building dashboards using APM tools.
  • Experience with distributed (multi-tiered) systems, algorithms, relational databases, and NoSQL databases.
  • Knowledge & Exposure caching tools (Redis, memcache) or messaging tools such as MQ, Kafka.
  • Must have working knowledge of APM tools such as splunk, GCL, ELK, Grafana, Prometheus etc.
  • Able to create Dashboards using GCL/Splunk/ELK and setup alerts.
  • Working knowledge of CICD is a plus – Source control like Git, Continuous Integration – Jenkins / UCD Release etc.
  • Ability to work with Engineering teams across the ecosystem such as Security, Networking & Infrastructure challenges which can impact platform health & resiliency.
  • Shell Scripting / DevOps tools like Ansible with good knowledge of yaml file to write playbooks.
  • Experience with distributed storage technologies like NFS as well as dynamic resource management frameworks PCF, Kubernetes / OpenShift, AWS or Azure.
  • Tech Stack: Java/J2EE (Spring, Spring Boot, Python, Shell Scripting, Kafka, Oracle, MongoDB etc.).
  • Able to work on shift duty in a 12/7 support organization.

Job Expectations:

  • You will be a core member of a SRE support team, utilizing the latest technology tools to write code, test cases, working with API specs and automate to maintain the resiliency, performance and availability of Digital Sales & Marketing platforms.
  • Strong & relevant experience in supporting Web/API platforms built using Java/java script Stack (Spring/Spring boot, Javascript -Angular/react)
  • Proficiency in dealing with Legacy infrastructure along with cloud infrastructure (on prem & 3rd party) such as PCF or Azure.
  • Identifying opportunities to adopt to new technologies while improving efficiency by removing toil and continues to drive efficiency & optimization.
  • Proactive monitoring of app performance through Splunk, App dashboards, App dynamics & Dynatrace etc.
  • Represent Platform engineering teams during production outages and collaborate with engineering teams to resolve production outages. Collaborate with stakeholders across engineering functions to own/derive RCA & work towards permanent resolution.
  • Plan, support, execute and comply with governance programs/processes in support of a strong control environment in your functional area. Leverage process documentation to improve operational controls and identify and remediate process deficiencies.
  • Proactively identify, communicate, mitigate and escalate risk originating from non-compliance of processes, operational errors, and data integrity issues in all applicable processes.
  • Ability to influence SRE practices within and outside teams to enable a strong DevOps culture within the organization.
  • Able to work on shift duty in a 12/7 support organization.
  • Responsible for working with Engineering teams to maintain the SLAs & SLOs. Constantly looking out for opportunities to improve platform metrics & communicate the same to stakeholders.
  • Exposure and proficiency in different API styles such as SOAP, REST, Micro services etc.
  • Working knowledge of Unix, Linux and Postman.
#J-18808-Ljbffr