TechDigital Group
Need SRE candidate with good Java Dev background interested in this role with strong hands-on experience in building dashboards and setting up alerts using Splunk, Grafana and GCL.
Required Qualifications:
- 10+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
- 10+ years of experience in Production support/Site Reliability Engineering teams with continued focus on improving Platform health
- Familiar with Agile or other rapid application development practices
- Hands-on expertise with Automated testing, Process Automation & building dashboards using APM tools.
- Experience with distributed (multi-tiered) systems, algorithms, relational databases, and NoSQL databases.
- Knowledge & Exposure caching tools (Redis, memcache) or messaging tools such as MQ, Kafka.
- Must have working knowledge of APM tools such as splunk, GCL, ELK, Grafana, Prometheus etc.
- Able to create Dashboards using GCL/Splunk/ELK and setup alerts.
- Working knowledge of CICD is a plus – Source control like Git, Continuous Integration – Jenkins / UCD Release etc.
- Ability to work with Engineering teams across the ecosystem such as Security, Networking & Infrastructure challenges which can impact platform health & resiliency.
- Shell Scripting / DevOps tools like Ansible with good knowledge of yaml file to write playbooks.
- Experience with distributed storage technologies like NFS as well as dynamic resource management frameworks PCF, Kubernetes / OpenShift, AWS or Azure.
- Tech Stack: Java/J2EE (Spring, Spring Boot, Python, Shell Scripting, Kafka, Oracle, MongoDB etc.).
- Able to work on shift duty in a 12/7 support organization.
Job Expectations:
- You will be a core member of a SRE support team, utilizing the latest technology tools to write code, test cases, working with API specs and automate to maintain the resiliency, performance and availability of Digital Sales & Marketing platforms.
- Strong & relevant experience in supporting Web/API platforms built using Java/java script Stack (Spring/Spring boot, Javascript -Angular/react)
- Proficiency in dealing with Legacy infrastructure along with cloud infrastructure (on prem & 3rd party) such as PCF or Azure.
- Identifying opportunities to adopt to new technologies while improving efficiency by removing toil and continues to drive efficiency & optimization.
- Proactive monitoring of app performance through Splunk, App dashboards, App dynamics & Dynatrace etc.
- Represent Platform engineering teams during production outages and collaborate with engineering teams to resolve production outages. Collaborate with stakeholders across engineering functions to own/derive RCA & work towards permanent resolution.
- Plan, support, execute and comply with governance programs/processes in support of a strong control environment in your functional area. Leverage process documentation to improve operational controls and identify and remediate process deficiencies.
- Proactively identify, communicate, mitigate and escalate risk originating from non-compliance of processes, operational errors, and data integrity issues in all applicable processes.
- Ability to influence SRE practices within and outside teams to enable a strong DevOps culture within the organization.
- Able to work on shift duty in a 12/7 support organization.
- Responsible for working with Engineering teams to maintain the SLAs & SLOs. Constantly looking out for opportunities to improve platform metrics & communicate the same to stakeholders.
- Exposure and proficiency in different API styles such as SOAP, REST, Micro services etc.
- Working knowledge of Unix, Linux and Postman.