Distributed Performance Engineer (DPE)
As a Tier 4 LoB-Facing Internal Consulting Engineer specializing in performance, you will conduct in-depth forensics network and application studies for production issues already investigated by numerous cross-technical Tier 1, 2 and 3 Teams yet remains negatively impacting client revenue, profit and/or reputation.
• Independently diagnose root cause of the performance production issue principally relying on network packet analysis of business transactions as they cross distributed systems both globally OnPrem and Public Cloud Data Center Tiers to identify the failed component (software and/or infrastructure) responsible for the failure.
• Author, publish & present detailed formal Findings, Analysis, and Recommendation Reports to the Product Owner and Senior Leadership responsible for the failed component.
• For Infrastructure-based failures, lead the OnPrem and/or Public Cloud (AWS, Azure, Google) Infrastructure Team (compute, network and/or storage) for remediation of the failed component (e.g., firewall, circuit, disk).
• For Software-based failures, lead the Application Software Development Team, either internally or a vendor, for remediation of the failed software module (e.g., application code, SQL, Messaging).
1. Interview Customers & Review Prior Incident Reports
• Conduct detailed interviews with the customers to gather information about the poor end user experience and/or slow business transactions.
• Review written incident reports previously written by the infrastructure and/or application teams to understand the initial findings and reported issues. 2. Collect Forensic Evidence Collected by Other Teams
• IP Addresses for each endpoint
• Architecture diagrams of the systems
• Application and infrastructure logs
• Data Center network diagrams
• Performance reports detailing the incident. 3. Create Network Topology
• Identify Data Center hosting each processing tier.
• Identify the in-between networks (primary & redundant)
• Create a new topology map of the flows for the application. 4. Network Packet Collection Points
• Research network taps & ER-SPANs which can collect traffic of interest.
• Configure packet brokers to collect traffic.
• Identify the in-between networks (primary & redundant)
• Create a new topology map of the flows for the application.
• Network topologies.
• Performance reports detailing the incident.
• Collect TCP/IP network packets from strategic network locations where the relevant network traffic traverses the client Backbone Network and/or connects to Public Cloud Platforms (AWS, Azure, GCP). 5. Findings and Analysis Report
• Publish a detailed Findings & Analysis Report for review by the Product Owner of the Processing Tier responsible for the slowdown. The report should include:
• An overview of the profiling results.
• Identified bottlenecks and their locations.
• Any anomalies or irregularities observed during packet analysis. 6. Collaboration for Root Cause Analysis
• Collaborate closely with the Product Team to perform a deeper analysis of the specific circumstances leading to the performance issue.
• Identify the root cause of the problem and recommend remediation steps. 7. Remediation and Resolution
• Work with the responsible Product Owner to implement the recommended remediation steps.
• Ensure that the issue is resolved and the Service Level Agreement (SLA) thresholds are met once again.
• Continue monitoring and adjusting as necessary to maintain performance standards.
Responsibilities 1. Incident Investigation: Lead in-depth investigation into incidents originating from a Line of Business (LoB) Application Team or an Infrastructure Team (e.g., network, compute, or storage), security breaches, performance degradation, and outages to uncover the root causes. Utilize forensic methodologies and tools to gather evidence, reconstruct events using time-series data collected real-time during the incident, and analyze packet-based network traffic, NetFlow and/or SNMP data sources. 2. Root Cause Analysis: Conduct thorough examinations of network infrastructure, configurations, logs, and traffic patterns across processing tiers, OnPrem Data Centers, and Public Cloud Platforms (Azure, AWS & Google) to identify underlying issues and vulnerabilities. Collaborate with cross-functional multi-disciplinary teams to determine the root causes and contributing factors of incidents. 3. Forensic Techniques: Apply advanced forensic techniques, including packet analysis, log analysis, memory forensics, and malware analysis, to extract relevant information and insights from network data. 4. Publish Findings, Analysis & Recommendation Deliverables: Write thoroughly clearly written formal papers for customer consumption and produce detailed reports for a broad audience including both infrastructure and application teams as well as business teams and leadership across all Lines of Business such as Treasury, Investment Banking, Digital, Retail or Corporate. Audience will include both technical and non-technical colleagues. 5. Incident Response: Develop and implement incident response procedures and protocols to effectively mitigate and contain network incidents. Provide guidance and support to incident response teams during critical incidents, ensuring timely resolution and minimal impact on business operations. 6. Performance Optimization: Analyze network performance metrics and trends to identify optimization opportunities and areas for improvement. Rec ommend and implement network enhancements, configurations, and upgrades to enhance performance, reliability, and security. 7. Security Compliance: Stay abreast of industry standards, regulations, and best practices related to network security and forensic analysis. Ensure compliance with relevant security frameworks (e.g., NIST, ISO 27001) and assist in audits and assessments as needed.
Minimum Qualifications
• Strong packet analysis expertise using a real-time application performance monitoring appliance such as Riverbed ARX, Live Action LiveNX or NetScout Infini Stream NG appliance as well as accompanying deep packet forensic tools such Riverbed Steel Central Transaction Analyzer or Live Action Live Capture.
• Proven experience (5+ years) in network engineering, consulting, and forensic analysis roles, with a focus on incident response and root cause analysis.
• Experience with one or more large-scale enterprise network topologies including LAN, WAN, Wireless, Network Security. Examples include Routers (Cisco, Juniper, Arista) Switches (Cisco), Wireless (Cisco, Juniper), Transport (Cisco, Ciena). Firewall (Fortinet, Check Point, Cisco), Load Balancers (Cisco, F5), Proxy (Blue Coat), Public Cloud (AWS, Azure).
• Knowledge of CIDR and sub-netting (IPv4 and IPv6); IPv6 transition challenges; and generic solutions for network security features, including AWS WAF, intrusion detection systems (IDS), intrusion prevention systems (IPS), DDoS protection, and economic denial of service/sustainability (EDoS).
• Proficient in network forensic tools and appliances (e.g., Riverbed Steel Center Transaction Analyzer and Riverbed ARX, Wireshark, Splunk, Arista Packet Broker) and familiarity with scripting languages (e.g., Python, Perl) for automated analysis.
• Experience with one or more application performance management technologies (AppDynamics, Dynatrace).
• Strong knowledge of statistical techniques/concepts and experience applying them (regression, properties of distributions, statistical tests, etc).
• Experience with one or more observability, monitoring, and visualization tools (Grafana, Cortex, Splunk, ThousandEyes, DataDog and SevOne)
• Experience with interactive visualizations or dashboard frameworks such as Plotly / Dash is an added advantage. Familiarity with scientific computing Python libraries such as NumPy and data manipulation and analysis libraries such as Pandas.
• Public Cloud Certification required; prefer AWS Advanced Networking Certification
Salary Range- $100,000-$120,000 a year
#LI-SP3 #LI-VX1
Tata Consultancy Services