jobr.pro

Principal, Software Engineer - Cloud Storage

jobr.pro, Sunnyvale, California, United States, 94087

Position Summary

We are seeking a highly skilled Principal Engineer (Ceph/Scale-Out Storage) with 10+ years of deep technical experience in distributed storage systems. This role focuses on hands‑on architecture, operations, performance tuning, and troubleshooting of multi‑petabyte scale storage clusters in mission‑critical environments. The ideal candidate will have strong expertise across Linux, networking, storage internals, and distributed systems, and the ability to diagnose complex issues spanning hardware, kernel, and storage layers. What You'll Do

Our Private Cloud Storage Engineering team builds and operates some of the largest‑scale Ceph storage clusters in the industry, supporting mission‑critical applications across Walmart’s global ecosystem. With hundreds of PB of data under management across multiple production clusters, we provide the backbone of reliable, secure, and high‑performance storage for business operations, customer platforms, and innovation workloads. We embrace a culture of deep technical expertise, hands‑on problem solving, and continuous learning, while driving adoption of automation, observability, and next‑generation storage technologies. Key Responsibilities

Scale‑Out Distributed Storage Architecture

Extensive experience designing, architecting, and managing scale‑out distributed storage systems in large production environments. Expertise in system performance tuning, data durability optimization (replication/erasure coding), and lifecycle management for petabyte‑scale data. Drive evaluation, selection, and deployment of best‑of‑breed software‑defined storage solutions to meet demanding SLAs for latency, throughput, and availability.

Ceph Storage Architecture & Operations

Architect, deploy, and manage large‑scale Ceph clusters across multiple production sites. Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies. Define upgrade strategy, node rebalancing, and hardware refreshes with minimal downtime. Own end‑to‑end lifecycle management of storage clusters, including OS/kernel tuning, firmware upgrades, and hardware integration.

Performance, Debugging & Troubleshooting

Identify, diagnose, and resolve performance bottlenecks across Ceph/Scale‑Out storage, Linux kernel, networking, and hardware layers. Utilize tools such as perf, blktrace, iostat, tcpdump, bpftrace, atop for advanced debugging. Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters. Debug network congestion, packet loss, latency, and RDMA/Ethernet issues impacting storage. Drive root cause analysis for critical production issues and provide long‑term remediation.

Automation & Observability

Build and standardize automation for cluster deployment, expansion, and monitoring using Ansible, Terraform, and custom Python/Shell scripts. Develop observability views for real‑time monitoring of IOPS, throughput, latency, and cluster health. Automate alerting, log analysis, and anomaly detection for proactive incident response. Scalability & Innovation

Design storage solutions to scale to hundreds of nodes and multiple petabytes while ensuring high availability and fault tolerance. Collaborate with compute and networking teams to integrate storage clusters with Kubernetes, OpenStack, and VM workloads. Research and implement new features such as CephFS, RGW S3/Swift gateways, Bluestore optimizations, and RocksDB tuning. Evaluate next‑gen hardware (NVMe SSDs, RDMA NICs, high‑density HDDs) and their impact on storage performance. Benchmark and compare next‑gen server SKUs to select the most appropriate storage hardware. Security & Compliance

Implement encryption (at‑rest and in‑transit), access controls, and audit mechanisms for secure data management. Ensure compliance with enterprise and regulatory standards (e.g., PCI‑DSS, SOC, HIPAA). Collaboration & Mentorship

Act as technical SME for storage within the organization, mentoring junior engineers. Collaborate with cross‑functional teams (Compute, Networking, Cloud, Security) to ensure seamless infrastructure integration. Partner with hardware and software stakeholders and the Ceph community to drive adoption of best practices and contribute to open‑source improvements. Qualifications

15–18 years of experience in scale‑out distributed storage systems, infrastructure engineering, and Linux systems. 10+ years hands‑on experience with Ceph, including architecture, operations, and large‑scale production support. Proven experience managing clusters at petabyte scale with high performance and resiliency requirements. Linux Systems: Kernel tuning, cgroups, systemd, process/thread debugging. Networking: TCP/IP, VLANs, BGP/OSPF, bonding, load balancing, RDMA, Jumbo Frames. Storage Internals: LVM, OSD design, Bluestore, RocksDB tuning, journaling, caching layers. Performance Tools: perf, iostat, atop, strace, tcpdump, Wireshark, eBPF. Debugging: Core dump analysis, kernel crash dump (kdump), system call tracing. Proficiency in Python and Shell scripting for automation and tooling. Hands‑on experience with configuration management (Ansible, Salt, Puppet) and IaC tools like Terraform. Knowledge of containerization (Docker, Kubernetes, LXC) and storage backends (CSI, RBD). Experience with monitoring and logging stacks (Prometheus, Grafana, ELK, OpenObserve). Familiarity with cloud platforms (Azure, GCP, OpenStack, AWS) and hybrid cloud storage. Preferred Skills

Contributions to the Ceph community or other distributed storage projects. Experience with large‑scale data replication, backup, and disaster recovery strategies. Exposure to AI/ML workloads on scale‑out storage and performance optimization for GPU clusters. Familiarity with hardware accelerators (NVMe‑oF, SPDK, DPDK). Location

1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of America

#J-18808-Ljbffr