High-Performance Network Engineer (Infiniband)
Aptly Technology
All India, Delhi • 2 months ago
Experience: 5 to 9 Yrs
PREMIUM
Deal of the Day
--:--:--
15 Days Free Trial
After Free Trial → Flat 50% OFF
Upgrade to CVX24 Premium
- Free Resume Writing
-
Get a Verified Blue tick
- See who viewed your profile
- Unlimited chat with recruiters
- Rank higher in recruiter searches
- Get up to 10× more recruiter visibility
- Auto-forward profile to 10 top recruiters
- Receive verified recruiter messages directly
- Unlock hidden jobs, not visible to free users
$0
Activate
$0
A small token amount will be charged to verify.
Get Refund in 48 Hours.
Free Earplugs Delivery Only after Payment of Rs. 99 for Five Consecutive Months.
After free-trial 6 Months subscription will be auto Activated @ $
1
(Cancel Anytime). Quoted price includes 50% discount.
Enter Your Details
Job Description
As an InfiniBand Engineer, your role involves designing, deploying, and supporting high-performance, low-latency network infrastructures. You will have hands-on experience with InfiniBand fabrics, data center networking, and large-scale distributed computing environments such as HPC, AI, and ML clusters.
**Key Responsibilities:**
- Design, implement, and manage large-scale InfiniBand (IB) fabrics in data center and HPC environments.
- Configure and troubleshoot InfiniBand switches and adapters (e.g., Mellanox / NVIDIA IB platforms).
- Perform fabric bring-up, subnet management (OpenSM), partitioning, and performance tuning.
- Monitor and optimize network performance, latency, throughput, and congestion control.
- Integrate InfiniBand with Ethernet-based networking environments.
- Support RDMA technologies (RoCE, iWARP) and GPUDirect environments.
- Collaborate with system, storage, and compute teams to support AI/ML and distributed workloads.
- Perform firmware upgrades, patching, and capacity planning.
- Troubleshoot Layer 2 / Layer 3 networking issues (BGP, OSPF, VLAN, VXLAN, etc.).
- Maintain documentation, network diagrams, and SOPs.
**Required Skills & Qualifications:**
- 5+ years of networking experience with solid fundamentals (TCP/IP, routing, switching).
- Hands-on experience with InfiniBand technologies (HDR/NDR preferred).
- Experience with NVIDIA / Mellanox Technologies switches and adapters.
- Strong understanding of RDMA, congestion control, QoS, and low-latency tuning.
- Experience with subnet managers (OpenSM) and fabric diagnostic tools.
- Solid understanding of BGP, OSPF, EVPN-VXLAN, MPLS (good to have).
- Experience in HPC, AI/ML cluster networking environments is highly preferred.
- Familiarity with Linux networking and troubleshooting tools.
- Experience with automation (Python, Ansible) is a plus.
**Preferred Qualifications:**
- Experience supporting large GPU clusters.
- Knowledge of storage networking (NVMe-oF, parallel file systems).
- Experience with monitoring tools and telemetry systems.
- Networking certifications (CCNP/CCIE or equivalent).
In addition to the above details, the company values individuals with strong analytical and troubleshooting skills, the ability to work in high-performance, mission-critical environments, excellent documentation, and communication skills, and a proactive problem-solving mindset. As an InfiniBand Engineer, your role involves designing, deploying, and supporting high-performance, low-latency network infrastructures. You will have hands-on experience with InfiniBand fabrics, data center networking, and large-scale distributed computing environments such as HPC, AI, and ML clusters.
**Key Responsibilities:**
- Design, implement, and manage large-scale InfiniBand (IB) fabrics in data center and HPC environments.
- Configure and troubleshoot InfiniBand switches and adapters (e.g., Mellanox / NVIDIA IB platforms).
- Perform fabric bring-up, subnet management (OpenSM), partitioning, and performance tuning.
- Monitor and optimize network performance, latency, throughput, and congestion control.
- Integrate InfiniBand with Ethernet-based networking environments.
- Support RDMA technologies (RoCE, iWARP) and GPUDirect environments.
- Collaborate with system, storage, and compute teams to support AI/ML and distributed workloads.
- Perform firmware upgrades, patching, and capacity planning.
- Troubleshoot Layer 2 / Layer 3 networking issues (BGP, OSPF, VLAN, VXLAN, etc.).
- Maintain documentation, network diagrams, and SOPs.
**Required Skills & Qualifications:**
- 5+ years of networking experience with solid fundamentals (TCP/IP, routing, switching).
- Hands-on experience with InfiniBand technologies (HDR/NDR preferred).
- Experience with NVIDIA / Mellanox Technologies switches and adapters.
- Strong understanding of RDMA, congestion control, QoS, and low-latency tuning.
- Experience with subnet managers (OpenSM) and fabric diagnostic tools.
- Solid understanding of BGP, OSPF, EVPN-VXLAN, MPLS (good to have).
- Experience in HPC, AI/ML cluster networking environments is highly preferred.
- Familiarity with Linux networking and troubleshooting tools.
- Experience with automation (Python, Ansible) is a plus.
**Preferred Qualifications:**
- Experience supporting large GPU clusters.
- Knowledge of storage networking (NVMe-oF, parallel file systems).
- Experience with monitoring tools and telemetry systems.
- Networking certifications (CCNP/CCIE or equivalent).
In addition to the above details, the company values individuals with strong analytical and troubleshooting skills, the ability to work in high-performance, mission-critical environments, excellent documentation, and communication skills, and a proactive problem-solving mindset.
Skills Required
advanced networking technologies
HPC
TCPIP
routing
switching
QoS
BGP
OSPF
MPLS
Ansible
HPC
monitoring tools
CCNP
CCIE
InfiniBand Engineer
InfiniBand fabrics
data center networking
AI
ML clusters
InfiniBand IB fabrics
Mellanox NVIDIA IB platforms
RDMA technologies
RoCE
iWARP
GPUDirect environments
HDRNDR
congestion control
lowlatency tuning
subnet managers OpenSM
fabric diagnostic tools
EVPNVXLAN
Linux networking
automation Python
AIML cluster networking environments
GPU clusters
storage networking NVMeoF
parallel file systems
telemetry systems
Posted on: March 7, 2026
Relevant Jobs
Step 2 of 2