Vishwanath Kamble
SRE @ iManage
verified Site Reliability Engineer

Vishwanath
Kamble

Building and scaling infrastructure that powers AI, cloud-native platforms, and enterprise systems — from Kubernetes clusters to GPU-accelerated ML pipelines.

location_on Chicago, IL
link LinkedIn
work_history
9+
Years Experience
dns
3
Companies
badge
5
Certifications
cloud
3
Cloud Platforms
Certifications verified RHCSA verified AWS Solutions Architect verified CKAD verified Azure Fundamentals
Explore
rocket_launch Career

Experience in depth

Nine years building infrastructure at scale — from Wall Street trading floors to AI-powered enterprise platforms.

iManage
Site Reliability Engineer Specialist
Dec 2023 — Present
Chicago, IL
memory
GPU Infrastructure & AI Model Deployment
Deployed and managed GPU SKUs across old architecture, testing MIG profiles and multiple SKU configurations. Upgraded decommissioned GPUs with newer models, dramatically improving ML inference performance.
speed 300 → 12 req/sec
hub
Multi-Cluster AKS Orchestration
Led comprehensive upgrades of AKS, Istio, and Terraform providers across three distinct architectures. Deployed two new hub clusters from scratch for 3 environments and onboarded foundational code for our dev team.
query_stats
Observability & ECK Transformation
Developed comprehensive ECK resources — repos, dependent charts, service accounts with minimal permissions, and ArgoCD applications. Implemented Prometheus Push Gateway for ephemeral job metrics, eliminating blind spots in short-lived processes.
smart_toy
OpenAI & AI Platform Enablement
Deployed GPT-4o models across all environments for Data Science teams. Provisioned managed deployments for production, configured Redis cache for ML data acceleration, and set up Grafana OpenAI dashboards.
architecture
Reference Architecture Modernization
Redesigned ServiceBus module from per-team individual modules into a unified, reusable architecture. Created protected production branches with automated tagging workflows and fixed critical storage class retain policies.
build
SRE Tooling & Automation
Engineered debug container image with curated troubleshooting tools, reducing MTTR. Built helm-diff PR scripts, AKS logshipper for audit logs, Vault troubleshooting scripts, and automated Azure Blob lifecycle management.
Instabase
Product / Site Reliability Engineer
Jun 2022 — Dec 2023
Remote
cloud
Multi-Cloud Platform Deployment
Orchestrated the Instabase platform deployment across AWS, Azure, and GCP for three major enterprise clients. Standardized the Terraform modules to support diverse environments (OpenShift vs. Native Cloud), ensuring a consistent delivery pipeline regardless of the underlying infrastructure provider.
domain 3 Enterprise Clients
savings
S3 Storage Cost Optimization
Reduced S3 storage costs by 87% through implementing lifecycle management policies, versioning rules, and archiving dead buckets. This initiative shrank the storage footprint dramatically without impacting data availability for ongoing customer operations.
trending_down 235 TB → 77 TB
notifications_active
Real-Time Alerting & EKS Upgrades
Built a Slack bot to monitor "zombie" sandboxes, identifying and decommissioning over 15 stale environments to eliminate cloud waste. Led critical EKS upgrades (1.21 → 1.23) and migrated storage classes from EBS to EFS, ensuring zero downtime for developer workflows.
delete_forever 15+ Envs Pruned
cleaning_services
Container Image & Security Cleanup
Designed a Python script to identify unused ECR images across active EKS clusters. Implemented ECR lifecycle policies for automated cleanup and addressed vulnerabilities by upgrading base images, significantly reducing the attack surface.
delete_sweep 140K → 2K Images
Citadel
Sr. Infrastructure Engineer & SME Linux
Jan 2020 — May 2022
Chicago, IL / London, UK
auto_fix_high
Automation at Scale
Automated ticket handling on Jira and ServiceNow — creating, updating, and closing over 10,000 tickets in 5 months. Engineered scripts to fix YAML config errors across hundreds of VMs, saving hundreds of hours of manual engineering time.
confirmation_number 10,000+ tickets
school
Linux SME Program & Mentorship
Enabled the team's expansion to a 24/7 "Follow-the-Sun" schedule by standardizing global knowledge. Authored 80+ Wiki pages of deep-dive troubleshooting guides and SOPs. Personally trained 7 new hires across Austin, Chicago, London, and Hong Kong—traveling internationally to ensure consistent operational standards.
public Global Training Lead
schedule
Server Hardware Lifecycle
Designed a comprehensive lifecycle tracking system that interfaced with datacenter inventory APIs. This automation proactively flagged warranty expirations and end-of-life assets, eliminating manual spreadsheet toil and ensuring 100% compliance with hardware refresh policies for the global fleet.
timer 90% time saved
rule_folder
K8s Rollout & Config Hygiene
Served as the Operations liaison for the "Trailblazer" Kubernetes deployment. Developed automated sanitation scripts to enforce naming conventions and data sovereignty—programmatically detecting and moving misfiled server YAMLs (e.g., relocating JPN assets from NYC folders) to their correct datacenter paths.
support_agent Trailblazer L1 Support
Citadel
Infrastructure Operations Developer & Intern
Jun 2016 — Dec 2019
Chicago, IL / New York, NY
dashboard
Luna — Infra Management Platform
Built the central tooling hub adopted by all infrastructure teams (Windows, Network, Storage). Features included a firm-wide "Server Lookup" (locating specific racks/cabinets) and Zabbix API integration to automate maintenance windows, preventing false alerts during patching.
groups 100% Team Adoption
network_check
Network Monitoring & Datacenter Ops
Developed a custom Python application monitoring over 500 network devices (Cisco, Palo Alto, Arista, Meraki). The tool utilized REST APIs to poll device health in real-time, providing a unified dashboard for internet circuits, VPN tunnels, and telecom status across global offices.
router 500+ Devices
lan
Automated Incident Response
Scripted an automated BGP peering check that detected circuit outages instantly. The system auto-created ServiceNow tickets and emailed ISPs with the exact Circuit ID and downtime logs, drastically reducing Mean Time to Detect (MTTD) and eliminating manual vendor triage.
bolt Auto-Vendor Ticketing
nightlight
Night Shift Operations
Managed the entire global infrastructure solo for 12 months, ensuring uptime for critical trading systems during off-hours. Following this tenure, authored the night-shift training curriculum and mentored two junior engineers to expand the rotation into a sustainable team.
person Sole Operator (1 Year)
code Expertise

Technical skills

Infrastructure & Cloud
KubernetesDockerHelmTerraformTerragruntAWSAzureGCPOpenShift
Observability & DevOps
PrometheusGrafanaJaegerKibanaElasticsearchArgoCDHashicorp VaultFluentbitKiali
Languages & Frameworks
PythonBashGoJavaScriptFlaskREST APIsHTML5 / CSS3C++
Networking & Linux
IstioBGP / OSPFCisco SwitchesPalo Alto FirewallsLinux AdminRed HatSplunk
AI / ML Infrastructure
GPU ManagementOpenAI / GPT-4oDocument IntelligenceRedis CacheMIG ProfilesAzure Cognitive Services
CI/CD & Automation
GitHub ActionsJiraServiceNowSnykECR LifecycleGitOps
school Education

Academic background

Master of Science in Computer Science
Stevens Institute of Technology, Hoboken, NJ
GPA: 3.59 — December 2016
Bachelor of Engineering in Computer Engineering
University of Mumbai, India
August 2014
auto_awesome Side Projects

Beyond the day job

Udemy Instructor
Launched and published 5 courses on AWS and Python, building a community of learners in cloud and automation fundamentals.
groups 84,000+ students · 900+ reviews
GeekStartS
Founded a WordPress-based tech blog for beginners, self-teaching WordPress and optimizing performance by deploying on cloud CDNs.