Hi, I'm Chao Li
AI Engineer & Distributed Systems Architect
Building the infrastructure that powers AI. Specializing in LLM training pipelines, distributed systems, and high-performance computing that handles millions of operations per second.
From Distributed Systems to AI Infrastructure
I'm a Principal Engineer with 9+ years of experience building high-performance distributed systems. My journey has taken me through startups to unicorns to global enterprises, where I've consistently delivered systems that push the boundaries of scale and efficiency.
Currently at VMware/Broadcom, I lead LLM infrastructure projects that have achieved 10x improvements in customer onboarding speed. I specialize in the intersection of distributed systems and machine learning—building the infrastructure that makes AI work at scale.
When I'm not at work, I'm building my own ML infrastructure in my homelab—from C++ backtesting engines with microsecond latency to Ray-based distributed training platforms running on Kubernetes.
My Journey
Current Focus
LLM Infrastructure
Building training and inference pipelines for large language models with distributed systems on Ray and Kubernetes.
Post-Training & RLHF
Implementing reinforcement learning from human feedback, PPO, DPO, and alignment techniques for production models.
High-Performance Systems
Designing systems handling 1M+ events/sec with microsecond latency using C++, shared memory, and async patterns.
Agentic AI Systems
Creating self-reflective multi-agent workflows with LangGraph for complex automation and code generation tasks.
Building Systems at Scale
From startups to global enterprises, I've built distributed systems and ML infrastructure that power millions of operations per second.
VMware (Broadcom)
CurrentPrincipal Engineer
Palo Alto, California
LLM Migration Automation
Self-reflective agentic AI system for automating F5 to Avi load balancer configuration conversion.
Distributed File Object System
Unified file management platform migrated from Python to Golang, enabling centralized file lifecycle control across multi-cluster control plane
Web Security CSRF Module
CSRF protection module integrated into nginx module chain with full-stack implementation from control plane to runtime
Argo AI
Software Engineer
Palo Alto, California
Onboard Logging Infrastructure
High-performance logging system for autonomous vehicles with zero-copy data paths and tiered event handling
Avi Networks → VMware
Staff Engineer
Palo Alto, California
Distributed Analytics Platform
High-throughput log analytics system with async indexing pipeline and custom query DSL, processing 1M+ events/sec
What I've Built
From high-frequency trading systems to LLM infrastructure—engineering solutions that push performance boundaries.
LLM Migration Automation
Featured @ vmware-broadcomSelf-reflective agentic AI system for automating F5 to Avi load balancer configuration conversion.
Stock Strategy Backtesting Platform
FeaturedEnd-to-end ML infrastructure product for quantitative trading strategy development and analysis.
Distributed Analytics Platform @ avi-networks-vmware
High-throughput log analytics system with async indexing pipeline and custom query DSL, processing 1M+ events/sec
Distributed File Object System @ vmware-broadcom
Unified file management platform migrated from Python to Golang, enabling centralized file lifecycle control across multi-cluster control plane
Distributed Training Platform
Ray-based ML training infrastructure for stock prediction models, running on homelab Kubernetes cluster with dynamic resource allocation and fault tolerance.
High-Performance ETL Pipeline
Ray-based data processing pipeline for stock market data ingestion and feature engineering, handling terabyte-scale historical quotes with columnar storage optimization for ML training.
Onboard Logging Infrastructure @ argo-ai
High-performance logging system for autonomous vehicles with zero-copy data paths and tiered event handling
Web Security CSRF Module @ vmware-broadcom
CSRF protection module integrated into nginx module chain with full-stack implementation from control plane to runtime
Technical Expertise
A decade of building high-performance systems across the full stack—from low-level C++ to distributed ML infrastructure.
Languages
ML & LLM
Infrastructure
Data & Storage
High Performance
Web & APIs
Let's Connect
Interested in discussing AI infrastructure, distributed systems, or potential collaborations? I'm always happy to chat about technology and engineering challenges.
Have a project in mind?
Whether it's building ML infrastructure, optimizing distributed systems, or exploring new AI applications—let's talk.
Send me an email