Optimizing LLMs for Edge Devices: A GCP & Hugging Face Tutorial
Ashwin Kashyap
This tutorial provides a guide on using Google Cloud Platform (GCP) and Hugging Face Transformers to fine-tune a small Large Language Model (LLM), and then apply techniques like distillation, quantization, and pruning to make it suitable for resource-constrained environments such as mobile phones and web browsers (via JavaScript).
1. Introduction to Google Cloud Platform (GCP)
What is GCP? Google Cloud Platform is a suite of cloud computing services offered by Google. It provides a wide array of services in areas like computing, storage, databases, networking, machine learning, and more, running on the same infrastructure that Google uses to run its end-user products (like Google Search, Gmail, and YouTube).
Key GCP Services for ML: For this workflow, the following GCP services are particularly relevant:
Vertex AI: A unified MLOps platform to build, deploy, and manage ML models.
Vertex AI Workbench: Jupyter notebook-based development environment for ML.
Vertex AI Training: For running custom training jobs at scale, with support for GPUs and TPUs.
Vertex AI Prediction: For deploying models and serving predictions.
Compute Engine: Provides virtual machines (VMs) that can be configured with various CPU, GPU, and memory options. Useful for development and custom training setups.
Cloud Storage: Scalable and durable object storage for your datasets, model checkpoints, and other artifacts.
Artifact Registry: A service to store and manage container images and language packages (like Docker images for your training environment).
Why use GCP for this workflow?
Scalability: Easily scale your compute resources up or down based on the demands of training and experimentation.
Specialized Hardware: Access to powerful GPUs (NVIDIA A100, T4, V100) and TPUs, which can significantly accelerate model training.
Managed Services: Vertex AI simplifies many MLOps tasks, allowing you to focus more on model development.
Integration: Seamless integration between various GCP services (e.g., loading data from Cloud Storage into a Vertex AI training job).
2. Setting Up Your Environment on GCP
Create a GCP Project: If you don’t have one, create a new project in the Google Cloud Console.
Enable APIs: Ensure that the following APIs are enabled for your project:
Compute Engine API
Vertex AI API
Cloud Storage API
Artifact Registry API
Set up Authentication & gcloud CLI:
Install the Google Cloud SDK (which includes the
gcloud
command-line tool).Authenticate:
gcloud auth login
andgcloud auth application-default login
.Configure your project:
gcloud config set project YOUR_PROJECT_ID
.
Choose your Development Environment:
Vertex AI Workbench:
Navigate to Vertex AI in the Cloud Console.
Go to “Workbench” and create a new “Managed notebooks” or “User-managed notebooks” instance.
Choose an instance type with appropriate CPU, RAM, and optionally a GPU.
Select a suitable environment (e.g., PyTorch, TensorFlow) or customize it.
Compute Engine VM:
Navigate to Compute Engine.
Create a new VM instance.
Select a machine type (e.g.,
n1-standard-4
or a GPU-enabled instance).Choose a boot disk image (e.g., “Deep Learning on Linux”).
SSH into your VM and install necessary libraries:
3. Fine-Tuning a Small LLM with Hugging Face Transformers
Choosing a Small Base Model: Start with a pre-trained model that is already relatively small. Examples:
BERT-based:
distilbert-base-uncased
,google/mobilebert-uncased
,prajjwal1/bert-tiny
GPT-2-based:
gpt2
(the smallest version), or distilled versions if available.Other architectures: Models specifically designed for efficiency.
Preparing Your Dataset:
Your dataset should be formatted appropriately for your task (e.g., text classification, question answering).
Use the Hugging Face
datasets
library to load and preprocess your data.Upload your dataset to a Cloud Storage bucket for easy access from your GCP environment.
The Fine-Tuning Script: Use the Hugging Face Trainer
API for a high-level training loop or write a custom PyTorch/TensorFlow training loop.
Example Conceptual Structure (using Trainer
):
Leveraging GCP for Training:
Vertex AI Training: Package your training script into a Docker container and submit it as a custom training job. This allows you to specify machine types (including GPUs/TPUs) and run training without managing infrastructure directly.
Compute Engine with GPU: If using a VM, ensure you’ve selected an instance with a GPU and have the necessary NVIDIA drivers and CUDA toolkit installed. The
accelerate
library from Hugging Face can help simplify distributed training.
4. Knowledge Distillation
Concept: Knowledge distillation is a model compression technique where a smaller “student” model is trained to mimic the behavior of a larger, pre-trained “teacher” model. The student learns from the teacher’s soft labels (probabilities for each class) or intermediate representations, in addition to the true labels.
How it works with Hugging Face:
You’ll need a fine-tuned teacher model (can be a larger, more accurate model) and a smaller student model architecture.
The loss function is typically a combination of:
Standard cross-entropy loss with the ground truth labels.
A distillation loss (e.g., Kullback-Leibler divergence) between the student’s and teacher’s output probability distributions.
Hugging Face doesn’t have a universal
KnowledgeDistillationTrainer
for all tasks, but you can implement it by:Subclassing the
Trainer
and overridingcompute_loss
.Writing a custom training loop.
Example Conceptual compute_loss
for Distillation (PyTorch):
Benefits:
Can significantly reduce model size while retaining a good portion of the teacher’s performance.
Often leads to faster inference.
5. Quantization
Concept: Quantization involves reducing the numerical precision of the model’s weights and/or activations (e.g., from 32-bit floating-point (FP32) to 8-bit integer (INT8) or FP16).
Benefits:
Smaller model size: INT8 models are roughly 4x smaller than FP32.
Faster inference: Integer operations are often faster on CPUs and specialized hardware (like mobile NPUs).
Reduced power consumption.
Tools & Techniques:
Hugging Face Optimum: A library that extends
transformers
and provides optimization tools, including quantization through backends like ONNX Runtime and Intel Neural Compressor.Example Post-Training Dynamic Quantization with Optimum & ONNX Runtime:
Post-Training Static Quantization (PTSQ): Requires a calibration dataset to determine scaling factors. Generally more accurate than dynamic quantization. Optimum supports this with ONNX Runtime and Neural Compressor.
Quantization-Aware Training (QAT): Simulates quantization effects during the fine-tuning process. Can lead to better accuracy but is more complex to implement. PyTorch and TensorFlow have QAT utilities.
PyTorch:
torch.quantization
module.TensorFlow:
tfmot.quantization.keras
(TensorFlow Model Optimization Toolkit).
6. Pruning
Concept: Pruning involves removing redundant or less important weights, neurons, or even larger structures (like attention heads or layers) from the model.
Benefits:
Reduced model size.
Potentially faster inference (especially with structured pruning).
Types & Tools:
Unstructured Pruning (Weight Pruning): Individual weights are set to zero. Can lead to sparse models. Requires specialized hardware or libraries for speedup.
Structured Pruning: Entire neurons, channels, or attention heads are removed. This often results in a smaller, dense model that can run faster on standard hardware.
Hugging Face Optimum: Can leverage
neural-compressor
for some pruning techniques.PyTorch:
torch.nn.utils.prune
module for implementing various pruning techniques (magnitude, L1 unstructured, etc.).TensorFlow Model Optimization Toolkit (TF MOT):
tfmot.sparsity.keras
provides tools for pruning during or after training.
Example Conceptual Pruning (PyTorch - Magnitude Pruning):
After pruning, the model usually needs to be fine-tuned again for a few epochs (“iterative pruning”) to recover any lost accuracy.
7. Exporting and Running on Client Devices
After applying these optimization techniques, you need to export the model to a format suitable for client-side inference.
Common Formats:
ONNX (Open Neural Network Exchange): An open format for ML models. Widely supported and can be run with ONNX Runtime.
TensorFlow Lite (TFLite): Optimized for mobile and edge devices. Offers good performance and small binary size.
A. For In-Browser (JavaScript):
ONNX Runtime Web:
Convert your Pytorch/TF model to ONNX (as shown in quantization section).
The quantized ONNX model is ideal here.
Use
onnxruntime-web
library in your JavaScript project.
- Input preparation (tokenization) needs to be replicated in JavaScript. You can use a JS tokenizer compatible with your Hugging Face tokenizer or implement it. Libraries like
tokenizers.js
(from Hugging Face) or simpler custom tokenizers can be used.
TensorFlow.js (TFJS):
Convert your TensorFlow model to TFJS format, or TFLite model to TFJS.
Use the
tfjs-tflite
package to run TFLite models.
Transformers.js (by Xenova/Hugging Face):
This library allows you to run many Hugging Face Transformers models directly in the browser. It handles model conversion (to ONNX) and tokenization automatically for supported models.
It’s the easiest way if your model architecture and tokenizer are supported.
B. For Mobile Phones (Android/iOS):
TensorFlow Lite (TFLite):
Convert your model (PyTorch/TF -> ONNX -> TFLite or TF -> TFLite).
Integrate the
.tflite
model into your Android (Java/Kotlin) or iOS (Swift/Objective-C) app using the TensorFlow Lite SDK.Handle tokenization natively or use pre-built libraries.
ONNX Runtime Mobile:
Use the quantized ONNX model.
ONNX Runtime has prebuilt packages for Android and iOS.
Supports various hardware accelerators (NNAPI for Android, Core ML for iOS).
Core ML (iOS):
Convert your model to Core ML format using
coremltools
.Integrate into your iOS app.
Considerations for Client-Side Deployment:
Tokenization: This is a critical step. The exact same tokenization logic used during training must be replicated on the client. This can be challenging in JavaScript or mobile native code.
Consider using sentencepiece or byte-pair encoding (BPE) tokenizers that have JavaScript/native implementations.
Hugging Face
tokenizers
library has Rust core, with bindings for Node.js, and can be compiled to WASM for browser.
Model Size: Even after optimization, ensure the model is small enough for reasonable download times and memory usage.
Inference Speed: Test on target devices.
UI/UX: Provide feedback to the user during model loading and inference.
8. Workflow Summary and Best Practices
The process is often iterative:
Baseline: Fine-tune a small LLM on GCP. Evaluate its performance and size.
Distill (Optional but Recommended): If you have a larger, more accurate teacher model, distill knowledge into your smaller student model. Evaluate.
Prune (Optional): Apply pruning techniques. Evaluate. May require some retraining.
Quantize: Apply post-training quantization (dynamic or static) or QAT. Evaluate accuracy and performance. This is often the most impactful step for size and speed on edge devices.
Export: Convert to ONNX, TFLite, or other suitable formats.
Test on Target: Thoroughly test on target browsers/devices for functionality, speed, and memory usage.
Best Practices:
Evaluate at Each Step: Measure accuracy, model size, and ideally inference speed after each optimization technique. There’s usually a trade-off.
Start Simple: Begin with simpler techniques like dynamic quantization before moving to more complex ones like QAT or extensive pruning.
Calibration Data: For static quantization, use a representative calibration dataset.
Hardware Awareness: The best quantization/export format might depend on the target hardware (e.g., specific mobile SoCs, browser WASM capabilities).
Iterate: Don’t expect perfection on the first try. Optimization is an iterative process of tweaking and re-evaluating.
This tutorial provides a high-level roadmap. Each step involves detailed considerations and code. Refer to the official documentation of Hugging Face (Transformers, Optimum, Tokenizers), PyTorch, TensorFlow, ONNX Runtime, and GCP for specific implementation details.