Optimizing LLMs for Edge Devices: A GCP & Hugging Face Tutorial

Ashwin Kashyap

May 19, 2025

This tutorial provides a guide on using Google Cloud Platform (GCP) and Hugging Face Transformers to fine-tune a small Large Language Model (LLM), and then apply techniques like distillation, quantization, and pruning to make it suitable for resource-constrained environments such as mobile phones and web browsers (via JavaScript).

1. Introduction to Google Cloud Platform (GCP)

What is GCP? Google Cloud Platform is a suite of cloud computing services offered by Google. It provides a wide array of services in areas like computing, storage, databases, networking, machine learning, and more, running on the same infrastructure that Google uses to run its end-user products (like Google Search, Gmail, and YouTube).

Key GCP Services for ML: For this workflow, the following GCP services are particularly relevant:

Vertex AI: A unified MLOps platform to build, deploy, and manage ML models.
- Vertex AI Workbench: Jupyter notebook-based development environment for ML.
- Vertex AI Training: For running custom training jobs at scale, with support for GPUs and TPUs.
- Vertex AI Prediction: For deploying models and serving predictions.
Compute Engine: Provides virtual machines (VMs) that can be configured with various CPU, GPU, and memory options. Useful for development and custom training setups.
Cloud Storage: Scalable and durable object storage for your datasets, model checkpoints, and other artifacts.
Artifact Registry: A service to store and manage container images and language packages (like Docker images for your training environment).

Why use GCP for this workflow?

Scalability: Easily scale your compute resources up or down based on the demands of training and experimentation.
Specialized Hardware: Access to powerful GPUs (NVIDIA A100, T4, V100) and TPUs, which can significantly accelerate model training.
Managed Services: Vertex AI simplifies many MLOps tasks, allowing you to focus more on model development.
Integration: Seamless integration between various GCP services (e.g., loading data from Cloud Storage into a Vertex AI training job).

2. Setting Up Your Environment on GCP

Create a GCP Project: If you don’t have one, create a new project in the Google Cloud Console.
Enable APIs: Ensure that the following APIs are enabled for your project:
- Compute Engine API
- Vertex AI API
- Cloud Storage API
- Artifact Registry API
Set up Authentication & gcloud CLI:
- Install the Google Cloud SDK (which includes the gcloud command-line tool).
- Authenticate: gcloud auth login and gcloud auth application-default login.
- Configure your project: gcloud config set project YOUR_PROJECT_ID.
Choose your Development Environment:
- Vertex AI Workbench:
  - Navigate to Vertex AI in the Cloud Console.
  - Go to “Workbench” and create a new “Managed notebooks” or “User-managed notebooks” instance.
  - Choose an instance type with appropriate CPU, RAM, and optionally a GPU.
  - Select a suitable environment (e.g., PyTorch, TensorFlow) or customize it.
- Compute Engine VM:
  - Navigate to Compute Engine.
  - Create a new VM instance.
  - Select a machine type (e.g., n1-standard-4 or a GPU-enabled instance).
  - Choose a boot disk image (e.g., “Deep Learning on Linux”).
  - SSH into your VM and install necessary libraries:

3. Fine-Tuning a Small LLM with Hugging Face Transformers

Choosing a Small Base Model: Start with a pre-trained model that is already relatively small. Examples:

BERT-based: distilbert-base-uncased, google/mobilebert-uncased, prajjwal1/bert-tiny
GPT-2-based: gpt2 (the smallest version), or distilled versions if available.
Other architectures: Models specifically designed for efficiency.

Preparing Your Dataset:

Your dataset should be formatted appropriately for your task (e.g., text classification, question answering).
Use the Hugging Face datasets library to load and preprocess your data.
Upload your dataset to a Cloud Storage bucket for easy access from your GCP environment.

The Fine-Tuning Script: Use the Hugging Face Trainer API for a high-level training loop or write a custom PyTorch/TensorFlow training loop.

Example Conceptual Structure (using Trainer):

Leveraging GCP for Training:

Vertex AI Training: Package your training script into a Docker container and submit it as a custom training job. This allows you to specify machine types (including GPUs/TPUs) and run training without managing infrastructure directly.
Compute Engine with GPU: If using a VM, ensure you’ve selected an instance with a GPU and have the necessary NVIDIA drivers and CUDA toolkit installed. The accelerate library from Hugging Face can help simplify distributed training.

4. Knowledge Distillation

Concept: Knowledge distillation is a model compression technique where a smaller “student” model is trained to mimic the behavior of a larger, pre-trained “teacher” model. The student learns from the teacher’s soft labels (probabilities for each class) or intermediate representations, in addition to the true labels.

How it works with Hugging Face:

You’ll need a fine-tuned teacher model (can be a larger, more accurate model) and a smaller student model architecture.
The loss function is typically a combination of:
- Standard cross-entropy loss with the ground truth labels.
- A distillation loss (e.g., Kullback-Leibler divergence) between the student’s and teacher’s output probability distributions.
Hugging Face doesn’t have a universal KnowledgeDistillationTrainer for all tasks, but you can implement it by:
- Subclassing the Trainer and overriding compute_loss.
- Writing a custom training loop.

Example Conceptual compute_loss for Distillation (PyTorch):

Benefits:

Can significantly reduce model size while retaining a good portion of the teacher’s performance.
Often leads to faster inference.

5. Quantization

Concept: Quantization involves reducing the numerical precision of the model’s weights and/or activations (e.g., from 32-bit floating-point (FP32) to 8-bit integer (INT8) or FP16).

Benefits:

Smaller model size: INT8 models are roughly 4x smaller than FP32.
Faster inference: Integer operations are often faster on CPUs and specialized hardware (like mobile NPUs).
Reduced power consumption.

Tools & Techniques:

Hugging Face Optimum: A library that extends transformers and provides optimization tools, including quantization through backends like ONNX Runtime and Intel Neural Compressor.
Example Post-Training Dynamic Quantization with Optimum & ONNX Runtime:
Post-Training Static Quantization (PTSQ): Requires a calibration dataset to determine scaling factors. Generally more accurate than dynamic quantization. Optimum supports this with ONNX Runtime and Neural Compressor.
Quantization-Aware Training (QAT): Simulates quantization effects during the fine-tuning process. Can lead to better accuracy but is more complex to implement. PyTorch and TensorFlow have QAT utilities.
- PyTorch: torch.quantization module.
- TensorFlow: tfmot.quantization.keras (TensorFlow Model Optimization Toolkit).

6. Pruning

Concept: Pruning involves removing redundant or less important weights, neurons, or even larger structures (like attention heads or layers) from the model.

Benefits:

Reduced model size.
Potentially faster inference (especially with structured pruning).

Types & Tools:

Unstructured Pruning (Weight Pruning): Individual weights are set to zero. Can lead to sparse models. Requires specialized hardware or libraries for speedup.
Structured Pruning: Entire neurons, channels, or attention heads are removed. This often results in a smaller, dense model that can run faster on standard hardware.
Hugging Face Optimum: Can leverage neural-compressor for some pruning techniques.
PyTorch: torch.nn.utils.prune module for implementing various pruning techniques (magnitude, L1 unstructured, etc.).
TensorFlow Model Optimization Toolkit (TF MOT): tfmot.sparsity.keras provides tools for pruning during or after training.

Example Conceptual Pruning (PyTorch - Magnitude Pruning):

After pruning, the model usually needs to be fine-tuned again for a few epochs (“iterative pruning”) to recover any lost accuracy.

7. Exporting and Running on Client Devices

After applying these optimization techniques, you need to export the model to a format suitable for client-side inference.

Common Formats:

ONNX (Open Neural Network Exchange): An open format for ML models. Widely supported and can be run with ONNX Runtime.
TensorFlow Lite (TFLite): Optimized for mobile and edge devices. Offers good performance and small binary size.

A. For In-Browser (JavaScript):

ONNX Runtime Web:
- Convert your Pytorch/TF model to ONNX (as shown in quantization section).
- The quantized ONNX model is ideal here.
- Use onnxruntime-web library in your JavaScript project.
- Input preparation (tokenization) needs to be replicated in JavaScript. You can use a JS tokenizer compatible with your Hugging Face tokenizer or implement it. Libraries like tokenizers.js (from Hugging Face) or simpler custom tokenizers can be used.
TensorFlow.js (TFJS):
- Convert your TensorFlow model to TFJS format, or TFLite model to TFJS.
- Use the tfjs-tflite package to run TFLite models.
Transformers.js (by Xenova/Hugging Face):
- This library allows you to run many Hugging Face Transformers models directly in the browser. It handles model conversion (to ONNX) and tokenization automatically for supported models.
- It’s the easiest way if your model architecture and tokenizer are supported.

B. For Mobile Phones (Android/iOS):

TensorFlow Lite (TFLite):
- Convert your model (PyTorch/TF -> ONNX -> TFLite or TF -> TFLite).
- Integrate the .tflite model into your Android (Java/Kotlin) or iOS (Swift/Objective-C) app using the TensorFlow Lite SDK.
- Handle tokenization natively or use pre-built libraries.
ONNX Runtime Mobile:
- Use the quantized ONNX model.
- ONNX Runtime has prebuilt packages for Android and iOS.
- Supports various hardware accelerators (NNAPI for Android, Core ML for iOS).
Core ML (iOS):
- Convert your model to Core ML format using coremltools.
- Integrate into your iOS app.

Considerations for Client-Side Deployment:

Tokenization: This is a critical step. The exact same tokenization logic used during training must be replicated on the client. This can be challenging in JavaScript or mobile native code.
- Consider using sentencepiece or byte-pair encoding (BPE) tokenizers that have JavaScript/native implementations.
- Hugging Face tokenizers library has Rust core, with bindings for Node.js, and can be compiled to WASM for browser.
Model Size: Even after optimization, ensure the model is small enough for reasonable download times and memory usage.
Inference Speed: Test on target devices.
UI/UX: Provide feedback to the user during model loading and inference.

8. Workflow Summary and Best Practices

The process is often iterative:

Baseline: Fine-tune a small LLM on GCP. Evaluate its performance and size.
Distill (Optional but Recommended): If you have a larger, more accurate teacher model, distill knowledge into your smaller student model. Evaluate.
Prune (Optional): Apply pruning techniques. Evaluate. May require some retraining.
Quantize: Apply post-training quantization (dynamic or static) or QAT. Evaluate accuracy and performance. This is often the most impactful step for size and speed on edge devices.
Export: Convert to ONNX, TFLite, or other suitable formats.
Test on Target: Thoroughly test on target browsers/devices for functionality, speed, and memory usage.

Best Practices:

Evaluate at Each Step: Measure accuracy, model size, and ideally inference speed after each optimization technique. There’s usually a trade-off.
Start Simple: Begin with simpler techniques like dynamic quantization before moving to more complex ones like QAT or extensive pruning.
Calibration Data: For static quantization, use a representative calibration dataset.
Hardware Awareness: The best quantization/export format might depend on the target hardware (e.g., specific mobile SoCs, browser WASM capabilities).
Iterate: Don’t expect perfection on the first try. Optimization is an iterative process of tweaking and re-evaluating.

This tutorial provides a high-level roadmap. Each step involves detailed considerations and code. Refer to the official documentation of Hugging Face (Transformers, Optimum, Tokenizers), PyTorch, TensorFlow, ONNX Runtime, and GCP for specific implementation details.