Implementing Zero-Trust Multi-Cloud: A Complete WIF Setup Guide

← Back to Articles

In Part 2A, I showed you why Workload Identity Federation (WIF) is the Zero-Trust solution for multi-cloud MLOps. We covered the business case, the security benefits, the cost savings.

Now let's build it.

I've implemented this pattern multiple times and each time, I hit the same gotchas, solved the same "Permission Denied" errors, and learned new tricks.

This is the implementation guide I wish I'd had for my first setup. It would have saved me 10 hours of debugging. We are going to build a production-ready, keyless authentication pipeline from AWS to Vertex AI, adaptable for Azure and GitHub Actions. We will cover:

The Core Setup: A complete AWS → Vertex AI implementation in 15 minutes.
The Enterprise Pattern: Moving multi-TB datasets using Storage Transfer Service (STS) without proxying data.
The Compliance Pattern: Azure to Vertex AI for HIPAA-regulated workloads.
Troubleshooting: How to debug the dreaded "Permission Denied" errors.

If you just want to test WIF works, skip to Phase 3 and use the Standard Pattern with a small test file. Come back for the enterprise patterns later.

WIF Architecture: Understanding the Core Components

Before we run Terraform, let's visualise the components. When I implemented this for the first time, I made the mistake of jumping straight into creating resources. I ended up with a mess of permissions that took 3 hours to untangle.

Here's the mental model that finally clicked for me:

The 8-Step Trust Dance

AWS Identity Token: Your EC2 instance requests its identity from AWS STS
Token Retrieval: AWS issues a signed token proving "I am IAM role X in account Y"
Token Exchange: Your app presents this AWS token to GCP's Workload Identity Pool
Provider Validation: The AWS Provider validates the token signature
Attribute Checking: Conditions verify the role ARN matches your allowed list
Service Account Impersonation: If checks pass, GCP issues a short-lived token for your service account
Resource Access: Your code uses this token to call Vertex AI APIs
Data Operations: Vertex AI accesses Cloud Storage with the service account's permissions

Key Components

Workload Identity Pool: The container for external identities (e.g., aws-prod-pool).
Pool Provider: The "bouncer" that verifies tokens (e.g., checks if the AWS token is valid).
GCP Service Account: The identity your external workload impersonates after passing the bouncer.
Attribute Conditions: The fine-grained rules (e.g., "Only allow access if the AWS role is ml-training-role").

Notice how zero static credentials flow through this system. Every token is short-lived (1 hour max), and if your AWS role is compromised, you just remove it from the attribute conditions.

Pattern 1: AWS to Vertex AI Authentication (15-Minute Setup)

Scenario: You have training data in AWS S3, model training on Vertex AI. This is the most frequently used pattern implementation.

Time to implement: 15-20 minutes for first-time setup, 10 minutes once you know the steps.

We'll do this in three phases:

GCP Configuration - Tell Google who to trust (5 min)
AWS Configuration - Give AWS permission to request tokens (5 min)
Application Code - Use the credentials seamlessly (5 min)

Let's start with the GCP side.

Phase 1: GCP Configuration (The Trust Side)

First, we tell Google Cloud to trust specific AWS identities.

1. Create the Workload Identity Pool

This acts as a namespace for your external identities.

Bash - Create Workload Identity Pool

export PROJECT_ID="your-ml-project"
export POOL_ID="aws-prod-pool"
export LOCATION="global"

gcloud iam workload-identity-pools create $POOL_ID \
  --project=$PROJECT_ID \
  --location=$LOCATION \
  --display-name="AWS Production ML Workloads" \
  --description="Federated access for AWS-based ML pipelines"

2. Create the AWS Provider

This links your specific AWS Account ID to the pool.

Bash - Create AWS Provider

export PROVIDER_ID="aws-provider"
export AWS_ACCOUNT_ID="123456789012"  # Replace with your AWS Account ID

gcloud iam workload-identity-pools providers create-aws $PROVIDER_ID \
  --project=$PROJECT_ID \
  --location=$LOCATION \
  --workload-identity-pool=$POOL_ID \
  --account-id=$AWS_ACCOUNT_ID

3. Create the Service Account & Grant Permissions

This is the identity your AWS workload will "become."

Bash - Create Service Account

export SA_NAME="vertex-training-sa"

# Create the Service Account
gcloud iam service-accounts create $SA_NAME \
  --project=$PROJECT_ID \
  --display-name="Vertex AI Training Agent"

# Grant it access to Vertex AI and GCS
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

4. Bind the Trust (The Critical Step)

This is where we say: "Allow the ml-training-role from the AWS Account 123... to impersonate this Service Account."

Bash - Bind the Trust

# Get the full pool resource name
POOL_RESOURCE_NAME=$(gcloud iam workload-identity-pools describe $POOL_ID \
  --project=$PROJECT_ID --location=$LOCATION --format="value(name)")

# Allow impersonation ONLY from a specific AWS Role
gcloud iam service-accounts add-iam-policy-binding \
  "$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/iam.workloadIdentityUser" \
  --member="principalSet://iam.googleapis.com/$POOL_RESOURCE_NAME/attribute.aws_role/arn:aws:iam::$AWS_ACCOUNT_ID:role/ml-training-role"

Pro Tip: Being specific with attribute.aws_role here is what makes this Zero-Trust. Never use a wildcard (*) in production.

Phase 2: AWS Configuration (The Client Side)

Now, we configure AWS to provide the credentials.

1. Create the IAM Role

This role will be assumed by your EC2 instance or EKS pod. It needs a trust policy that allows it to talk to Google.

Important: This trust policy goes in your AWS IAM role definition, NOT in GCP. This is what allows AWS to issue tokens that GCP will accept.

trust-policy.json:

JSON - AWS Trust Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "accounts.google.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "accounts.google.com:aud": "http://iam.googleapis.com/projects/${GCP_PROJECT_NUMBER}/locations/global/workloadIdentityPools/aws-prod-pool/providers/aws-provider"
        }
      }
    }
  ]
}

2. Configure the Client Library

Google's client libraries need a configuration file to know how to perform the exchange. You generate this once and bake it into your application image or mount it as a Kubernetes secret.

Bash - Configure Client Library

gcloud iam workload-identity-pools create-cred-config \
  projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$POOL_ID/providers/$PROVIDER_ID \
  --service-account="$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
  --aws \
  --output-file="credential-config.json"

Phase 3: The Vertex AI Pipeline Code

This is where the magic happens. We will look at two patterns: the Standard Pattern for small data, and the Enterprise Pattern for massive datasets.

Common Setup

In your Python script running on AWS, simply point to the config file.

Python - Common Setup

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/credential-config.json'

Option A: The "Standard" Pattern (Small Data)

For metadata or small files, you can download to the AWS worker and re-upload to GCS.

Python - Standard Pattern

def fetch_data_standard(s3_bucket, gcs_bucket):
    import boto3
    from google.cloud import storage

    # 1. Get data from S3 (using AWS native credentials)
    s3 = boto3.client('s3')
    s3.download_file(s3_bucket, 'data.csv', '/tmp/data.csv')

    # 2. Upload to GCS (using WIF credentials automatically)
    storage_client = storage.Client()
    bucket = storage_client.bucket(gcs_bucket)
    blob = bucket.blob('training/data.csv')
    blob.upload_from_filename('/tmp/data.csv')
    
    return f"gs://{gcs_bucket}/training/data.csv"

Option B: The "Enterprise" Pattern (Multi-TB Data)

If you are moving 2TB of training data, do not proxy it through your worker node. It's slow, expensive, and fragile.

Let me show you why with a visual comparison:

Enterprise Approach: Direct STS Transfer

Naive Approach: Proxy Through Worker

The Cost of Getting This Wrong

Naive Approach (Proxy Through Worker):

Download 2TB from S3 to EC2: 2 hours
Upload 2TB from EC2 to GCS: 3 hours
Total time: 5 hours
AWS egress cost: $180 (S3 → Internet → GCS)
Single point of failure: Your worker node
Network instability: Any disconnect = start over

Enterprise Approach (Storage Transfer Service):

Issue STS transfer command via WIF: 30 seconds
GCP transfers directly with parallel streams: 1.5 hours
Total time: 1.5 hours
AWS egress cost: $90 (S3 → GCP direct peering)
No worker dependency: STS handles retries automatically
Bandwidth optimisation: Google's infrastructure, not your EC2 instance

The Architectural Insight

With STS, your worker node becomes a lightweight orchestrator instead of a data proxy. You use WIF credentials to command GCP to fetch the data, then step aside while the platforms handle the heavy lifting.

This is the architect-level distinction: Knowing when to move data yourself, and when to orchestrate the platform to do it for you.

Python - Enterprise Pattern

def trigger_enterprise_transfer(s3_bucket, gcs_bucket):
    """
    Triggers a server-to-server transfer from AWS S3 to GCS.
    Zero data flows through this script.
    """
    from google.cloud import storage_transfer

    client = storage_transfer.StorageTransferServiceClient()

    transfer_job = {
        "description": "Enterprise Transfer via WIF",
        "project_id": "your-ml-project",
        "transfer_spec": {
            "aws_s3_data_source": {
                "bucket_name": s3_bucket,
                # STS requires a federated role ARN on the AWS side
                # This is a SEPARATE AWS role specifically for STS
                # It needs s3:GetObject permissions on source bucket
                # And should be listed in the GCP WIF pool's allowed principals
                "role_arn": "arn:aws:iam::123456789012:role/sts-transfer-role"
            },
            "gcs_data_sink": {"bucket_name": gcs_bucket},
        },
        "status": "ENABLED"
    }

    result = client.create_transfer_job({"transfer_job": transfer_job})
    print(f"✓ STS Job Started: {result.name}")
    return result.name

Cost Impact: Using STS instead of proxying saved us $90 in egress costs on a single 2TB transfer. For weekly transfers, that's $4,680/year.

Implementation Pattern 2: Azure to Vertex AI (Healthcare)

Standard WIF is great. But what about regulated industries?

Let me show you how WIF adapts for healthcare compliance. This is the pattern implemented for a diagnostics startup that needed:

Patient data must stay in Azure (existing HIPAA-compliant environment)
Training must happen on Vertex AI (AutoML capabilities)
Zero PHI (Protected Health Information) can touch public internet
Full audit trail required for compliance

HIPAA Note: CMEK is required for HIPAA workloads on GCP. Combined with WIF, VPC Service Controls, and audit logging, this creates a compliant ML environment. Always consult your compliance team for your specific requirements.

Healthcare compliance adds three layers beyond basic WIF:

1. Data Residency

All patient data stayed in Azure Blob Storage in their compliant region. Only anonymised features crossed the cloud boundary.

2. Network Isolation

We used Cloud Interconnect + VPC Service Controls to ensure data never touched public internet. In retrospect, this was the most complex part of the implementation—it took 2 days to get the routing right.

3. Cryptographic Trail

Every data access had to be auditable. WIF gave us this automatically—each token exchange generated an audit log we could present to regulators.

The Azure Difference

Azure doesn't use the AWS-style federation. It uses OIDC (OpenID Connect).

GCP Side: Create an OIDC Provider

Bash - Create OIDC Provider

gcloud iam workload-identity-pools providers create-oidc "azure-provider" \
  --workload-identity-pool="azure-health-pool" \
  --issuer-uri="https://sts.windows.net/YOUR_TENANT_ID/" \
  --allowed-audiences="api://AzureADTokenExchange"

Azure Side: Enable Managed Identity

Enable Managed Identity on your VM/AKS cluster.

Compliance Hardening: CMEK

CMEK (Customer Managed Encryption Keys): All ML artifacts encrypted with your own keys for full cryptographic control.

First, create your encryption key:

Bash - Create KMS Key

# Create KMS key ring (one-time setup)
gcloud kms keyrings create healthcare-ml-keyring \
  --location=us-central1 \
  --project=$PROJECT_ID

# Create the encryption key
gcloud kms keys create vertex-artifacts-key \
  --keyring=healthcare-ml-keyring \
  --location=us-central1 \
  --purpose=encryption \
  --rotation-period=90d \
  --next-rotation-time=$(date -u -d "+90 days" +%Y-%m-%dT%H:%M:%SZ)

# Grant Vertex AI permission to use this key
gcloud kms keys add-iam-policy-binding vertex-artifacts-key \
  --keyring=healthcare-ml-keyring \
  --location=us-central1 \
  --member="serviceAccount:service-$PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com" \
  --role="roles/cloudkms.cryptoKeyEncrypterDecrypter"

Then reference it in your Vertex AI training jobs:

Python - CMEK Configuration

from google.cloud import aiplatform

# Initialize with CMEK configuration
aiplatform.init(
    project='your-healthcare-project',
    location='us-central1',
    encryption_spec_key_name='projects/your-healthcare-project/locations/us-central1/keyRings/healthcare-ml-keyring/cryptoKeys/vertex-artifacts-key'
)

# Training job with CMEK
job = aiplatform.CustomTrainingJob(
    display_name='hipaa-compliant-training',
    container_uri='gcr.io/your-project/training-image:latest',
    model_serving_container_image_uri='gcr.io/your-project/serving-image:latest',
    # This ensures ALL artifacts (model, logs, metadata) use CMEK
    encryption_spec_key_name='projects/your-healthcare-project/locations/us-central1/keyRings/healthcare-ml-keyring/cryptoKeys/vertex-artifacts-key'
)

model = job.run(
    dataset=dataset,
    model_display_name='diagnostic-model-v1',
    # Model registry artifacts also encrypted
    encryption_spec_key_name='projects/your-healthcare-project/locations/us-central1/keyRings/healthcare-ml-keyring/cryptoKeys/vertex-artifacts-key'
)

# Deploy endpoint with CMEK
endpoint = model.deploy(
    deployed_model_display_name='diagnostic-endpoint',
    machine_type='n1-standard-4',
    # Even prediction endpoint artifacts encrypted
    encryption_spec_key_name='projects/your-healthcare-project/locations/us-central1/keyRings/healthcare-ml-keyring/cryptoKeys/vertex-artifacts-key'
)

Why CMEK Matters for Healthcare

Here's how the encryption layers protect your ML pipeline:

CMEK Encryption Layers for Healthcare ML

The Compliance Value Chain

For our healthcare client, this architecture meant:

Cryptographic Control: The customer owns the keys in their KMS, not Google. If they need to revoke access, they disable the key, and all encrypted data becomes immediately inaccessible.
Automated Rotation: 90-day key rotation satisfies HIPAA requirements without manual intervention.
Complete Audit Trail: Every time Vertex AI uses the encryption key, it's logged in Cloud Audit Logs with:
- Which service account requested access
- What artifact was encrypted/decrypted
- Timestamp and source IP
- Success or failure status
Instant Revocation: If a compliance issue arises, disabling the KMS key instantly cuts off all access to training data, models, and predictions.

What Gets Encrypted

✓ Training datasets and preprocessing artifacts
✓ Model binaries and checkpoints
✓ Prediction logs and metadata
✓ Explainability reports and feature attributions

The pattern above ensures that even Google engineers cannot access your ML artifacts without your explicit key permissions.

Pro Tip: Use separate keys for different sensitivity levels. We used one key for anonymised features, another for the diagnostic models containing derived PHI.

VPC Service Controls

Wrap your GCP project in a perimeter that blocks all internet egress, allowing traffic only from your trusted Azure range via Private/Interconnect.

Common WIF Errors: Debugging Permission Denied and Token Issues

You've built it. Now let's make sure it works.

I've debugged WIF implementations dozens of times. Here are the three errors you'll almost certainly encounter, and how to fix them in minutes.

Problem 1: "Permission Denied" on Token Exchange

Symptom: Error: Permission 'iam.serviceAccounts.getAccessToken' denied
The Fix: You likely forgot the Service Account impersonation binding. Check that your AWS role is exactly matching the attribute condition.

Bash - Debug Command

# Debug Command
gcloud iam service-accounts get-iam-policy $SA_EMAIL

Look for roles/iam.workloadIdentityUser.

Problem 2: "Invalid Token"

Symptom: The exchange fails claiming the AWS token is invalid. The error message is completely unhelpful: "Invalid token format."
The Fix: Check IMDSv2. AWS now defaults to requiring a "hop limit" of 2 for containerized workloads. If your hop limit is 1, the token cannot be retrieved inside the container.

Bash - Fix IMDSv2

# Fix on EC2
aws ec2 modify-instance-metadata-options --http-put-response-hop-limit 2 --instance-id ${INSTANCE_ID}

By the third implementation, I just automated this check in Terraform modules.

Problem 3: "Security Token Service API Disabled"

Symptom: A generic 403 error.
The Fix: It sounds obvious, but ensure the sts.googleapis.com and iamcredentials.googleapis.com APIs are enabled in your GCP project.

Conclusion: Start Small, Scale Securely

You now have the blueprint to eliminate static keys from your infrastructure.

Your Next Steps

Week 1: Implement the AWS setup in a sandbox environment.
Week 2: Migrate one non-critical pipeline (e.g., a daily batch job).
Week 3: Enable "Attribute Conditions" to lock down access to specific roles.

Get the Code

I've published the complete Terraform modules and Python scripts for both the AWS and Azure patterns in my GitHub repository. Star it, fork it, and use it as your template.

GitHub Repo: wif-mlops-patterns

What's Next: From Security to Economics

You now have the security foundation in place—zero static credentials, cryptographic trust, and automated compliance. Your multi-cloud MLOps pipeline is secure.

But security is just one dimension of production readiness. The other critical question: What does this infrastructure actually cost?

In Part 3: Cloud Composer vs. Vertex AI Pipelines, we discussed how to choose the right orchestration layer for your ML workflows. Now it's time to make it cost-efficient.

In Part 4, we'll tackle the economics: Cost-optimised MLOps: Reducing Infrastructure Spend by 80%. I'll show you:

How I reduced a $3,200/month ML infrastructure bill to $600/month
The service isolation strategy that unlocked these savings
Preemptible instances, auto-scaling, and scheduling tactics with real ROI numbers
A cost optimisation checklist you can apply to your pipelines this week

Because secure pipelines that bankrupt your team aren't sustainable.

Series Navigation

Part 1: The Hidden Cost of Data Chaos in ML Projects
Part 2A: Why static credentials are catastrophic + WIF strategy
Part 2B: Complete AWS + Azure implementation (you are here)
Part 3: Orchestration decisions - Composer vs Kubeflow
Part 4: Cost-optimised MLOps: Reducing Infrastructure Spend by 80% (coming soon)

This article is part of the "Enterprise MLOps on GCP" series. Follow me on Medium and LinkedIn for Part 4 and the rest of the series.

Also published on Medium - Join the discussion in the comments!