Enterprise MLOps on GCP

Implementing Zero-Trust Multi-Cloud: A Complete WIF Setup Guide

Step-by-step guide to implement Workload Identity Federation for AWS→Vertex AI and Azure→Vertex AI.

November 2025
10 min read
Implementing Zero-Trust Multi-Cloud WIF Setup

Image generated using Gemini

← Back to Articles

In Part 2A, I showed you why Workload Identity Federation (WIF) is the Zero-Trust solution for multi-cloud MLOps. We covered the business case, the security benefits, the cost savings.

Now let's build it.

I've implemented this pattern multiple times and each time, I hit the same gotchas, solved the same "Permission Denied" errors, and learned new tricks.

This is the implementation guide I wish I'd had for my first setup. It would have saved me 10 hours of debugging. We are going to build a production-ready, keyless authentication pipeline from AWS to Vertex AI, adaptable for Azure and GitHub Actions. We will cover:

  1. The Core Setup: A complete AWS → Vertex AI implementation in 15 minutes.
  2. The Enterprise Pattern: Moving multi-TB datasets using Storage Transfer Service (STS) without proxying data.
  3. The Compliance Pattern: Azure to Vertex AI for HIPAA-regulated workloads.
  4. Troubleshooting: How to debug the dreaded "Permission Denied" errors.

If you just want to test WIF works, skip to Phase 3 and use the Standard Pattern with a small test file. Come back for the enterprise patterns later.

WIF Architecture: Understanding the Core Components

Before we run Terraform, let's visualise the components. When I implemented this for the first time, I made the mistake of jumping straight into creating resources. I ended up with a mess of permissions that took 3 hours to untangle.

Here's the mental model that finally clicked for me:

WIF Trust Flow Architecture

The 8-Step Trust Dance

  1. AWS Identity Token: Your EC2 instance requests its identity from AWS STS
  2. Token Retrieval: AWS issues a signed token proving "I am IAM role X in account Y"
  3. Token Exchange: Your app presents this AWS token to GCP's Workload Identity Pool
  4. Provider Validation: The AWS Provider validates the token signature
  5. Attribute Checking: Conditions verify the role ARN matches your allowed list
  6. Service Account Impersonation: If checks pass, GCP issues a short-lived token for your service account
  7. Resource Access: Your code uses this token to call Vertex AI APIs
  8. Data Operations: Vertex AI accesses Cloud Storage with the service account's permissions

Key Components

Notice how zero static credentials flow through this system. Every token is short-lived (1 hour max), and if your AWS role is compromised, you just remove it from the attribute conditions.

Pattern 1: AWS to Vertex AI Authentication (15-Minute Setup)

Scenario: You have training data in AWS S3, model training on Vertex AI. This is the most frequently used pattern implementation.

Time to implement: 15-20 minutes for first-time setup, 10 minutes once you know the steps.

We'll do this in three phases:

  1. GCP Configuration - Tell Google who to trust (5 min)
  2. AWS Configuration - Give AWS permission to request tokens (5 min)
  3. Application Code - Use the credentials seamlessly (5 min)

Let's start with the GCP side.

Phase 1: GCP Configuration (The Trust Side)

First, we tell Google Cloud to trust specific AWS identities.

1. Create the Workload Identity Pool

This acts as a namespace for your external identities.

Bash - Create Workload Identity Pool
export PROJECT_ID="your-ml-project"
export POOL_ID="aws-prod-pool"
export LOCATION="global"

gcloud iam workload-identity-pools create $POOL_ID \
  --project=$PROJECT_ID \
  --location=$LOCATION \
  --display-name="AWS Production ML Workloads" \
  --description="Federated access for AWS-based ML pipelines"

2. Create the AWS Provider

This links your specific AWS Account ID to the pool.

Bash - Create AWS Provider
export PROVIDER_ID="aws-provider"
export AWS_ACCOUNT_ID="123456789012"  # Replace with your AWS Account ID

gcloud iam workload-identity-pools providers create-aws $PROVIDER_ID \
  --project=$PROJECT_ID \
  --location=$LOCATION \
  --workload-identity-pool=$POOL_ID \
  --account-id=$AWS_ACCOUNT_ID

3. Create the Service Account & Grant Permissions

This is the identity your AWS workload will "become."

Bash - Create Service Account
export SA_NAME="vertex-training-sa"

# Create the Service Account
gcloud iam service-accounts create $SA_NAME \
  --project=$PROJECT_ID \
  --display-name="Vertex AI Training Agent"

# Grant it access to Vertex AI and GCS
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

4. Bind the Trust (The Critical Step)

This is where we say: "Allow the ml-training-role from the AWS Account 123... to impersonate this Service Account."

Bash - Bind the Trust
# Get the full pool resource name
POOL_RESOURCE_NAME=$(gcloud iam workload-identity-pools describe $POOL_ID \
  --project=$PROJECT_ID --location=$LOCATION --format="value(name)")

# Allow impersonation ONLY from a specific AWS Role
gcloud iam service-accounts add-iam-policy-binding \
  "$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/iam.workloadIdentityUser" \
  --member="principalSet://iam.googleapis.com/$POOL_RESOURCE_NAME/attribute.aws_role/arn:aws:iam::$AWS_ACCOUNT_ID:role/ml-training-role"

Pro Tip: Being specific with attribute.aws_role here is what makes this Zero-Trust. Never use a wildcard (*) in production.

Phase 2: AWS Configuration (The Client Side)

Now, we configure AWS to provide the credentials.

1. Create the IAM Role

This role will be assumed by your EC2 instance or EKS pod. It needs a trust policy that allows it to talk to Google.

Important: This trust policy goes in your AWS IAM role definition, NOT in GCP. This is what allows AWS to issue tokens that GCP will accept.

trust-policy.json:

JSON - AWS Trust Policy
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "accounts.google.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "accounts.google.com:aud": "http://iam.googleapis.com/projects/${GCP_PROJECT_NUMBER}/locations/global/workloadIdentityPools/aws-prod-pool/providers/aws-provider"
        }
      }
    }
  ]
}

2. Configure the Client Library

Google's client libraries need a configuration file to know how to perform the exchange. You generate this once and bake it into your application image or mount it as a Kubernetes secret.

Bash - Configure Client Library
gcloud iam workload-identity-pools create-cred-config \
  projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$POOL_ID/providers/$PROVIDER_ID \
  --service-account="$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
  --aws \
  --output-file="credential-config.json"

Phase 3: The Vertex AI Pipeline Code

This is where the magic happens. We will look at two patterns: the Standard Pattern for small data, and the Enterprise Pattern for massive datasets.

Common Setup

In your Python script running on AWS, simply point to the config file.

Python - Common Setup
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/credential-config.json'

Option A: The "Standard" Pattern (Small Data)

For metadata or small files, you can download to the AWS worker and re-upload to GCS.

Python - Standard Pattern
def fetch_data_standard(s3_bucket, gcs_bucket):
    import boto3
    from google.cloud import storage

    # 1. Get data from S3 (using AWS native credentials)
    s3 = boto3.client('s3')
    s3.download_file(s3_bucket, 'data.csv', '/tmp/data.csv')

    # 2. Upload to GCS (using WIF credentials automatically)
    storage_client = storage.Client()
    bucket = storage_client.bucket(gcs_bucket)
    blob = bucket.blob('training/data.csv')
    blob.upload_from_filename('/tmp/data.csv')
    
    return f"gs://{gcs_bucket}/training/data.csv"

Option B: The "Enterprise" Pattern (Multi-TB Data)

If you are moving 2TB of training data, do not proxy it through your worker node. It's slow, expensive, and fragile.

Let me show you why with a visual comparison:

Enterprise Approach: Direct STS Transfer
Enterprise Approach: Direct STS Transfer
Naive Approach: Proxy Through Worker
Naive Approach: Proxy Through Worker

The Cost of Getting This Wrong

Naive Approach (Proxy Through Worker):

Enterprise Approach (Storage Transfer Service):

The Architectural Insight

With STS, your worker node becomes a lightweight orchestrator instead of a data proxy. You use WIF credentials to command GCP to fetch the data, then step aside while the platforms handle the heavy lifting.

This is the architect-level distinction: Knowing when to move data yourself, and when to orchestrate the platform to do it for you.

Python - Enterprise Pattern
def trigger_enterprise_transfer(s3_bucket, gcs_bucket):
    """
    Triggers a server-to-server transfer from AWS S3 to GCS.
    Zero data flows through this script.
    """
    from google.cloud import storage_transfer

    client = storage_transfer.StorageTransferServiceClient()

    transfer_job = {
        "description": "Enterprise Transfer via WIF",
        "project_id": "your-ml-project",
        "transfer_spec": {
            "aws_s3_data_source": {
                "bucket_name": s3_bucket,
                # STS requires a federated role ARN on the AWS side
                # This is a SEPARATE AWS role specifically for STS
                # It needs s3:GetObject permissions on source bucket
                # And should be listed in the GCP WIF pool's allowed principals
                "role_arn": "arn:aws:iam::123456789012:role/sts-transfer-role"
            },
            "gcs_data_sink": {"bucket_name": gcs_bucket},
        },
        "status": "ENABLED"
    }

    result = client.create_transfer_job({"transfer_job": transfer_job})
    print(f"✓ STS Job Started: {result.name}")
    return result.name
Cost Impact: Using STS instead of proxying saved us $90 in egress costs on a single 2TB transfer. For weekly transfers, that's $4,680/year.

Implementation Pattern 2: Azure to Vertex AI (Healthcare)

Standard WIF is great. But what about regulated industries?

Let me show you how WIF adapts for healthcare compliance. This is the pattern implemented for a diagnostics startup that needed:

HIPAA Note: CMEK is required for HIPAA workloads on GCP. Combined with WIF, VPC Service Controls, and audit logging, this creates a compliant ML environment. Always consult your compliance team for your specific requirements.

Healthcare compliance adds three layers beyond basic WIF:

1. Data Residency

All patient data stayed in Azure Blob Storage in their compliant region. Only anonymised features crossed the cloud boundary.

2. Network Isolation

We used Cloud Interconnect + VPC Service Controls to ensure data never touched public internet. In retrospect, this was the most complex part of the implementation—it took 2 days to get the routing right.

3. Cryptographic Trail

Every data access had to be auditable. WIF gave us this automatically—each token exchange generated an audit log we could present to regulators.

The Azure Difference

Azure doesn't use the AWS-style federation. It uses OIDC (OpenID Connect).

GCP Side: Create an OIDC Provider

Bash - Create OIDC Provider
gcloud iam workload-identity-pools providers create-oidc "azure-provider" \
  --workload-identity-pool="azure-health-pool" \
  --issuer-uri="https://sts.windows.net/YOUR_TENANT_ID/" \
  --allowed-audiences="api://AzureADTokenExchange"

Azure Side: Enable Managed Identity

Enable Managed Identity on your VM/AKS cluster.

Compliance Hardening: CMEK

CMEK (Customer Managed Encryption Keys): All ML artifacts encrypted with your own keys for full cryptographic control.

First, create your encryption key:

Bash - Create KMS Key
# Create KMS key ring (one-time setup)
gcloud kms keyrings create healthcare-ml-keyring \
  --location=us-central1 \
  --project=$PROJECT_ID

# Create the encryption key
gcloud kms keys create vertex-artifacts-key \
  --keyring=healthcare-ml-keyring \
  --location=us-central1 \
  --purpose=encryption \
  --rotation-period=90d \
  --next-rotation-time=$(date -u -d "+90 days" +%Y-%m-%dT%H:%M:%SZ)

# Grant Vertex AI permission to use this key
gcloud kms keys add-iam-policy-binding vertex-artifacts-key \
  --keyring=healthcare-ml-keyring \
  --location=us-central1 \
  --member="serviceAccount:service-$PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com" \
  --role="roles/cloudkms.cryptoKeyEncrypterDecrypter"

Then reference it in your Vertex AI training jobs:

Python - CMEK Configuration
from google.cloud import aiplatform

# Initialize with CMEK configuration
aiplatform.init(
    project='your-healthcare-project',
    location='us-central1',
    encryption_spec_key_name='projects/your-healthcare-project/locations/us-central1/keyRings/healthcare-ml-keyring/cryptoKeys/vertex-artifacts-key'
)

# Training job with CMEK
job = aiplatform.CustomTrainingJob(
    display_name='hipaa-compliant-training',
    container_uri='gcr.io/your-project/training-image:latest',
    model_serving_container_image_uri='gcr.io/your-project/serving-image:latest',
    # This ensures ALL artifacts (model, logs, metadata) use CMEK
    encryption_spec_key_name='projects/your-healthcare-project/locations/us-central1/keyRings/healthcare-ml-keyring/cryptoKeys/vertex-artifacts-key'
)

model = job.run(
    dataset=dataset,
    model_display_name='diagnostic-model-v1',
    # Model registry artifacts also encrypted
    encryption_spec_key_name='projects/your-healthcare-project/locations/us-central1/keyRings/healthcare-ml-keyring/cryptoKeys/vertex-artifacts-key'
)

# Deploy endpoint with CMEK
endpoint = model.deploy(
    deployed_model_display_name='diagnostic-endpoint',
    machine_type='n1-standard-4',
    # Even prediction endpoint artifacts encrypted
    encryption_spec_key_name='projects/your-healthcare-project/locations/us-central1/keyRings/healthcare-ml-keyring/cryptoKeys/vertex-artifacts-key'
)

Why CMEK Matters for Healthcare

Here's how the encryption layers protect your ML pipeline:

CMEK Encryption Layers for Healthcare ML

The Compliance Value Chain

For our healthcare client, this architecture meant:

What Gets Encrypted

The pattern above ensures that even Google engineers cannot access your ML artifacts without your explicit key permissions.

Pro Tip: Use separate keys for different sensitivity levels. We used one key for anonymised features, another for the diagnostic models containing derived PHI.

VPC Service Controls

Wrap your GCP project in a perimeter that blocks all internet egress, allowing traffic only from your trusted Azure range via Private/Interconnect.

Common WIF Errors: Debugging Permission Denied and Token Issues

You've built it. Now let's make sure it works.

I've debugged WIF implementations dozens of times. Here are the three errors you'll almost certainly encounter, and how to fix them in minutes.

Problem 1: "Permission Denied" on Token Exchange

Bash - Debug Command
# Debug Command
gcloud iam service-accounts get-iam-policy $SA_EMAIL

Look for roles/iam.workloadIdentityUser.

Problem 2: "Invalid Token"

Bash - Fix IMDSv2
# Fix on EC2
aws ec2 modify-instance-metadata-options --http-put-response-hop-limit 2 --instance-id ${INSTANCE_ID}

By the third implementation, I just automated this check in Terraform modules.

Problem 3: "Security Token Service API Disabled"

Conclusion: Start Small, Scale Securely

You now have the blueprint to eliminate static keys from your infrastructure.

Your Next Steps

  1. Week 1: Implement the AWS setup in a sandbox environment.
  2. Week 2: Migrate one non-critical pipeline (e.g., a daily batch job).
  3. Week 3: Enable "Attribute Conditions" to lock down access to specific roles.

Get the Code

I've published the complete Terraform modules and Python scripts for both the AWS and Azure patterns in my GitHub repository. Star it, fork it, and use it as your template.

GitHub Repo: wif-mlops-patterns

What's Next: From Security to Economics

You now have the security foundation in place—zero static credentials, cryptographic trust, and automated compliance. Your multi-cloud MLOps pipeline is secure.

But security is just one dimension of production readiness. The other critical question: What does this infrastructure actually cost?

In Part 3: Cloud Composer vs. Vertex AI Pipelines, we discussed how to choose the right orchestration layer for your ML workflows. Now it's time to make it cost-efficient.

In Part 4, we'll tackle the economics: Cost-optimised MLOps: Reducing Infrastructure Spend by 80%. I'll show you:

Because secure pipelines that bankrupt your team aren't sustainable.

Series Navigation

This article is part of the "Enterprise MLOps on GCP" series. Follow me on Medium and LinkedIn for Part 4 and the rest of the series.

Also published on Medium - Join the discussion in the comments!