Skip to content

Recipe 56: GPU Generation Targeting

Situation

A research team runs ML inference on NVIDIA Hopper-class GPUs (H100, H200) for the latency-critical path, and batch training on Ampere- class GPUs (A100, A40) where throughput matters more than per-token latency. The team’s quota envelope shouldn’t be a single fleet-wide “N GPUs”; it should be “up to 4 Hopper and up to 8 Ampere,” with training requests refused if they accidentally target Hopper and inference refused if they accidentally land on Ampere.

In grafOS the tenant’s quota table carries per-generation GpuGenerationQuota entries. The scheduler treats those entries as an allow-list: when the table is non-empty, every GPU lease request MUST declare its HardwareGeneration and unlisted generations have an effective limit of zero. Missing generation and exceeded generation are distinct typed denials, so policy code and SIEM alerts can react to each cleanly.

What You Build

A quota policy + admission check that:

  • Builds a typed GpuFleetPolicy from per-generation allowances (Hopper, Ampere, Blackwell, MI300, etc.);
  • Rejects duplicate-generation policy entries at construction time;
  • Commits the policy onto a QuotaManager;
  • Checks an incoming GpuLeaseRequest against the committed envelope and returns a typed AdmissionResult::{Approved, Denied(QuotaDenied)};
  • Surfaces the exact QuotaDenied shape the scheduler emits, so callers can pattern-match on GpuGenerationRequired vs GpuGenerationLimitExceeded.

The compiled recipe lives in cookbook/recipe-56-gpu-generation-targeting.

Core grafOS API Path

use grafos_core::{GpuGenerationQuota, HardwareGeneration};
use grafos_scheduler::{QuotaDenied, QuotaManager, TenantId};
let mut mgr = QuotaManager::new();
let tenant = TenantId(0xa1);
mgr.set_gpu_generation_limits(
tenant,
&[
GpuGenerationQuota {
generation: HardwareGeneration::NvidiaHopper,
count: 4,
},
GpuGenerationQuota {
generation: HardwareGeneration::NvidiaAmpere,
count: 8,
},
],
)?;
// Approved: declared, within envelope.
mgr.check_gpu_generation(tenant, Some(HardwareGeneration::NvidiaHopper), 2)?;
// Denied: generation required when limits are configured.
let err = mgr
.check_gpu_generation(tenant, None, 1)
.unwrap_err();
assert_eq!(err, QuotaDenied::GpuGenerationRequired { requested: 1 });
// Denied: unlisted generation has limit=0.
let err = mgr
.check_gpu_generation(tenant, Some(HardwareGeneration::NvidiaBlackwell), 1)
.unwrap_err();
assert!(matches!(
err,
QuotaDenied::GpuGenerationLimitExceeded {
generation: HardwareGeneration::NvidiaBlackwell,
limit: 0,
used: 0,
requested: 1,
}
));
# Ok::<(), Box<dyn std::error::Error>>(())

Program

use cookbook_recipe_56_gpu_generation_targeting::{
check_gpu_request, AdmissionResult, GpuFleetPolicy, GpuLeaseRequest,
};
use grafos_core::{GpuGenerationQuota, HardwareGeneration};
use grafos_scheduler::{QuotaManager, TenantId};
let tenant = TenantId(0xa1);
let mut mgr = QuotaManager::new();
let policy = GpuFleetPolicy::new(
tenant,
vec![
GpuGenerationQuota {
generation: HardwareGeneration::NvidiaHopper,
count: 4,
},
GpuGenerationQuota {
generation: HardwareGeneration::NvidiaAmpere,
count: 8,
},
],
)?;
policy.commit(&mut mgr).expect("commit");
// Inference path requests Hopper.
let inference = GpuLeaseRequest {
generation: Some(HardwareGeneration::NvidiaHopper),
count: 2,
};
assert_eq!(check_gpu_request(&mgr, tenant, inference), AdmissionResult::Approved);
// Training path forgot to tag generation — fail closed.
let untagged = GpuLeaseRequest {
generation: None,
count: 1,
};
match check_gpu_request(&mgr, tenant, untagged) {
AdmissionResult::Denied(_) => {}
AdmissionResult::Approved => unreachable!("untagged request must be denied"),
}
# Ok::<(), cookbook_recipe_56_gpu_generation_targeting::GpuFleetPolicyError>(())

Design

The per-generation table is an allow-list, not a refinement. That choice has two operator-visible consequences:

  1. Untagged GPU requests fail closed. When the table has any entries, requests without a HardwareGeneration are rejected with QuotaDenied::GpuGenerationRequired. Tenants that mix targeted and untargeted workloads need a policy entry for every generation they want to land on — including a HardwareGeneration::Other row if they want a catch-all bucket.
  2. Unlisted generations have effective limit zero. A tenant that lists Hopper + Ampere implicitly forbids Blackwell, MI300, etc. The typed denial carries the unlisted generation, a limit = 0, and the requested count — so a SIEM operator can distinguish “asked for the wrong generation entirely” from “asked for a generation we have, but exceeded the count.”

The duplicate-generation check at construction time mirrors the scheduler-side check at set_gpu_generation_limits. Catching it in the builder shortens the failure path: a misconfigured policy file is rejected before any scheduler state mutates.

GpuGenerationQuota only carries the variant kind and count. Carried fields like driver version, NVLink topology, or per-SM config are out-of-band (they belong on the inventory side, not the quota side). Two Hopper { count: 4 } entries with different driver versions aggregate under the same SIEM bucket — quota attribution does not split on driver micro-version.

Failure Modes

  • Generation required: tenant has per-generation limits configured and the request omitted generation. Typed QuotaDenied::GpuGenerationRequired { requested }.
  • Generation limit exceeded: the request’s (generation, count) would push usage past the listed limit. Typed QuotaDenied::GpuGenerationLimitExceeded { generation, limit, used, requested }. Operators read all four numbers to size the next policy revision.
  • Unlisted generation: the request named a generation not in the policy. Same shape as GpuGenerationLimitExceeded with limit = 0, used = 0 — the SIEM bucket can be the same (“wrong generation”) or split by inspecting limit.
  • Duplicate generation in policy: rejected at policy build time via GpuFleetPolicy::new, before any scheduler state mutates.
  • Release lowers usage: record_gpu_generation_free is the counterpart of record_gpu_generation_alloc. Recipes are responsible for pairing them; a release that exceeds the recorded usage saturates at zero.

Tests

Run it with:

Terminal window
cargo test -p cookbook-recipe-56-gpu-generation-targeting

The tests cover declared-and-fits (Approved), missing generation (GpuGenerationRequired), exceeding the envelope (GpuGenerationLimitExceeded with typed limit/used/requested), unlisted generation (limit = 0 denial), duplicate-generation policy rejection at construction time, and release lowering usage so a subsequent request fits again.

Adaptation Notes

  • Allow-list vs additive: this recipe uses the allow-list semantic (unlisted = forbidden). If you want a “soft target” semantic where unlisted generations fall back to a fleet-wide pool, leave per-generation limits unset and use the gpu_count field on QuotaSchema instead. The two surfaces compose: a tenant with both has the per-generation table as the strict ceiling and the total gpu_count as an additional cap.
  • Adding a new GPU family: extend HardwareGeneration (in grafos-core) with the new variant and its snake_case as_str(). Existing policies that don’t list the new variant continue to forbid it (allow-list semantic stays correct).
  • Per-generation rate-card pricing: the typed HardwareGeneration value flows into the billing surface; a tenant with mixed Hopper + Ampere allowances sees per-generation cost rows in their invoice. See docs/operations/scheduler-features.md § “Resource taxonomy” for the billing-side mapping.

See also:

  • crates/grafos-core/src/policy_vocab.rsHardwareGeneration, GpuGenerationQuota, QuotaSchema.
  • crates/grafos-scheduler/src/quota.rsQuotaManager, check_gpu_generation, record_gpu_generation_alloc.
  • docs/operations/scheduler-features.md § “Quota schema” and “HardwareGeneration”.
  • docs/operations/siem-vocabulary-cookbook.md — SIEM filter recipes for quota_violation == "gpu_generation_required" and quota_violation == "gpu_generation_limit_exceeded".