feat(infrastructure): add conversion pipeline Terraform module#542
feat(infrastructure): add conversion pipeline Terraform module#542
Conversation
- add diagnostic settings for storage account and blob service - create private endpoints for blob and dfs subresources - define outputs for storage account, containers, and event grid - set up role assignments for Event Grid and Fabric workspace - add tests for conversion pipeline module functionality 🔧 - Generated by Copilot
Dependency ReviewThe following issues were found:
Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. License Issues.github/workflows/terraform-security.yml
OpenSSF Scorecard
Scanned Files
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #542 +/- ##
==========================================
- Coverage 63.91% 63.13% -0.78%
==========================================
Files 250 265 +15
Lines 15409 16850 +1441
Branches 2163 2266 +103
==========================================
+ Hits 9848 10638 +790
- Misses 5274 5926 +652
+ Partials 287 286 -1
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
rezatnoMsirhC
left a comment
There was a problem hiding this comment.
Good module structure and conventions throughout. Four comments below: one high-severity bug in the staging tfvars that will fail at plan time, one least-privilege RBAC issue, one Event Grid reliability concern, and one stale TODO to remove.
WilliamBerryiii
left a comment
There was a problem hiding this comment.
Review Summary — Conversion Pipeline IaC
Thanks for this; the module structure (Edge → Raw blob → Event Grid → Function App → Converted blob → Fabric Lakehouse) is coherent and easy to follow. Requesting changes for the items below — most are small, but RI-03 and RI-04 are blockers in my view.
Acknowledgements
- Existing reviewer @rezatnoMsirhC has already covered E1–E4 (Function App SKU/runtime, identity, network ACLs, retention defaults). I've intentionally left those alone to avoid duplicate noise; please address them in the same revision pass.
- I withdrew several internal candidates (provider version pinning, blob CORS, etc.) after confirming they were either already raised or non-issues against repo conventions.
Blocking items (inline)
- RI-03 🟧 Medium —
main.tf:64— Fourcheckov:skipdirectives reference unresolvedWI-01placeholder. File the real follow-up issue(s) and replaceWI-01with the issue reference before merge. - RI-04 🟧 Medium —
infrastructure/terraform/variables.tf(root, ~lines 547–573) andinfrastructure/terraform/main.tf(~lines 217–260) — Theconversion_pipeline_configobject exposed at the root does not surface every meaningful module variable. Operators today cannot, for example, override Function App SKU, blob retention windows, or Event Grid retry policy without editing the module directly. Either: (a) widen the root config object to mirror the module's full variable surface, or (b) document explicitly which knobs are intentionally fixed at the module layer and why. As-is the partial passthrough is a hidden footgun.
Non-blocking items (inline)
- RI-02 🟨 Low —
fabric.tf:23— Fabric capacity name lacks separators; align with hyphenated convention (fc-${local.resource_name_suffix}). - RI-06 🟨 Low —
README.md:121— Two-pass deployment with manual GUID copy is UX-fragile; suggest-targetfirst pass or wrapper script. - RI-09 🟨 Low —
versions.tf:43— Add precondition or doc note forFABRIC_TENANT_ID/FABRIC_CLIENT_ID/FABRIC_CLIENT_SECRETenv vars so missing values fail fast. - RI-07 🟦 Trivial —
README.md:4—author: Edge AI Teaminconsistent with sibling module READMEs. - RI-10 ℹ️ Verify-only —
event-grid.tf:11— Confirmevgt(vs.egst) is the intended Event Grid System Topic abbreviation.
Body-only finding
- RI-05 🟨 Low —
infrastructure/terraform/variables.tf(root, ~lines 547–573) duplicates module-level defaults (Function App SKUF2, storageZRS, raw retention30 days, converted retention90 days). Each default now lives in two places and will drift. Recommendation: drop the defaults from the root variable block and let the module own them, or remove them from the module and centralize at the root — either is fine, but not both.
Validation evidence
- All 6 inline anchors verified against the PR diff hunks for commit
01de39ebf3db4eff0eba0cf7966b44e4d71c1f22before posting. - Findings cross-checked against
.github/copilot-instructions.md(Terraform Conventions section) and existing PR conversation to avoid duplication of @rezatnoMsirhC's review.
Happy to re-review once RI-03 and RI-04 are addressed.
…d letter support - implement dynamic dead letter identity based on configuration - update role assignment scope for event grid dead letter queue - modify staging variables for PostgreSQL high availability 🔒 - Generated by Copilot
…loyment details - correct author attribution to Microsoft Robotics-AI Team - clarify two-pass deployment process in README - add lifecycle precondition for fabric workspace in fabric.tf - update Checkov control tracking references in main.tf 🔒 - Generated by Copilot
… pipeline module - eliminate unused virtual network variable and references - update tests to reflect changes in module dependencies 🔧 - Generated by Copilot
🔒 - Generated by Copilot
…ile paths 🔧 - Generated by Copilot
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
katriendg
left a comment
There was a problem hiding this comment.
Thanks @nguyena2 for this great addition and integrating Fabric into the picture.
Reviewing the work, I have two generic notes which were based on reflections as I was looking through your updates. I'll let you decide being able to integrate these. For the first one I believe it's valuable, the second one I'll leave to you to assess if relevant.
Comment 1 — Can we drop the two-pass deploy and FABRIC_* env vars
Files referenced:
infrastructure/terraform/modules/conversion-pipeline/fabric.tfinfrastructure/terraform/modules/conversion-pipeline/variables.tfinfrastructure/terraform/modules/conversion-pipeline/README.mdinfrastructure/terraform/versions.tfinfrastructure/README.md
Comment body
The two-pass workflow and the FABRIC_TENANT_ID / FABRIC_CLIENT_ID / FABRIC_CLIENT_SECRET prerequisite are avoidable. microsoft/edge-ai already solves the same azurerm_fabric_capacity "GUID not exposed" problem in a single apply and uses Azure CLI auth — no SP secret required. We should align with that pattern, if possible.
Wdty?
Reference: edge-ai/src/000-cloud/031-fabric/terraform/main.tf and versions.tf.
Edge-ai resolves the capacity GUID via a deferred data "fabric_capacity" lookup keyed by display_name. The terraform_data indirection prevents plan-time queries (handles the same testing/build-system constraint that prompted the two-pass design here):
resource "terraform_data" "defer_fabric_capacity_created" {
count = var.should_create_fabric_capacity ? 1 : 0
input = { display_name = local.fabric_capacity_name }
depends_on = [module.fabric_capacity]
}
data "fabric_capacity" "created" {
count = length(terraform_data.defer_fabric_capacity_created)
display_name = terraform_data.defer_fabric_capacity_created[0].output.display_name
}
locals {
capacity_id = try(data.fabric_capacity.created[0].id, data.fabric_capacity.existing[0].id, null)
}
resource "fabric_workspace" "this" {
count = var.should_create_fabric_workspace ? 1 : 0
capacity_id = local.capacity_id
display_name = "fws-${local.resource_name_suffix}"
description = "Conversion pipeline workspace (${var.environment})"
}Edge-ai's versions.tf also has no provider "fabric" {} block and the blueprint README documents no FABRIC_* env vars. The microsoft/fabric provider supports Azure CLI auth (use_cli defaults to true) and picks up the same az login already used by azurerm. The SP-secret env-var path is one auth mode of several, not a requirement.
Proposed changes:
- Replace
fabric_capacity_uuid(variable + workspace gating +lifecycle.precondition) with the deferreddata "fabric_capacity"pattern above. Drop the entire "Two-pass deployment" README section. - Remove the
provider "fabric" {}block and its comment frominfrastructure/terraform/versions.tf. The provider auto-discovers via Azure CLI when no provider block is declared. - Remove the
FABRIC_TENANT_ID/FABRIC_CLIENT_ID/FABRIC_CLIENT_SECRETprerequisite from bothinfrastructure/README.mdandmodules/conversion-pipeline/README.md. Replace with a note thataz loginis sufficient and that the operator's identity (or an opt-in SP, if the user explicitly chooses SP auth) must be in the "Service principals can use Fabric APIs" tenant allow-list. - Pin the
fabricprovider to1.3.0(matching edge-ai) instead of~> 1.0, which would let any new1.xminor drift in.
This collapses the workflow to a single terraform apply, removes the secret-handling burden in CI, and aligns with the established Microsoft pattern.
Comment 2 — Storage overlap with the platform data-lake account
Files referenced:
infrastructure/terraform/modules/platform/storage.tfinfrastructure/terraform/modules/conversion-pipeline/main.tfinfrastructure/terraform/modules/conversion-pipeline/lifecycle.tfinfrastructure/terraform/main.tf
Comment body
The platform module already creates an HNS Gen2 data-lake account (stdl{prefix}{env}{instance}) with datasets, models, evaluation containers and lifecycle rules on datasets/raw/ (delete), datasets/converted/ (cool), and evaluation/reports/ (cool → archive). See modules/platform/storage.tf lines 61–120 and 200+.
The conversion-pipeline module creates a second HNS Gen2 account (stcp{prefix}{env}{instance}) with raw, converted, event-grid-dlq containers and the same lifecycle semantics on raw/ and converted/.
With the defaults should_create_data_lake_storage = true and should_deploy_conversion_pipeline = true, the deployment provisions:
- Two HNS Gen2 storage accounts in the same resource group
- Two
raw/convertedlanes with duplicate lifecycle policies - Two sets of blob + dfs private endpoints in the same PE subnet
The README's blast-radius rationale — "The AzureML extension storage account is managed separately by modules/platform" — refers to the st... ML-workspace account, not stdl... (the data-lake account). The data-lake account already serves the raw/converted use case and is silently duplicated.
There is no cross-gating in infrastructure/terraform/main.tf (should_create_data_lake_storage and should_deploy_conversion_pipeline are independent flags), and the conversion-pipeline README/variables make no mention of stdl.
Requested resolution (pick one and document it):
- Option A — Reuse the platform data-lake account (preferred). Add
data_lake_storage_accounttomodules/conversion-pipeline/variables.deps.tfas a typed object dependency, and create theraw/converted/event-grid-dlqcontainers (or use the existingdatasets/prefix) inside the platform-owned account. Dropstcp...and the duplicate private endpoints and lifecycle policy. This matches thevariables.deps.tfpattern already used bymodules/sil/andmodules/dataviewer/. - Option B — Mutually exclude. Add a root-level
preconditionthat fails when bothshould_create_data_lake_storageandshould_deploy_conversion_pipelinearetrue, and update the dev/staging/prod tfvars examples to flipshould_create_data_lake_storage = falsewhen the conversion pipeline is enabled. Document the rationale in both READMEs.
Option A is preferred — it eliminates the overlap rather than papering over it, and it preserves the intended platform/conversion-pipeline split (compute and event plumbing stay in conversion-pipeline; durable storage stays platform-owned).
| * Provisions an azurerm_fabric_capacity (gated by should_create_fabric_capacity) | ||
| * and a fabric_workspace via the microsoft/fabric provider. | ||
| * | ||
| * The fabric_workspace resource's capacity_id is the Fabric capacity GUID (not |
There was a problem hiding this comment.
Please see my note in the PR named "Comment 1 — Can we drop the two-pass deploy and FABRIC_* env vars"
| @@ -0,0 +1,57 @@ | |||
| name: Terraform Security Scan | |||
There was a problem hiding this comment.
nit: could you expand the PR description (or add a short "CI / Security" section) covering: scope of files scanned, soft-fail default, where results land, and any rollout plan to flip soft-fail to hard-fail later? Reviewers and downstream owners (security, repo admins) should not have to read the workflow YAML to learn that a new code-scanning surface is being introduced. Or at least reference it in the original associated issue, so we know this was extended from the original issue. Thanks.
Summary
Adds a new
conversion-pipelineTerraform module that provisions the Azure infrastructure for the dataset conversion pipeline, along with supporting examples, root-level wiring, and CI updates.Fixes #39
Changes
infrastructure/terraform/modules/conversion-pipeline/:README.md,TERRAFORM.md) andterraform testcoverageinfrastructure/terraform/): wire the new module intomain.tf, expose outputs, add variables, and pin provider versionsterraform.tfvars.dev,terraform.tfvars.staging, andterraform.tfvars.prodsamplespr-validation.ymland add a newterraform-security.ymlworkflowTERRAFORM.mdfiles across modulesValidation
terraform testcases undermodules/conversion-pipeline/tests/lint:tf:validateand the new security scan