Skip to content

feat(infrastructure): add conversion pipeline Terraform module#542

Open
nguyena2 wants to merge 7 commits intomainfrom
feat/issue-39-conversion-pipeline-iac
Open

feat(infrastructure): add conversion pipeline Terraform module#542
nguyena2 wants to merge 7 commits intomainfrom
feat/issue-39-conversion-pipeline-iac

Conversation

@nguyena2
Copy link
Copy Markdown
Contributor

Summary

Adds a new conversion-pipeline Terraform module that provisions the Azure infrastructure for the dataset conversion pipeline, along with supporting examples, root-level wiring, and CI updates.

Fixes #39

Changes

  • New module infrastructure/terraform/modules/conversion-pipeline/:
    • Storage account with blob/dfs private endpoints, lifecycle policies, and diagnostic settings
    • Event Grid topic with role assignments
    • Fabric workspace integration and role assignments
    • Full module documentation (README.md, TERRAFORM.md) and terraform test coverage
  • Root deployment (infrastructure/terraform/): wire the new module into main.tf, expose outputs, add variables, and pin provider versions
  • Examples: add terraform.tfvars.dev, terraform.tfvars.staging, and terraform.tfvars.prod samples
  • CI: extend pr-validation.yml and add a new terraform-security.yml workflow
  • Docs: regenerate TERRAFORM.md files across modules

Validation

  • terraform test cases under modules/conversion-pipeline/tests/
  • Updated PR validation workflow runs lint:tf:validate and the new security scan

- add diagnostic settings for storage account and blob service
- create private endpoints for blob and dfs subresources
- define outputs for storage account, containers, and event grid
- set up role assignments for Event Grid and Fabric workspace
- add tests for conversion pipeline module functionality

🔧 - Generated by Copilot
@nguyena2 nguyena2 requested a review from a team as a code owner April 22, 2026 15:37
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Dependency Review

The following issues were found:
  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 1 package(s) with unknown licenses.
See the Details below.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 13cb9c0.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

License Issues

.github/workflows/terraform-security.yml

PackageVersionLicenseIssue Type
bridgecrewio/checkov-action99bb2caf247dfd9f03cf984373bc6043d4e32ebfNullUnknown License

OpenSSF Scorecard

PackageVersionScoreDetails
actions/actions/checkout de0fac2e4500dabe0009e67214ff5f5447ce83dd 🟢 5.2
Details
CheckScoreReason
Code-Review🟢 5Found 16/29 approved changesets -- score normalized to 5
Maintained⚠️ 00 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Binary-Artifacts🟢 10no binaries found in the repo
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Packaging⚠️ -1packaging workflow not detected
Pinned-Dependencies🟢 3dependency not pinned by hash detected -- score normalized to 3
Signed-Releases⚠️ -1no releases found
Security-Policy🟢 9security policy file detected
Branch-Protection🟢 5branch protection is not maximal on development and all release branches
SAST🟢 9SAST tool detected but not run on all commits
actions/actions/upload-artifact 043fb46d1a93c77aae656e7c1c64a875d1fc6a0a 🟢 5.7
Details
CheckScoreReason
Code-Review🟢 7Found 7/9 approved changesets -- score normalized to 7
Binary-Artifacts🟢 10no binaries found in the repo
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Packaging⚠️ -1packaging workflow not detected
Maintained🟢 88 commit(s) and 2 issue activity found in the last 90 days -- score normalized to 8
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Pinned-Dependencies⚠️ 1dependency not pinned by hash detected -- score normalized to 1
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Security-Policy🟢 9security policy file detected
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST🟢 10SAST tool is run on all commits
actions/bridgecrewio/checkov-action 99bb2caf247dfd9f03cf984373bc6043d4e32ebf UnknownUnknown
actions/github/codeql-action/upload-sarif 95e58e9a2cdfd71adc6e0353d5c52f41a045d225 UnknownUnknown

Scanned Files

  • .github/workflows/terraform-security.yml

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.13%. Comparing base (d77c167) to head (13cb9c0).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #542      +/-   ##
==========================================
- Coverage   63.91%   63.13%   -0.78%     
==========================================
  Files         250      265      +15     
  Lines       15409    16850    +1441     
  Branches     2163     2266     +103     
==========================================
+ Hits         9848    10638     +790     
- Misses       5274     5926     +652     
+ Partials      287      286       -1     
Flag Coverage Δ *Carryforward flag
pester 81.11% <ø> (-2.02%) ⬇️ Carriedforward from 65c7df9
pytest-data-pipeline ?
pytest-dataviewer 65.12% <ø> (ø) Carriedforward from 65c7df9
pytest-dm-tools ?
pytest-evaluation 99.83% <ø> (?)
pytest-fuzz 1.56% <ø> (-3.42%) ⬇️ Carriedforward from 65c7df9
pytest-inference ?
pytest-training ?
vitest 51.08% <ø> (ø) Carriedforward from 65c7df9

*This pull request uses carry forward flags. Click here to find out more.
see 32 files with indirect coverage changes

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@rezatnoMsirhC rezatnoMsirhC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good module structure and conventions throughout. Four comments below: one high-severity bug in the staging tfvars that will fail at plan time, one least-privilege RBAC issue, one Event Grid reliability concern, and one stale TODO to remove.

Comment thread infrastructure/examples/terraform.tfvars.staging
Comment thread infrastructure/terraform/modules/conversion-pipeline/event-grid.tf
Comment thread .github/workflows/terraform-security.yml
Copy link
Copy Markdown
Member

@WilliamBerryiii WilliamBerryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary — Conversion Pipeline IaC

Thanks for this; the module structure (Edge → Raw blob → Event Grid → Function App → Converted blob → Fabric Lakehouse) is coherent and easy to follow. Requesting changes for the items below — most are small, but RI-03 and RI-04 are blockers in my view.

Acknowledgements

  • Existing reviewer @rezatnoMsirhC has already covered E1–E4 (Function App SKU/runtime, identity, network ACLs, retention defaults). I've intentionally left those alone to avoid duplicate noise; please address them in the same revision pass.
  • I withdrew several internal candidates (provider version pinning, blob CORS, etc.) after confirming they were either already raised or non-issues against repo conventions.

Blocking items (inline)

  • RI-03 🟧 Mediummain.tf:64 — Four checkov:skip directives reference unresolved WI-01 placeholder. File the real follow-up issue(s) and replace WI-01 with the issue reference before merge.
  • RI-04 🟧 Mediuminfrastructure/terraform/variables.tf (root, ~lines 547–573) and infrastructure/terraform/main.tf (~lines 217–260) — The conversion_pipeline_config object exposed at the root does not surface every meaningful module variable. Operators today cannot, for example, override Function App SKU, blob retention windows, or Event Grid retry policy without editing the module directly. Either: (a) widen the root config object to mirror the module's full variable surface, or (b) document explicitly which knobs are intentionally fixed at the module layer and why. As-is the partial passthrough is a hidden footgun.

Non-blocking items (inline)

  • RI-02 🟨 Lowfabric.tf:23 — Fabric capacity name lacks separators; align with hyphenated convention (fc-${local.resource_name_suffix}).
  • RI-06 🟨 LowREADME.md:121 — Two-pass deployment with manual GUID copy is UX-fragile; suggest -target first pass or wrapper script.
  • RI-09 🟨 Lowversions.tf:43 — Add precondition or doc note for FABRIC_TENANT_ID / FABRIC_CLIENT_ID / FABRIC_CLIENT_SECRET env vars so missing values fail fast.
  • RI-07 🟦 TrivialREADME.md:4author: Edge AI Team inconsistent with sibling module READMEs.
  • RI-10 ℹ️ Verify-onlyevent-grid.tf:11 — Confirm evgt (vs. egst) is the intended Event Grid System Topic abbreviation.

Body-only finding

  • RI-05 🟨 Lowinfrastructure/terraform/variables.tf (root, ~lines 547–573) duplicates module-level defaults (Function App SKU F2, storage ZRS, raw retention 30 days, converted retention 90 days). Each default now lives in two places and will drift. Recommendation: drop the defaults from the root variable block and let the module own them, or remove them from the module and centralize at the root — either is fine, but not both.

Validation evidence

  • All 6 inline anchors verified against the PR diff hunks for commit 01de39ebf3db4eff0eba0cf7966b44e4d71c1f22 before posting.
  • Findings cross-checked against .github/copilot-instructions.md (Terraform Conventions section) and existing PR conversation to avoid duplication of @rezatnoMsirhC's review.

Happy to re-review once RI-03 and RI-04 are addressed.

Comment thread infrastructure/terraform/versions.tf
Comment thread infrastructure/terraform/modules/conversion-pipeline/README.md
Comment thread infrastructure/terraform/modules/conversion-pipeline/fabric.tf
Comment thread infrastructure/terraform/modules/conversion-pipeline/README.md Outdated
Comment thread infrastructure/terraform/modules/conversion-pipeline/main.tf Outdated
Comment thread infrastructure/terraform/modules/conversion-pipeline/event-grid.tf
…d letter support

- implement dynamic dead letter identity based on configuration
- update role assignment scope for event grid dead letter queue
- modify staging variables for PostgreSQL high availability

🔒 - Generated by Copilot
…loyment details

- correct author attribution to Microsoft Robotics-AI Team
- clarify two-pass deployment process in README
- add lifecycle precondition for fabric workspace in fabric.tf
- update Checkov control tracking references in main.tf

🔒 - Generated by Copilot
… pipeline module

- eliminate unused virtual network variable and references
- update tests to reflect changes in module dependencies

🔧 - Generated by Copilot
@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

Copy link
Copy Markdown
Collaborator

@katriendg katriendg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nguyena2 for this great addition and integrating Fabric into the picture.
Reviewing the work, I have two generic notes which were based on reflections as I was looking through your updates. I'll let you decide being able to integrate these. For the first one I believe it's valuable, the second one I'll leave to you to assess if relevant.

Comment 1 — Can we drop the two-pass deploy and FABRIC_* env vars

Files referenced:

  • infrastructure/terraform/modules/conversion-pipeline/fabric.tf
  • infrastructure/terraform/modules/conversion-pipeline/variables.tf
  • infrastructure/terraform/modules/conversion-pipeline/README.md
  • infrastructure/terraform/versions.tf
  • infrastructure/README.md

Comment body

The two-pass workflow and the FABRIC_TENANT_ID / FABRIC_CLIENT_ID / FABRIC_CLIENT_SECRET prerequisite are avoidable. microsoft/edge-ai already solves the same azurerm_fabric_capacity "GUID not exposed" problem in a single apply and uses Azure CLI auth — no SP secret required. We should align with that pattern, if possible.

Wdty?

Reference: edge-ai/src/000-cloud/031-fabric/terraform/main.tf and versions.tf.

Edge-ai resolves the capacity GUID via a deferred data "fabric_capacity" lookup keyed by display_name. The terraform_data indirection prevents plan-time queries (handles the same testing/build-system constraint that prompted the two-pass design here):

resource "terraform_data" "defer_fabric_capacity_created" {
  count      = var.should_create_fabric_capacity ? 1 : 0
  input      = { display_name = local.fabric_capacity_name }
  depends_on = [module.fabric_capacity]
}

data "fabric_capacity" "created" {
  count        = length(terraform_data.defer_fabric_capacity_created)
  display_name = terraform_data.defer_fabric_capacity_created[0].output.display_name
}

locals {
  capacity_id = try(data.fabric_capacity.created[0].id, data.fabric_capacity.existing[0].id, null)
}

resource "fabric_workspace" "this" {
  count        = var.should_create_fabric_workspace ? 1 : 0
  capacity_id  = local.capacity_id
  display_name = "fws-${local.resource_name_suffix}"
  description  = "Conversion pipeline workspace (${var.environment})"
}

Edge-ai's versions.tf also has no provider "fabric" {} block and the blueprint README documents no FABRIC_* env vars. The microsoft/fabric provider supports Azure CLI auth (use_cli defaults to true) and picks up the same az login already used by azurerm. The SP-secret env-var path is one auth mode of several, not a requirement.

Proposed changes:

  1. Replace fabric_capacity_uuid (variable + workspace gating + lifecycle.precondition) with the deferred data "fabric_capacity" pattern above. Drop the entire "Two-pass deployment" README section.
  2. Remove the provider "fabric" {} block and its comment from infrastructure/terraform/versions.tf. The provider auto-discovers via Azure CLI when no provider block is declared.
  3. Remove the FABRIC_TENANT_ID / FABRIC_CLIENT_ID / FABRIC_CLIENT_SECRET prerequisite from both infrastructure/README.md and modules/conversion-pipeline/README.md. Replace with a note that az login is sufficient and that the operator's identity (or an opt-in SP, if the user explicitly chooses SP auth) must be in the "Service principals can use Fabric APIs" tenant allow-list.
  4. Pin the fabric provider to 1.3.0 (matching edge-ai) instead of ~> 1.0, which would let any new 1.x minor drift in.

This collapses the workflow to a single terraform apply, removes the secret-handling burden in CI, and aligns with the established Microsoft pattern.

Comment 2 — Storage overlap with the platform data-lake account

Files referenced:

  • infrastructure/terraform/modules/platform/storage.tf
  • infrastructure/terraform/modules/conversion-pipeline/main.tf
  • infrastructure/terraform/modules/conversion-pipeline/lifecycle.tf
  • infrastructure/terraform/main.tf

Comment body

The platform module already creates an HNS Gen2 data-lake account (stdl{prefix}{env}{instance}) with datasets, models, evaluation containers and lifecycle rules on datasets/raw/ (delete), datasets/converted/ (cool), and evaluation/reports/ (cool → archive). See modules/platform/storage.tf lines 61–120 and 200+.

The conversion-pipeline module creates a second HNS Gen2 account (stcp{prefix}{env}{instance}) with raw, converted, event-grid-dlq containers and the same lifecycle semantics on raw/ and converted/.

With the defaults should_create_data_lake_storage = true and should_deploy_conversion_pipeline = true, the deployment provisions:

  • Two HNS Gen2 storage accounts in the same resource group
  • Two raw / converted lanes with duplicate lifecycle policies
  • Two sets of blob + dfs private endpoints in the same PE subnet

The README's blast-radius rationale — "The AzureML extension storage account is managed separately by modules/platform" — refers to the st... ML-workspace account, not stdl... (the data-lake account). The data-lake account already serves the raw/converted use case and is silently duplicated.

There is no cross-gating in infrastructure/terraform/main.tf (should_create_data_lake_storage and should_deploy_conversion_pipeline are independent flags), and the conversion-pipeline README/variables make no mention of stdl.

Requested resolution (pick one and document it):

  • Option A — Reuse the platform data-lake account (preferred). Add data_lake_storage_account to modules/conversion-pipeline/variables.deps.tf as a typed object dependency, and create the raw / converted / event-grid-dlq containers (or use the existing datasets/ prefix) inside the platform-owned account. Drop stcp... and the duplicate private endpoints and lifecycle policy. This matches the variables.deps.tf pattern already used by modules/sil/ and modules/dataviewer/.
  • Option B — Mutually exclude. Add a root-level precondition that fails when both should_create_data_lake_storage and should_deploy_conversion_pipeline are true, and update the dev/staging/prod tfvars examples to flip should_create_data_lake_storage = false when the conversion pipeline is enabled. Document the rationale in both READMEs.

Option A is preferred — it eliminates the overlap rather than papering over it, and it preserves the intended platform/conversion-pipeline split (compute and event plumbing stay in conversion-pipeline; durable storage stays platform-owned).

* Provisions an azurerm_fabric_capacity (gated by should_create_fabric_capacity)
* and a fabric_workspace via the microsoft/fabric provider.
*
* The fabric_workspace resource's capacity_id is the Fabric capacity GUID (not
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my note in the PR named "Comment 1 — Can we drop the two-pass deploy and FABRIC_* env vars"

@@ -0,0 +1,57 @@
name: Terraform Security Scan
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you expand the PR description (or add a short "CI / Security" section) covering: scope of files scanned, soft-fail default, where results land, and any rollout plan to flip soft-fail to hard-fail later? Reviewers and downstream owners (security, repo admins) should not have to read the workflow YAML to learn that a new code-scanning surface is being introduced. Or at least reference it in the original associated issue, so we know this was extended from the original issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

infra(cloud): conversion pipeline Bicep/Terraform template

6 participants