Skip to content

Conversation

@pggPL
Copy link
Collaborator

@pggPL pggPL commented Jan 30, 2026

Description

This PR fixes following issues:

  • Deploy nightly docs fails, because of non-compatible packages. I tested it in my own fork and version changes fix the issue,
  • Build jobs are red, because of OoM - the MAX_JOBS=1 envvar was not propagated correctly inside the containers,
  • PyTorch build job needed more disk space, so I changed container to JAX one and installed pytorch manually - it takes much less space than any other option,

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <[email protected]>
@pggPL pggPL changed the title PR to debug github workflows fails PR to debug github workflows failures Jan 30, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

Fixed critical CI/CD infrastructure issues including OOM failures in build jobs and incompatible package versions in docs deployment.

Build workflow fixes:

  • Fixed MAX_JOBS=1 environment variable not being passed into Docker containers by using docker exec -e flag
  • Added aggressive disk space cleanup steps (removed boost, toolchains, Swift, GHC) to prevent OOM errors
  • Reduced memory reserves from 5120MB to 4096MB and swap from 10240MB to 4096MB for more build space
  • Standardized PyTorch job on ghcr.io/nvidia/jax:jax image (previously used separate CUDA image)
  • Added pip cache purge after dependency installation to save disk space
  • Simplified dependency installation by relying on pre-installed packages in JAX image

Docs deployment fixes:

  • Upgraded actions/upload-pages-artifact from v1.0.7 to v3 to fix compatibility issues
  • Upgraded actions/deploy-pages from v2.0.0 to v4 for latest features
  • Removed deprecated name parameter from upload-pages-artifact (v3 uses fixed github-pages name)
  • Added workflow_dispatch trigger for manual deployment runs
  • Added id: deployment to deploy step for proper output reference

Confidence Score: 4/5

  • Safe to merge with low risk - fixes critical CI/CD failures with well-tested infrastructure changes
  • The changes address real production issues (OOM failures, package incompatibilities) with proven solutions. The environment variable fix using docker exec -e is the correct approach, and the GitHub Actions upgrades follow official deprecation guidance. The container image change standardizes on an image already used successfully in the "All" job. Minor risk from simplified dependencies in PyTorch job, but author confirmed testing in their fork.
  • No files require special attention - changes are straightforward infrastructure fixes

Important Files Changed

Filename Overview
.github/workflows/build.yml Fixed MAX_JOBS environment variable propagation, added disk space cleanup steps, switched PyTorch job to JAX image, reduced memory reserves
.github/workflows/deploy_nightly_docs.yml Updated GitHub Actions versions (upload-pages-artifact v1→v3, deploy-pages v2→v4), removed deprecated name parameter, added workflow_dispatch trigger

Sequence Diagram

sequenceDiagram
    participant GH as GitHub Event
    participant Build as Build Workflow
    participant Docker as Docker Container
    participant Disk as Disk Space
    participant Docs as Docs Workflow
    participant Pages as GitHub Pages

    Note over GH,Build: build.yml workflow
    GH->>Build: PR or workflow_dispatch
    Build->>Disk: Free up disk space (rm boost, tools, swift, etc.)
    Build->>Disk: Maximize build space (reduce reserves)
    Build->>Docker: Start container (ghcr.io/nvidia/jax:jax)
    Docker->>Docker: Install dependencies (cmake, torch, etc.)
    Docker->>Docker: pip cache purge
    Build->>Docker: docker exec -e MAX_JOBS=1 -e NVTE_FRAMEWORK=pytorch
    Docker->>Docker: Build TransformerEngine
    Docker->>Docker: Run sanity checks

    Note over GH,Pages: deploy_nightly_docs.yml workflow
    GH->>Docs: Push to main or workflow_dispatch
    Docs->>Docs: Build documentation
    Docs->>Pages: Upload artifact (v4)
    Pages->>Pages: Prepare pages artifact (v3)
    Pages->>Pages: Deploy to GitHub Pages (v4)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Pawel Gadzinski <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@pggPL pggPL changed the title PR to debug github workflows failures Fix Github workflows issues Jan 30, 2026
Signed-off-by: Pawel Gadzinski <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

- name: Start named container
run: |
docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 sleep infinity
docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d ghcr.io/nvidia/jax:jax sleep infinity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched from nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 to ghcr.io/nvidia/jax:jax base image

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant