feat: improve maintainers detection [CM-1033]#3908
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
This PR improves maintainer file detection in the git integration service by adding a multi-step discovery and analysis flow that combines static filename matching, dynamic ripgrep-based content search, and an AI fallback, while also surfacing more metadata about what was tried.
Changes:
- Added ripgrep-based repo scanning (
rg --filesand keyword search) with fallback toos.walk, plus scoring/filtering of dynamic candidates. - Refactored maintainer extraction to prioritize a previously saved maintainer file, then analyze top candidates, then use AI file suggestion as a last resort.
- Extended
MaintainerResultand service execution metrics to includecandidate_filesandai_suggested_file; addedripgrepto the Docker image.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py | New candidate discovery + fallback extraction flow; logs and metrics now include candidate/AI-suggested file metadata. |
| services/apps/git_integration/src/crowdgit/models/maintainer_info.py | Adds new result metadata fields (candidate_files, ai_suggested_file). |
| scripts/services/docker/Dockerfile.git_integration | Installs ripgrep in the runner image to support dynamic search. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
services/apps/git_integration/src/crowdgit/models/maintainer_info.py
Outdated
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Outdated
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Show resolved
Hide resolved
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
… detection Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…rd in content Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
bc8e3df to
b4dd488
Compare
…improve prompt Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
| KNOWN_PATHS = { | ||
| "maintainers", | ||
| "maintainers.md", | ||
| "maintainer.md", | ||
| "codeowners", | ||
| "codeowners.md", | ||
| "contributors", | ||
| "contributors.md", | ||
| "owners", | ||
| "owners.md", | ||
| "authors", | ||
| "authors.md", | ||
| "governance.md", | ||
| "docs/maintainers.md", | ||
| ".github/maintainers.md", | ||
| ".github/contributors.md", | ||
| ".github/codeowners", | ||
| } | ||
|
|
||
| # Governance stems (basename without extension, lowercased) for filename search | ||
| GOVERNANCE_STEMS = { | ||
| "maintainers", | ||
| "maintainer", | ||
| "codeowners", | ||
| "codeowner", | ||
| "contributors", | ||
| "contributor", | ||
| "owners", | ||
| "owners_aliases", | ||
| "authors", | ||
| "committers", | ||
| "commiters", | ||
| "reviewers", | ||
| "approvers", | ||
| "administrators", | ||
| "stewards", | ||
| "credits", | ||
| "governance", | ||
| "core_team", | ||
| "code_owners", | ||
| "emeritus", | ||
| } | ||
|
|
||
| VALID_EXTENSIONS = { | ||
| "", | ||
| ".md", | ||
| ".markdown", | ||
| ".txt", | ||
| ".rst", | ||
| ".yaml", | ||
| ".yml", | ||
| ".toml", | ||
| ".adoc", | ||
| ".csv", | ||
| ".rdoc", | ||
| } | ||
|
|
||
| SCORING_KEYWORDS = [ | ||
| "maintainer", | ||
| "codeowner", | ||
| "owner", | ||
| "contributor", | ||
| "governance", | ||
| "steward", | ||
| "emeritus", | ||
| "approver", | ||
| "reviewer", | ||
| ] | ||
|
|
||
| EXCLUDED_FILENAMES = { | ||
| "contributing.md", | ||
| "contributing", | ||
| "code_of_conduct.md", | ||
| "code-of-conduct.md", | ||
| } |
There was a problem hiding this comment.
Those were mainly inferred from our processing history.
joanagmaia
left a comment
There was a problem hiding this comment.
This looks great, I have a couple of questions and requests to make sure that we have some more metrics given that these are big changes on the current process.
Questions:
- With the mechanism of only picking one file for analysis we are assuming that all maintainers information will only be in 1 file right? I'm not sure if we should make sure that we won't lose data because of it.
Requests:
- Can we run the new mechanism in like 10 repos and see the accuracy? I would even say on the current issues we have opened on Insights as well to see if we have improved coverage https://github.com/linuxfoundation/insights/issues?q=is%3Aissue%20state%3Aopen%20maintainer
- Can we prepare a monitor in metaplane that covers the amount of repositories where we can get maintainers data for? And also the amount of projects?
- Can we test using the Haiku model for find_maintainer_file_with_ai since it would be a simpler task then the rest of the work?
| MAX_AI_FILE_LIST_SIZE = 300 | ||
|
|
||
| # Full paths that get the highest score bonus when matched exactly | ||
| KNOWN_PATHS = { |
There was a problem hiding this comment.
We should also include SECURITY-INSIGHTS.md. It was supported before as well.
E.g. https://github.com/open-telemetry/opentelemetry-dotnet/blob/d54379e28c07db783452a33e119f1cdf8e7d96a6/SECURITY-INSIGHTS.yml#L13
| } | ||
|
|
||
| # Governance stems (basename without extension, lowercased) for filename search | ||
| GOVERNANCE_STEMS = { |
There was a problem hiding this comment.
Should we also add:
- workgroup (e.g. https://github.com/open-feature/community/blob/d2f54702a4bca67cd7781a8fed91e9809ecc4a0a/config/open-feature/sdk-ruby/workgroup.yaml#L15)
My only concern here is that it seems that they use the community repo to manage some maintainers data. So here we might need to infer the repository based on the directory structure. Maybe it's too complex for us to want to support at least for now
There was a problem hiding this comment.
It's tricky when repo and maintainers are in different places, will check how we can support this easily
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| role = maintainer.normalized_title | ||
| original_role = self.make_role(maintainer.title) | ||
| if github_username == "unknown": | ||
| if github_username == "unknown" and maintainer.email in ("unknown", None): |
There was a problem hiding this comment.
Dict dedup silently drops "unknown" username maintainers
High Severity
new_maintainers_dict is built with {m.github_username: m for m in maintainers}, so when multiple maintainers have github_username="unknown", only the last one survives. Previously all "unknown" entries were unconditionally skipped, so the dedup was harmless. Now that the skip guard allows "unknown" usernames with valid emails through for email-based identity lookup, all but the last "unknown" maintainer are silently dropped before processing. Contrast with insert_new_maintainers, which iterates the list directly and processes every entry.
Additional Locations (1)
| "docs/maintainers.md", | ||
| ".github/maintainers.md", | ||
| ".github/contributors.md", | ||
| ".github/codeowners", |
There was a problem hiding this comment.
Case mismatch prevents SECURITY-INSIGHTS.md from matching
Medium Severity
KNOWN_PATHS contains "SECURITY-INSIGHTS.md" in uppercase, but _score_filename lowercases candidate_path before checking membership. The lowercased "security-insights.md" will never match the uppercase entry. Additionally, no GOVERNANCE_STEMS entry matches "security-insights", so _ripgrep_search won't discover the file either. This file type is effectively unsupported despite being listed.
Additional Locations (1)
| line[2:] if line.startswith("./") else line | ||
| for line in output.strip().split("\n") | ||
| if line.strip() | ||
| ] |
There was a problem hiding this comment.
Empty extension glob matches all files in listing
Medium Severity
VALID_EXTENSIONS includes "" (empty string), so _list_repo_files generates --iglob "*" which matches all files. Since multiple ripgrep --iglob include patterns use OR logic, this wildcard makes all other extension-specific patterns redundant. The function returns every file in the repo instead of filtering to document-like extensions. This degrades the Step 4 AI fallback: when no files score above zero, the first 300 files sent to AI will be arbitrary (likely source code) rather than text/document files.
| ai_cost = 0.0 | ||
| maintainers_found = 0 | ||
| maintainers_skipped = 0 | ||
| candidate_files: list[str] = [] |
There was a problem hiding this comment.
Type annotation mismatch for candidate_files variable
Low Severity
candidate_files is declared as list[str] but is later assigned maintainers.candidate_files which is list[tuple[str, int]] (path and score pairs). The annotation is misleading and inconsistent with MaintainerResult.candidate_files. This won't crash at runtime since Python doesn't enforce type hints, but it obscures the actual data shape written into the metrics dict.


What changed
Before
MAINTAINER_FILES: 13 entries, root-only, no recursion).README.mdwas in the candidate list and required a simple content check for the word "maintainer".extract_maintainersalways started from scratch — no reuse of a previously found file.compare_and_update_maintainersskipped all maintainers withgithub_username == "unknown", including those with a valid email; no email fallback for identity lookup.candidate_filesandai_suggested_filedid not exist inMaintainerResultor execution metrics.After
Detection pipeline (4-step with fallback)
rgscans the full repo for files matching 20 governance stems (MAINTAINERS,OWNERS,CODEOWNERS,GOVERNANCE,EMERITUS, etc.) across all depths and valid extensions. Each file is scored: exact known path (100), exact stem match (50), partial stem (25), plus +1 per governance keyword found in content. All candidates are returned sorted by score; the top one is analyzed.maintainer.(filename, score)tuples. The prompt instructs the model to prefer higher scores, shallower paths, and to reject files insidevendor/,node_modules/,third_party/,external/, and similar third-party directories.Bug fixes
compare_and_update_maintainers: the skip guard now only fires when bothgithub_usernameandemailare unknown/None (previously skipped all"unknown"usernames unconditionally). New maintainers identified by email now go throughfind_maintainer_identity_by_emailas a fallback, matchinginsert_new_maintainersbehaviour.elsebranch, avoiding a wasted string allocation on every large file.Observability
MaintainerResultgainscandidate_files: list[tuple[str, int]]andai_suggested_file: str | None.ServiceExecutionmetrics now recordcandidate_files(top-100 by score) andai_suggested_fileon every run.Note
Medium Risk
Refactors the maintainer extraction pipeline to rely on recursive
ripgrep-based discovery, scoring, and AI fallback, which can change which governance file is selected and affects runtime behavior/cost. Also adds a new runtime dependency (ripgrep) and persists more execution metadata, so failures or environment mismatches could impact maintainer processing.Overview
Improves maintainer extraction by replacing the prior hard-coded maintainer filename scan with a multi-step pipeline: reuse the previously saved maintainer file when available, otherwise perform recursive
ripgrep-based candidate discovery with filename/content scoring, then fall back to an AI-driven file picker fed with scored candidates.Adds guards and fixes around maintainer ingestion: README candidates are rejected unless they mention maintainer, unknown usernames are only skipped when email is also unknown, and identity lookup now falls back to
find_maintainer_identity_by_emailwhengithub_usernameisunknown. ExtendsMaintainerResultandServiceExecutionmetrics to record topcandidate_filesand theai_suggested_file, and updates the git-integration Docker image to includeripgrep.Written by Cursor Bugbot for commit b38abdc. This will update automatically on new commits. Configure here.