ergonomic message constructors by hallerite · Pull Request #1002 · PrimeIntellect-ai/verifiers

hallerite · 2026-03-10T01:10:38Z

Description

Env authors no longer need raw dicts to build messages. UserMessage("describe", Image.from_pil(img)) just works. ToolCall accepts dict arguments, ToolMessage accepts a ToolCall for tool_call_id. Migrated browser env, mcp env, and gym env.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Touches core message/tool typing and serialization paths plus multiple environment integrations; subtle provider formatting or JSON-serialization regressions are possible without broad test coverage.

Overview
Adds typed, ergonomic constructors for building prompts and tool results: vf.SystemMessage/vf.UserMessage now accept either a plain string or multipart content via vf.Text/vf.Image/vf.Audio, vf.ToolCall auto-serializes dict arguments, and vf.ToolMessage accepts a ToolCall for tool_call_id.

Updates ToolEnv/tool utilities and integrations to handle structured tool outputs: tools may return a list of content parts (text/image) that is preserved, and the browser CUA mode, MCPEnv, and GymEnv are migrated off raw message dicts to these constructors. Public exports and docs are expanded to document and expose the new types (vf.Text, vf.Image, vf.Audio) and usage patterns.

^{Written by Cursor Bugbot for commit 778b1e9. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Documentation not updated for new user-facing types
- I documented the new vf.Text/vf.Image/vf.Audio types and ergonomic message constructors in docs/reference.md and added constructor usage guidance in docs/environments.md (with synced generated AGENTS docs).
✅ Fixed: Return type annotation mismatches actual returned types
- I updated CUA mode return annotations to list[vf.ContentPart] and made is_valid_tool_content_parts accept pydantic content-part models so multipart screenshot payloads are preserved instead of stringified.

Or push these changes by commenting:

@cursor push a7c23af157

Preview (a7c23af157)

diff --git a/assets/lab/environments/AGENTS.md b/assets/lab/environments/AGENTS.md
--- a/assets/lab/environments/AGENTS.md
+++ b/assets/lab/environments/AGENTS.md
@@ -121,6 +121,17 @@
 ]

+If you prefer typed constructors over raw dicts, you can build the same prompt with:
+
+```python
+[

vf.SystemMessage("You are a helpful math tutor."),
vf.UserMessage("What is 2+2?"),
+]
+```

+vf.UserMessage / vf.SystemMessage also support multipart content via vf.Text, vf.Image, and vf.Audio parts.
+
If your dataset already has a prompt column, question is ignored. However, if a system_prompt is provided, it will be prepended to existing prompts that don't already start with a system message.

Evaluation Datasets

diff --git a/docs/environments.md b/docs/environments.md
--- a/docs/environments.md
+++ b/docs/environments.md
@@ -115,6 +115,17 @@
]


+If you prefer typed constructors over raw dicts, you can build the same prompt with:
+
+```python
+[
+    vf.SystemMessage("You are a helpful math tutor."),
+    vf.UserMessage("What is 2+2?"),
+]
+```
+
+`vf.UserMessage` / `vf.SystemMessage` also support multipart content via `vf.Text`, `vf.Image`, and `vf.Audio` parts.
+
If your dataset already has a `prompt` column, `question` is ignored. However, if a `system_prompt` is provided, it will be prepended to existing prompts that don't already start with a system message.

### Evaluation Datasets

diff --git a/docs/reference.md b/docs/reference.md
--- a/docs/reference.md
+++ b/docs/reference.md
@@ -21,19 +21,41 @@
### Messages

```python
-Messages = str | list[ChatMessage]
+ContentPart = vf.Text | vf.Image | vf.Audio | dict[str, Any]
+MessageContent = str | list[ContentPart]
+Message = (
+    vf.SystemMessage
+    | vf.UserMessage
+    | vf.AssistantMessage
+    | vf.ToolMessage
+    | vf.TextMessage
+)
+Messages = list[Message]

-The primary message type. Either a plain string (completion mode) or a list of chat messages (chat mode).
+Provider-agnostic message types used across environments and clients.

-### ChatMessage
+### Content Parts (vf.Text, vf.Image, vf.Audio)

-ChatMessage = ChatCompletionMessageParam  # from openai.types.chat
+vf.Text("hello")
+vf.Image("data:image/png;base64,...")
+vf.Audio(data="...", format="wav")

-OpenAI's chat message type with role, content, and optional tool_calls / tool_call_id fields.
+vf.Text, vf.Image, and vf.Audio are aliases for content-part models and can be used directly when building multipart message content.

+### Ergonomic Message Constructors
+
+python +user = vf.UserMessage("Look at this", vf.Image("data:image/png;base64,...")) +system = vf.SystemMessage("You are a helpful assistant.") +tool_call = vf.ToolCall(id="call_0", name="search", arguments={"q": "verifiers"}) +tool_result = vf.ToolMessage(tool_call_id=tool_call, content=[vf.Text("done")]) +
+
+These constructors are optional conveniences for environment authors; raw dict-based messages are still supported.
+

Info

@@ -264,7 +286,7 @@
        dataset: Dataset | None = None,
        eval_dataset: Dataset | None = None,
        system_prompt: str | None = None,
-        few_shot: list[ChatMessage] | None = None,
+        few_shot: list[Message] | None = None,
        parser: Parser | None = None,
        rubric: Rubric | None = None,
        sampling_args: SamplingArgs | None = None,
@@ -433,7 +455,7 @@
        num_train_examples: int = 100,
        num_eval_examples: int = 50,
        seed: int = 0,
-        prompt_renderer: Callable[..., ChatMessages] | None = None,
+        prompt_renderer: Callable[..., Messages] | None = None,
        max_turns: int = -1,
        rubric: Rubric | None = None,
        **kwargs,

diff --git a/environments/AGENTS.md b/environments/AGENTS.md
--- a/environments/AGENTS.md
+++ b/environments/AGENTS.md
@@ -121,6 +121,17 @@
]

+If you prefer typed constructors over raw dicts, you can build the same prompt with:
+
+```python
+[

vf.SystemMessage("You are a helpful math tutor."),
vf.UserMessage("What is 2+2?"),
+]
+```

+vf.UserMessage / vf.SystemMessage also support multipart content via vf.Text, vf.Image, and vf.Audio parts.
+
If your dataset already has a prompt column, question is ignored. However, if a system_prompt is provided, it will be prepended to existing prompts that don't already start with a system message.

Evaluation Datasets

diff --git a/tests/test_tool_env.py b/tests/test_tool_env.py
--- a/tests/test_tool_env.py
+++ b/tests/test_tool_env.py
@@ -34,6 +34,14 @@
]
assert is_valid_tool_content_parts(content) is True

def test_valid_pydantic_content_parts(self):

   """Valid list with pydantic text/image content parts."""

```
   content = [
```

       vf.Text("Here's the screenshot"),

       vf.Image("data:image/png;base64,abc123"),

```
   ]
```

   assert is_valid_tool_content_parts(content) is True

def test_empty_list_is_valid(self):
"""Empty list is valid (no invalid parts)."""
assert is_valid_tool_content_parts([]) is True
@@ -372,6 +380,33 @@
]

@pytest.mark.asyncio
async def test_call_tool_returns_pydantic_content_parts(

   self, mock_client, sample_chat_dataset

):

   """Test that call_tool preserves pydantic text/image content parts."""

```
   def pydantic_parts_tool() -> list:
```
```
       return [
```

           vf.Text("Here's the screenshot"),

           vf.Image("data:image/png;base64,abc"),

```
       ]
```
```
   env = vf.ToolEnv(
```
```
       tools=[pydantic_parts_tool],
```
```
       client=mock_client,
```
```
       model="test-model",
```
```
       dataset=sample_chat_dataset,
```
```
   )
```

   result = await env.call_tool("pydantic_parts_tool", {}, "call_0")

   assert isinstance(result["content"], list)

   assert result["content"][0] == {"type": "text", "text": "Here's the screenshot"}

```
   assert result["content"][1] == {
```
```
       "type": "image_url",
```

       "image_url": {"url": "data:image/png;base64,abc"},

```
   }
```
@pytest.mark.asyncio
async def test_call_tool_casts_invalid_list_to_str(
self, mock_client, sample_chat_dataset
):

diff --git a/verifiers/envs/integrations/browser_env/modes/cua_mode.py b/verifiers/envs/integrations/browser_env/modes/cua_mode.py
--- a/verifiers/envs/integrations/browser_env/modes/cua_mode.py
+++ b/verifiers/envs/integrations/browser_env/modes/cua_mode.py
@@ -741,7 +741,9 @@
self.logger.warning(f"Failed to save screenshot: {e}")
return None

def _format_response(self, response: dict, session_id: str = "") -> list[dict]:

def _format_response(

   self, response: dict, session_id: str = ""

) -> list[vf.ContentPart]:
"""Format action response as multipart content with text and image."""
success = response.get("success", False)
error = response.get("error")
@@ -763,7 +765,7 @@
f"Viewport: {viewport.get('width', 0)}x{viewport.get('height', 0)}"
)

   content: list = [vf.Text("\n".join(text_parts))]

   content: list[vf.ContentPart] = [vf.Text("\n".join(text_parts))]

   if screenshot_b64 and session_id:
       self._save_screenshot(session_id, screenshot_b64, url)

@@ -1029,7 +1031,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Click at coordinates (x, y) on the page."""
response = await self._execute_action(
session_id,
@@ -1046,7 +1048,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Double-click at coordinates (x, y) on the page."""
response = await self._execute_action(
session_id,
@@ -1062,7 +1064,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Type text into the currently focused element."""
response = await self._execute_action(
session_id,
@@ -1078,7 +1080,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Press keyboard key(s)."""
response = await self._execute_action(
session_id,
@@ -1097,7 +1099,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Scroll the page at a specific position."""
response = await self._execute_action(
session_id,
@@ -1119,7 +1121,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Navigate to a URL."""
try:
response = await self._execute_action(
@@ -1141,7 +1143,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Navigate back in browser history."""
response = await self._execute_action(
session_id,
@@ -1156,7 +1158,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Navigate forward in browser history."""
response = await self._execute_action(
session_id,
@@ -1172,7 +1174,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Wait for a specified amount of time."""
try:
response = await self._execute_action(
@@ -1194,7 +1196,7 @@
session_id: str = "",
sandbox_id: str = "",
tool_call_id: str = "",

) -> list[dict]:

) -> list[vf.ContentPart]:
"""Capture a screenshot of the current page state."""
response = await self._execute_action(
session_id,

diff --git a/verifiers/utils/tool_utils.py b/verifiers/utils/tool_utils.py
--- a/verifiers/utils/tool_utils.py
+++ b/verifiers/utils/tool_utils.py
@@ -1,3 +1,4 @@
+from collections.abc import Mapping
from typing import Any

from agents.function_schema import function_schema
@@ -10,14 +11,19 @@
def is_valid_tool_content_parts(value: Any) -> bool:
"""Check if value is a valid list of tool content parts.

Valid content parts have a "type" field with value "text" or "image_url".

Valid content parts have a "type" field with value "text" or "image_url",
and can be either dict-like objects or pydantic models.
"""
if not isinstance(value, list):
return False
for item in value:

```
   if not isinstance(item, dict):
```

```
   if isinstance(item, Mapping):
```
```
       content_type = item.get("type")
```
```
   elif hasattr(item, "model_dump"):
```

       content_type = getattr(item, "type", None)

```
   else:
       return False
```

   if item.get("type") not in VALID_TOOL_CONTENT_PART_TYPES:

   if content_type not in VALID_TOOL_CONTENT_PART_TYPES:
       return False

return True


</details>
<sub>This Bugbot Autofix run was free. To enable autofix for future PRs, go to the <a href="https://www.cursor.com/dashboard?tab=bugbot">Cursor dashboard</a>.</sub>

</details>

verifiers/types.py

verifiers/envs/integrations/browser_env/modes/cua_mode.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

docs/environments.md

hide dicts from env dev

3d05e32

cursor bot reviewed Mar 10, 2026

View reviewed changes

verifiers/types.py Show resolved Hide resolved

verifiers/envs/integrations/browser_env/modes/cua_mode.py Show resolved Hide resolved

minor fixes

779a896

cursor bot reviewed Mar 10, 2026

View reviewed changes

docs/environments.md Show resolved Hide resolved

hallerite added 2 commits March 10, 2026 03:41

update skill

89b5a05

Merge remote-tracking branch 'origin/main' into hallerite/constructors

778b1e9

hallerite requested a review from willccbb March 10, 2026 04:07

snimu mentioned this pull request Mar 10, 2026

fix CI for datasets v4.7.0 #1004

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ergonomic message constructors#1002

ergonomic message constructors#1002
hallerite wants to merge 4 commits intomainfrom
hallerite/constructors

hallerite commented Mar 10, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Mar 10, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Evaluation Datasets

Info

Evaluation Datasets

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Mar 10, 2026 •

edited by cursor bot

Loading

cursor bot left a comment •

edited

Loading