fix: Validate the embeddings size to catch silent embeddings batch failures#103
fix: Validate the embeddings size to catch silent embeddings batch failures#103aperepel wants to merge 2 commits intogoogleapis:mainfrom
Conversation
With a larger batch size for `add_documents` (e.g. 1000), the embeddings service may silently fail and return nothing for some entries. This lead to the more cryptic error:
```
[values_dict[key][i] for key in values_dict]
~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
```
Add additional size validation and a suggestion on how to remedy.
|
/gcbrun |
|
@aperepel, thank you raising the pull request. Question: In your PR you are adding a validation check, which is required for sure. On a separate note: Have you tried diving it into batches and then generating the embeddings if it's failing for large batches? |
Yes, that's what we've been doing. On a larger batch (usually around 1000 items), the embeddings service returned ~20-25% loss. The 100 size batch works perfectly stable, however. We do use retry logic everywhere we run micro-batches, but if you were asking about pro-actively splitting it after a failure - that's probably too many LLM calls, as we must retry a whole split. |
|
Tests are failing due to code coverage. Can you add a tests for your changes? |
With a larger batch size for
add_documents(e.g. 1000), the embeddings service may silently fail and return nothing for some entries. This lead to the more cryptic error:Add additional size validation and a suggestion on how to remedy.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕