-
Notifications
You must be signed in to change notification settings - Fork 61
[Fix] OPFS Multitab Deadlocks #786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e edge cases where multiple tabs with OPFS/Safari can cause deadlocks.
🦋 Changeset detectedLatest commit: b48a685 The changes in this PR will be included in the next version bump. This PR includes changesets to release 8 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
…protected clients. Perform reconnect operations inside shared port mutex.
| async () => fn(this.generateDBHelpers({ execute: this._execute, executeRaw: this._executeRaw })), | ||
| { | ||
| timeoutMs: options?.timeoutMs | ||
| timeoutMs: options?.timeoutMs ?? this.options.defaultLockTimeoutMs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeouts here are meant to help expose potentially held locks. E.g. in the past we've seen some DBAdapters created in the shared sync worker - which were not completely closed - could hold on to a lock indefinitely.
Without a timeout it just seems like the whole process is stuck. With a timeout, we'll at least get some indication of this.
| // will also attempt to acquire it. Since the lock is returned when the tab is closed, this allows the share worker | ||
| // to free resources associated with this tab. | ||
| // We take hold of this lock as soon-as-possible in order to cater for potentially closed tabs. | ||
| getNavigatorLocks().request(`tab-close-signal-${crypto.randomUUID()}`, async (lock) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initialisation flow here is meant to assist the Shared Sync worker to only use clients which have a navigator-lock based close detection established. We don't want to use clients which might have been registered, but closed before the lock has been configured.
…ning internal worker connections.
…t in SharedSyncImplementation. This should make locking more fair.
| schema: AppSchema, | ||
| database: { | ||
| dbFilename: 'example.db' | ||
| database: new WASQLiteOpenFactory({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to check this in or have a commented out example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have an example/demo of OPFS being used.
simolus3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixes here are clearly a significant improvement, so I'm very happy with them 👍 The iframe / mock sync worker setup also looks impressive.
One thing I am a bit worried about is the increasing complexity in the sync worker setup. E.g. the Dart implementation in comparison uses an actor-like pattern for the worker to handle all events in order, which makes that code a bit easier to reason about and potentially decreases the likelihood of race conditions (although I'm certain Dart would deadlock on sleeping tabs as well, it's just not seeing enough usage for us to get reports). To ensure all our web SDKs work correctly here, it might be worth eventually restructuring the worker setup and communication protocol a bit to align the Dart and JS implementations (or to use the JS sync worker for the Dart SDK as well, if we have a stable protocol).
| }; | ||
| } | ||
|
|
||
| protected serializeError(error?: Error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would perhaps be nice to mention that as a reason in a comment on this method or on line 255 where it's used.
| * This is particularly relevant when worker connections are marked as closed while | ||
| * operations are still in progress. | ||
| */ | ||
| export class ConnectionClosedError extends Error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We could consider moving and exporting this from common, since we also throw a similar error in the Node.JS SDK here.
| databaseReOpened: () => { | ||
| // We may have missed some table updates while the database was closed. | ||
| // We can poke the crud in case we missed any updates. | ||
| this.connectionManager.syncStreamImplementation?.triggerCrudUpload(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation should be fine as-is since it reconnects on errors, but we could eventually use this to reconnect if we're using the Rust client implementation.
As discussed offline, sync state with the Rust client is inherently bound to a connection. So if the database is reopened, we should ideally reconnect because the client won't know about the connection state otherwise. I don't think this needs to be fixed in this PR, but we could add a TODO comment mentioning the issue.
Summary
This PR fixes potential deadlocks in the shared sync implementation when using OPFS (Origin Private File System) with multiple browser tabs. The solution introduces a distributed database adapter that automatically handles connection reconfiguration when tabs close, eliminating the need for explicit reconnection logic in many scenarios.
The Problem
When using OPFS with multiple tabs, we use a shared web worker for processing sync events. The shared web worker shares a SQLite connection from one of the tabs. For OPFS, we have a connection per tab, and tabs connect to the shared sync worker. The shared sync worker requests a connection from a tab when needed.
Historically, we requested a connection from the last connected tab. When that tab closed, it would inform the shared sync worker (or a navigator lock would detect the tab closed), and we would then disconnect and reconnect using a DB connection from another tab. This approach worked in normal scenarios, but proved to be very racy when tabs were opening and closing at extremely fast rates. The race conditions could lead to deadlocks, especially in Safari with OPFS, where the timing of tab lifecycle events could cause the sync worker to get stuck waiting for a connection that would never respond.
Related issues
The Solution
This PR removes the direct need for reconnection in many circumstances when a tab closes. Instead of the sync implementation managing connections directly, we now use a distributed DB adapter (
LockedAsyncDatabaseAdapter) in the shared sync worker. This distributed adapter serves as the DBAdapter API surface for the sync implementation and handles reconfiguring the internal connection from another tab if the current tab closes.DB reconnection is now transparent to the sync implementation. If a reconnect happens during an idle period of sync, the sync implementation will be completely unaware and can proceed normally. If an operation is attempted while reconnecting, the sync manager will detect the error and retry after a delay, just as it does for all errors. This eliminates many of the race conditions that occurred during rapid tab open/close scenarios and provides better error handling with improved error serialization for connection closed scenarios.
It's important to note that this PR does not yet address suspected frozen tab issues in Safari. While random connection selection helps avoid relying on potentially frozen tabs, additional work may be needed to fully handle Safari's tab freezing behavior.
Technical Implementation
The Distributed Database Adapter
At the heart of the solution is the
LockedAsyncDatabaseAdapter, which was already in use and had pluggable ability for opening connections. We've now extended this adapter to support automatically reopening the database connection on specific errors. The adapter detectsConnectionClosedErrorandNoModificationAllowedErrorduring operations and automatically re-opens the database connection from another available tab when needed. This provides a stable API surface that hides connection lifecycle details from the sync implementation, allowing it to operate as if it always has a reliable connection.It's important to note that the distributed database adapter reconnect logic (
reOpenOnConnectionClosed: true) is only used by the shared sync implementation. Single-tab implementations and other database adapters do not use this automatic reconnection feature, as they don't face the same multi-tab connection challenges.Shared Sync Implementation
The shared sync implementation has been refactored to use
LockedAsyncDatabaseAdapterinstead of directly managing connections. The adapter is initialized withreOpenOnConnectionClosed: trueto enable automatic reconnection, and connection selection is now randomized usinggetRandomWrappedPort()instead of always using the last port. This randomization is particularly important because in some cases, like Safari, tabs can be frozen by the browser, and random selection ensures we don't rely on a potentially frozen tab, improving reliability across different tab scenarios.We've also tightened the setup flow for the shared sync worker. The sync manager now only uses tabs once the navigator lock-based close detection has been configured. This ensures that tabs which are opened and closed very quickly (closed before the lock could have been configured) will never be used by the sync implementation, since we can't reliably detect if they died. The client is only added to the
SharedSyncImplementationafter the navigator lock has been successfully registered with the shared worker, creating a safety gate that prevents dead tabs from entering the system.Connection Management Improvements
The
WorkerWrappedAsyncDatabaseConnectionnow properly handles unexpected remote closures. WhilemarkRemoteClosed()existed before to immediately abort pending operations when a tab closes, we've now added detection if the tab was closed while we were busy configuring or opening the database. This prevents operations from hanging indefinitely during the critical database initialization phase. We've also added better timeout handling for database connection operations with a 10-second timeout, and improved abort controller management for pending operations to ensure resources are properly cleaned up.Additionally, we've now put the DB open operations in the same Navigator lock as used by the specific write/read operations. This prevents write conflicts when opening databases while another tab might be writing, ensuring that database initialization and operations are properly serialized across tabs.
We've also created a temporary fork of the OPFS Coop VFS. It appears that in some circumstances the initialization logic could grab hold of a navigator lock and not release it, which could contribute to deadlock scenarios. The fork addresses this potential lock retention issue.
Error Handling Enhancements
Error handling has been significantly improved throughout the system. Previously, some errors such as DOM exceptions were not serializable and were not reported from the shared sync worker to main tabs. We now handle these cases better when doing JSON serialization of errors in sync statuses, ensuring all errors can be properly transmitted across worker boundaries. Upload and download errors are now serialized for
SyncStatusevents, and we've implemented better error propagation and retry logic. Connection closed errors are now properly caught and trigger reconnection automatically.Key Files Modified
The main changes are spread across several key files:
packages/web/src/worker/sync/SharedSyncImplementation.ts- Main sync implementation using distributed adapterpackages/web/src/db/adapters/LockedAsyncDatabaseAdapter.ts- Distributed adapter with auto-reconnectionpackages/web/src/db/adapters/WorkerWrappedAsyncDatabaseConnection.ts- Improved remote closure handlingpackages/web/src/worker/db/opfs.ts- Temporary fork of OPFS Coop VFS to fix potential navigator lock retention issuespackages/common/src/db/crud/SyncStatus.ts- Error serialization improvementspackages/common/src/client/sync/stream/AbstractStreamingSyncImplementation.ts- Retry logic improvementsTesting
This PR includes comprehensive test coverage to ensure the solution works correctly. We've added a multiple tabs test (
multiple_tabs_iframe.test.ts) that tests multi-tab scenarios with rapid open/close operations. We've also created new mock sync service tests with testing utilities for isolated sync service testing, and added error serialization tests to ensure errors are properly serialized across worker boundaries. The existing multi-instance tests have been updated to use the new mock sync service.