Tour of the main features/design decisions in the updated HNSW design spec:
- Hybrid index: design uses a long-term HNSW index plus a short-term full scan index for initial database insertions, which can also be used as a temporary fallback in case maintenance is needed on the HNSW index
- Uniqueness pipeline:
- New enrollments are processed in batches to check for uniqueness; compare against batch, full scan index, and HNSW index, in 31 rotations plus 31 mirrored rotations
- Unique entries are added to the full scan index
- Eventually entries that are "old enough" are migrated from full scan index to HNSW index, which requires only a single insertion operation for the default middle alignment of the iris code
- Parallelism:
- Uniqueness checks within a batch operation are read only, so can be fully parallelized
- Batch insertion into HNSW index requires search at each level to take place prior to updating edges at that level
- Insertion can take place either level by level with this back and forth of reading for search then writing new vertices, or graph updates can take place after finishing search operations at all levels
- Graph updates for a batch insertion op do not depend on ordering, as the outcome neighbor sets always will consist of the collection of nearest neighbors among prior neighbors and those inserted
- Rare edge case: if multiple vertices are inserted at higher level than current max, then connect all vertices in batch inserted at new higher levels
- Deletion:
- Deletion of recent enrollments comes out of the full scan index, so doesn't touch the HNSW graph; if many of the deletion requests take place for recent enrollments, this will reduce strain on the HNSW index from deletions
- For deletion of entries in HNSW index, first:
- Mark deleted entry as "tombstoned" in a persistent set
- Remove iris code data shares from DB
- Remove vertex, outgoing edges, and any associated incoming edges from HNSW index
- If vertex is the current HNSW entry point, choose new entry point as "earliest index in highest nonempty layer" after deletion
- Then:
- Some incoming edges to tombstoned vertex may remain in graph, no easy (local or O(1)) way to find them; maintain ongoing process to find and remove directed edges leading to tombstoned vertices
- This process is non-interactive as long as neighbor lists are maintained as fully sorted lists rather than priority queues
- If a "tombstoned edges" check is also implemented when adding new edges, then scheduling of this process also does not affect graph connectivity
- Maintaining an ongoing process of this kind ensures deterministic rather than probabilistic time at which all traces of a deleted vertex in terms of graph connections are eliminated, which may be worthwhile from a data privacy perspective
- To greatly reduce the impact of deletions on graph connectivity, the nearest neighbor list of graph vertices is expanded by some constant size to keep track of neighbors which would have been "pruned" in standard HNSW graph by deletions
- The extra neighbors are not used in searches in order to keep graph search algorithm more efficient
- As long as "threshold" of pruned edges (the end of the list) doesn't reach the "active neighbor set", then deletion has not impacted the structure of the graph
- Choosing how large this added buffer is depends on the rate of deletions versus insertions
- This approach is unusual for HNSW indexes because they are typically considered a “memory heavy but performant” indexing solution in applications, so increasing the memory cost is not usually acceptable
- However, for the WC uniqueness check system, the secret-shared iris codes themselves are significantly heavier than the index, so adding a moderate percent cost to the HNSW index size in exchange for sustained index accuracy and minimized maintenance needs should be a worthwhile tradeoff
- Scheduler for background tasks can be pretty simple at the moment:
- Four tasks to schedule:
- uniqueness checks (should be the majority of the work)
- migration of entries from short-term full scan index to HNSW index
- initial processing of deletion requests
- ongoing cleanup of stale edges in HNSW graph caused by deletions
- Time read from each new query, and new operations are added to the queue deterministically based on time and system state
- Uniqueness queries scheduled when batch size large enough (or use dummy data if too long since last batch; can be generated server-side by each MPC node without interaction)
- Deletion operations (both initial deletions and HNSW graph maintenance) scheduled at regular time intervals -- both are relatively lightweight operations without communication
- Migration of full scan index entries to HNSW (insertion procedure) also scheduled at regular time intervals, possibly with a constraint on the overall size of the full scan buffer to maintain query performance -- only moderately heavy process (1 search operation compared to 62 for uniqueness check, plus needs to do graph insertion which requires additional comparisons and communication), but has immediate performance benefits
- For all operations being considered, it looks like it is worth doing immediately rather than trying to schedule around variable enrollment baseload