Overview
The HNSW algorithm for approximate k-nearest neighbor graph search provides a scalable approach to identifying a close match to a query vector in a very large database while requiring many fewer comparison operations than are needed for a full database scan, at the cost of a controllable degree of random error in the recall accuracy of the system. (See here for the original paper of Malkov and Yashunin, first published as a preprint in 2016, and here for an accessible summary of the technique as implemented in Facebook Research’s Faiss library.) For the MPCv3 design we are aiming to implement the HNSW algorithm on top of an MPC privacy layer, so that all iris biometric data is stored in secret-shared form by several independent server nodes, and all comparisons between biometric templates are done via a Secure Multiparty Computation protocol as in the MPCv2 design.
In the following, we give a more detailed description of the system architecture we plan to implement for the MPCv3 release to integrate an HNSW index which supports scaling of the private iris uniqueness system to a very large iris code database. In particular, we outline the high-level design of the system, and discuss changes and extensions to the HNSW algorithm compared to the specification given in the original HNSW paper. The most important details are the following:
- Hybrid index — the system makes use of a long-term HNSW index plus a short-term full scan index for initial insertions. New enrollments are first inserted into the full scan index for 1 to 2 days, and are then migrated to the HNSW index.
- Deletions — deletion requests which take place shortly after enrollment while the corresponding biometric template is still located in the full scan index are simple, as deletion is just removal of the record from the data store. Deletion requests for entries in the HNSW index are handled using a “tombstoning” approach to handle residual graph edges referencing deleted index entries. A background process is run to clean up such stale edges over time, and a small extra buffer of previously pruned neighbors is maintained for each item in the index to help maintain performance in the presence of moderate levels of deletions.
- Scheduler — four main processes need to be maintained over time to support the operation of this hybrid design: uniqueness checks, migration of entries from the full scan index to the HNSW index, initial processing of deletion requests, and cleanup of stale edges in the HNSW index. A simple scheduler which initiates background tasks at regular intervals based on a clock measurement derived from entries in the request queue appears to be appropriate for the current design.
- Parallelism — the architecture supports high parallelism for both uniqueness queries and HNSW index insertions, some details of which are outlined below in the pseudocode for functions
process_query_batch
and insert_batch
.
A somewhat more in-depth tour of the design features can be found in this page. The remainder of this document provides more details and justification for our design choices, a list of the primary system parameters, and detailed pseudocode of the main functions which will need to be implemented.
Algorithm Specification
Design Choices
- Iris code resolution — use full resolution iris codes for comparisons to simplify the initial system design.
- Iris rotations — run a different HNSW lookup for each relative rotation of the query iris code. If no match is found, insert only the unrotated iris code into the database and HNSW index.
- Hybrid indexes — maintain a short-term linear scan index along with a long-term HNSW index.
- Short-term index is likely to absorb a reasonable number of deletion requests, which otherwise would have to be processed by the HNSW index, which is more expensive
- Short-term index can also serve as a short-term holding space for new enrollments in the event that maintenance or a rebuild is required for the HNSW index
- Batch uniqueness queries — pairwise deduplication of queries within a batch, and run search procedures against database entries for all queries in parallel; then insert verified unique entries in short-term linear scan buffer.
- Deduplicate between entries in a batch by direct pairwise comparisons against static matching threshold
- In case of pairwise match, only the earlier ID of the matching pair may be inserted
- Periodically, oldest entries in the short-term linear scan buffer are migrated to the HNSW index, for which only the central rotation of the iris code is inserted
- Parallel search — search logic is independent among queries in a batch, and likewise individual comparisons in a batch deduplication process are independent
- Parallel insertion — search logic for each batch has to be completed for a particular HNSW layer before new vertices are inserted at that layer.
- Search is again fully parallelizable over queries in the batch since it is read-only and search logic is independent between queries
- Insertion of new edges can be parallelized over source vertices, but insertion of new neighbors should be executed in increasing order of the neighbor vertex index. This ordering is needed to avoid a potential race condition when two inserted neighbors in a batch have equal distance from the source vertex.
- In case more than one new vertex is inserted at a higher level than the existing entry point, a very rare edge case, the first highest-level vertex becomes the new entry point, and all vertices inserted at new layers are set as neighbors
- Height selection — will need to use a PRNG which is synchronized between nodes so that identical heights are generated for identical queries
- Candidate and neighbor lists — transient lists of candidates maintained during search can be implemented as a heap-based priority queue to save some comparisons, but neighbor lists needing granular sorted access should be implemented as sorted lists
- Deletion — immediate deletion of iris data and outgoing HNSW graph edges, deletion of incoming edges by garbage collection process with “tombstoned” flag per ID
- In order to maintain connectivity of the graph, the neighbor list of a node includes a buffer of extra edges that are not used in the search procedure, allowing deleted active edges to be replaced from the outstanding inactive edges
- The size of this buffer should be tuned based on the expected rate of deletions so that the buffer of inactive edges is exhausted only rarely
- The search entry point for the index can be defined as “the first inserted entry in the highest layer of the index”, so this gives a rule for selecting a new entry point in case the existing entry point is deleted