Bulk indexing for new embedding tables
In case we want to create a new embedding table, it's easier on the DB to first insert the data, then create the HNSW indexes.
On a DB that is not used by the application in production with pg_vector >= 0.7.0, we can use the following settings to speed up the process:
set maintenance_work_mem = '6GB'; -- this fits on db.r6g.xlarge, use a bigger instance if needed
-- if we want to use more then 8 workers, we need to increase max_worker_processes with requires a restart
-- this is a config with 8 workers:
set max_parallel_maintenance_workers = 7; -- plus leader
set max_parallel_workers = 8;
set max_parallel_workers_per_gather = 8;
set work_mem = '128MB';
Then create an index like this:
CREATE INDEX "hnsw_m24_ef256_embeddings_embeddings_text_large_cc_xmo_en"
ON embeddings_text_large
USING hnsw ((embedding_vector::halfvec(1536)) halfvec_cosine_ops)
WITH (m='24', ef_construction='256')
WHERE (content_type = 'xmo_page' and language = 'en') or content_type != 'xmo_page';
The indexing process will be fast until the graph does not fit in memory anymore, then Postgres will print a warning. If this happens too soon, try a bigger instance.