Hybrid Image Search at Scale: Lessons in Accuracy, Latency, and Cost
9:15 - François Gaillard, Adeo Services & Guilherme de Freitas Guitte, Adeo Services
- they use elastic (in GCP)
- they use a multimodel approach, image encoding and text encoding
- they use vertexAI for image embeddings
- idea is to query with an image (e.g. of adapters and utils for gardeing) and to find the matching product
- they evaluated lexical vs a hybrid approach with knn search
- their catalog comprises around 10M docs
- due to images they have volumes up to 207gb disk storage
- they use BBQ = better binary quanitzation to tackle downscaling of images (to reduce disk storage volume)
- bringing it from 207 to 53 GB
- BBQ is very effective, bc they almost lose no relevance while slimming storage space
- Their advice on vector search to reduce search latency: reduce embedding dimension
Women of Search Present: AI Agents: From Hype to Reality
10:00 - Angeley Mullins, Aetheris Ventures & Olena Gorbatiuk, Independent & Atita Arora, Voyager Search
- Companies like Klarna and DuoLingo messed up by trying te replace CS/ workers with AI agents
- "Human authenticity will become the new luxury"
- "AI agents need search" - but that had goods and bads
- they advocate that search for AI needs a different tuning than search for humans
-
Ai search will:
- perform way more searches
- do longer and more detailed queries
- the evaluation is rather: was the task completed successfully?
- its optimized for recall, context and data openness
From LLM-as-a-Judge to Human-in-the-Loop: Rethinking Evaluation in RAG and Search
11:00 - Fernando Rejon Barrera, Zeta Alpha & Daniel Wrigley, OpenSource Connections
- this is about eyeballing (a HIL approach):
- it's part of the relevance workbench in OS
- we utilize 2 search configurations, where we can compare two sets of queries to see the differences in the result sets
- we then see the unique and the common results in the two sets
- the idea is to combine eyeballing and RAG evaluation
- with the current approach, we query an LLM, which does a (vector) search and generates an answer
- we could optimize the prompts and the (vector) search retrieval
- they're leveraging pairwise comparison on results for LLMaaJ evaluation
- an idea is to use the ELO ranking system, like in chess torunaments
- they run a "torunament between all agents"
- introduced and demoed "Ragelo"
- currently a simple streamlit app, where one can select agent configurations, benchmarks and then can collect answers
- one can then evaluate pair- or pointwise
Beyond Keywords: Measuring Multimodal Search Quality
11:45 - Philippe Bouzaglou, Vectra
- eyeballing ebmeddings is not possible, but embedding models are important for search relevance in vector search
- though they are hard to evaluate
- we usually just google which ever model make most "sense" (is newest, performs best on benchmarks)
- current method to evaluate: install model => embed docs => run queries => measure and => repeat
- that's good but slow
- they propose a new method "document-query similarity evaluation"
- flips the eval upside down
- we take one document, get many queries and then get the cosine similarity scores between each query and the document
- then we rank the queries by similarity and then we get an LLM judge (or human) to rank the queries
- then we compare these
- we could do this with a lot of models to see their differences
Future-Proofing E-commerce Search Architecture for Conversational Commerce and Beyond
13:30 - Jens Kürsten, OTTO GmbH & Co. KGaA
- they use a hybrid approach of lexical and semantic search
- they do a facet selection (classical filters) and then do rank fusion
- functionally, they care about latency, maintainability, infra cost, etc.
- they also did a gradual rollout
- started with 0 result searches, then also did high volume queries, then a low recall until finally going for high recall
- experimented a lot with different rank fusions, with only serving specific pages, only offering partial filters, etc.
- they can manage a latency with is around 600ms
- OTTOs stack doesn't look that different from ours (it's just way more mature)
- otto uses solr with tweaks for lexical search
-
"Search stack better be:
- API accessible
- lightning fast
- dead cheap
- scalable to infinity"
- they also do LLMaaJ:
- they're simulating a customer
- they reduced infra cost by 25%
- His takeaways are:
- Hybrid sesarch is hard, but worth the money"
- "Make it work, make it right, make it fast", applicable at scale
- There are a couple of teams working on search
Smart Recall: Enhancing Local LLM Conversations with Embedding-Aware Context Retrieval
14:15 - Lucas Jeanniot, Eliatra
- business reality sometimes might bring you into the situation of having to host LLMs yourself locally for privacy
- challenges:
- no persistent memory
- context window filling up
- expensive redundant searches
- they leverage OpenSearch conversational interface (he gave examples in ptyhon)
- focus on token management, so they're tracking the tokens (it's local, so we aim to keep it slim)
- they have different strategies like remove examples, summarize middle, keep recent only, aggressive summary
- the token limits they have are set to all generations
- they encourage to implement fallbacks
- they use a three layer cache strategy, for embedding-, search result- and conversation summary cache
Commoditizing Inference: Why Your Query Language Should Speak AI
15:15 - Aurélien Foucret, Elastic
- (elastic also offers an MCP server)
- they propose an inference API
- it'll be agnostic fo different llm provider formats
- currently inference runs on the ML nodes => freom now on it's no longer tied to the ML nodes of the cluster
- also has better throughput
- currently only available on one AWS region (as it's in tech preview)
- it'll be agnostic fo different llm provider formats
- they'll introduce a new elastic inference service (EIS)
- ESQL gets new inference parameters
Hybrid search Lessons learned
16:00 - Tom Burgmans, Wolters Kluwer & Mohit Sidana, Wolters Kluwer
- Wolters Kluwer is an information serbice provider for experts in lega, business tax, etc.
- they use solar for lexical search
- they also use matryoshka embeddings with scalar quantization
- they also struggled with throughput from 3rd party embedding models
- they optimized indexing an tried to re-use vectors
- if core text is not changing; fort that they check if metadata was changed
- they sum the scores of vector search & lexical search
-
"Blending lexical results with vector results is like mixing oil and water“
- keyword matching is an exact match
- embeddings are an approximation
- problematic of hybrid: you can do boosting on the whole lexical result set, with vector search you can only apply boosting to the top k
- they base the balance of hybrid to lexical search depending on the type of the query:
- e.g. for citations they do lexical, for case nickname, they blends some vector results in, etc.
- for case summaries they rely on vector search, for typos as well, but they blend in some lexical results
- their query understanding also considers history and context info
- lots of challenges they face seem very similar to ours
- vector search comes at a price -> one should definitely check how big the relevancy benefit is
- cool idea they had: building a prototype interface that (with colors) shows what the origin of a search result is (hybrid vs lexical vs both) -> they then also implemented a slider that then showed the impact of the specific searched
- Hybrid search sometimes also leads to unexpected situations (e.g. having a filter that shows aggregations, and the agg number might change when filter is applied, bc then the vector search might show up more documents than before)
Key Conference Findings
- Research agentic relevance framework => this might have big impact in the future
- Discuss interleaved A/B testing => this could reduce the A/B test times a lot
- When we're live we should increase our focus on query understanding