Skip to content

Hybrid Image Search at Scale: Lessons in Accuracy, Latency, and Cost

9:15 - François Gaillard, Adeo Services & Guilherme de Freitas Guitte, Adeo Services

  • they use elastic (in GCP)
  • they use a multimodel approach, image encoding and text encoding
  • they use vertexAI for image embeddings
  • idea is to query with an image (e.g. of adapters and utils for gardeing) and to find the matching product
  • they evaluated lexical vs a hybrid approach with knn search
  • their catalog comprises around 10M docs
  • due to images they have volumes up to 207gb disk storage
  • they use BBQ = better binary quanitzation to tackle downscaling of images (to reduce disk storage volume)
    • bringing it from 207 to 53 GB
    • BBQ is very effective, bc they almost lose no relevance while slimming storage space
  • Their advice on vector search to reduce search latency: reduce embedding dimension

Women of Search Present: AI Agents: From Hype to Reality

  • Companies like Klarna and DuoLingo messed up by trying te replace CS/ workers with AI agents
  • "Human authenticity will become the new luxury"
  • "AI agents need search" - but that had goods and bads
  • they advocate that search for AI needs a different tuning than search for humans
  • Ai search will:

    • perform way more searches
    • do longer and more detailed queries
    • the evaluation is rather: was the task completed successfully?
    • its optimized for recall, context and data openness

    From LLM-as-a-Judge to Human-in-the-Loop: Rethinking Evaluation in RAG and Search

    11:00 - Fernando Rejon Barrera, Zeta Alpha & Daniel Wrigley, OpenSource Connections

  • this is about eyeballing (a HIL approach):
    • it's part of the relevance workbench in OS
    • we utilize 2 search configurations, where we can compare two sets of queries to see the differences in the result sets
      • we then see the unique and the common results in the two sets
  • the idea is to combine eyeballing and RAG evaluation
  • with the current approach, we query an LLM, which does a (vector) search and generates an answer
  • we could optimize the prompts and the (vector) search retrieval
  • they're leveraging pairwise comparison on results for LLMaaJ evaluation
  • an idea is to use the ELO ranking system, like in chess torunaments
    • they run a "torunament between all agents"
  • introduced and demoed "Ragelo"
    • currently a simple streamlit app, where one can select agent configurations, benchmarks and then can collect answers
  • one can then evaluate pair- or pointwise

Beyond Keywords: Measuring Multimodal Search Quality

11:45 - Philippe Bouzaglou, Vectra

  • eyeballing ebmeddings is not possible, but embedding models are important for search relevance in vector search
  • though they are hard to evaluate
    • we usually just google which ever model make most "sense" (is newest, performs best on benchmarks)
    • current method to evaluate: install model => embed docs => run queries => measure and => repeat
      • that's good but slow
  • they propose a new method "document-query similarity evaluation"
  • flips the eval upside down
  • we take one document, get many queries and then get the cosine similarity scores between each query and the document
  • then we rank the queries by similarity and then we get an LLM judge (or human) to rank the queries
  • then we compare these
  • we could do this with a lot of models to see their differences

Future-Proofing E-commerce Search Architecture for Conversational Commerce and Beyond

13:30 - Jens Kürsten, OTTO GmbH & Co. KGaA

  • they use a hybrid approach of lexical and semantic search
  • they do a facet selection (classical filters) and then do rank fusion
  • functionally, they care about latency, maintainability, infra cost, etc.
  • they also did a gradual rollout
    • started with 0 result searches, then also did high volume queries, then a low recall until finally going for high recall
    • experimented a lot with different rank fusions, with only serving specific pages, only offering partial filters, etc.
    • they can manage a latency with is around 600ms
  • OTTOs stack doesn't look that different from ours (it's just way more mature)
  • otto uses solr with tweaks for lexical search
  • "Search stack better be:

    • API accessible
    • lightning fast
    • dead cheap
    • scalable to infinity"
    • they also do LLMaaJ:
      • they're simulating a customer
  • they reduced infra cost by 25%
  • His takeaways are:
    • Hybrid sesarch is hard, but worth the money"
    • "Make it work, make it right, make it fast", applicable at scale
  • There are a couple of teams working on search

Smart Recall: Enhancing Local LLM Conversations with Embedding-Aware Context Retrieval

14:15 - Lucas Jeanniot, Eliatra

  • business reality sometimes might bring you into the situation of having to host LLMs yourself locally for privacy
  • challenges:
    • no persistent memory
    • context window filling up
    • expensive redundant searches
  • they leverage OpenSearch conversational interface (he gave examples in ptyhon)
  • focus on token management, so they're tracking the tokens (it's local, so we aim to keep it slim)
    • they have different strategies like remove examples, summarize middle, keep recent only, aggressive summary
  • the token limits they have are set to all generations
  • they encourage to implement fallbacks
  • they use a three layer cache strategy, for embedding-, search result- and conversation summary cache

Commoditizing Inference: Why Your Query Language Should Speak AI

15:15 - Aurélien Foucret, Elastic

  • (elastic also offers an MCP server)
  • they propose an inference API
    • it'll be agnostic fo different llm provider formats
      • currently inference runs on the ML nodes => freom now on it's no longer tied to the ML nodes of the cluster
      • also has better throughput
      • currently only available on one AWS region (as it's in tech preview)
  • they'll introduce a new elastic inference service (EIS)
    • ESQL gets new inference parameters

Hybrid search Lessons learned

16:00 - Tom Burgmans, Wolters Kluwer & Mohit Sidana, Wolters Kluwer

  • Wolters Kluwer is an information serbice provider for experts in lega, business tax, etc.
  • they use solar for lexical search
  • they also use matryoshka embeddings with scalar quantization
  • they also struggled with throughput from 3rd party embedding models
  • they optimized indexing an tried to re-use vectors
    • if core text is not changing; fort that they check if metadata was changed
  • they sum the scores of vector search & lexical search
  • "Blending lexical results with vector results is like mixing oil and water“

    • keyword matching is an exact match
    • embeddings are an approximation
    • problematic of hybrid: you can do boosting on the whole lexical result set, with vector search you can only apply boosting to the top k
  • they base the balance of hybrid to lexical search depending on the type of the query:
  • e.g. for citations they do lexical, for case nickname, they blends some vector results in, etc.
  • for case summaries they rely on vector search, for typos as well, but they blend in some lexical results
  • their query understanding also considers history and context info
  • lots of challenges they face seem very similar to ours
  • vector search comes at a price -> one should definitely check how big the relevancy benefit is
  • cool idea they had: building a prototype interface that (with colors) shows what the origin of a search result is (hybrid vs lexical vs both) -> they then also implemented a slider that then showed the impact of the specific searched
  • Hybrid search sometimes also leads to unexpected situations (e.g. having a filter that shows aggregations, and the agg number might change when filter is applied, bc then the vector search might show up more documents than before)

Key Conference Findings

  • Research agentic relevance framework => this might have big impact in the future
  • Discuss interleaved A/B testing => this could reduce the A/B test times a lot
  • When we're live we should increase our focus on query understanding