How to configure and customize a vectorizer?

If you are not fully satisfied with the default vectorizer, you can use and experiment other off-the-shelf vectorizers. Customization is also possible by fine-tuning on a particular business domain or by adding more context to each segment.

  • Go to the Processing view
  • Create a new vectorizer
  • Give a name to your vectorizer
  • Select an off-the-shelf vectorizer
    • all-MiniLM-L6-v2 (English)
    • OpenAI embeddings (English)
    • Paraphrase-multilingual-MiniLM (multilingual)
  • By default, we use the following vectorizers:
    • all-MiniLM-L6-v2 for english content
    • CamemBERT for french content
    • paraphrase-multilingual-MiniLM-L12-v2 for all other languages
  • Save and possibly activate the vectorizer on your project
  • The active vectorizers are marked with the green tick
  • The default vectorizer used in the semantic search in marked with the yellow cross

Customization

  • It is possible to add context to each segment for a better vectorization hence a better Retriever.
    • Go to the Processing menu
    • Create a new Vectorizer
    • Select “Advanced vectorizer”
    • Select “Web template engine Jinja” in the off-the-shelf component list.
    • Parameters allow you create a Ninja script (see here for instance).
      • It is possible to add the document title to each text segment writing “{{ title }} > {{ text }}” in the Jinja template
      • It is possible to add the title and document metadata to each text segment writing “{{ title }} > {{ metadata.name }} > {{ text }}” in the Jinja template.
      • Kairntech professional services can assist you.
  • A vectorizer can be fine-tuned for a particular business domain or language.
    • Kairntech professional services can do it for you.