Before we can discuss vector databases, we want to recap a precursor concept called “embeddings“.
Embeddings are a fundamental concept in Natural Language Understanding (NLU), essential for representing text in a form that’s suitable for high-performance language understanding. This representation stands in for texts of various lengths: from single words to lengthy paragraphs to chapters.
At its core, an embedding algorithm transforms text into a numerical format, specifically a series of numbers known as a vector. Each number in this set is referred to as a “dimension”. These vectors often contain a large number of dimensions, sometimes over a thousand, allowing them to capture the nuance and complexity of the text.
What makes embeddings particularly useful is their variability based on the algorithm used. Each algorithm produces vectors in a different way, leading to varied representations of the same text. This diversity allows for the selection of an embedding algorithm that best suits the specific needs of an application. Dozens of these algorithms exist, each with its own strengths and ideal use cases. A great resource for reviewing and selecting an embedding algorithm is Hugging Face, which provides comprehensive information about different models.
One notable example of an embedding model is the “text-embedding-ada-002” from OpenAI. This model, however, is proprietary and accessible only via an API through the OpenAI platform. The vectors it generates are tailored to the nuances of this specific algorithm, offering unique insights into the text it processes.
These generated vectors are typically stored in specialized databases called a vector database. Alongside the vector, additional information like the original text, headings, or pre-built classification labels are also saved. This comprehensive storage allows for efficient and sophisticated operations like searching, comparing, and analyzing text. It facilitates the direct comparison of different pieces of text or the matching of a query with stored content, enhancing the capabilities of language-based applications.
See Also: Vector Space, Word level embedding, Word2vec