ChatGPT and other generative AI programs spit out “hallucinations,” assertions of falsehoods as fact, because the programs not being built to “know” anything; they are simply built to produce a string of characters that is a plausible continuation of whatever you’ve just typed.
“If I ask a question about medicine or legal or some technical question, the LLM [large language model] will not have that information, especially if that information is proprietary,” said Edo Liberty, CEO and founder of startup Pinecone, in an interview recently with ZDNET. “So, it will just make up something, what we call hallucinations.”
Liberty’s company, a four-year-old, venture-backed software maker based in New York City, specializes in what’s called a vector database. The company has received $138 million in financing for the quest to ground the merely plausible output of GenAI in something more authoritative, something resembling actual knowledge.
“The right thing to do, is, when you have the query, the prompt, go and fetch the relevant information from the vector database, put that into the context window, and suddenly your query or your interaction with the language model is a lot more effective,” explained Liberty.
Vector databases are one corner of a rapidly expanding effort called “retrieval-augmented generation,” or, RAG, whereby the LLMs seek outside input in the midst of forming their outputs in order to amplify what the neural network can do on its own.
Of all the RAG approaches, the vector database is among those with the deepest background in both research and industry. It has been around in a crude form for over a decade.
In his prior roles at huge tech companies, Liberty helped pioneer vector databases as an under-the-hood, skunkworks affair. He has served as head of research for Yahoo!, and as senior manager of research for the Amazon AWS SageMaker platform, and, later, head of Amazon AI Labs.
“If you look at shopping recommendations at Amazon or feed ranking at Facebook, or ad recommendations, or search at Google, they’re all working behind the scenes with something that is effectively a vector database,” Liberty told ZDNET.
For many years, vector databases were “still a kind of a well-kept secret” even within the database community, said Liberty. Such early vector databases weren’t off-the-shelf products. “Every company had to build something internally to do this,” he said. “I myself participated in building quite a few different platforms that require some vector database capabilities.”
Liberty’s insight in those years at Amazon was that using vectors couldn’t simply be stuffed inside of an existing database. “It is a separate architecture, it is a separate database, a service — it is a new kind of database,” he said.
It was clear, he said, “where the puck was going” with AI even before ChatGPT. “With language models such as Google’s BERT, that was the first language model that started picking up steam with the average developer,” referring to Google’s generative AI system, introduced in 2018, a precursor to ChatGPT.
“When that starts happening, that’s a phase transition in the market.” It was a transition that he had to jump on, he said.
“I knew how hard it is, and how long it takes, to build foundational database layers, and that we had to start ahead of time, because we only had a couple of years before this would become used by thousands of companies.”
Any database is defined by the ways that data are organized, such as the rows and columns of relational databases, and the means of access, such as the structured query language of relational.
In the case of a vector database, each piece of data is represented by what’s called a vector embedding, a group of numbers that place the data in an abstract space — an “embedding space” — based on similarity. For example, the cities London and Paris are closer together in a space of geographic proximity than either is to New York. Vector embeddings are just an efficient numeric way to represent the relative similarity.
In an embedding space, any kind of data can be represented as closer or farther based on similarity. Text, for example, can be thought of as words that are close, such as “occupies” and “located,” which are both closer together than they are near a word such as “founded.” Images, sounds, program codes — all kinds of things can be reduced to numeric vectors that are then embedded by their similarity.
To access the database, the vector database turns the query into a vector, and that vector is compared with the vectors in the database based on how close it is to them in the embedding space, what’s known as a “similarity search.” The closest match is then the output, the answer to a query.
You can see how this has obvious relevance for the recommender engines: two kinds of vacuum cleaners might be closer to each other than either is to a third type of vacuum. A query for a vacuum cleaner might be matched for how close it is to any of the descriptions of the three vacuums. Broadening or narrowing the query can lead to a broader or finer search for similarity throughout the embedding space.
But similarity search across vector embeddings is not itself sufficient to make a database. At best, it is a simple index of vectors for very basic retrieval.
A vector database, Liberty contends, has to have a management system, just like a relational database, something to handle numerous challenges of which a user isn’t even aware. That includes how to store the various vectors across the available storage media, and how to scale the storage across distributed systems, and how to update, add and delete vectors within the system.
“Those are very, very unique queries, and very hard to do, and when you do that at scale, you have to build the system to be highly specialized for that,” said Liberty.
“And it has to be built from the ground up, in terms of algorithms and data structures and everything, and it has to be cloud-native, otherwise, honestly, you can’t really get the cost, scale, performance trade-offs that make it feasible and reasonable in production.”
Matching queries to vectors stored in a database obviously dovetails well with large language models such as GPT-4. Their main function is to match a query in vector form to their amassed training data, summarized as vectors, and to what you’ve previously typed, also represented as vectors.
“The way LLMs [large language models] access data, they actually access the data with the vector itself,” explained Liberty. “It’s not metadata, it’s not an added field that is the primary way that the information is represented.”
For example, “If you want to say, give me everything that looks like this, and I see an image — maybe I crop a face and say, okay, fetch everybody from the database that looks like that, out of all my images,” explained Liberty.
“Or if it’s audio, something that sounds like this, or if it’s text, it’s something that’s relevant from this document.” Those sorts of combined queries can all be a matter of different similarity searches across different vector embedding spaces. That could be particularly useful for the multi-modal future that is coming to GenAI, as ZDNET has related.
The whole point, again, is to reduce hallucinations.
“Say you are building an application for technical support: the LLM might have been trained on some random products, but not your product, and it definitely won’t have the new release that you have coming up, the documentation that’s not public yet.” As a consequence, “It will just make up something.” Instead, with a vector database, a prompt pertaining to the new product will be matched to that particular information.
There are other promising avenues being explored in the overall RAG effort. AI scientists, aware of the limitations of large language models, have been trying to approximate what a database can do. Numerous parties, including Microsoft, have experimented with directly attaching to the LLMs something like a primitive memory, as ZDNET has previously reported.
By expanding the “context window,” the term for the amount of stuff that was previously typed into the prompt of a program such as ChatGPT, more can be recalled with each turn of a chat session.
That approach can only go so far, Liberty told ZDNET. “That context window might or might not contain the information needed to actually produce the right answer,” he said, and in practice, he argues, “It almost certainly will not.”
“If you’re asking a question about medicine, you’re not going to put in the context window all of the knowledge of medicine,” he pointed out. In the worst-case scenario, such “context stuffing,” as it’s called, can actually exacerbate hallucinations, said Liberty, “because you’re adding noise.”
Of course, other database software and tools vendors have seen the virtues of searching for similarities between vectors, and are adding capabilities to their existing wares. That includes MongdoDB, one of the most popular non-relational database systems, which has added “vector search” to its Atlas cloud-managed database platform. It also includes small-footprint database vendor Couchbase.
“They don’t work,” said Liberty of the me-too efforts, “because they don’t even have the right mechanisms in place.”
The means of access of other database systems can’t be bolted to vector similarity search, in his view. Liberty offered an example of recall. “If I ask you what is your most recent interview you’ve done, what happens in your brain is not an SQL query,” he said, referring to the structured retrieval language of relational databases.
“You have connotations, you can fetch relevant information by context — that similarly or analogy is something vector databases can do because of the way they represent data” that other databases can’t do because of their structure.
“We are highly specialized to do vector search extremely well, and we are built from the ground up, from algorithms, to data structures, to the data layout and query planning, to the architecture in the cloud, to do that extremely well.”
What MongoDB, Couchbase, and the rest, he said “are trying to do, and, in some sense, successfully, is to muddy the waters on what a vector database even is,” he said. “They know that, at scale, when it comes to building real-world applications with vector databases, there’s going to be no competition.”
The momentum is with Pinecone, argues Liberty, by virtue of having pursued his original insight with great focus.
“We have today thousands of companies using our product,” said Liberty, “hundreds of thousands of developers have built stuff on Pinecone, our clients are being downloaded millions of times and used all over the place.” Pinecone is “ranked as number one by God knows how many different surveys.”
Going forward, said Liberty, the next several years for Pinecone will be about building a system that comes closer to what knowledge actually means.
“I think the interesting question is how do we represent knowledge?” Liberty told ZDNET. “If you have an AI system that needs to be truly intelligent, it needs to know stuff.”
The path to representing knowledge for AI, said Liberty, is definitely a vector database. “But that is not the end answer,” he said. “That is the initial part of the answer.” There’s another “two, three, five, ten years worth of investment in the technology to make those systems integrate with one another better to represent data more accurately,” he said.
“There is a huge roadmap ahead of us of making knowledge an integral part of every application.”