Frank Liu, Zilliz - what initially attracted you to machine learning?

Updated 3 years ago on September 14, 2022

Frank Liu is Director of Operations at Zilliz, a leading provider of vector database and artificial intelligence technologies. They are also the engineers and scientists who created LF AI Milvus®, the world's most popular open source vector database.

What initially attracted you to machine learning?

I was first introduced to the possibilities of ML/AI when I was an undergraduate student at Stanford, even though it was somewhat far away from my major (electrical engineering). I was initially attracted to electrical engineering as a field because the ability to translate complex electrical and physical systems into mathematical approximations seemed very powerful to me, and statistics and machine learning seemed to be just as powerful. I ended up taking more classes on computer vision and machine learning in graduate school and ended up writing a master's thesis on using machine vision to evaluate the aesthetic beauty of images. All of this led to my first job in the computer vision and machine learning group at Yahoo, where I did mixed research and software development. At the time, we were still in the pre-transformer days of AlexNet and VGG, and watching an entire field and industry evolve so rapidly - from data preparation to massively parallel model training and production - was amazing. In many ways, using the phrase "back then" to refer to something that happened less than 10 years ago seems a bit ridiculous, but such is the progress that has been made in the field.

After Yahoo, I was CTO of a startup I co-founded where we used ML for indoor localization. There we had to optimize sequential models for very small microcontrollers, a very different but nevertheless related engineering challenge compared to today's massive LLM and diffusion models. We also built hardware, visualization dashboards, and simple cloud native applications, but AI/LLM has always been a core component of our work.

Even though I have been in or adjacent to the ML field for over 7-8 years now, I still have a great love for circuitry and digital logic design. The electrical engineering background helps me a lot in the work that I still do nowadays. Many important digital design concepts such as virtual memory, branch prediction, and concurrent execution in HDL provide a comprehensive understanding of many modern ML and distributed systems. While I understand the appeal of CS, I hope that the next couple years will see a resurgence of more traditional engineering specialties - EE, MechE, ChemE, etc..

For readers unfamiliar with the term, let's explain what is unstructured data?

Unstructured data refers to "complex" data, that is, data that cannot be stored in a predefined format or fit into an existing data model. In comparison, structured data is any type of data that has a predefined structure: numeric data, strings, tables, objects, key/value stores are all examples of structured data.

To better understand what unstructured data is and why it is traditionally difficult to process computationally, we need to compare it to structured data. Simply put, traditional structured data can be stored using a relational model. Take, for example, a relational database with a table for storing information about books: each row of the table could represent a specific book indexed by ISBN number, and the columns could represent the corresponding category of information, such as title, author, publication date, etc. etc. etc. Nowadays, there are much more flexible data models - wide-column stores, object databases, graph databases, etc. etc. etc. But the general idea remains the same: these databases are designed to store data that conforms to a particular form or data model.

On the other hand, unstructured data can be thought of as a pseudo-random blob of binary data. It can represent anything, be arbitrarily large or small, transformed and read in one of countless different ways. It is therefore impossible to fit it into any data model, let alone a table in a relational database.

What are some examples of this type of data?

Human-created data - images, video, audio, natural language, etc. - are great examples of unstructured data. But there are many less down-to-earth examples of unstructured data as well. User profiles, protein structures, genome sequences, and even human-readable code are also great examples of unstructured data. The main reason why unstructured data has traditionally been so difficult to manage is that unstructured data can take any form and require very different runtimes to process.

Images are an example: two photos of the same scene can have completely different pixel values, but have similar overall content. Natural language is another example of unstructured data that I like to refer to. The phrases "electrical engineering" and "computer science" are very closely related - so much so that the EE and CS buildings at Stanford are right next to each other - but without a way to encode the semantic meaning of these two phrases, a computer might naively think that "computer science" and "social sciences" are more related.

What is a vector database?

To understand what a vector database is, we must first understand what an embedding is. I'll come back to this question later, but to summarize, an embedding is a high-dimensional vector that can represent the semantics of unstructured data. In general, two embeddings that are close to each other in distance are very likely to correspond to semantically similar input data. Modern ML allows encoding and transforming a wide variety of unstructured data types - such as images and text - into semantically powerful embedding vectors.

From an organizational perspective, managing unstructured data becomes incredibly difficult once the amount of data exceeds a certain limit. This is where vector databases such as Zilliz Cloud come to the rescue. A vector database is designed to store, index, and search huge amounts of unstructured data by using embeddings as the underlying representation. Search in a vector database is usually performed using query vectors and the query results are N most similar results based on distance.

The best vector databases have many of the features inherent in traditional relational databases: horizontal scaling, caching, replication, failover, and query execution are just a few of the many features that should be implemented in a true vector database. As a categorizer, we are also active in academia, publishing papers at SIGMOD 2021 and VLDB 2022, two leading database conferences.

Could you please discuss what embedding is?

In general, an embedding is a high-dimensional vector resulting from the activation of an intermediate layer of a multilayer neural network. Many neural networks are trained to produce embeddings on their own, and some applications use composed vectors from multiple intermediate layers as embeddings, but I won't delve too deeply into these issues now. Another, less common but no less important way of generating embeddings is manual feature processing. Rather than automatically training the ML model with the correct representations for the input data, good old-fashioned feature generation techniques can also work in many applications. Regardless of the underlying method, embeddings for semantically similar objects are closely spaced, and it is this property that is the foundation of vector databases.

What are the most popular uses of this technology?

Vector databases are great for any application that requires semantic search - product recommendation, video analytics, document retrieval, threat and fraud detection, and AI-based chatbots are some of the most popular use cases for vector databases today. As an example, Milvus, the open source vector database created by Zilliz and underlying Zilliz Cloud, is used by more than a thousand enterprise users in a wide variety of applications.

I'm always happy to talk about these applications and help people understand how they work, but I also really enjoy talking about some of the lesser-known use cases for vector databases. Finding new drugs is one of my favorite "niche" use cases for vector databases. The task of this particular application is to search for potential candidate drugs to treat a particular disease or symptom among a database of 800 million compounds. One of the pharmaceutical companies we spoke with was able to significantly improve the drug discovery process, as well as reduce hardware resources, by combining Milvus with the RDKit chemoinformatics library.

Another example I would like to cite is the Cleveland Museum of Art's (CMA) AI ArtLens. AI ArtLens is an interactive tool that, taking a query image as input, retrieves visually similar images from the museum's database. This is commonly referred to as reverse image search and is a fairly common use case for vector databases, but the unique value proposition Milvus provided to the CMA was the ability to get the application up and running within a week with a very small team.

Could you tell us about what Towhee's open source platform is?

In talking to the Milvus community, we found that many of them would like to have a unified way of generating embeddings for Milvus. This is true for almost every organization we've talked to, but it's especially true for companies that don't have many machine learning engineers. At Towhee, we aim to solve this problem with what we call "data vector ETL." While traditional ETL pipelines focus on combining and transforming structured data from various sources into a usable format, Towhee is designed to work with unstructured data and explicitly incorporates ML into the resulting ETL pipeline. Towhee does this by providing hundreds of models, algorithms, and transformations that can be used as building blocks in an ETL pipeline of vector data. In addition, Towhee provides an easy-to-use Python API that allows developers to create and test such ETL pipelines with a single line of code.

While Towhee is a standalone project, it is also part of a broader vector database ecosystem based on Milvus that Zilliz is building. We believe that Milvus and Towhee are two well complementary projects that, when used together, can truly democratize the processing of unstructured data.

Zilliz recently raised $60 million in Series B funding. How will this accelerate the realization of Zilliz's mission?

First of all, I would like to thank Prosperity7 Ventures, Pavilion Capital, Hillhouse Capital, 5Y Capital, Yunqi Capital and others for believing in Zilliz's mission and supporting us with a Series B extension. We have now raised a total of $113 million in funding, and this latest round of funding will support our efforts to scale our engineering and marketing teams. In particular, we will refine our managed cloud offering, which is currently in early access but is scheduled to open to all comers later this year. We will also continue to invest in cutting-edge research in databases and artificial intelligence, as we have done over the past four years.

Is there anything else you would like to share about the company Zilliz?

As a company we are growing rapidly, but what really sets our current team apart from others in the database and ML space is our extreme passion for what we create. Our mission is to democratize the processing of unstructured data, and it's absolutely amazing to see so many talented people at Zilliz working towards a single goal. If anything we do sounds interesting to you, don't hesitate to get in touch. We'd love to have you on our team.

If you'd like to learn a bit more, I'm also willing to chat with you about Zilliz, vector databases, or AI/ML advances related to embedding. My (image) door is always open, so feel free to contact me directly on Twitter/LinkedIn.

Let's get in touch!

Please feel free to send us a message through the contact form.

Drop us a line at mailrequest@nosota.com / Give us a call over skypenosota.skype