Top 3 Strategies to Search Your Data

2024-12-22 10:36| 发布者: admin| 查看: 840| 评论: 0|原作者: Shawn Shi|来自: Towards Data Science

Background — Everything is data-driven

Would you agree with this statement? Technology is deeply integrated into every aspect of our daily lives. I certainly would!

When you shop on Amazon, you search from millions of products and find the ones meet your expectation! When you browse on Facebook, TikTok or other social media apps, the feed you see is so interesting to you that it is hard to stop scrolling. Do you ever wonder how Tinder finds matches and recommends them to their millions of users? I recently read a few articles on recommendation engines from the Tinder Tech Blog, and from the software perspective, a fascinating effort goes into finding a friend recommendation at large scale.

Behind the scenes, all these applications depend on retrieving relevant data efficiently to deliver these seamless user experiences.

As the data volume grows exponentially, the ability to quickly, accurately, and intelligently search through data becomes a critical component of modern software engineering. For us software engineers and machine learning engineers, we have tools to leverage so we can apply the right search strategies to meet user expectations for speed and relevance. Whether it’s looking up a user profile by email, searching for text tags, or performing semantic searches with AI based models, choosing the right tools and techniques can make all the difference in performance and scalability.

This article explores the evolution of search strategies, from fundamental index seeks to modern vector searches powered by machine learning. Using relatable examples, we’ll walk through how these approaches work, their use cases, and the example cloud solutions that support them. By the end, you’ll have a clearer understanding of how to design efficient search systems in the era of big data and AI.

Image created by author. Icons from flaticon.com

Common scenarios

Imagine we have an app similar to Tinder; the top functionalities to retrieve data may include:

User login by email and password

Find all users who are at least age 28

Find all users who are tagged as a “pet owner”

This one is particularly interesting. Find all users who also have an interest in sports (does not matter which sport), if I were a sports lover and want to find them.

Let’s use the above scenarios as an example and see how we can efficiently serve these queries.

Sample user data set

For illustration purpose, below is a tiny set of fictional user data. This is fake data. In the real world, each user could be a row in a SQL table, or a JSON document in a NoSQL container, or a hybrid. Every column below is straightforward, other than the “Introduction Embedding”, which is the numerical representation of “Introduction” in the embedding space. You can think of it as the format in which a machine learning algorithm would read the Introduction text.

Full data set for the sample users

Scenario 1 — User login by email

The app will need to look up the user by email address and password. A pseudo query may look like:

select * from users u where u.Email={email} and u.PasswordHash=Hash({password})

How do we efficiently serve this query? The answer is index seek! Index seek is the go-to solution for exact match queries.

How to implement:

Create an index on the Email field, which allows the database to perform an index seek to directly locate the record matching the email.

How index seek works:

The database checks the index, which is commonly organized as a B-Tree, navigates directly through the B-tree structure to locate the exact value in order to locate the exact match. This operation is highly efficient and is constant time.

Result:

Search result from exact match

Scenario 2 — Find all users who are at least age 28

The app will need to look up the users who have age ≥ 28. Note often the query should have other search criteria to narrow down the results. But for illustration purpose, we will only include this single criteria here. Pseudo query may look like:

select * from users u where u.Age >= {age}

How do we efficiently serve this query? The answer is index range scan!

How to implement:

Similar to the index seek above, we can create an index on the Agefield, which allows the database to perform a range scan over the index to retrieve users aged 28 or above.

How index range scan works:

In a range query (e.g., find all users where Age >= 28), the database uses the B-tree index to locate the start of the range (e.g., Age = 28) and then scans through adjacent leaf nodes in the index to retrieve all matching values within the range.

While this still leverages the efficiency of the B-tree structure, the range scan involves reading multiple entries, which makes it slightly less efficient than an exact match query.

Expected Result:

Search result from range query

Scenario 3 — Find all users who are tagged as a “pet owner”

The app will need to look up the users who are tagged as pet owner. This is a kind of keyword search. A pseudo query may look like:

select * from users u where contains(u.Tags, "pet owner")

How do we efficiently serve this query? The answer is full text search! For more detailed information on full text search and relevant terminologies, see Microsoft documentation on Full-Text Search as an example.

How to Implement:

Tags are stored as a text field. We can set up full-text search on the users table. For example, SQL Server supports full-text queries in two quick steps, create a full-text catalog and create a full-text index. See detailed instructions in Microsoft documentation Get Started with Full-Text Search.

How it works:

The full-text index tokenizes the Tags values ( pet owner, student, etc.) and maps them to their corresponding rows.

The query looks up the term “pet owner” and retrieves the relevant records.

Expected Result:

Full-text search is a strategy to achieve key word search, supporting synonyms and variants of a word (e.g., stemming).

Downside: if you search for users tags “have an animal”, you may not be able to find the pet owners above because no user has a tag “animal”. This is when semantic search may help!

Scenario 4 — Find all users who also have an interest in sports

The app will need to look up the users whose introduction indicates that they may have interest in some sort of sport. The query string can be something like “sports lover”, or “have interest in sports”, or anything semantically equivalent. A pseudo query may look like:

select top 3 u.Id FROM users u

order by VectorDistance(u.IntroductionEmbedding, [0.0492,0.3952,....,0.8325])

Where:

VectorDistance is one example system function to calculate a similarity score between the embedding of the search query (i.e., from the user who is running the query), and the embeddings of the other users.

[0.0492,0.3952,….,0.8325] is the embedding for query string “have interest in sports”.

How do we efficiently serve this query? The answer is vector search!

How to Implement:

In this example, each user contain not only traditional data like name and email and introduction text, but also a high-dimensional vector property called Introduction Embedding. This co-location of original data (introduction text) and corresponding vectors (introduction embedding) allows for efficient indexing and searching, as the vectors are stored in the same logical storage unit. Another alternative approach is to have a separate database that is specifically used to store vectors, call it the vector store. In this example, we will keep both original data and their vector integrated in the same storage. The benefit is greater data consistency and less complexity. Fun fact, ChatGPT service is built on top of Azure Cosmos DB for its chat history, as referred in Microsoft documentation Vector database.

Enable the vector indexing on the Introduction Embedding field. The vector indexing algorithm is different from the B-Tree index above, and can be k-nearest neighbors for full accuracy or approximate nearest neighbors for better efficiency. For more options on index algorithm and tradeoffs, see Microsoft documentation on Vectors in Azure AI Search for a good introduction.

Each user’s introduction should be calculated into its numeric embedding using a chosen embedding model, for example, OpenAI text-embedding-ada-002 model. This is the user introduction embedding.

Calculate the embedding for the query string, using exactly the same embedding model above. This ensures the data vectors being searched in and the query vector have consistent dimensions and are represented the same way in the vector space. So you can find their distance or similarity score in the multidimensional space!

Perform vector search: Use a similarity metric like cosine similarity to find records with embeddings closest to the query. The vector indexing helps to perform the search efficiently.

How it works:

Embedding is the mathematical representation of the text (or other modality like image or video), and is often a vector of floats. The vector may have a dimension of N, i.e., size = N, and each dimension corresponds to a feature of the original data.

The vector of floats allows machine learning algorithms to read and run calculations such as cosine similarity between two vectors. The higher the cosine similarity score is, the more similar the two vectors are in the vector space. In other words, the two users may share more similar interest if their introduction embedding have higher cosine similarity score.

Expected Result:

Search result from vector search

鲜花

握手

雷人

路过

鸡蛋

收藏邀请

上一篇：UK finance sector in top 10 for carbon emissions下一篇：Human versus autonomous car race ends before it begins

Top 3 Strategies to Search Your Data

最新评论

相关分类