Office Data

Posted: **Sat May 24, 2025 6:38 am**

Similarity analysis: Identify texts with similar meanings. canada phone number list We can calculate the distance between the vectors of two texts to determine their semantic similarity. This is useful for finding related documents, recommending similar products, identifying duplicate content, or grouping texts by topic.
Content Optimization: Improving the semantic relevance of content to target keywords. We can compare a text's embedding with a keyword's embedding to determine if the text is relevant to that keyword.
Clustering: Grouping texts with similar meanings into clusters.
Manual: We can analyze the clusters generated by an algorithm to understand the structure of information and discover hidden patterns.
Automatic: We can use algorithms such as k-means to automate cluster creation.
Outlier Detection: Identify texts that stray from the main topic. We can calculate the distance of each text from the centroid of its cluster. Texts that stray too far from the centroid can be considered outliers and therefore content that is not as focused as it should be.
Advanced search: Search based on meaning, not just keywords. We can use embeddings to find documents that express similar ideas, even if they don't contain the same keywords.
So it's clear: We want to work with embeddings. But which model should I use to create them?
The problem with embeddings is that while the definition of what they are and what is expected of them is fairly standardized, the process for generating them and, above all, the results of the models that generate them are not.

In other words, embeddings aren't similar across models. In fact, they have nothing to do with each other. If I compare a vector generated with an OpenAI model with another vector generated with a Google model, it will tell me that completely unrelated words are similar, since the dimensions in each case have different meanings. This isn't even related to the company; it has to do with the training itself, so different embedding models from the same company won't match either.

Therefore, one of the things you should know is what embedding models exist and what their characteristics are. Summarizing them all is impossible, especially since at the rate we're evolving with AI, new models emerge every day. But I wanted to summarize the most important and stable ones so you can have them at hand.

So I created this table, so that it can serve as a reference and you can start deciding which one you're going to marry (because yes, once you have the contents in an embedding model, you're going to have to stick with it for everything).

Office Data

Examples of analysis that we can do with embeddings

Examples of analysis that we can do with embeddings