Page 1 of 1

Imagine you have a huge box of Lego bricks in front of you

Posted: Sat May 24, 2025 6:21 am
by jahid12
How would you start sorting it to understand what canada phone number list you really have? You'd probably make little piles, right? One pile for the large red bricks, another for the small wheels, another for the figures... Intuitively, you'd be grouping similar pieces together.



Well, in the world of marketing data and web analytics, we often find ourselves faced with a similar "Lego box of data": hundreds of landing pages, thousands of keywords, millions of user interactions ... Invaluable data, but difficult to grasp if we look at it one by one. Wouldn't it be great to have an automatic way to create these "heaps" to discover groups of pages with similar performance, user segments with similar behaviors, or keywords that perform anomalously ?

One option for creating these groups is to create manual groupings by defining the rules for assigning data to one group or another: This way, you can define manual regex to classify URLs , or create a histogram separating metrics by 100. But this requires a level of knowledge and simplicity of the data that we won't always have. Other times, we simply want a few groups and have no idea what criteria a record would use to enter one group or another.

This is where K-Means comes in , a powerful clustering technique. And best of all, thanks to BigQuery ML , you can apply it directly to your data using SQL queries, without needing to be a data scientist or leave your BigQuery environment.

For example, we can use K-Means to extract key insights from GA4, Google Search Console, crawlers, and our clients' databases. With this technique, we have the potential to:

Automatically segment pages or products by their performance (traffic, conversion, revenue, etc.).
Detect users with specific purchasing or browsing patterns.
Identify keywords or campaigns with unusual results (for better or worse!).
Group content by semantic similarity (although we will cover this better in future posts about embeddings).
Create new dimensions or variables that enrich our analysis.

In this post, we'll guide you step-by-step through what K-Means is, how it works in BigQuery ML, and how you can start using it today to organize your data and make smarter decisions.

Let's get to it!

What are K-Means and autoclustering? ( Debunking the concepts)
Let's start with the basics. Clustering is simply the process of grouping a set of objects (in our case, data rows: pages, users, keywords, etc.) in such a way that the objects within the same group (called a cluster ) are more similar to each other than to those in other groups. It's like building those Lego stacks we mentioned.

K-Means is one of the most popular algorithms for doing this. Its main idea is quite intuitive and starts with giving it our data and the number of clusters we want to sort it into. Its process is so simple (compared to other ML processes) that we can easily list all the steps involved in K-Means cluster training.

Decide how many clusters ('K') you want to be created (total groups).
The algorithm randomly (or a little more intelligently) chooses 'K' initial points that act as the initial centers of the clusters. These centers are called centroids (remember this word, it's the most important in K-means). Think of them as the "heart" or the most representative point of each group.
The system assigns each of your data points (each page, each user, etc.) to the closest centroid, based on the metrics you've specified (we'll see which ones later).
Once everything is calculated, K-means recalculates the position of each centroid, moving it to the midpoint of all the points assigned to it. This way, this new location is more representative of reality.
Repeat steps 3 and 4 several times: reassign the points to the new centroids and recalculate their positions. The algorithm loops this process until the centroids barely move when recalculated. When their values ​​are very similar, it assumes the clusters are stable and leaves them as is