how to implement k nearest neighbors from scratch

Introduction to k nearest neighbors (knn algorithm)

Imagine you're in a new city and want to know if a neighborhood is safe. You'd probably ask your closest neighbors for their experiences, right?

KNN works exactly like that. It's an algorithm that classifies or predicts a new data point by looking at the k most similar (nearest) examples from its training data. No complex rules—just similarity-based reasoning.

New Data Point
Visualizing Euclidean Distance in 2D Space
k=1 k=3 k=10

Current Prediction:

Finding neighbors...
Majority vote among neighbors.

🏷️ Classification

Predicting categories (e.g., "Cat" or "Dog"). KNN takes a majority vote among the k nearest neighbors. If 3 neighbors are cats and 2 are dogs, the prediction is "Cat".

📈 Regression

Predicting numbers (e.g., House Prices). KNN takes the average value of the k nearest neighbors. If neighbors cost $300k, $310k, and $290k, the prediction is $300k.

You use the same core logic for both—just change the final aggregation step. It's simple, intuitive, and requires no assumptions about your data's underlying distribution.

⚠️ Common Misconception: "It just memorizes data"

It's true KNN stores all training data—but that doesn't mean it just memorizes. If it did, it would always pick the single closest neighbor (k=1), which easily overfits to noise.

By using k > 1 neighbors, KNN smooths out irregularities and generalizes. It's "lazy" (no explicit training phase), but during prediction it aggregates multiple points to form a robust decision—that's generalization, not memorization.

How the KNN algorithm makes predictions step by step

1. Store

Store all your labeled training data. (In our visualization above, these are the Blue and Red dots).

2. Compute Distance

For a new point (the Green Star), compute its distance to every single training example. Usually, we use Euclidean distance (straight line).

3. Sort & Select

Sort these distances and pick the k smallest—those are your nearest neighbors. (Watch the lines appear in the visualization as you slide k).

4. Aggregate

Classification: Return the most frequent class.
Regression: Return the average value.

That's the entire algorithm. All the "learning" happens at prediction time by measuring similarity. No parameters to tweak during training—just your data, a distance metric, and a choice of k.

Setting up the development environment for KNN Python implementation

Before writing any KNN logic, you need a clean workspace where you can focus on the algorithm itself—not on fighting your tools. Think of this like sharpening your pencil before drawing: it makes everything else smoother.

terminal
pip install numpy scikit-learn

You only need two core libraries for a from-scratch implementation:

  • NumPy: For efficient numerical operations, especially distance calculations between vectors.
  • scikit-learn (optional but helpful): We won't use its KNN class, but it provides nice datasets and utilities like train_test_split to quickly generate and partition data for testing.

Note: You don't need pandas or matplotlib for the core KNN code—keep dependencies minimal.

📦 Preparing Data Structures

KNN needs your training data in a simple, consistent format. Think of this as the "language" your algorithm speaks.

  • X_train A 2D NumPy array where each row is a sample and each column is a feature.
  • y_train A 1D NumPy array of labels (for classification) or values (for regression), aligned with X_train rows.
import numpy as np

# Example with two features (e.g., height and weight)
X_train = np.array([
    [5.1, 3.5],
    [4.9, 3.0],
    [7.0, 3.2]
])
y_train = np.array([0, 0, 1])  # class labels

This structure lets you compute distances between a new point and every row in X_train using vectorized operations—fast and clean.

Common Pitfall: Feature Scaling

KNN's decisions are based entirely on distance. If one feature ranges from 0 to 100 (Age) and another from 0 to 100,000 (Income), the Income feature will dominate the distance calculation—even if Age is equally important.

Click the button to see how scaling normalizes the data spread.

🛠️ The Solution: Standardization

We subtract the mean and divide by the standard deviation. This forces every feature to have a mean of 0 and a variance of 1.

⚠️ Crucial Rule: No Data Leakage

Fit the scaler only on training data, then use it to transform both training and new data. Never fit on the whole dataset!

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# For a new point:
X_new_scaled = scaler.transform(X_new.reshape(1, -1))

✅ Verifying Environment Correctness

After setup, confirm everything works with a tiny sanity check. Write a function that computes Euclidean distance between two points—this will be the core of your KNN distance calculations.

def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

# Test
point1 = np.array([1, 2])
point2 = np.array([4, 6])

print(euclidean_distance(point1, point2))
# Output should be 5.0
Success Check

If this prints 5.0, your NumPy is installed correctly and you're ready to build the rest of KNN on top of this simple, reliable building block.

Understanding the KNN Algorithm Core Concepts

The Core Idea: Instance-Based Learning

At its heart, KNN is a lazy learner. Unlike other algorithms that "study" the data to build a general model (like a teacher summarizing a textbook), KNN simply memorizes the entire dataset.

When you ask KNN to classify a new point, it doesn't run a complex formula. It literally looks at its "memory" and finds the k training points that are closest to your new point. It asks them: "What are you?" and adopts their answer.

Common Misconception: "Nearest" isn't always "Best"

KNN finds the mathematically nearest neighbors based on your chosen distance metric. But if your data isn't prepared correctly, the math can be misleading.

Observe how the "nearest" neighbor changes when we normalize the scales.

The Problem

Imagine predicting house prices using:
Bedrooms: range 1–5
Square Feet: range 500–5000

Without scaling, a difference of 1 bedroom is mathematically tiny compared to a difference of 100 sq ft. The algorithm will ignore the bedroom count entirely because the "distance" is dominated by the larger numbers.

The Solution

Feature Scaling (like Standardization) forces all features to have a similar range (e.g., mean 0, std 1). This ensures every feature contributes fairly to the "nearest" calculation.

📐 Formal Definition: Step-by-Step

Let's translate the concept into the precise mathematical steps the computer takes. Given a new point x_new and a training set X_train:

  1. Compute Distances: Calculate the distance (usually Euclidean) between x_new and every single row in X_train.
    distances = || x_new - x_i || for all i
  2. Identify Neighbors: Sort these distances from smallest to largest. Pick the indices of the top k smallest values.
    k_indices = argsort(distances)[:k]
  3. Aggregate Labels: Retrieve the labels for these k neighbors and vote.
    • Classification: Return mode(N_k) (most frequent class).
    • Regression: Return mean(N_k) (average value).
import numpy as np

# 1. Compute distances efficiently using NumPy broadcasting
# X_train shape: (n_samples, n_features)
distances = np.linalg.norm(X_train - x_new, axis=1)

# 2. Sort and select top k indices
k_nearest_indices = np.argsort(distances)[:k]

# 3. Get their labels
k_nearest_labels = y_train[k_nearest_indices]

# 4. Majority Vote (for classification)
from scipy.stats import mode
prediction = mode(k_nearest_labels).mode

Note: In the code above, np.linalg.norm(..., axis=1) calculates the distance from your new point to every training point simultaneously. This vectorization is what makes Python fast enough for KNN.

This formal process is exactly what you envisioned in the introduction—it's just the precise, vectorized implementation of those mental steps. The key insight: all learning is reduced to a distance calculation and a simple vote/average. There are no parameters to learn, only the choice of k and the distance metric to define what "nearest" means.

Implementing the KNN Algorithm from Scratch in Python

Now we turn the mental model into real code. You already know the steps: store training data, compute distances, find k nearest, then aggregate. The implementation mirrors that exactly—no hidden magic.

🐍 KNN Class Implementation

knn.py
import numpy as np

class KNN:
    def __init__(self, k=3, task='classification'):
        self.k = k
        self.task = task  # 'classification' or 'regression'
    
    def fit(self, X_train, y_train):
        # Simply store the data—no training happens here.
        self.X_train = X_train
        self.y_train = y_train
    
    def predict(self, X_test):
        predictions = []
        for x in X_test:
            # 1. Compute Euclidean distance to every training point
            distances = np.linalg.norm(self.X_train - x, axis=1)
            
            # 2. Find indices of the k smallest distances
            k_nearest_indices = np.argsort(distances)[:self.k]
            
            # 3. Get the labels of those neighbors
            k_nearest_labels = self.y_train[k_nearest_indices]
            
            # 4. Aggregate based on task
            if self.task == 'classification':
                # Majority vote
                labels, counts = np.unique(k_nearest_labels, return_counts=True)
                pred = labels[np.argmax(counts)]
            else:  # regression
                pred = np.mean(k_nearest_labels)
            
            predictions.append(pred)
        
        return np.array(predictions)

Why this works:

  • np.linalg.norm(..., axis=1) computes Euclidean distance from x to every row in X_train in one vectorized step.
  • np.argsort(distances) returns indices that would sort the distances from smallest to largest. Slicing [:self.k] gives exactly k indices of the nearest neighbors.
  • For classification, np.unique with return_counts=True counts occurrences of each class among the k neighbors. np.argmax(counts) picks the class with the highest count.
  • For regression, np.mean averages the neighbor values.

Common Pitfall: Off-by-one Errors in Slicing

The most frequent mistake is mis-slicing the sorted indices. Remember: Python uses zero-based indexing, and slicing [:k] includes indices 0 through k-1—that's exactly k neighbors.

k=1 k=3 k=5
Sorted Distances (Indices)
indices = np.argsort(distances)[:3]
Selected: [0, 1, 2]

✅ Correct: [:k]

Gets the first k entries (indices 0 to k-1). This is the standard way to grab the top k items.

❌ Wrong: [:k+1]

Gets k+1 neighbors. You asked for 3, but got 4. Your majority vote is now based on the wrong amount of data.

❌ Wrong: [k]

Gets only one neighbor (the k-th smallest). You asked for a neighborhood, but found a single house.

🔍 Detailed Breakdown: fit vs predict

📥 fit(X_train, y_train)

This method does nothing more than assign the training arrays to instance variables. There is no learning—KNN is "lazy." The only requirement is that X_train and y_train are NumPy arrays with matching first-dimension lengths.

🔮 predict(X_test)

The prediction loop is where the algorithm lives:

  1. For each test point x, compute distances to all stored training points.
  2. Use np.argsort to get indices of smallest distances.
  3. Index self.y_train with those indices to retrieve neighbor labels.
  4. Aggregate: vote for classification, average for regression.

⚠️ Performance Note

This implementation is not optimized for large datasets. It loops over every test point and computes distances to all training points each time (O(n_test × n_train) operations). That's fine for learning and small data. For bigger data, you'd use spatial data structures (like KD-trees), but that's an advanced topic.

You now have a working KNN from scratch. Test it on a small dataset (like Iris) to see it classify correctly. If predictions seem off, revisit the off‑by‑one slice and ensure your features are scaled—distance magnitude matters.

Advanced topics: Optimizations for KNN Python

You've built a working KNN. But what happens when your training set grows from 100 points to 100,000?

The brute-force approach—calculating the distance from your new point to every single training sample—becomes painfully slow. It's an O(n_train) operation per query. If you have millions of points, a single prediction could take seconds or even minutes.

To fix this, we need to stop looking at everything. Let's explore two optimization philosophies: Exact Speedups (finding the true nearest neighbors faster) and Approximate Methods (trading a tiny bit of accuracy for massive speed gains).

🌲 Intuition: Using KD-Trees to Speed Up Search

Exact Optimization
2D Feature Space Partitioning
Query Point
Split Line
Pruned (Skipped)

Instead of checking every point, a KD-Tree recursively splits the space (like slicing a cake) along alternating axes.

  • Organize Hierarchically

    Points are stored in a binary tree structure based on their coordinates.

  • Pruning

    If a region of space is too far from your query, you skip checking every single point inside it.

⚠️ Limitation: KD-Trees work best in low dimensions (typically < 20 features). In high dimensions, the "curse of dimensionality" makes almost all points equidistant, rendering the tree useless.

Common Misconception: "Exact" is always required

It's easy to assume that only the mathematically exact k nearest neighbors will give the correct prediction. But KNN's predictions are inherently robust.

If your k is moderate (say 5 or 10), swapping one neighbor for another that is almost as close rarely changes the majority vote. In real-world data, finding the true top-5 among 100,000 points might take forever, but finding 5 very similar points among 1,000 is instant—and often accurate enough.

🚀 Implementing Approximate KNN (Subsampling)

approx_knn.py
import numpy as np

class ApproximateKNN:
    def __init__(self, k=3, sample_size=1000):
        self.k = k
        self.sample_size = sample_size
    
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
        n = len(X_train)
        
        # 1. Pick a fixed random subset during fit
        if n > self.sample_size:
            self.sample_indices = np.random.choice(n, self.sample_size, replace=False)
        else:
            self.sample_indices = np.arange(n)
        
        # 2. Extract the subset once
        self.X_sample = X_train[self.sample_indices]
        self.y_sample = y_train[self.sample_indices]
    
    def predict(self, X_test):
        predictions = []
        for x in X_test:
            # 3. Search ONLY the subset
            distances = np.linalg.norm(self.X_sample - x, axis=1)
            k_nearest = np.argsort(distances)[:self.k]
            
            # 4. Vote (same as before)
            neighbors = self.y_sample[k_nearest]
            labels, counts = np.unique(neighbors, return_counts=True)
            predictions.append(labels[np.argmax(counts)])
        
        return np.array(predictions)

⚖️ Speed vs. Accuracy Trade-off

Adjusting sample_size changes the workload (calculations) vs. potential accuracy.

✅ Why this works

By pre-selecting a random subset (e.g., 1,000 points) during fit, you reduce the search space. Instead of comparing against 100,000 points, you compare against 1,000.

⚠️ When to use

Use this when your dataset is huge (millions of points) and exact KNN is too slow. It works best when your data has natural clusters that random sampling is likely to catch.

🔍 Next Steps

For even larger scale, explore Locality-Sensitive Hashing (LSH) or Random Projection Trees. The principle remains: trade exactness for speed by reducing distance computations.

Evaluating KNN Performance: Metrics and Validation

You've built your KNN. It looks good. But "looks good" isn't data. To truly trust your model, you need to measure how it fails, not just if it succeeds.

The Confusion Matrix: Your Truth Table

Imagine your KNN is a spam filter. It makes 4 types of decisions. Let's visualize them interactively.

Predicted: Spam
Predicted: Not Spam
Actual: Spam
Actual: Not Spam

Accuracy (The Basics)

Formula: (TP + TN) / Total
Meaning: How often is the model right overall?

Click the boxes above to see how they affect your metrics!

Why it matters for KNN: If you have imbalanced data (e.g., 99% Not Spam), a lazy KNN with a huge k might just predict "Not Spam" for everything. It gets 99% accuracy but catches 0 spam. Precision and Recall save you from this trap.

📈 The K-Value Trade-off: Finding the Sweet Spot

Adjust k to see how model complexity changes.

Current State: k = 3

Training Accuracy: High (Model remembers data well)
Validation Accuracy: Good (Generalizes well)
What's happening?

With k=3, we have a balance. The model is flexible enough to learn patterns but smooth enough to ignore noise.

k=1 (Overfit) Optimal k=50 (Underfit)

Common Pitfall: Data Leakage in Cross-Validation

Cross-validation (CV) is the gold standard for picking k. But if you do it wrong, you'll get fake results.

❌ The Wrong Way (Leakage)

  1. Scale ALL data first.
  2. Split into folds.
  3. Train on Fold 1, Test on Fold 2.

Why it fails: Your scaler "saw" the test data statistics before training. This is cheating (data leakage).

✅ The Correct Way (Isolation)

  1. Split into folds first.
  2. Fit scaler on Train Fold only.
  3. Transform Train AND Test folds using that scaler.

Why it works: The test fold remains "unseen" until the very end, just like real life.

🐍 Python Code: Correct CV Implementation

from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create a pipeline: Scale first, then KNN
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNN(k=5))  # Your from-scratch or sklearn KNN
])

# K-Fold handles the splitting automatically
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# The pipeline ensures scaling happens INSIDE each fold
scores = cross_val_score(pipeline, X, y, cv=kf, scoring='accuracy')

print(f"Mean Accuracy: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")

By wrapping your scaler and KNN in a pipeline and using cross-validation, you ensure that every evaluation is fair. This is the professional way to validate your KNN model.

Hyperparameter tuning for k nearest neighbors

You already saw that k is the only true hyperparameter in KNN—it controls how many neighbors vote on each prediction. Changing k directly shapes the bias–variance trade-off. Let's visualize what happens inside that "neighborhood" as we adjust k.

Query Point
k=1 (Low Bias, High Variance) k=3 k=15 (High Bias, Low Variance)

Current Prediction:

Calculating...
Majority vote among neighbors.

Small k: Sensitive to noise (Overfitting).
Large k: Smoother, but might ignore local patterns (Underfitting).

⚠️ Common Misconception: "Larger k always improves performance"

It's tempting to think "more neighbors means more information." But beyond a certain point, larger k harms performance because you start including points that are too far away—points that belong to other regions or classes.

Imagine classifying a point near the boundary between two classes. With k=5, you get a correct majority vote. With k=50, you might include 25 from Class A and 25 from Class B → tie or random choice. The prediction becomes less locally faithful.

🔍 Finding the Sweet Spot: Grid Search

tuning.py
from sklearn.model_selection import StratifiedKFold
import numpy as np

# 1. Define range of k to test
k_values = list(range(1, 31, 2))  # Odd k to avoid ties
cv_scores = []

for k in k_values:
    # Run Cross-Validation for this k
    score = cross_val_score_knn(X, y, k=k)
    cv_scores.append(score)

# 2. Pick best k
best_k = k_values[np.argmax(cv_scores)]
print(f"Best k: {best_k}")

The peak of the curve indicates the optimal k.

Once you select the best k via CV, re-train your KNN on the full training set (with scaling fitted on all training data) and evaluate once on a held-out test set. This gives your final unbiased performance estimate. Never use the test set during hyperparameter tuning—that would invalidate your evaluation.

Practical considerations: When KNN works, and when it doesn't

You've built a working KNN. But what happens when your training set grows from 100 points to 100,000?

The brute-force approach—calculating the distance from your new point to every single training sample—becomes painfully slow. It's an O(n_train) operation per query. If you have millions of points, a single prediction could take seconds or even minutes.

📈 Scaling Reality: Prediction Time vs. Dataset Size

Observation: As the number of training samples increases, the time required to make a single prediction increases linearly.

What this means for you:

  • Small, low-dimensional data (e.g., <10,000 samples, <20 features): Brute-force KNN is simple and effective.
  • Large or high-dimensional data: You'll need optimizations (like KD-Trees) or you should consider a different algorithm altogether. KNN's simplicity comes at a scalability cost.

The "Curse of Dimensionality"

Even more critical than size is dimensionality—the number of features. In high-dimensional spaces (dozens or hundreds of features), the concept of "nearest" breaks down.

Why it happens

As dimensions increase, the volume of the space grows so fast that the nearest and farthest points from a query are almost the same distance away. Your distance metrics lose discriminative power.

Result

You'll hit a wall where adding more features doesn't improve—and often worsens—performance, while simultaneously making distance calculations more expensive.

🚫 Common Misconception: "KNN is a drop-in replacement for everything"

It's easy to think, "KNN is so intuitive—why not just use it for everything?" But KNN has fundamental limitations that more sophisticated models address.

Where KNN Struggles
  • No feature selection: Treats all features equally. Noise degrades performance.
  • Poor with categorical data: Requires numerical conversion which can create false ordinal relationships.
  • Memory intensive: Stores the entire training dataset.
When to Choose Alternatives
  • High Dimensionality: Use Linear Models (Logistic Regression) or Tree Ensembles (Random Forest).
  • Large Datasets (n > 50k): Use Gradient Boosting (XGBoost) or Neural Networks.
  • Mixed Data Types: Use Decision Trees or CatBoost (handle categoricals natively).
✅ Practical Workflow
  1. Start with KNN as a simple baseline (after scaling).
  2. If performance is lacking or data is large/high-dimensional, try a Random Forest (robust to feature scaling, handles mixed types).
  3. For very large datasets or high dimensions, move to gradient boosting or linear models with regularization.
  4. Only use KNN if it meets your performance, scalability, and maintenance constraints—it's rarely the best choice for production systems with real-world data volume and complexity.

Remember: No algorithm is universally best. The "no free lunch" theorem applies. Your choice should be driven by data size, feature types, performance needs, and computational constraints—not just familiarity. KNN is a valuable tool, but knowing when to put it down is part of becoming a skilled practitioner.

Real‑world considerations and when to avoid knn algorithm

You now have a working KNN implementation and understand how to tune k and evaluate performance. But in practice, KNN's simplicity comes with important constraints—ignoring them can lead to models that are slow, inaccurate, or both. Let's look at the two most common pitfalls and when you should reach for a different algorithm instead.

📐 The "Curse of Dimensionality"

High-Dimensional Pitfall

In low dimensions (2D, 3D), "nearest" means "close." But as you add features, something strange happens: all points become roughly equidistant.

Observation: As dimensions increase, the ratio of (Nearest Distance / Farthest Distance) approaches 1.0.

Why it breaks KNN

Imagine a 100-dimensional hypercube. Most of the volume is near the edges. A randomly chosen point will be far from the center and far from other points. The difference between the nearest and farthest neighbor shrinks relative to the absolute distances.

⚠️ The Result

  • Neighborhoods become unstable.
  • Distance metrics lose meaning.
  • KNN performance degrades significantly.
Rule of thumb: If you have more than ~20-30 features, KNN often fails unless you reduce dimensionality (e.g., with PCA) first.

Common Pitfall: Ignoring Feature Scaling

Distance is sensitive to numeric range. In real projects, it's easy to forget this step. The consequence is that features with larger scales dominate the distance calculation, effectively drowning out informative but smaller-scale features.

Observe how the "nearest" neighbor changes when we normalize the scales.

The House Price Problem

Imagine predicting prices using:
Size (sqft): 500–5,000
Bedrooms: 1–5
Age (years): 0–100

Without scaling, a difference of 1 bedroom is mathematically tiny compared to a difference of 100 sqft. The algorithm will ignore the bedroom count entirely because the "distance" is dominated by the larger numbers.

The Solution

Standardization forces all features to have a similar range (mean 0, std 1). This ensures every feature contributes fairly to the "nearest" calculation.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Always wrap scaling in a pipeline!
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNN(k=5))
])

🚫 When to Choose Alternatives

KNN is a useful baseline, but many real-world scenarios expose its weaknesses. Consider switching when:

Where KNN Struggles
  • Large Datasets (n > 50k): Prediction is O(n), making it too slow.
  • Mixed Data Types: KNN needs numeric distances. Categorical data requires complex encoding that often breaks distance logic.
  • High Dimensionality: As seen above, "nearest" loses meaning in 100+ dimensions.
  • Probabilistic Outputs: KNN's vote-based probabilities are often poorly calibrated.
When to Choose Alternatives
  • High Dimensionality: Use Linear Models (Logistic Regression) or Tree Ensembles (Random Forest).
  • Large Datasets: Use Gradient Boosting (XGBoost, LightGBM) or Neural Networks.
  • Mixed Data Types: Use Decision Trees or CatBoost (handle categoricals natively).
  • Interpretability: Use Decision Trees or Linear Models for clear rules/coefficients.
✅ Practical Workflow
  1. Start with KNN as a simple baseline (after scaling).
  2. If performance is lacking or data is large/high-dimensional, try a Random Forest (robust to feature scaling, handles mixed types).
  3. For very large datasets or high dimensions, move to gradient boosting or linear models with regularization.
  4. Only use KNN if it meets your performance, scalability, and maintenance constraints—it's rarely the best choice for production systems with real-world data volume and complexity.

Remember: No algorithm is universally best. The "no free lunch" theorem applies. Your choice should be driven by data size, feature types, performance needs, and computational constraints—not just familiarity. KNN is a valuable tool, but knowing when to put it down is part of becoming a skilled practitioner.

Frequently Asked Questions (FAQ)

Post a Comment

Previous Post Next Post