Research log #2

Why You Should Test With Simpler Models First (and What They Reveal)

By Lidia on Mon Jun 09 2025

In my last post, I shared a modular deep learning model I built to generate entity-level embeddings from tabular data, but it ran into a wall — it overfit almost immediately.

Why?

With high-dimensional inputs (1,000+ dimensions) and a small dataset (~4,000 rows), the model simply memorized rather than learning. Before jumping into more tweaks, I took a step back to ask a more fundamental question.

💡 Today’s Key Question

How do I evaluate whether the embeddings are actually useful for my task?

A Proxy Task with a Baseline Model

My end goal is deduplication or similarity search which means I want embeddings that capture meaningful structure. But I also want to avoid the laborious process of labeling duplicate pairs.

This led me to the idea of using a proxy task:

Use a simple classifier to predict an existing column, and treat that as an indirect signal of embedding quality.

Before adding a classifier to my deep learning model, I started with logistic regression and random forests as baseline models.

Sometimes, simple models are enough and they’re great for debugging.

Dataset Profile (It’s not great.)

Since the dataset is proprietary healthcare provider data, I can't share everything, but here’s the general shape:

Rows: ~4,000
Features: Name, email, specialty, org, category, etc.
Challenge: Most columns were sparse or noisy (e.g., 52% missing emails, 44% missing specialties)
Structure: Most rows only had name populated.

I chose the category column as the target label for this proxy task because:

✅ 91% filled

✅ 9 categories (e.g., physician, nurse, technician)

✅ Tied to other fields like org and specialty

Result (Looks good, but it isn’t..)

The results looked promising at first:

Logistic Regression: 92% accuracy
Random Forest: 94% accuracy

But then I looked closer and it turns out 91% of rows were labeled "physician."
The models weren’t learning anything meaningful — they were just learning to guess the dominant class.

Lessons Learnt

Despite the sophisticated design and thoughtful preprocessing, this dataset just wasn’t right for testing deep representation learning.

Too small
Too sparse
Too imbalanced
Not enough variation to learn entity similarity meaningfully

Yes, there are ways to handle imbalance or I could patch the missing values.

But sometimes, the deeper issue isn’t the model — it’s that I’m solving the wrong problem with the wrong tool.

What’s Next

It’s not all bad news. I now have a reusable pipeline, an evaluation method, and a clearer sense of what kind of dataset I actually need.

So next, I’ll be looking for a larger, messier, more representative dataset that reflects real-world deduplication problems: noisy inputs, overlapping records, structural patterns.

I’ll reuse the same pipeline, but this time, on a problem that actually needs it.

Blogs about AI and data