Research log #2
Why You Should Test With Simpler Models First (and What They Reveal)
By Lidia on Mon Jun 09 2025
Why?
With high-dimensional inputs (1,000+ dimensions) and a small dataset (~4,000 rows), the model simply memorized rather than learning. Before jumping into more tweaks, I took a step back to ask a more fundamental question.💡 Today’s Key Question
How do I evaluate whether the embeddings are actually useful for my task?
A Proxy Task with a Baseline Model
My end goal is deduplication or similarity search which means I want embeddings that capture meaningful structure. But I also want to avoid the laborious process of labeling duplicate pairs.
This led me to the idea of using a proxy task:
Use a simple classifier to predict an existing column, and treat that as an indirect signal of embedding quality.
Before adding a classifier to my deep learning model, I started with logistic regression and random forests as baseline models.
Sometimes, simple models are enough and they’re great for debugging.Dataset Profile (It’s not great.)
Since the dataset is proprietary healthcare provider data, I can't share everything, but here’s the general shape:- Rows: ~4,000
- Features: Name, email, specialty, org, category, etc.
- Challenge: Most columns were sparse or noisy (e.g., 52% missing emails, 44% missing specialties)
- Structure: Most rows only had name populated.
✅ 91% filled
✅ 9 categories (e.g., physician, nurse, technician)
✅ Tied to other fields like org and specialty
Result (Looks good, but it isn’t..)
The results looked promising at first:- Logistic Regression: 92% accuracy
- Random Forest: 94% accuracy
The models weren’t learning anything meaningful — they were just learning to guess the dominant class.
Lessons Learnt
Despite the sophisticated design and thoughtful preprocessing, this dataset just wasn’t right for testing deep representation learning.- Too small
- Too sparse
- Too imbalanced
- Not enough variation to learn entity similarity meaningfully
But sometimes, the deeper issue isn’t the model — it’s that I’m solving the wrong problem with the wrong tool.
What’s Next
It’s not all bad news. I now have a reusable pipeline, an evaluation method, and a clearer sense of what kind of dataset I actually need.
So next, I’ll be looking for a larger, messier, more representative dataset that reflects real-world deduplication problems: noisy inputs, overlapping records, structural patterns.
I’ll reuse the same pipeline, but this time, on a problem that actually needs it.