Stop Tuning Hyperparameters. Start Tuning Your Problem.

towardsdatascience.com

. You’re three weeks into a churn prediction model, hunched over a laptop, watching a Bayesian optimization sweep crawl through its 200th trial. The validation AUC ticks from 0.847 to 0.849. You screenshot it. You post it in Slack. Your manager reacts with a thumbs-up.

You feel productive. You are not.

If you’ve ever spent days squeezing fractions of a percent out of a Machine Learning (ML) metric while a quiet voice in the back of your head whispered does any of this actually matter?, you already sense the problem. That voice is right. And silencing it with another grid search is one of the most expensive habits in the profession.

Here’s the uncomfortable math: more than 80% of Artificial Intelligence (AI) projects fail, according to RAND Corporation research published in 2024. The number one root cause isn’t bad models. It isn’t insufficient data. It’s misunderstanding (or miscommunicating) what problem needs to be solved. Not a modeling failure. A framing failure.

This article gives you a concrete protocol to catch that failure before you write a single line of training code. Five steps. Each one takes a conversation, not a GPU cluster.

“All that progress in algorithms means it’s actually time to spend more time on the data.” Andrew Ng didn’t say spend more time on the model. He said the opposite.

The Productive Procrastination Trap

Hyperparameter tuning feels like engineering. You have a search space. You have an objective function. You iterate, measure, improve. The feedback loop is tight (minutes to hours), the progress is visible (metrics go up), and the work is legible to your team (“I improved AUC by 2 points”).

Problem framing feels like stalling. You sit in a room with business stakeholders who use imprecise language. You ask questions that don’t have clean answers. There’s no metric ticking upward. No Slack screenshot to post. Your manager asks what you did today and you say, “I spent four hours figuring out whether we should predict churn or predict reactivation likelihood.” That answer doesn’t sound like progress.

But it is the only progress that matters.

Effort allocation vs. actual impact in ML projects. Sources: RAND (2024), Anaconda State of Data Science (2022).
Image by the author.

The reason is structural. Tuning operates within the problem as defined. If the problem is defined wrong, tuning optimizes a function that doesn’t map to business value. You get a beautiful model that solves the wrong thing. And no amount of Optuna sweeps can fix a target variable that shouldn’t exist.

Zillow Bet $500 Million on the Wrong Problem

In 2021, Zillow shut down its home-buying division, Zillow Offers, after losing over $500 million. The company had acquired roughly 7,000 homes across 25 metro areas, consistently overpaying because its pricing algorithm (the Zestimate) didn’t adjust to a cooling market.

The post-mortems focused on concept drift. The model trained on hot-market data couldn’t keep up as demand slowed. Contractor shortages during COVID delayed renovations. The feedback loop between purchase and resale was too slow to catch the error.

But the deeper failure happened before any model was trained.

Zillow framed the problem as: Given a home’s features, predict its market value. That framing assumed a stable relationship between features and price. It assumed Zillow could renovate and resell fast enough that the prediction window stayed short. It assumed the model’s error distribution was symmetric (overpaying and underpaying equally likely). None of those assumptions held.

Competitors Opendoor and Offerpad survived the same market shift. Their models detected the cooling and adjusted pricing. The difference wasn’t algorithmic sophistication. It was how each company framed what their model needed to do and how quickly they updated that frame.

Zillow didn’t lose $500 million because of a bad model. They lost it because they never questioned whether “predict home value” was the right problem to solve at their operational speed.

When the AI Learned to Detect Rulers Instead of Cancer

A research team built a neural network to classify skin lesions as benign or malignant. The model reached accuracy comparable to board-certified dermatologists. Impressive numbers. Clean validation curves.

Then someone looked at what the model actually learned.

It was detecting rulers. When dermatologists suspect a lesion might be malignant, they place a ruler next to it to measure its size. So in the training data, images containing rulers correlated with malignancy. The model found a shortcut: ruler present = probably cancer. Ruler absent = probably benign.

The accuracy was real. The learning was garbage. And no hyperparameter tuning could have caught this, because the model was performing exactly as instructed on the data exactly as provided. The failure was upstream: nobody asked, “What should the model be looking at to make this decision?” before measuring how well it made the decision.

This is a pattern called shortcut learning, and it shows up everywhere. Models learn to exploit correlations in your data that won’t hold in production. The only defense is a clear specification of what the model should and should not use as signal, and that specification comes from problem framing, not from tuning.

Why Framing Errors Survive So Long

If bad problem framing is this destructive, why do smart teams keep skipping it?

Three reinforcing dynamics make it persistent.

First, feedback asymmetry. When you tune a hyperparameter, you see the result in minutes. When you reframe a problem, the payoff is invisible for weeks. Human brains discount delayed rewards. So teams gravitate toward the fast feedback loop of tuning, even when the slow work of framing has 10x the return.

Second, legibility bias. “I improved accuracy from 84.7% to 84.9%” is a clean, defensible statement in a standup meeting. “I spent yesterday convincing the product team that we’re optimizing the wrong metric” sounds like you accomplished nothing. Organizations reward visible output. Framing produces no visible output until it prevents a disaster nobody knows was coming.

Third, identity. Data scientists are trained as model builders. The tools, the courses, the Kaggle leaderboards, the interview questions: they all center on modeling. Problem framing feels like someone else’s job (product, business, strategy). Claiming it means stepping outside your technical identity, and that’s uncomfortable.

The three reinforcing dynamics that keep ML teams optimizing the wrong thing. Image by the author.

Andrew Ng named this pattern when he introduced the concept of data-centric Artificial Intelligence (AI) in 2021. He defined it as “the discipline of systematically engineering the data needed to build a successful AI system.” His argument: the ML community had spent a decade obsessing over model architecture while treating data (and by extension, problem definition) as someone else’s job. The returns from better architectures had plateaued. The returns from better problem definition had barely been tapped.

The Steel-Man for Tuning
Before going further: hyperparameter tuning is not useless. There are situations where it’s exactly the right thing to do.

If you’ve already validated that your target variable maps directly to a business decision. If your data distribution in production matches training. If you’ve confirmed that your features capture the signal the business cares about (and only that signal). If all of this is true, then tuning the model’s capacity, regularization, and learning rate is legitimate optimization.

The claim isn’t “never tune.” The claim is: most teams start tuning before they’ve earned the right to tune. They skip the framing work that determines whether tuning will matter at all. And when tuning produces marginal gains on a misframed problem, those gains are illusory.

Data analytics research shows the pattern clearly: once you’ve achieved 95% of possible performance with basic configuration, spending days to extract another 0.5% rarely justifies the computational cost. That calculation gets worse when the 95% is measured against the wrong objective.

The 5-Step Problem Framing Protocol

This protocol runs before any modeling. It takes 2 to 5 days depending on stakeholder availability. Every step produces a written artifact that your team can reference and challenge. Skip a step, and you’re gambling that your assumptions are correct. Most aren’t.

Step 1: Name the Decision (Not the Prediction)

Who: Data science lead + the business stakeholder who will act on the model’s output.
When: First meeting. Before any data exploration.
How: Ask this question and write down the answer verbatim:

“When this model produces an output, what specific decision changes? Who makes that decision, and what do they do differently?”

Example (good): “The retention team calls the top 200 at-risk customers each week instead of emailing all 5,000. The model ranks customers by reactivation probability so the team knows who to call first.”

Example (bad): “We want to predict churn.” (No decision named. No actor identified. No action specified.)

Red flag: If the stakeholder can’t name a specific decision, the project doesn’t have a use case yet. Pause. Do not proceed to data exploration. A model without a decision is a report nobody reads.

Step 2: Define the Error Cost Asymmetry

Who: Data science lead + business stakeholder + finance (if available).
When: Same meeting or next day.
How: Ask:

“What’s worse: a false positive or a false negative? By how much?”

Example: For a fraud detection model, a false negative (missed fraud) costs the company an average of $4,200 per incident. A false positive (blocking a legitimate transaction) costs $12 in customer service time plus a 3% chance of losing the customer ($180 expected value). The ratio is roughly 23:1. This means the model should be tuned for recall, not precision, and the decision threshold should be set much lower than 0.5.

Why this matters: Default ML metrics (accuracy, F1) assume symmetric error costs. Real business problems almost never have symmetric error costs. If you optimize F1 when your actual cost ratio is 23:1, you will build a model that performs well on paper and poorly in production. Zillow’s Zestimate treated overestimates and underestimates as equally bad. They weren’t. Overpaying for a house you can’t resell for months is catastrophically worse than underbidding and losing a deal.

Step 3: Audit the Target Variable

Who: Data science lead + domain expert.
When: After Steps 1-2 are documented. Before any feature engineering.
How: Answer these four questions in writing:

Does this target variable actually measure what the business cares about? “Churn” might mean “cancelled subscription” in your data but “stopped using the product” in the stakeholder’s mind. These are different populations. Clarify which one maps to the decision in Step 1.
When is the target observed relative to when the model needs to act? If you’re predicting 30-day churn but the retention team needs 14 days to intervene, your prediction window is wrong. The model needs to predict churn at least 14 days before it happens.
Is the target contaminated by the intervention you’re trying to optimize? If past retention efforts already reduced churn for some customers, your training data underestimates their true churn risk. The model learns “these customers don’t churn” when the truth is “these customers don’t churn because we intervened.” This is the causal inference trap, and it’s invisible in standard train/test splits.
Can the model learn the right signal, or will it find shortcuts? The ruler-in-dermatology problem. List the features. For each one, ask: “Would a domain expert use this feature to make this decision?” If not, it might be a proxy that won’t generalize.

Step 4: Simulate the Deployment Decision

Who: Full project team (DS, engineering, product, business stakeholder).
When: After Steps 1-3 are documented. Before modeling begins.
How: Run a tabletop exercise. Present the team with 10 synthetic model outputs (a mix of correct predictions, false positives, and false negatives) and ask:

“Given this output, what action does the business take?”
“Is that action correct given the ground truth?”
“How much does each error type cost?”
“At what confidence threshold does the business stop trusting the model?”

This exercise surfaces misalignments that no metric can catch. You might discover that the business actually needs a ranking (not a binary classification). Or that the stakeholder won’t act on predictions below 90% confidence, which means half your model’s output is ignored. Or that the “action” requires information the model doesn’t provide (like why a customer is at risk).

Artifact: A one-page deployment spec listing: who uses the output, in what format, at what frequency, with what confidence threshold, and what happens when the model is wrong.

Step 5: Write the Anti-Target

Who: Data science lead.
When: After Steps 1-4. The last check before modeling begins.
How: Write one paragraph answering:

“If this project succeeds on every metric we’ve defined but still fails in production, what went wrong?”

Example 1: “The churn model hits 0.91 AUC on the test set, but the retention team ignores it because the predictions arrive 48 hours after their weekly planning meeting. The model is accurate but operationally useless because we didn’t align the prediction cadence with the decision cadence.”

Example 2: “The fraud model flags 15% of transactions, overwhelming the review team. They start rubber-stamping approvals to clear the queue. Technically the model catches fraud; practically the humans in the loop have learned to ignore it.”

The anti-target is an inversion: instead of defining success, define the most plausible failure. If you can write a vivid anti-target, you can often prevent it. If you can’t write one, you haven’t thought hard enough about deployment.

Run all 5 steps before writing training code. Each step produces a written artifact the team can reference. Image by the author.

Is This a Tuning Problem or a Framing Problem?

Not every stalled project needs reframing. Sometimes the problem is well-framed and you genuinely need better model performance. Use this diagnostic to tell the difference.

Image by the author.

What Changes When Teams Frame First

The shift from model-centric to problem-centric work isn’t just about avoiding failure. It changes what “senior” means in data science.

Junior data scientists are valued for modeling skill: can you train, tune, and deploy? Senior data scientists should be valued for framing skill: can you translate an ambiguous business situation into a well-posed prediction problem with the right target, the right features, and the right success criteria?

The industry is slowly catching up. Andrew Ng’s push toward data-centric AI is one signal. The RAND Corporation’s 2024 report on AI anti-patterns is another: their top recommendation is that leaders should ensure technical staff understand the purpose and context of a project before starting. QCon’s 2024 analysis of ML failures names “misaligned objectives” as the most common pitfall.

The pattern is clear. The bottleneck in ML isn’t algorithms. It’s alignment between the model’s objective and the business’s actual need. And that alignment is a human conversation, not a computational one.

The bottleneck in ML is not compute or algorithms. It’s the conversation between the person who builds the model and the person who uses the output.

For organizations, this means problem framing should be a first-class activity with its own time allocation, its own deliverables, and its own review process. Not a preamble to “the real work.” The real work.

For individual data scientists, it means the fastest way to increase your impact isn’t learning a new framework or mastering distributed training. It’s learning to ask better questions before you open a notebook.

It’s 11:14 PM on a Wednesday. You’re three weeks into a project. Your validation metric is climbing. You’re about to launch another sweep.

Stop.

Open a blank document. Write one sentence: “The decision that changes based on this model’s output is ___.” If you can’t fill in the blank without calling a stakeholder, you’ve just found the highest-ROI activity for tomorrow morning. It won’t feel like progress. It won’t produce a Slack-worthy screenshot. But it’s the only work that determines whether the next three weeks matter at all.

References

RAND Corporation, “The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed”, James Ryseff, Brandon De Bruhl, Sydne J. Newberry, 2024.
MIT Sloan, “Why It’s Time for ‘Data-Centric Artificial Intelligence’”, Sara Brown, June 2022.
insideAI News, “The $500mm+ Debacle at Zillow Offers: What Went Wrong with the AI Models?”, December 2021.
Stanford Graduate School of Business, “Flip Flop: Why Zillow’s Algorithmic Home Buying Venture Imploded”.
Diagnostics (MDPI), “Uncovering and Correcting Shortcut Learning in Machine Learning Models for Skin Cancer Diagnosis”, 2022.
VentureBeat, “When AI Flags the Ruler, Not the Tumor”.
InfoQ, “QCon SF 2024: Why ML Projects Fail to Reach Production”, November 2024.
Number Analytics, “8 Hyperparameter Tuning Insights Backed by Data Analytics”.

Feeds