tidb_feature_1800x600 (1)

I know too many people drowning in student debt.

A friend got her Master’s in Music from an expensive private school. She’s $120k in debt and works at Starbucks. Another guy I know got a business degree from a school that costs $50k a year. He’s doing the same job he could’ve gotten without the degree, except now he has loan payments until he’s 40.

These are smart, driven people. They just didn’t have the right information when they were 17 and picking schools. Nobody showed them the math.

Most college tools sell vibes: rankings, campus photos, acceptance rates. What they don’t show you is the only question that matters for most families:

“If I pick this school with this major, what does my financial life look like at 30…or 35?”

So I built College Picker, an AI college advisor that compares schools based on real financial outcomes. Not prestige. Not marketing. Money.

Key Takeaways

  • Rankings can be wildly disconnected from affordability and post-grad outcomes.
  • “College cost” isn’t one number. It depends on income, residency, aid, and transfer paths.
  • The tool models 20-year net worth projections per school + major + career path.
  • TiDB lets me do relational filtering + full-text search + vector similarity in one SQL query, without syncing multiple databases.

Beyond College Rankings: Solving the Student Debt Crisis with Data

College rankings don’t help. US News ranks schools by prestige metrics that have almost nothing to do with whether you’ll be able to pay rent after graduation. A school can be #15 in the country and still leave you with $200k in debt and a $45k starting salary.

I’m not saying “don’t pursue music or art.” I’m saying: if you want to study painting, maybe do it as a minor while you major in something that pays the bills. Keep making art. Do not take on six figures of debt for a degree that cannot realistically service it.

College Picker is built to answer the question every 17-year-old should be asking:

“Is this degree worth the debt?”

What College Picker Does

Given a student profile (income, state, intended major, and optionally a career path), College Picker:

  • Finds relevant schools using hybrid search (keywords + semantic search)
  • Pulls costs, debt, graduation rates, and earnings outcomes
  • Looks up career path requirements (grad school, residency years, etc.)
  • Generates 20-year projections (debt payoff, salary growth, savings/investing)
  • Produces a comparison verdict (Excellent/Good/Fair/Poor) with reasoning
  • Visualizes the tradeoffs as a decision tree

The AI Stack: Combining TiDB Cloud, Claude Opus, and GPT-4o mini

There’s no single model that does everything well, so I use the right tool for each job:

  • TiDB Cloud Starter: I didn’t want to run three databases. Postgres for user data, a vector DB for embeddings, Redis for caching. That’s a lot of infrastructure for a side project. TiDB handles relational queries and vector similarity in the same SQL statement. It’s MySQL-compatible so I didn’t have to learn anything new. And the Starter tier scales to zero, so I’m not paying when nobody’s using it.
  • Claude Opus 4.5: The conversational advisor. Students ask questions like “Should I go to Stanford or community college first?” and it responds with real data, not platitudes.
  • Claude Sonnet 4: Career data lookup. Given a career like “pediatric surgeon,” it returns structured salary data, residency years, and grad school costs. I use Sonnet because speed matters and it’s cheaper than Opus for structured extraction.
  • GPT-4o mini: College comparison verdicts. It weighs tradeoffs and outputs “Excellent/Good/Fair/Poor” ratings with reasoning.
  • OpenAI Embeddings: text-embedding-3-small generates vectors for everything, powering semantic search like “affordable engineering schools near California.”

Building the AI Workflow: Mapping Financial Outcomes from Search to Projection

Here’s the pipeline:

[Student inputs profile: income, state, intended major]
      ↓
[Search colleges via hybrid search (text + vector)]
      ↓
[Fetch college data from TiDB: costs, outcomes, earnings]
      ↓
[Claude looks up career salary data for intended path]
      ↓
[Generate 20-year financial projection per college]
      ↓
[Build decision tree visualization]
      ↓
[GPT-4o-mini generates verdict: which school is better ROI]
      ↓
[Render comparison with break-even ages]

Key insight: College cost isn’t a single number. It varies by family income, in-state vs out-of-state, financial aid, and whether you do a 2+2 community college transfer. The system models all of it.

Why Unified Databases Beat Split Architectures for AI Applications

Most AI apps run a split architecture:

  • Postgres for relational data
  • A vector database (Pinecone/Weaviate/etc.) for embeddings

It works, but you inherit a couple problems that get painful fast.

1. Eliminating the “Sync Tax” Between Relational and Vector Data

If college data lives in one database and embeddings live in another, every update becomes a mini two-phase commit you’re managing yourself:

  1. Update the relational row
  2. Regenerate the embedding
  3. Upsert to the vector database
  4. Hope nothing fails in between

When College Scorecard releases new earnings data, you’re running sync scripts and praying for consistency. You can end up with $85k earnings in Postgres but an embedding generated from old $72k data. Search results get wrong in ways that are hard to debug.

With TiDB, that problem disappears:

UPDATE colleges
SET earnings_10yr = 85000,
    embedding = ?
WHERE id = 12345;

One transaction. Both update or neither updates. ACID guarantees across relational and vector data.

2. Solving the Query Latency Problem: Combining Relational Filters and Vector Search in One SQL Statement

My search needed three things at once:

  • Filter by state and earnings
  • Text match for exact names like “UCLA”
  • Semantic search for queries like “good engineering schools”

With a split stack, that’s multiple round trips: query the vector DB, fetch IDs, query Postgres, do text search, merge and re-rank in application code.

With TiDB, it’s one query:

SELECT id, name, state, earnings_10yr,
       FTS_MATCH_WORD(name, 'UCLA') as text_score,
       VEC_COSINE_DISTANCE(embedding, ?) as vector_distance
FROM colleges
WHERE state = 'CA' AND earnings_10yr > 60000
ORDER BY
  CASE WHEN FTS_MATCH_WORD(name, 'UCLA') > 0 THEN 0 ELSE 1 END,
  vector_distance ASC
LIMIT 10;

Relational filter, text search, vector search, custom ranking. One query. One round trip. (In my case: ~47ms.)

AI Infrastructure Economics: Why ‘Scale to Zero’ Is Essential for Side Projects

Vector databases often charge by vector count and queries. At ~6,000 colleges with 1,536-dim embeddings, my rough math was ~$70/month minimum on many platforms, plus the relational database hosting, plus the engineering time to keep everything in sync.

For a side project with sporadic traffic, TiDB Cloud Starter scaling to zero is the difference between “I’ll keep this running” and “I’m shutting it down.”

Technical Implementation: SQL Vector Search and Hybrid Querying

Here’s what I store per college:

CREATE TABLE colleges (
  id INT PRIMARY KEY,
  name VARCHAR(255),
  state VARCHAR(2),
  ownership ENUM('public', 'private_nonprofit', 'private_forprofit'),
  
  -- Costs (varies by income bracket)
  tuition_in_state INT,
  tuition_out_of_state INT,
  room_and_board INT,
  net_price_0_30k INT,      -- Family income $0-30k
  net_price_30_48k INT,     -- Family income $30k-48k
  net_price_48_75k INT,     -- Family income $48k-75k
  net_price_75_110k INT,    -- Family income $75k-110k
  net_price_110k_plus INT,  -- Family income $110k+
  
  -- Outcomes
  graduation_rate_4yr FLOAT,
  graduation_rate_6yr FLOAT,
  retention_rate FLOAT,
  
  -- Earnings
  earnings_6yr INT,         -- Median earnings 6 years after enrollment
  earnings_10yr INT,        -- Median earnings 10 years after enrollment
  
  -- Debt
  median_debt INT,
  monthly_payment INT,
  
  -- Vector embedding for semantic search
  embedding VECTOR(1536),
  
  INDEX idx_state (state),
  INDEX idx_earnings (earnings_10yr),
  VECTOR INDEX idx_embedding (embedding) USING HNSW
);

That VECTOR(1536) column sits next to tuition_in_state. Same table, same row, same transaction.

Also: notice the five income brackets for net price. A family making $40k pays a very different amount than one making $150k. Most college tools ignore this. They show sticker price like everyone pays the same. They don’t.

Implementing Hybrid Search: Merging Keyword Matching and Vector Similarity

Students search in messy ways:

  • “UCLA”
  • “University of California Los Angeles”
  • “good CS schools in California”
  • “affordable engineering near LA”

So I built a hybrid search route that combines keyword matching with semantic similarity.

export async function POST(request: NextRequest) {
  const { query, mode = "hybrid", limit = 10 } = await request.json();
  
  // Generate embedding for semantic search
  const queryEmbedding = await generateEmbedding(query);

  if (mode === "text" || mode === "hybrid") {
    // BM25 text search on college names
    const [textResults] = await pool.execute(`
      SELECT id, name, state, earnings_10yr,
             FTS_MATCH_WORD(name, ?) as relevance
      FROM colleges
      WHERE FTS_MATCH_WORD(name, ?)
      ORDER BY relevance DESC
      LIMIT ?
    `, [query, query, limit]);

    if (mode === "text" || textResults.length >= limit) {
      return NextResponse.json({ results: textResults, mode: "text" });
    }
  }

  if (mode === "vector" || mode === "hybrid") {
    // Semantic search via vector similarity
    const [vectorResults] = await pool.execute(`
      SELECT id, name, state, earnings_10yr,
             VEC_COSINE_DISTANCE(embedding, ?) as distance
      FROM colleges
      WHERE VEC_COSINE_DISTANCE(embedding, ?) < 0.5
      ORDER BY distance ASC
      LIMIT ?
    `, [queryEmbedding, queryEmbedding, limit]);

    // Merge and deduplicate results
    const merged = mergeResults(textResults, vectorResults);
    return NextResponse.json({ results: merged, mode: "hybrid" });
  }
}

FTS_MATCH_WORD() handles exact matches. VEC_COSINE_DISTANCE() handles semantic queries. Same database, same query pattern, both search paradigms.

The College Alias Problem: Improving Search Accuracy with Lookup Tables

This is where I wasted a full day.

My first version just did fuzzy search:

SELECT * FROM colleges WHERE LOWER(name) LIKE '%mit%' LIMIT 5

It was slow. And it returned garbage. “MIT” matched “Summit University” and “Smith College” before it matched Massachusetts Institute of Technology.

I tried weights based on string position. Levenshtein distance. Regex patterns. All fragile. All slow.

The fix was embarrassingly simple: maintain a lookup table.

CREATE TABLE college_aliases (
  alias VARCHAR(255) PRIMARY KEY,
  college_id INT,
  INDEX idx_college (college_id)
);

-- Examples
INSERT INTO college_aliases VALUES
  ('UCLA', 110662),
  ('University of California Los Angeles', 110662),
  ('UC Los Angeles', 110662),
  ('USC', 123961),
  ('University of Southern California', 123961),
  ('MIT', 166683),
  ('Massachusetts Institute of Technology', 166683);

Now the chat route extracts possible school names, checks aliases first, then falls back to fuzzy search:

async function findColleges(terms: string[]): Promise<College[]> {
  const colleges: College[] = [];
  
  for (const term of terms) {
    // Try alias lookup first (fast, exact)
    const [aliasRows] = await pool.execute(
      `SELECT c.* FROM colleges c
       JOIN college_aliases a ON c.id = a.college_id
       WHERE LOWER(a.alias) = LOWER(?)`,
      [term]
    );
    
    if (aliasRows.length > 0) {
      colleges.push(aliasRows[0]);
      continue;
    }
    
    // Fall back to fuzzy search
    const [fuzzyRows] = await pool.execute(
      `SELECT * FROM colleges
       WHERE LOWER(name) LIKE ?
       LIMIT 1`,
      [`%${term.toLowerCase()}%`]
    );
    
    if (fuzzyRows.length > 0) {
      colleges.push(fuzzyRows[0]);
    }
  }
  
  return colleges;
}

I also log misses so the system gets better over time:

CREATE TABLE college_lookup_misses (
  term VARCHAR(255),
  searched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Every week I check the missing table and add new aliases. That’s dramatically more reliable than trying to guess what abbreviations people use.

Calculating College ROI: Modeling 20-Year Net Worth and Break-Even Ages

Given a college, major, and career path, I project net worth over 20 years.

The model factors in:

  • College costs (income-adjusted net price)
  • Scholarships
  • Major-specific earnings multipliers (CS: 1.8x, Arts: 0.65x)
  • Career path requirements (med school adds 4 years + ~$240k debt)
  • Loan repayment (10-year standard plan)
  • Savings rate (15%) and investment returns (7%)
function generateLifePath(
  college: College,
  major: string,
  career: CareerData,
  familyIncome: number
): YearlySnapshot[] {
  const timeline: YearlySnapshot[] = [];
  
  // Get income-adjusted net price
  const yearlyCollegeCost = getNetPrice(college, familyIncome);
  const totalCollegeCost = yearlyCollegeCost * 4;
  
  // Major affects starting salary
  const earningsMultiplier = MAJOR_MULTIPLIERS[major] || 1.0;
  const baseSalary = college.earnings_10yr * earningsMultiplier;
  
  // Career might require grad school
  const gradSchoolYears = career.grad_school_years || 0;
  const gradSchoolCost = career.grad_school_cost || 0;
  const residencyYears = career.residency_years || 0;
  
  let totalDebt = totalCollegeCost + gradSchoolCost;
  let netWorth = -totalDebt;
  let currentSalary = 0;
  
  for (let year = 0; year <= 20; year++) {
    const age = 18 + year;
    let phase = "college";
    
    if (year < 4) {
      phase = "college";
      currentSalary = 0;
    } else if (year < 4 + gradSchoolYears) {
      phase = "grad_school";
      currentSalary = 0;
    } else if (year < 4 + gradSchoolYears + residencyYears) {
      phase = "residency";
      currentSalary = career.residency_salary || 60000;
    } else {
      phase = "career";
      const yearsWorking = year - 4 - gradSchoolYears - residencyYears;
      currentSalary = baseSalary * Math.pow(1.03, yearsWorking); // 3% annual raises
    }
    
    // Loan repayment during career years
    const loanPayment = phase === "career" ? (totalDebt / 10) : 0;
    
    // Savings and investment
    const savings = currentSalary * 0.15;
    netWorth = netWorth * 1.07 + savings - loanPayment;
    
    timeline.push({
      year,
      age,
      phase,
      salary: Math.round(currentSalary),
      debt: Math.round(Math.max(0, totalDebt)),
      netWorth: Math.round(netWorth)
    });
    
    totalDebt = Math.max(0, totalDebt - loanPayment);
  }
  
  return timeline;
}

The output is dead simple to understand. For medical paths, it shows the brutal truth: you often don’t break even until your mid-30s, but long-term earnings can compensate.

Output from a AI College ROI Advisor.

And for that friend with the Master’s in Music? The projection would’ve shown exactly what she was getting into. $120k debt, ~$35k starting salary, break-even age: never.

Extracting Real-Time Career Salary Data with Claude Sonnet

My first version hardcoded salary data. I had a giant JSON file with 200+ careers and their salaries.

It was wrong within six months. Salaries change. New careers emerge. I was constantly patching it.

Now I ask Claude Sonnet to look it up:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 500,
  system: `You are a career data assistant. Given a career, return JSON with:
    - title: normalized job title
    - median_salary: annual median
    - salary_25th: 25th percentile
    - salary_75th: 75th percentile
    - growth_rate: projected job growth %
    - education_years: years of education/training required
    - grad_school_required: boolean
    - grad_school_years: if required
    - grad_school_cost: estimated total cost
    - residency_years: for medical careers
    - residency_salary: if applicable`,
  messages: [{ role: "user", content: `Career: ${careerInput}` }]
});

This handles edge cases like:

  • “pediatric surgeon” (med school + residency)
  • “patent attorney” (law school after undergrad)

Results are cached in-memory to avoid repeated API calls for common careers.

Visualizing Financial Paths with ReactFlow Decision Trees

I generate a decision tree showing outcomes at key stages:

function buildDecisionTree(colleges: College[], lifePaths: LifePath[]): TreeNode[] {
  const nodes: TreeNode[] = [];
  const edges: TreeEdge[] = [];
  
  // Root node: "Now"
  nodes.push({
    id: "start",
    data: { label: "Now (Age 18)", netWorth: 0 },
    position: { x: 0, y: 0 }
  });
  
  for (const [index, college] of colleges.entries()) {
    const path = lifePaths[index];
    
    // Year 1 node
    nodes.push({
      id: `${college.id}-y1`,
      data: {
        label: college.name,
        phase: "Freshman",
        netWorth: path[1].netWorth
      }
    });
    
    // Year 4 node (graduation)
    nodes.push({
      id: `${college.id}-y4`,
      data: {
        label: "Graduation",
        netWorth: path[4].netWorth,
        debt: path[4].debt
      }
    });
    
    // Year 10 node (mid-career)
    nodes.push({
      id: `${college.id}-y10`,
      data: {
        label: "Year 10",
        netWorth: path[10].netWorth,
        salary: path[10].salary,
        sentiment: path[10].netWorth > 0 ? "positive" : "negative"
      }
    });
    
    // Connect nodes
    edges.push({ source: "start", target: `${college.id}-y1` });
    edges.push({ source: `${college.id}-y1`, target: `${college.id}-y4` });
    edges.push({ source: `${college.id}-y4`, target: `${college.id}-y10` });
  }
  
  // Add "Skip College" path for comparison
  nodes.push({
    id: "skip-y10",
    data: {
      label: "No Degree - Year 10",
      netWorth: calculateNoDegreeNetWorth(10),
      salary: 35000
    }
  });
  
  return { nodes, edges };
}

It uses ReactFlow + Dagre for layout. Nodes are color-coded: Green for positive net worth, red for negative.

What I’m Still Iterating On

  • Major multipliers are rough. A CS degree from Stanford and one from a regional state school don’t produce the same outcomes, but right now I apply the same 1.8x multiplier to both.
  • School-specific multipliers are messy. I’m working on field-of-study data from College Scorecard. Lots of missing values and inconsistent categorization.
  • Career lookups aren’t perfectly consistent. Ask about “data scientist” twice and you might get slightly different salary ranges. I’m considering using a structured source like BLS for base numbers and keeping Claude for edge cases and career-path logic.

The Results

What works today:

  • Hybrid search across 6,000+ colleges
  • Income-adjusted cost comparisons across five brackets
  • 20-year projections (with career-specific paths)
  • Decision tree visualizations
  • AI verdicts comparing schools head-to-head
  • 2+2 community college alternatives
  • Voice chat if you don’t want to type

What’s next:

  • Scholarship database integration
  • Geographic cost-of-living adjustments
  • School-specific major earnings (not just national averages)
  • Multi-year historical trends

Try It Yourself

The college data comes from the Department of Education’s College Scorecard. The AI layers add the financial modeling and natural language interface.

If you want to build something similar, TiDB Cloud Starter gives you hybrid search out of the box: FTS_MATCH_WORD for keywords, VEC_COSINE_DISTANCE for semantics, same query, same database. No need to stitch together Postgres plus a vector DB plus Redis.

The main insight from building this: Rankings don’t matter. ROI matters.

A $200k degree that leads to a $50k job is a worse deal than a $40k degree that leads to the same job. The math isn’t complicated. The data just isn’t presented this way anywhere else.

If this tool had existed when my friend was picking schools, she might not be $120k in debt right now. That’s why I built it.

Check out my College Picker GitHub repo to see how I used TiDB Cloud Starter and hybrid search to build this financial-first advisor.


Experience modern data infrastructure firsthand.

Start for Free

Have questions? Let us know how we can help.

Contact Us

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

TiDB Cloud Starter

A fully-managed cloud DBaaS for auto-scaling workloads