← Back to Work

Alumni Career Success Prediction

Nov 2023

Random Forest Statistical Modeling Python
View on GitHub

The Business Problem

Universities track alumni career outcomes but struggle to extract insights from fragmented data. Alumni records sit across multiple spreadsheets with inconsistent formats, different currencies, and no standardization.

This creates problems for:

  • Academic programs: Can't identify which curriculum elements lead to career success
  • Career services: Don't know which industries or geographies offer best opportunities
  • Student advising: Lack data to guide students toward high-growth career paths
  • Institutional planning: Can't demonstrate ROI to stakeholders or inform strategic decisions

The opportunity: Merge fragmented alumni data, predict career trajectories, and identify patterns that inform curriculum development and student guidance.

Scale & Approach

I analyzed 3,300+ alumni records from IIITDM Kancheepuram across 4 separate datasets with completely different structures. The challenge wasn't just prediction but data unification across inconsistent formats, currencies, and time periods.

The Data Challenge

Four Fragmented Sources

Alumni data existed in 4 separate Excel files with zero standardization:

  • Academics Brochure: Basic employment info with mixed formatting
  • Working Abroad Records: International alumni with salaries in USD
  • ECE Data Collection Form: Survey responses with free-text job titles
  • Split Database: Historical records with outdated notation

Data Cleaning Challenges

The biggest challenge was salary standardization across multiple currencies, time periods, and notations. Unified 3,300+ records into a consistent format for modeling.

Specific cleaning tasks:

  • Salary normalization: Converted USD, INR, lakhs per annum, monthly, annual figures to consistent annual INR
  • Job title standardization: Used regex and string matching to unify "Software Engineer," "SDE," "Developer" into consistent categories
  • Geographic standardization: Cleaned country names (USA vs United States vs US)
  • Missing data handling: Imputed missing values using median by job category and location

Career Pattern Analysis

Key Findings

Data scientists and senior software engineers commanded highest salaries (14-18 LPA median). Significant variation within roles based on geography and company tier.

What the data revealed:

  • Geographic arbitrage matters more than job title: Same designation showed 3x salary variance by country. Data scientist in US earned 3x more than data scientist in India.
  • International mobility is the biggest lever: Alumni working abroad earned 2-3x more than domestic positions across all roles.
  • Career path clustering: Identified three distinct trajectories (industry tech, academia, entrepreneurship) with different salary curves.
  • Time-to-senior matters: Promotions to senior roles within 3 years correlated with 40% higher long-term earnings.

The Solution: Predictive Models

Built two complementary models to help advise students and evaluate program outcomes:

Model 1: Salary Prediction (Regression)

  • Random Forest regressor predicting exact salary based on designation, country, years since graduation
  • Used for career services to estimate compensation ranges for different paths
  • Helps students understand earning potential by role and geography

Model 2: Career Success Classification (Binary)

  • Random Forest classifier predicting whether career exceeds median salary threshold
  • Used to identify high-growth career patterns and curriculum correlations
  • Helps academic programs understand which elements lead to above-median outcomes

Technical Implementation

Built with Python, scikit-learn, and pandas. Used Random Forest for both tasks because it handles non-linear relationships between geography, designation, and salary better than linear models.

Geography and designation are the strongest predictors of salary outcomes. Years since graduation has moderate but non-linear impact.

Model training approach:

  • Hyperparameter optimization: GridSearchCV with 5-fold cross-validation to tune tree depth, features, and estimators
  • Class balancing: Weighted parameters to handle imbalanced success/non-success ratios
  • Feature engineering: Combined designation and geography into interaction features (e.g., "SDE-USA")
  • Log transformation: Applied to skewed salary distributions for better model performance

Technologies: Python 3.8+, pandas, scikit-learn, joblib for model persistence, openpyxl for Excel integration.

Business Impact for Universities

For Career Services:

  • Data-driven guidance: Replace generic advice with specific salary predictions by role and geography
  • Opportunity identification: Show students which industries and geographies offer highest growth
  • Alumni network targeting: Identify successful alumni in target sectors for mentorship programs

For Academic Programs:

  • Curriculum validation: Correlate course selections with career success metrics
  • Program ROI demonstration: Show stakeholders concrete alumni outcomes by major and specialization
  • Resource allocation: Invest in programs and partnerships that lead to high-growth careers

For Student Advising:

  • Personalized recommendations: Predict which career paths suit student interests and market demand
  • Geographic planning: Help students understand international vs. domestic opportunity tradeoffs
  • Timeline benchmarking: Set realistic expectations for salary growth and promotion trajectories

What I Learned

Data quality matters more than model complexity. 80% of the work was cleaning and unifying fragmented datasets. The modeling itself was straightforward once data was standardized. Universities need data infrastructure before advanced analytics.

Geography dominates job title. The same role in different countries shows massive salary variance. This matters for advising students on international opportunities and understanding real career outcomes beyond job titles.

Missing data reveals patterns. Alumni who didn't respond to surveys tended to be in lower-paying roles or unemployed. This non-response bias means reported outcomes overstate success rates. Honest analysis requires acknowledging these gaps.

Next Steps:

  • Longitudinal tracking: Track career progression over time, not just snapshots
  • Curriculum correlation: Link specific courses and projects to career outcomes
  • Automated pipeline: Build system to ingest new alumni data and update predictions continuously
  • Interactive dashboard: Let career services explore predictions for different scenarios