TRACK LLM
Track LLM is an MVP benchmarking tool backed by the Indian government, designed to evaluate the fairness, safety, and inclusivity of Large Language Models (LLMs) and LLM-based applications within India’s diverse socio-cultural context.
Still in progress, the tool lays the foundation for a transparent and accountable AI ecosystem, featuring an intuitive interface that simplifies audits across regional languages, social groups, and ethical dimensions.
Figma, Miro, Bolt AI, Claude AI, Chat GPT
User flows & information architecture
Low and high fidelity prototyping
Usability studies
Iterating on designs
Accounting for accessibility
Responsive design for desktop
Megha, Rahul, Shourya
Challenge
Solution
🎯 Our Framework is designed to address these evaluation gaps head-on. The system provides:
01
Domain-specific evaluation criteria
02
Culturally grounded Indian benchmarks
03
Multi-dimensional safety assessment
04
Socially responsible AI standards
Process
Before designing TrackLLM, we needed to validate whether the evaluation challenges we identified were experienced across the broader AI research and deployment community. My research partner conducted stakeholder research to uncover common pain points and understand how teams currently evaluate LLM applications.
Research Goals
01
Understand AI practitioners and their evaluation needs
02
Identify frustrations with existing LLM benchmarking and assessment methods
03
Identify types of unexpected failures when deploying LLMs in Indian contexts
04
Identify critical evaluation criteria needed before and during LLM deployment in diverse cultural settings
Why This Matters
🧠 Western-built tests don't work for India
Current AI evaluation tools like HELM and BIG-bench were created in Western universities. They focus on English-only content and Western cultural values. But what's considered harmful or biased in the West is very different from India's concerns around caste, religion, and regional differences.
🌏 India's languages are ignored
India has 22 official languages and over 19,500 dialects. Most AI systems only work well in English and a few major Indian languages like Hindi. This leaves millions of people with broken or biased AI experiences in their native languages.
🧭 Different cultures, different problems
What counts as toxic, biased, or unfair varies dramatically. Western AI tests focus on race and gender issues. In India, caste discrimination, religious tensions, and regional stereotypes are much bigger concerns that current tests completely miss.
📉 Real Indian use cases aren't tested
AI systems in India are used for government services, rural healthcare, and education for low-income students. But evaluation tests don't include Indian names, cultural references, or the moral questions that matter in Indian society.
⚖️ Building fair AI requires local input
Global AI ethics experts and Indian government bodies like NITI Aayog agree: AI systems need to be evaluated using local values and priorities. Generic Western tests can't ensure AI is safe and fair for Indian users.
Work in progress ⏳