TRACK LLM

Track LLM is an MVP benchmarking tool backed by the Indian government, designed to evaluate the fairness, safety, and inclusivity of Large Language Models (LLMs) and LLM-based applications within India’s diverse socio-cultural context.

Still in progress, the tool lays the foundation for a transparent and accountable AI ecosystem, featuring an intuitive interface that simplifies audits across regional languages, social groups, and ethical dimensions.

Software

Software

Figma, Miro, Bolt AI, Claude AI, Chat GPT

Responsibilities

Responsibilities

User flows & information architecture
Low and high fidelity prototyping
Usability studies
Iterating on designs
Accounting for accessibility
Responsive design for desktop

Team

Team

Megha, Rahul, Shourya

Challenge

01 Uncertainty about contextual relevance

  • How do we know if Western-developed benchmarks apply to Indian users?

  • What cultural biases are embedded in current evaluation metrics?

  • Are we measuring what actually matters for Indian applications?

02 Frustration with one-size-fits-all approaches

  • The same toxicity standards applied to a therapist bot and a math tutor make no sense

  • Fairness metrics that work in the western world completely miss caste-based discrimination

  • Universal benchmarks ignore the nuanced social hierarchies that shape Indian interactions

03 Missing linguistic and cultural complexity

  • Evaluations don't account for code-switching between Hindi, English, and regional languages

  • They miss the cultural context that determines whether something is respectful or offensive

  • There's no consideration for how gender, religion, and social status affect appropriate AI responses

04 Skepticism toward generic evaluation claims

  • The research says this model is "safe" but safe for whom and in what context?

  • They show impressive benchmark scores but will this actually work for Indian students?

  • How do we know these evaluations aren't just reflecting Western academic assumptions?

01

Uncertainty about contextual relevance

  • How do we know if Western-developed benchmarks apply to Indian users?

  • What cultural biases are embedded in current evaluation metrics?

  • Are we measuring what actually matters for Indian applications?

02 Frustration with one-size-fits-all approaches

  • The same toxicity standards applied to a therapist bot and a math tutor make no sense

  • Fairness metrics that work in the western world completely miss caste-based discrimination

  • Universal benchmarks ignore the nuanced social hierarchies that shape Indian interactions

03

Missing linguistic and cultural complexity

  • Evaluations don't account for code-switching between Hindi, English, and regional languages

  • They miss the cultural context that determines whether something is respectful or offensive

  • There's no consideration for how gender, religion, and social status affect appropriate AI responses

04

Skepticism toward generic evaluation claims

  • The research says this model is "safe" but safe for whom and in what context?

  • They show impressive benchmark scores but will this actually work for Indian students?

  • How do we know these evaluations aren't just reflecting Western academic assumptions?

01

Uncertainty about contextual relevance

  • How do we know if Western-developed benchmarks apply to Indian users?

  • What cultural biases are embedded in current evaluation metrics?

  • Are we measuring what actually matters for Indian applications?

02 Frustration with one-size-fits-all approaches

  • The same toxicity standards applied to a therapist bot and a math tutor make no sense

  • Fairness metrics that work in the western world completely miss caste-based discrimination

  • Universal benchmarks ignore the nuanced social hierarchies that shape Indian interactions

03

Missing linguistic and cultural complexity

  • Evaluations don't account for code-switching between Hindi, English, and regional languages

  • They miss the cultural context that determines whether something is respectful or offensive

  • There's no consideration for how gender, religion, and social status affect appropriate AI responses

04

Skepticism toward generic evaluation claims

  • The research says this model is "safe" but safe for whom and in what context?

  • They show impressive benchmark scores but will this actually work for Indian students?

  • How do we know these evaluations aren't just reflecting Western academic assumptions?

Solution

🎯 Our Framework is designed to address these evaluation gaps head-on. The system provides:

01

Domain-specific evaluation criteria

02

Culturally grounded Indian benchmarks

03

Multi-dimensional safety assessment

04

Socially responsible AI standards

MOCKUPS

MOCKUPS

MOCKUPS

Process

Before designing TrackLLM, we needed to validate whether the evaluation challenges we identified were experienced across the broader AI research and deployment community. My research partner conducted stakeholder research to uncover common pain points and understand how teams currently evaluate LLM applications.

Research Goals

01

Understand AI practitioners and their evaluation needs

02

Identify frustrations with existing LLM benchmarking and assessment methods

03

Identify types of unexpected failures when deploying LLMs in Indian contexts

04

Identify critical evaluation criteria needed before and during LLM deployment in diverse cultural settings

Why This Matters

🧠 Western-built tests don't work for India

Current AI evaluation tools like HELM and BIG-bench were created in Western universities. They focus on English-only content and Western cultural values. But what's considered harmful or biased in the West is very different from India's concerns around caste, religion, and regional differences.

🌏 India's languages are ignored

India has 22 official languages and over 19,500 dialects. Most AI systems only work well in English and a few major Indian languages like Hindi. This leaves millions of people with broken or biased AI experiences in their native languages.

🧭 Different cultures, different problems

What counts as toxic, biased, or unfair varies dramatically. Western AI tests focus on race and gender issues. In India, caste discrimination, religious tensions, and regional stereotypes are much bigger concerns that current tests completely miss.

📉 Real Indian use cases aren't tested

AI systems in India are used for government services, rural healthcare, and education for low-income students. But evaluation tests don't include Indian names, cultural references, or the moral questions that matter in Indian society.

⚖️ Building fair AI requires local input

Global AI ethics experts and Indian government bodies like NITI Aayog agree: AI systems need to be evaluated using local values and priorities. Generic Western tests can't ensure AI is safe and fair for Indian users.

Work in progress ⏳

Like what you see?

Let's connect

E-mail: meghasai286@gmail.com

Like what you see?

Let's connect

E-mail: meghasai286@gmail.com

Like what you see?

Let's connect

E-mail: meghasai286@gmail.com