Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info Update #91

Merged
merged 1 commit into from
Oct 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion frontend/components/Dynamic/Area.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Additionally, we introduced the Bharat Parallel Corpus Collection (BPCC), which
description: `At AI4Bharat, our dedication to building language models and datasets for all 22 constitutionally
recognized Indian languages is central to our mission. We employ a multifaceted approach, leveraging
large-scale data crawling, synthetic data creation, and human annotation/crowd collections to create
comprehensive datasets. Our efforts have resulted in an extensive pretraining corpus of 251 million
comprehensive datasets. Our efforts have resulted in an extensive pretraining corpus of 251 billion
tokens across 22 languages, complemented by 74.7 million prompt-response pairs in 20 Indian
languages. Tools like Setu play a crucial role in large-scale crawling and data cleaning, enabling
us to build state-of-the-art models such as Airavata, IndicBART, and IndicBERT. We also emphasize
Expand Down
Loading