facebookresearch/fairseq

Request for Guidance on Adding Punjabi Language Support to NLLB

Open

#5,630 创建于 2025年7月23日

在 GitHub 查看
 (0 评论) (0 反应) (0 负责人)Python (29,107 star) (6,224 fork)batch import
enhancementhelp wantedneeds triage

描述

Hello NLLB/fairseq team,

I’m reaching out to explore how to fine-tune the NLLB model to support better Punjabi, a vibrant language spoken by over 100 million people worldwide, including a historic Sikh community in California that has thrived since 1909.

As part of efforts to preserve and promote Punjabi in digital spaces, I’d like to understand:

Requirements for fine-tuning NLLB for Punjabi – Are there specific considerations for its Gurmukhi script or dialectal variations (e.g., Eastern vs. Western Punjabi)?

Existing tutorials – Is there a guide for adding new languages, particularly those with rich literary traditions, such as Punjabi?

Data needs – What type/amount of parallel data (e.g., Punjabi-English) would be optimal? Could community-translated datasets (e.g., religious texts, literature, or news) supplement existing resources?

Leveraging seed datasets – Are there templates (such as the NLLB-Seed dataset) that we could adapt for Punjabi?

Punjabi is a culturally significant language with deep roots in California’s Sikh diaspora, and I’d love to contribute to its inclusion in NLLB. Any advice or resources you could share would be invaluable!

Thank you for your time and for working on multilingual AI.

Best regards, Manav

贡献者指南