Request for Guidance on Adding Punjabi Language Support to NLLB · facebookresearch/fairseq#5630

(0 评论) (0 反应) (0 负责人)Python (6,224 fork)batch import

enhancementhelp wantedneeds triage

仓库指标

Star: (29,107 star)
PR 合并指标: (30 天内没有已合并 PR)

描述

Hello NLLB/fairseq team,

I’m reaching out to explore how to fine-tune the NLLB model to support better Punjabi, a vibrant language spoken by over 100 million people worldwide, including a historic Sikh community in California that has thrived since 1909.

As part of efforts to preserve and promote Punjabi in digital spaces, I’d like to understand:

Requirements for fine-tuning NLLB for Punjabi – Are there specific considerations for its Gurmukhi script or dialectal variations (e.g., Eastern vs. Western Punjabi)?

Existing tutorials – Is there a guide for adding new languages, particularly those with rich literary traditions, such as Punjabi?

Data needs – What type/amount of parallel data (e.g., Punjabi-English) would be optimal? Could community-translated datasets (e.g., religious texts, literature, or news) supplement existing resources?

Leveraging seed datasets – Are there templates (such as the NLLB-Seed dataset) that we could adapt for Punjabi?

Punjabi is a culturally significant language with deep roots in California’s Sikh diaspora, and I’d love to contribute to its inclusion in NLLB. Any advice or resources you could share would be invaluable!

Thank you for your time and for working on multilingual AI.

Best regards, Manav

贡献者指南

研究方向: 查阅NLLB微调文档和示例。了解数据需求和可用的种子数据集。探索旁遮普语平行数据的社区资源，并调整微调流程。
技术栈: pythonpytorch
领域: ai
议题类型: 调研
难度: 2
预计时间: 1-3 小时
活动状态: 新近可参与
清晰度: 清晰
前置要求: PythonPyTorchfairseq basics
新手友好度: 70

仓库指标

描述

贡献者指南

每天在邮箱收到新鲜 Easy issues。