

MASi is building a Natural Language Processing (NLP) model that works with low-resourced languages to deliver educational content. This model will encompass various NLP tasks such as translation, text generation, and language understanding, with the primary purpose of delivering high-quality educational content through a comprehensive school (kindergarten to grade 12).
According to the United Nations, 200 million children worldwide have no access to schooling of any kind, and another 600 million, even though enrolled, do not have basic literacy and numeracy skills. Barriers to education access include the lack of high-quality teachers, costly infrastructure, and the need for sophisticated technology.
These barriers have been mitigated for children who speak rich-resourced languages (e.g., English, Spanish, French) because they can access education through various edtech programs that provide educational content technologically.
However, for children who speak low-resourced languages, such services do not exist because the necessary services cannot be produced in their languages. Additionally, even for children with rich-resourced languages, access to many edtech programs is still severely limited because they require sophisticated technology. MASi is designed to run on a simple smartphone.
Existing NLP models (e.g., GPT, BERT) are designed to work only with rich-resourced languages, preventing the creation of services in low-resourced languages. Our project addressed this gap head-on by developing an NLP model specifically for low-resourced languages, thereby providing educational content to every child, in every language, everywhere.
Choosing a Language.
NLP Model Requirement
We're building a universal core NLP (natural language processing) model with adaptable frameworks designed to handle a variety of linguistic features common to low-resourced languages. This development involves creating specialized layers or modules to handle unique features such as complex morphology, tonal variations, and syntactic structures. But to ensure thorough testing and validation, we initially focused on one language that represents a broad spectrum of linguistic diversity. This approach allowed us to develop a flexible and scalable model that can be expanded to include more languages over time. We started with Quechua.
Linguistic Criteria
-
Number of Speakers: Enough speakers to make a meaningful impact.
-
Geographic Distribution: Languages spoken in diverse and underserved areas.
-
Educational Barriers: Limited access to formal education.
-
Digital Resource Scarcity: Minimal digital texts, annotated corpora, and linguistic databases.
-
Linguistic Diversity: Languages with different linguistic features to ensure model versatility.
-
Cultural Significance: Languages with strong cultural heritage and potential for preservation.
-
Feasibility of Data Collection: Ability to gather data from native speakers..
Why Quechua
Based on these criteria, Quechua, spoken in Central and South America, was our pilot language:
-
Rationale: Quechua is spoken by approximately 8-10 million people in the Andean regions of Peru, Bolivia, Ecuador, Colombia, and Argentina. The language is essential for cultural preservation and faces limited access to educational materials.
-
Linguistic Features: Quechua has polysynthetic morphology and multiple dialects, providing a diverse linguistic landscape for testing the model’s flexibility.
The Technological Challenge.
01
Data Scarcity and Quality:
-
Risk: Low-resourced languages often have limited annotated data available for training. The lack of extensive, high-quality datasets poses a significant risk to the model’s ability to learn and generalize effectively.
-
Impact: Without sufficient data, the models may not capture the linguistic nuances and cultural context required for accurate translations and language processing.
03
Bias and Representation:
-
Risk: Ensuring that the models are unbiased and fairly represent all dialects and cultural aspects of low-resourced languages is complex. Training data often contains inherent biases that can be perpetuated and amplified by the models.
-
Impact: Biases in the models can lead to unfair treatment of certain language variants, cultural insensitivity, and lack of acceptance by the communities the project aims to serve.
02
Handling Unique Linguistic Features:
-
Risk: Low-resourced languages frequently possess unique linguistic characteristics, such as complex morphology, tonal variations, and distinct syntactic structures. Adapting existing models to handle these features accurately is technically challenging.
-
Impact: Failure to accurately model these features can result in poor translation quality, misinterpretations, and reduced effectiveness of the educational content delivered.
04
Scalability and Accessibility:
-
Risk: Developing a deployment strategy that ensures the models are scalable and accessible in regions with limited internet connectivity and low-power devices is challenging.
-
Impact: If the deployment is not optimized for these conditions, the technology may not reach the intended users, limiting the project's societal impact
Augmentation and Synthesis
Using back-translation, paraphrasing, and synthetic data generation to expand the training datasets.
Community
Engagement
Involving native speakers and linguistic experts in the data collection and validation process to ensure cultural and linguistic accuracy.
Data Bias Detection
and Mitigation
Implementing algorithms and techniques to detect, measure, and mitigate biases in the training data and model outputs.
Specialized Layers and Modules
Developing custom layers and modules tailored to handle the unique linguistic features of low-resourced languages.
Optimized Deployment
Using model compression, edge computing, and offline capabilities to ensure the models are accessible in low-connectivity regions.