mBERT-Burmese-Register-Classifier
This model is a fine-tuned version of multilingual BERT (mBERT) designed to classify the linguistic registers of the Myanmar (Burmese) language. Specifically, it distinguishes between Formal (Written) and Colloquial (Spoken) styles.
Model Details
Model Description
The mBERT-Burmese-Register-Classifier was developed to bridge the gap in Myanmar Natural Language Processing regarding style and register detection. In Burmese, the written and spoken forms differ significantly in grammar, particles, and sentence endings. This model identifies these nuances to help downstream applications ensure stylistic consistency.
- Developed by: Khant Sint Heinn(Kalix Louis)
- Model type: Transformer-based Text Classification
- Language(s) (NLP): Burmese (my)
- License: Apache 2.0
- Finetuned from model:
bert-base-multilingual-cased
Model Sources
- Repository: kalixlouiis/mBERT-Burmese-Register-Classifier
- Dataset: myanmar-written-spoken-text-pairs
Uses
Direct Use
- Detecting formality levels in Burmese text.
- Pre-processing data for Machine Translation or Style Transfer.
- Assisting in grammar checking tools for Myanmar language.
Out-of-Scope Use
- The model is not intended for deep sentiment analysis or dialect detection.
- It should not be used as the sole authority for academic or legal document validation without human oversight.
Bias, Risks, and Limitations
While the model shows strong performance on the validation set, users should be aware of the following limitations:
- Data Sparsity: The model was trained on a relatively small dataset (approx. 3,286 augmented rows from 1,643 parallel pairs). Consequently, it may struggle with highly diverse or niche vocabularies.
- Short Texts: Very short sentences (e.g., 1-2 words) that lack distinctive grammatical markers (like "သည်" or "တယ်") may lead to inaccurate classifications.
- Neutral Registers: Phrases that are valid in both spoken and written contexts (Natural/Neutral text) may cause the model to provide low-confidence or fluctuating predictions.
- Hybrid Styles: In modern social media contexts where formal and spoken styles are often blended, the model might not accurately capture the intended register.
Recommendations
Users are encouraged to use this model as a diagnostic tool rather than an absolute classifier. For critical applications, please implement a confidence threshold check (e.g., only accepting predictions with >0.95 probability).
How to Get Started with the Model
You can easily use the model with the Hugging Face pipeline:
from transformers import pipeline
classifier = pipeline("text-classification", model="kalixlouiis/mBERT-Burmese-Register-Classifier")
# Example usage
text = "ယနေ့ ရာသီဥတု သာယာလျက်ရှိပါသည်။"
result = classifier(text)
print(result)
Training Details
Training Data
The model was trained on the Myanmar Written-Spoken Text Pairs dataset, consisting of 1,643 original parallel entries (3,286 total labeled rows after transformation). The data includes various sentence types, ranging from news-style formal reports to casual daily conversations.
Training Procedure
Preprocessing
- Sub-word tokenization using the standard mBERT WordPiece tokenizer.
- Sentences were truncated to a maximum length of 128 tokens.
- Stratified Splitting: Data was split into 80% training and 20% validation sets while maintaining the ratio of written vs. spoken labels.
Training Hyperparameters
- Optimizer: AdamW
- Epochs: 3
- Batch Size: 16
- Learning Rate Schedule: Linear warmup
- Precision: fp32
Evaluation
Testing Data, Factors & Metrics
Metrics
- Accuracy: Used to measure overall correctness.
- F1-Score: Used to ensure a balance between Precision and Recall, which is crucial for binary register classification.
Results
- Validation Accuracy: ~100% (on the specific 20% validation split of the provided dataset).
- F1-Score: 1.0
Note: The perfect scores are reflective of the distinct grammatical markers present in the curated dataset. Real-world performance on noisy, unstructured data may vary.
Environmental Impact
- Hardware Type: NVIDIA T4 GPU
- Hours used: < 1 hour
- Cloud Provider: Google Colab
Technical Specifications
Model Architecture and Objective
The model uses the BertForSequenceClassification architecture with a custom classification head on top of the pre-trained bert-base-multilingual-cased backbone.
Model Card Contact
For questions or feedback regarding this model, please reach out via the Hugging Face Community tab.
- Downloads last month
- 25
Model tree for kalixlouiis/mBERT-Burmese-Register-Classifier
Base model
google-bert/bert-base-multilingual-casedDataset used to train kalixlouiis/mBERT-Burmese-Register-Classifier
Evaluation results
- Accuracy on myanmar-written-spoken-text-pairsself-reported1.000
- F1 on myanmar-written-spoken-text-pairsself-reported1.000