mBERT-Burmese-Register-Classifier

This model is a fine-tuned version of multilingual BERT (mBERT) designed to classify the linguistic registers of the Myanmar (Burmese) language. Specifically, it distinguishes between Formal (Written) and Colloquial (Spoken) styles.

Model Details

Model Description

The mBERT-Burmese-Register-Classifier was developed to bridge the gap in Myanmar Natural Language Processing regarding style and register detection. In Burmese, the written and spoken forms differ significantly in grammar, particles, and sentence endings. This model identifies these nuances to help downstream applications ensure stylistic consistency.

Developed by: Khant Sint Heinn(Kalix Louis)
Model type: Transformer-based Text Classification
Language(s) (NLP): Burmese (my)
License: Apache 2.0
Finetuned from model: bert-base-multilingual-cased

Model Sources

Repository: kalixlouiis/mBERT-Burmese-Register-Classifier
Dataset: myanmar-written-spoken-text-pairs

Uses

Direct Use

Detecting formality levels in Burmese text.
Pre-processing data for Machine Translation or Style Transfer.
Assisting in grammar checking tools for Myanmar language.

Out-of-Scope Use

The model is not intended for deep sentiment analysis or dialect detection.
It should not be used as the sole authority for academic or legal document validation without human oversight.

Bias, Risks, and Limitations

While the model shows strong performance on the validation set, users should be aware of the following limitations:

Data Sparsity: The model was trained on a relatively small dataset (approx. 3,286 augmented rows from 1,643 parallel pairs). Consequently, it may struggle with highly diverse or niche vocabularies.
Short Texts: Very short sentences (e.g., 1-2 words) that lack distinctive grammatical markers (like "သည်" or "တယ်") may lead to inaccurate classifications.
Neutral Registers: Phrases that are valid in both spoken and written contexts (Natural/Neutral text) may cause the model to provide low-confidence or fluctuating predictions.
Hybrid Styles: In modern social media contexts where formal and spoken styles are often blended, the model might not accurately capture the intended register.

Recommendations

Users are encouraged to use this model as a diagnostic tool rather than an absolute classifier. For critical applications, please implement a confidence threshold check (e.g., only accepting predictions with >0.95 probability).

How to Get Started with the Model

You can easily use the model with the Hugging Face pipeline:

from transformers import pipeline

classifier = pipeline("text-classification", model="kalixlouiis/mBERT-Burmese-Register-Classifier")

# Example usage
text = "ယနေ့ ရာသီဥတု သာယာလျက်ရှိပါသည်။"
result = classifier(text)
print(result)

Training Details

Training Data

The model was trained on the Myanmar Written-Spoken Text Pairs dataset, consisting of 1,643 original parallel entries (3,286 total labeled rows after transformation). The data includes various sentence types, ranging from news-style formal reports to casual daily conversations.

Training Procedure

Preprocessing

Sub-word tokenization using the standard mBERT WordPiece tokenizer.
Sentences were truncated to a maximum length of 128 tokens.
Stratified Splitting: Data was split into 80% training and 20% validation sets while maintaining the ratio of written vs. spoken labels.

Training Hyperparameters

Optimizer: AdamW
Epochs: 3
Batch Size: 16
Learning Rate Schedule: Linear warmup
Precision: fp32

Evaluation

Testing Data, Factors & Metrics

Metrics

Accuracy: Used to measure overall correctness.
F1-Score: Used to ensure a balance between Precision and Recall, which is crucial for binary register classification.

Results

Validation Accuracy: ~100% (on the specific 20% validation split of the provided dataset).
F1-Score: 1.0

Note: The perfect scores are reflective of the distinct grammatical markers present in the curated dataset. Real-world performance on noisy, unstructured data may vary.

Environmental Impact

Hardware Type: NVIDIA T4 GPU
Hours used: < 1 hour
Cloud Provider: Google Colab

Technical Specifications

Model Architecture and Objective

The model uses the BertForSequenceClassification architecture with a custom classification head on top of the pre-trained bert-base-multilingual-cased backbone.

Model Card Contact

For questions or feedback regarding this model, please reach out via the Hugging Face Community tab.

Downloads last month: 25

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for kalixlouiis/mBERT-Burmese-Register-Classifier

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(939)

this model

Dataset used to train kalixlouiis/mBERT-Burmese-Register-Classifier

Evaluation results

Accuracy on myanmar-written-spoken-text-pairs
self-reported

1.000
F1 on myanmar-written-spoken-text-pairs
self-reported

1.000

kalixlouiis
/

mBERT-Burmese-Register-Classifier

Eval Results (legacy)

Model card Files Files and versions

xet

Community

mBERT-Burmese-Register-Classifier

Model Details

Model Description

Developed by: Khant Sint Heinn(Kalix Louis)
Model type: Transformer-based Text Classification
Language(s) (NLP): Burmese (my)
License: Apache 2.0
Finetuned from model: bert-base-multilingual-cased

Model Sources

Repository: kalixlouiis/mBERT-Burmese-Register-Classifier
Dataset: myanmar-written-spoken-text-pairs

Uses

Direct Use

Detecting formality levels in Burmese text.
Pre-processing data for Machine Translation or Style Transfer.
Assisting in grammar checking tools for Myanmar language.

Out-of-Scope Use

The model is not intended for deep sentiment analysis or dialect detection.
It should not be used as the sole authority for academic or legal document validation without human oversight.

Bias, Risks, and Limitations

While the model shows strong performance on the validation set, users should be aware of the following limitations:

Data Sparsity: The model was trained on a relatively small dataset (approx. 3,286 augmented rows from 1,643 parallel pairs). Consequently, it may struggle with highly diverse or niche vocabularies.
Short Texts: Very short sentences (e.g., 1-2 words) that lack distinctive grammatical markers (like "သည်" or "တယ်") may lead to inaccurate classifications.
Neutral Registers: Phrases that are valid in both spoken and written contexts (Natural/Neutral text) may cause the model to provide low-confidence or fluctuating predictions.
Hybrid Styles: In modern social media contexts where formal and spoken styles are often blended, the model might not accurately capture the intended register.

Recommendations

How to Get Started with the Model

You can easily use the model with the Hugging Face pipeline:

from transformers import pipeline

classifier = pipeline("text-classification", model="kalixlouiis/mBERT-Burmese-Register-Classifier")

# Example usage
text = "ယနေ့ ရာသီဥတု သာယာလျက်ရှိပါသည်။"
result = classifier(text)
print(result)

Training Details

Training Data

Training Procedure

Preprocessing

Sub-word tokenization using the standard mBERT WordPiece tokenizer.
Sentences were truncated to a maximum length of 128 tokens.
Stratified Splitting: Data was split into 80% training and 20% validation sets while maintaining the ratio of written vs. spoken labels.

Training Hyperparameters

Optimizer: AdamW
Epochs: 3
Batch Size: 16
Learning Rate Schedule: Linear warmup
Precision: fp32

Evaluation

Testing Data, Factors & Metrics

Metrics

Accuracy: Used to measure overall correctness.
F1-Score: Used to ensure a balance between Precision and Recall, which is crucial for binary register classification.

Results

Validation Accuracy: ~100% (on the specific 20% validation split of the provided dataset).
F1-Score: 1.0

Note: The perfect scores are reflective of the distinct grammatical markers present in the curated dataset. Real-world performance on noisy, unstructured data may vary.

Environmental Impact

Hardware Type: NVIDIA T4 GPU
Hours used: < 1 hour
Cloud Provider: Google Colab

Technical Specifications

Model Architecture and Objective

The model uses the BertForSequenceClassification architecture with a custom classification head on top of the pre-trained bert-base-multilingual-cased backbone.

Model Card Contact

For questions or feedback regarding this model, please reach out via the Hugging Face Community tab.

Downloads last month: 25

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for kalixlouiis/mBERT-Burmese-Register-Classifier

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(939)

this model

Dataset used to train kalixlouiis/mBERT-Burmese-Register-Classifier

Evaluation results

Accuracy on myanmar-written-spoken-text-pairs
self-reported

1.000
F1 on myanmar-written-spoken-text-pairs
self-reported

1.000