mBERT-Burmese-Register-Classifier

This model is a fine-tuned version of multilingual BERT (mBERT) designed to classify the linguistic registers of the Myanmar (Burmese) language. Specifically, it distinguishes between Formal (Written) and Colloquial (Spoken) styles.

Model Details

Model Description

The mBERT-Burmese-Register-Classifier was developed to bridge the gap in Myanmar Natural Language Processing regarding style and register detection. In Burmese, the written and spoken forms differ significantly in grammar, particles, and sentence endings. This model identifies these nuances to help downstream applications ensure stylistic consistency.

  • Developed by: Khant Sint Heinn(Kalix Louis)
  • Model type: Transformer-based Text Classification
  • Language(s) (NLP): Burmese (my)
  • License: Apache 2.0
  • Finetuned from model: bert-base-multilingual-cased

Model Sources

Uses

Direct Use

  • Detecting formality levels in Burmese text.
  • Pre-processing data for Machine Translation or Style Transfer.
  • Assisting in grammar checking tools for Myanmar language.

Out-of-Scope Use

  • The model is not intended for deep sentiment analysis or dialect detection.
  • It should not be used as the sole authority for academic or legal document validation without human oversight.

Bias, Risks, and Limitations

While the model shows strong performance on the validation set, users should be aware of the following limitations:

  • Data Sparsity: The model was trained on a relatively small dataset (approx. 3,286 augmented rows from 1,643 parallel pairs). Consequently, it may struggle with highly diverse or niche vocabularies.
  • Short Texts: Very short sentences (e.g., 1-2 words) that lack distinctive grammatical markers (like "သည်" or "တယ်") may lead to inaccurate classifications.
  • Neutral Registers: Phrases that are valid in both spoken and written contexts (Natural/Neutral text) may cause the model to provide low-confidence or fluctuating predictions.
  • Hybrid Styles: In modern social media contexts where formal and spoken styles are often blended, the model might not accurately capture the intended register.

Recommendations

Users are encouraged to use this model as a diagnostic tool rather than an absolute classifier. For critical applications, please implement a confidence threshold check (e.g., only accepting predictions with >0.95 probability).

How to Get Started with the Model

You can easily use the model with the Hugging Face pipeline:

from transformers import pipeline

classifier = pipeline("text-classification", model="kalixlouiis/mBERT-Burmese-Register-Classifier")

# Example usage
text = "ယနေ့ ရာသီဥတု သာယာလျက်ရှိပါသည်။"
result = classifier(text)
print(result)

Training Details

Training Data

The model was trained on the Myanmar Written-Spoken Text Pairs dataset, consisting of 1,643 original parallel entries (3,286 total labeled rows after transformation). The data includes various sentence types, ranging from news-style formal reports to casual daily conversations.

Training Procedure

Preprocessing

  • Sub-word tokenization using the standard mBERT WordPiece tokenizer.
  • Sentences were truncated to a maximum length of 128 tokens.
  • Stratified Splitting: Data was split into 80% training and 20% validation sets while maintaining the ratio of written vs. spoken labels.

Training Hyperparameters

  • Optimizer: AdamW
  • Epochs: 3
  • Batch Size: 16
  • Learning Rate Schedule: Linear warmup
  • Precision: fp32

Evaluation

Testing Data, Factors & Metrics

Metrics

  • Accuracy: Used to measure overall correctness.
  • F1-Score: Used to ensure a balance between Precision and Recall, which is crucial for binary register classification.

Results

  • Validation Accuracy: ~100% (on the specific 20% validation split of the provided dataset).
  • F1-Score: 1.0

Note: The perfect scores are reflective of the distinct grammatical markers present in the curated dataset. Real-world performance on noisy, unstructured data may vary.

Environmental Impact

  • Hardware Type: NVIDIA T4 GPU
  • Hours used: < 1 hour
  • Cloud Provider: Google Colab

Technical Specifications

Model Architecture and Objective

The model uses the BertForSequenceClassification architecture with a custom classification head on top of the pre-trained bert-base-multilingual-cased backbone.

Model Card Contact

For questions or feedback regarding this model, please reach out via the Hugging Face Community tab.

Downloads last month
25
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kalixlouiis/mBERT-Burmese-Register-Classifier

Finetuned
(939)
this model

Dataset used to train kalixlouiis/mBERT-Burmese-Register-Classifier

Evaluation results

kalixlouiis/mBERT-Burmese-Register-Classifier · Hugging Face

mBERT-Burmese-Register-Classifier

This model is a fine-tuned version of multilingual BERT (mBERT) designed to classify the linguistic registers of the Myanmar (Burmese) language. Specifically, it distinguishes between Formal (Written) and Colloquial (Spoken) styles.

Model Details

Model Description

The mBERT-Burmese-Register-Classifier was developed to bridge the gap in Myanmar Natural Language Processing regarding style and register detection. In Burmese, the written and spoken forms differ significantly in grammar, particles, and sentence endings. This model identifies these nuances to help downstream applications ensure stylistic consistency.

  • Developed by: Khant Sint Heinn(Kalix Louis)
  • Model type: Transformer-based Text Classification
  • Language(s) (NLP): Burmese (my)
  • License: Apache 2.0
  • Finetuned from model: bert-base-multilingual-cased

Model Sources

Uses

Direct Use

  • Detecting formality levels in Burmese text.
  • Pre-processing data for Machine Translation or Style Transfer.
  • Assisting in grammar checking tools for Myanmar language.

Out-of-Scope Use

  • The model is not intended for deep sentiment analysis or dialect detection.
  • It should not be used as the sole authority for academic or legal document validation without human oversight.

Bias, Risks, and Limitations

While the model shows strong performance on the validation set, users should be aware of the following limitations:

  • Data Sparsity: The model was trained on a relatively small dataset (approx. 3,286 augmented rows from 1,643 parallel pairs). Consequently, it may struggle with highly diverse or niche vocabularies.
  • Short Texts: Very short sentences (e.g., 1-2 words) that lack distinctive grammatical markers (like "သည်" or "တယ်") may lead to inaccurate classifications.
  • Neutral Registers: Phrases that are valid in both spoken and written contexts (Natural/Neutral text) may cause the model to provide low-confidence or fluctuating predictions.
  • Hybrid Styles: In modern social media contexts where formal and spoken styles are often blended, the model might not accurately capture the intended register.

Recommendations

Users are encouraged to use this model as a diagnostic tool rather than an absolute classifier. For critical applications, please implement a confidence threshold check (e.g., only accepting predictions with >0.95 probability).

How to Get Started with the Model

You can easily use the model with the Hugging Face pipeline:

from transformers import pipeline

classifier = pipeline("text-classification", model="kalixlouiis/mBERT-Burmese-Register-Classifier")

# Example usage
text = "ယနေ့ ရာသီဥတု သာယာလျက်ရှိပါသည်။"
result = classifier(text)
print(result)

Training Details

Training Data

The model was trained on the Myanmar Written-Spoken Text Pairs dataset, consisting of 1,643 original parallel entries (3,286 total labeled rows after transformation). The data includes various sentence types, ranging from news-style formal reports to casual daily conversations.

Training Procedure

Preprocessing

  • Sub-word tokenization using the standard mBERT WordPiece tokenizer.
  • Sentences were truncated to a maximum length of 128 tokens.
  • Stratified Splitting: Data was split into 80% training and 20% validation sets while maintaining the ratio of written vs. spoken labels.

Training Hyperparameters

  • Optimizer: AdamW
  • Epochs: 3
  • Batch Size: 16
  • Learning Rate Schedule: Linear warmup
  • Precision: fp32

Evaluation

Testing Data, Factors & Metrics

Metrics

  • Accuracy: Used to measure overall correctness.
  • F1-Score: Used to ensure a balance between Precision and Recall, which is crucial for binary register classification.

Results

  • Validation Accuracy: ~100% (on the specific 20% validation split of the provided dataset).
  • F1-Score: 1.0

Note: The perfect scores are reflective of the distinct grammatical markers present in the curated dataset. Real-world performance on noisy, unstructured data may vary.

Environmental Impact

  • Hardware Type: NVIDIA T4 GPU
  • Hours used: < 1 hour
  • Cloud Provider: Google Colab

Technical Specifications

Model Architecture and Objective

The model uses the BertForSequenceClassification architecture with a custom classification head on top of the pre-trained bert-base-multilingual-cased backbone.

Model Card Contact

For questions or feedback regarding this model, please reach out via the Hugging Face Community tab.

Downloads last month
25
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kalixlouiis/mBERT-Burmese-Register-Classifier

Finetuned
(939)
this model

Dataset used to train kalixlouiis/mBERT-Burmese-Register-Classifier

Evaluation results