BERT Based Model for Punctuation and Capitalization Restoration

Features:

Uses Huggingface Tranformer library for base transformer architecture.
Pytorch Lightning is used for training and checkpoints.
Easy config based model description for easy experimenttation and reaearch.
Can be exported as a pytorch quantized model for faster inference on CPU.
Includes helper function for data preparation, text normalization, and offline sentence augmentation specific for punctuation and capitalization restoration.

Quick guide:

# Install requirements:
pip install -r requirements.txt

# Downloads raw text corpus from tatoeba for english language
bash download_tatoeba_en_sent.sh

# Preprocess raw text data. Check config file for more details
python preprocess_raw_text_data.py --config="example_configs/preprocess_config_en.yaml"

# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python merge_and_tokenize_datasets.py --config="example_configs/model_config_en.yaml"

# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python train_punct_and_capit_model.py --config="example_configs/model_config_en.yaml"

For inference:

from transformer_punct_and_capit.models import TransformerPunctAndCapitModel

model_path="experiments/model.pcm" # pcm_checkpoint path
model = TransformerPunctAndCapitModel.restore_model(model_path, device='cuda')

model.predict("how are you") # Single example
# Output: ["How are you?"]

model.predict_batch(["how are you"], batch_size=64, show_pbar=True) # Batch example
# Output: ["How are you?"]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

BERT Based Model for Punctuation and Capitalization Restoration

Features:

Quick guide:

For inference:

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

BERT Based Model for Punctuation and Capitalization Restoration

Features:

Quick guide:

For inference: