You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BERT Based Model for Punctuation and Capitalization Restoration
Features:
Uses Huggingface Tranformer library for base transformer architecture.
Pytorch Lightning is used for training and checkpoints.
Easy config based model description for easy experimenttation and reaearch.
Can be exported as a pytorch quantized model for faster inference on CPU.
Includes helper function for data preparation, text normalization, and offline sentence augmentation specific for punctuation and capitalization restoration.
Quick guide:
# Install requirements:
pip install -r requirements.txt
# Downloads raw text corpus from tatoeba for english language
bash download_tatoeba_en_sent.sh
# Preprocess raw text data. Check config file for more details
python preprocess_raw_text_data.py --config="example_configs/preprocess_config_en.yaml"# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python merge_and_tokenize_datasets.py --config="example_configs/model_config_en.yaml"# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python train_punct_and_capit_model.py --config="example_configs/model_config_en.yaml"
For inference:
fromtransformer_punct_and_capit.modelsimportTransformerPunctAndCapitModelmodel_path="experiments/model.pcm"# pcm_checkpoint pathmodel=TransformerPunctAndCapitModel.restore_model(model_path, device='cuda')
model.predict("how are you") # Single example# Output: ["How are you?"]model.predict_batch(["how are you"], batch_size=64, show_pbar=True) # Batch example# Output: ["How are you?"]