Model Training
Prodigy provides tools to make it easy to train machine learning pipelines on top of your annotations. In particular, many utilities for spaCy are provided, but the convenient JSONL output also makes it staightforward to any other machine learning tools.
Iterate faster with flexible training
With Prodigy, you can easily train spaCy models on your annotated data by using the train
recipe. This command allows you to immediately train a pipeline for multiple use-cases too. Need to have a pipeline that does both text classification and named entity recognition? Simply use the flags and point to the annotations. You can even fetch data from multiple datasets and Prodigy will ensure that spaCy is aware of all the labels.
Prodigy will automatically configure a train and evaluation set, but you can also specify this yourself manually. Rather than starting from scratch, Prodigy lets you re-use pre-existing spaCy pipelines for your training procedure by giving it a base model as a starting point. Finally, you can also speed up training by providing a GPU.
Example
prodigytrainoutput_dir--ner fashion_train,eval:fashion_eval--textcat text_clf_data1,text_clf_data2--base-model en_core_web_md--gpu-id 1======================== Generating Prodigy config ========================ℹ Auto-generating config with spaCy✔ Generated training config========================== Initializing pipeline ==========================Components: ner, textcat_multilabel, spancatMerging training and evaluation data for 2 components- [ner] Training: 685 | Evaluation: 300 (from datasets)- [textcat] Training: 552 | Evaluation: 138 (20% split)============================ Training pipeline ============================ℹ Pipeline: ['tok2vec', 'ner', 'textcat_multilabel']ℹ Initial learn rate: 0.001...
Example
prodigytrain-curve--ner news_headlines--show-plotTraining 4 times with 25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ ┌──────────────────────────────────┐ 0.56┤ •│ 0.44┤ • •••• │ 0.43┤ ••• ••••••••••• │ │ ••• │ 0.31┤ ••• │ │ •• │ │ •• │ │ •• │ 0.00┤•• │ └┬───────┬────────┬───────┬───────┬┘ 0% 25% 50% 75% 100%✔ Accuracy improved in the last sample
Learn from training curves
Prodigy is designed to make it easy to iterate on both your data and your model. Your trained models can be re-used inside of Prodigy for active learning, but you can also re-use your data to understand the behavior of your model. To help with this, Prodigy also provides utilities to help you understand when to add more data for your model.
If you want to understand the effect of adding more training data to your model you can try the train-curve
recipe. This recipe will train a model on different portions of the training examples and print the accuracy figures and accuracy improvements with more data. This allows you to see how the model improves as more data is added.
This recipe is great in the early stages of the project when you're trying to determine the quality of the collected annotations, and whether more training examples will improve the accuracy. As a rule of thumb, if accuracy increases in the last segment, this could indicate that collecting more annotations of the same type might improve the model further.
Customize with spaCy
The train
recipe is great for getting started. But there are also utilities that allow you to export to spaCy to finetune the model once you're ready to start fully customizing the pipeline.
The spacy-config
recipe command allows you to generate a configuration file for spaCy that's capable of reading the data straight out of Prodigy.
Alternatively, if you want even more control, the data-to-spacy
recipe allows you to export the annotations in Prodigy into binary files that spaCy can use to train models. This will also generate an appropriate spaCy config file for you to adapt.
Example
prodigydata-to-spacy./corpus--ner news_ner_person,news_ner_org,news_ner_product--textcat news_cats2018,news_cats2019--eval-split 0.3ℹ Using language 'en'============================= Generating data =============================✔ Saved 1223 training examples./corpus/train.spacy✔ Saved 530 evaluation examples./corpus/dev.spacy============================ Generating config ============================ℹ Auto-generating config with spaCy✔ Generated training config============================ Finalizing export ============================✔ Saved training config./corpus/config.cfgTo use this data for training with spaCy, you can run:python -m spacy train ./corpus/config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy
Example for NER
prodigydb-outner_brands{ "text": "Apple updates its analytics service with new metrics", "spans": [{ "start": 0, "end": 5, "label": "ORG" }] }
Example for Computer Vision
prodigydb-outskateboard_images{ "x": 47.5, "y": 171.4, "width": 109.1, "height": 67.4, "points": [ [47.5, 171.4], [47.5, 238.8], [156.6, 238.8], [156.6, 171.4] ], "center": [102.05, 205.1], "label": "SKATEBOARD" }
Export to other machine learning libraries
Prodigy has great support for spaCy, but nothing is stopping you from using any other machine learning framework.
All data annotated in Prodigy can be exported via the db-out
command. Prodigy provides annotations in a JSONL format, which makes it easy to share nested annotations. JSONlines are also easy to parse with Python so that you can easily use it with your favourite machine learning pipeline. Want to build something custom using scikit-learn? Go for it! Want to use a third party API instead? Totally up to you!
You can also leverage the Python API to fetch annotations from the Prodigy database directly. For more information on this, check the documentation.
Read more