Train Box Metadata Extraction Scanner
This guide covers the process of training a Hydra Scanner to extract metadata from documents stored in Box.
Prerequisites: Create a Box metadata extraction Scanner following this guide.
1. Identify a set of training documents. Refer to Choosing your Training Data Set for more details.
2. Upload training documents to a Box folder monitored by your Box Skill for Hydra.
Hydra Scanners have a corresponding custom Box Skill, which monitors a list of Box folders and passes along any documents added to these folder to Hydra for processing. Adding the training documents to a Box folder monitored by your Box Skill will make them available to Hydra for training.
3. Open the Hydra trainer UI
Go to the scanners list page on the Hydra web app. Identify the data extraction scanner, and click on the TRAIN button.
Navigating the training user interface
- Labels: Each label represents a specific type of information you wish to extract.
- Tag All Similar Selections: Checkbox to enable the labeling of all matching words/phrases to the highlighted selection.
- Rules editor: A link to the Rules Editor UI where you can define specific patterns associated with your labels. See here for more details.
- Content canvas: A scrollable content area to review and label the training document.
- Pagination: Each page represents a document you uploaded to the Box folder.
- Activate Scanner button: The button to activate the scanner after you train the scanner.
4. Train the scanner for data extraction
The purpose of training is to teach your scanner how to recognize the specific information you intend to extract. You do so by repeatedly showing the scanner the information you’re looking for in the training documents. With enough training documents, the scanner can recognize the patterns and automatically extract the information from new documents.
Use your cursor to select/highlight the specific information you’re looking to extract.
Note: Be mindful to highlight only the information you wish to extract. Inconsistent highlighting may introduce unnecessary variations that may degrade the scanner performance.
Highlighted information in the training document
Click on the corresponding label to tag the information with the label.
Repeat the process for all labels (if applicable). Double click on any label to remove it.
Example Document with two tagged labels
Repeat the training for all training documents.
Each page is one document
5. Activate the Scanner
Click on ACTIVATE SCANNER button on the bottom right of the screen. Once you activate the scanner, Hydra will create a dedicated A.I. model for your data extraction. This process takes time. The scanner goes live when the model is created.
Once the Scanner is live, you should receive an email from email@example.com with the subject line “Scanner is live”.