How To Choose A Training Data Set For A Document Scanner
Your set of training documents should be a fair representation of the documents you plan to apply the Skill to. The training documents should reflect the same diverse variations that you come across in the real business case.
How many training documents do you need?
For a Skill task with minimal data variation and document complexity, we recommend a minimum of 12 training documents per label (insight/metadata). If the Skill has multiple labels, you need to increase the number of training documents to make sure each label is adequately represented. As you increase the data variations and document complexity, you also need to increase the number of training documents.
Note: If possible, you should randomize your document selection process to avoid introducing human bias in the Skill model.
Identifying variations and complexity for data extraction
Be aware that the scanners may see variations differently from a human being.
The scanner MAY consider the following as variation:
- 4 / 12 / 12
- April 12, 2012
- Apr 12th, 12
- 12th Apr 2012
Number of Words & spaces in-between
- Eric Schmidt
- Eric E. Schmidt
- Eric Emerson Schmidt
- Important !
Identifying variations and complexity for insight discovery
This refers to how many different ways and patterns can you signal the insight.
A complex insight can be expressed in many different ways and patterns. The key is to train the scanner to look out for a variety of phrase patterns signaling the insight. This requires a thorough understanding of the insight and business use case.
For example, let’s breakdown the key indicators for the COPPA inapplicable insight as to the following:
- Indicator 01: Service is not directed to children under 13
- Indicator 02: Do not knowingly collect personal information from children under 13
- Indicator 03: Delete data from children under 13 once the company is made aware of
Note: Look for the shortest-possible phrase(s) signaling the insight. Avoid irrelevant phrases in-between two insights.
Based on the insight indicators above, we may find the following insight variations in our training documents (privacy policies):
Complex Insight Example 01: COPPA inapplicable insight
Complex Insight Example 02: COPPA inapplicable insight
Complex Insight Example 03: COPPA inapplicable insight
Refer to Train your Box Skill Scanner to Extract Info from Documents to find out how to train your scanner to extract data.
Refer to Train your Box Skill Scanner to Detect Insights from Documents to find out how to train your scanner to discover insights.