How To Choose A Training Data Set For A Document Scanner

Your set of training documents should be a fair representation of the documents you plan to apply the Skill to.  The training documents should reflect the same diverse variations that you come across in the real business case.

How many training documents do you need?

For a Skill task with minimal data variation and document complexity, we recommend a minimum of 12 training documents per label (insight/metadata).  If the Skill has multiple labels, you need to increase the number of training documents to make sure each label is adequately represented. As you increase the data variations and document complexity, you also need to increase the number of training documents.

Note:  If possible, you should randomize your document selection process to avoid introducing human bias in the Skill model.

Identifying variations and complexity for data extraction

Be aware that the scanners may see variations differently from a human being.  

The scanner MAY consider the following as variation:

Date format

  • 4-12-12
  • 4/12/12
  • 4 / 12 / 12
  • April 12, 2012
  • Apr 12th, 12
  • 12th Apr 2012

Numbers

  • 60
  • 60.0
  • #60
  • (60)
  • Sixty
  • Sixty(60)
  • Sixty-one
  • Sixty-one(61)
  • (650)121-2121
  • 650-121-2121
  • 6501212121

Number of Words & spaces in-between

  • Eric Schmidt
  • Eric E. Schmidt
  • Eric Emerson Schmidt

Punctuation

Identifying variations and complexity for insight discovery

This refers to how many different ways and patterns can you signal the insight.

Simple Insight: 

A simple insight can be expressed in as little as one word. For example, when a company is in compliance with GDPR, the word “GDPR” becomes a strong signal for compliance.  The scanner can analyze the context around the word GDPR to understand whether a privacy policy claims to be in compliance with GDPR. In this scenario, you may consider using the “Tag All Similar Selections” feature in the training screen to label all GDPR in the document.

Complex Insight:

 A complex insight can be expressed in many different ways and patterns.  The key is to train the scanner to look out for a variety of phrase patterns signaling the insight. This requires a thorough understanding of the insight and business use case.

For example, let’s breakdown the key indicators for the COPPA inapplicable insight as to the following:

  • Indicator 01: Service is not directed to children under 13
  • Indicator 02: Do not knowingly collect personal information from children under 13
  • Indicator 03: Delete data from children under 13 once the company is made aware of

Note: Look for the shortest-possible phrase(s) signaling the insight.  Avoid irrelevant phrases in-between two insights.

Based on the insight indicators above, we may find the following insight variations in our training documents (privacy policies):

Complex Insight Example 01: COPPA inapplicable insight

Complex Insight Example 02: COPPA inapplicable insight

Complex Insight Example 03: COPPA inapplicable insight

Next Steps

Refer to Train your Box Skill Scanner to Extract Info from Documents to find out how to train your scanner to extract data.

Refer to Train your Box Skill Scanner to Detect Insights from Documents to find out how to train your scanner to discover insights. 

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.

Still need help? Contact Us Contact Us