Skip to main content

Training Data Selection

This page describes the technical details of how training data is selected and filtered for Next Best Message, Search & Autocomplete models. The selection logic is implemented using JsonLogic rules, which are evaluated against both platform-specific metadata and core message attributes.

Conversion of JsonLogic to SQL

When loading data for training, JsonLogic rules are converted to SQL 'WHERE' clauses to efficiently filter data in BigQuery. For example, profile selection rules (tags) and ignore message rules are translated directly into SQL conditions:

  • Tags: Used to select data belonging to a profile. The rule is converted to a SQL clause that matches metadata in the tags column.
  • Ignore message: Used to exclude messages from training data. The rule is converted to a negative SQL clause, so matching messages are left out of the result set.

Example:

{"==": [{"var": "author_name"}, "survey"]}

becomes:

AND NOT (author_name = 'survey')

The code for converting JsonLogic to SQL can be found here.

Legacy Schema for Exclude as Suggestion

The 'exclude as suggestion' rules are not applied in the SQL query, but in post-query data processing. After retrieving data from BigQuery, messages matching the exclusion logic are tagged with TAG_EXCLUDE_AS_SUGGESTION. These tags are then used in subsequent steps to prevent these messages from being recommended.

warning

For legacy reasons, the non-metadata attributes used for exclusion are agent_id, not author_id. This is due to an older schema still used in Engine. There is a Linear ticket to address this technical debt. Other field names that can be used include: agent_name, visitor_name, origin, text.

Example exclusion rule:

{"==": [{"var": "agent_id"}, "18829"]}