Study design and setting
The study comprised two phases. The first phase involved building FAIIR for the identification of issue(s) that a young person might be experiencing from their textual conversations with trained CRs. Identification was performed using a list of 19 predefined issue tags. This phase laid the foundation for our work, where we diligently developed, fine-tuned, and evaluated FAIIR’s capacity as an NLP tool to understand and predict issues. In the second phase, we validated the model’s efficacy and accuracy through testing with domain experts and silent testing. This phase confirmed the practical applicability and real-world utility of FAIIR for CRs. Both phases utilized conversations related to crisis support services at KHP.
Curation of study dataset
The primary conversational dataset used for building and evaluating the NLP models of FAIIR was comprised of 703,975 unique, scrubbed, multi-turn dialog instances between service users and CRs via SMS from January 2018 until February 2023. An additional batch of 84,832 conversations from February to September 2023 was used for silent testing. It is important to note that some of these dialogs may originate from service users who engage in multiple instances of interaction with CRs; however, encounters from the same individual are not linked. In total, the training data represented conversations with 340,512 individual service users and 7937 CRs. The silent testing data represented 57,031 unique service users and 2038 CRs, with expected overlap between the individuals represented in both datasets.
At the end of each conversation, service users are asked to fill out an optional demographic survey. The demographic survey captures information including the helpfulness of the conversation to the user and demographics including their age range, ethnicity or cultural group, identification with any of ten identity groups (e.g., newcomer, refugee, deaf, blind, people with disabilities), and setting of current living (i.e., city, rural area, First Nations Reserve). Approximately 17% of service users typically complete this survey, most of whom identify as female, heterosexual and of European ancestry. Most conversations are flagged as medium-risk as shown in Fig. 1h with the distribution of priority labels across the main conversational dataset, according to the priority flagging methods further described below.
A total of 19 pre-defined issue tags currently serve to describe the range of topics raised by a user during a conversation, including topics such as Depressed, Anxiety/Stress, and Gender/Sexual Identity. Upon conversation conclusion, a CR manually assigns at least one available issue tag(s) to the conversation. Metrics related to tags are used in aggregate for insight generation, to best follow trends in youth issues, support CRs, as well as reporting to funders and other agencies. It is important to note that this labeling process is carried out by CRs at their own discretion, and according to their training. Due to limited resources and large volumes of service user inquiries, issue tags typically do not undergo additional review.
Data was anonymized, undergoing a process of scrubbing identifying information such as names and locations, which were automatically replaced with the placeholder [scrubbed]. In many instances, complete phrases and sentences were scrubbed due to the anonymization process. This process therefore introduced some noise due to the unintentional removal of harmless words, like turkey.
Priority flag pre-processing
At the start of each conversation, the system generates a priority flag based on the service user’s first few words. Service users are then triaged into categories of either high, medium, low-risk, or “no ground truth” via an algorithm owned by Crisis Text Line. Medium risk is assigned when a user expresses suicidal thoughts or self harm, and high risk is assigned when an individual is deemed to be an “imminent risk”, defined as having a combination of suicidal thoughts, a plan, access to means, and a 0–48 h timeline to end their life. The presence of any 56 English or 73 French words in an initial message from a user leads to their automatic triage to a higher priority level. According to the distribution in Fig. 1h (main text), the vast majority of conversations (87%) were flagged as medium-risk, with about 13% being flagged as high risk. Almost no conversations (0.0001%) were flagged as low-risk.
To assess how the FAIIR model performs across the different assigned priority levels, similar to Table 1, we collected fine-grained performance of the main FAIIR tool across our main metrics on the retrospective test set (n = 140,795). We divided the conversations in the test set into two main priority levels (“Medium” and “High”) and report the results across the two thresholds (0.25 and 0.5) in Supplementary Table 4. We observe that performance across the two risk categories has little variation, meaning there is little bias towards conversations according to priority and that the model is able to handle these different levels accordingly.
Development of the FAIIR tool
We framed the issue tag identification task as a multi-label classification problem, where multiple labels can be assigned to a single instance. The development of the classifier followed two distinct stages. In the first stage, we compared and evaluated various pre-trained transformer-based language models, fine-tuning them on a randomly selected subset of the data for classification. Pre-training involves training models on large-scale text corpora to learn general language patterns, while fine-tuning adapts these models to a specific task using a smaller, task-specific dataset. The second step involved domain adaptation, refining the model to better capture the nuances of youth mental health conversations. This was achieved through additional pre-training and fine-tuning on the full baseline training dataset, ensuring the classifier effectively recognized context-specific language patterns and issues relevant to the domain.
Development step 1: model comparisons
We explored four primary variants of transformer models for processing lengthy documents and task-oriented conversational data. These models fall into two categories of “encoder-only” models—designed primarily for classification tasks and “encoder-decoder” models—which process input text using an encoder and generate output using a decoder24. The models evaluated included Longformer9, an encoder-only model with 149M parameters, Conversational BERT, an encoder-only model with 110M parameters8,25, DialogLED, an encoder-decoder model with 139M parameters26, and MVP (Multi-task superVised Pre-training), an encoder-decoder model with 406M parameters27. During fine-tuning, these models, we incorporated a classification head, a single-layer neural network that converts the model’s output into probability scores for each class label. In encoder-only models, this layer was applied to the [CLS] token, which represents the entire input. For encoder-decoder models, classification was based on the [EOS] token (DialogLED) or the first token in the sequence (MVP), following established conventions28.
All four models were fine-tuned on 50,000 conversations randomly sampled from the dataset. We built a 60/20/20 stratified training/validation/test. The fine tuning on the full dataset took approximately 12 h per epoch on four A10 NVIDIA GPUs (24GB VRAM) with 16 CPU cores, each with an effective batch size of 16. Learning rates were tuned within the range of 1e−5 to 3e−5. Max token lengths were applied for BERT, DialogLED, and MVP, while Longformer was capped at 2048 tokens for efficiency. This limits the length of input text that can be provided but provides faster processing. The optimal training durations for each model were determined through basic hyperparameter tuning (used to find optimal parameters such as learning rate and batch size), resulting in two epochs (training cycles) for BERT, three epochs for DialogLED, five epochs for Longformer, and two epochs for MVP. Threshold selection was a key consideration in determining how labels were assigned. Since this is a multi-label classification task, where each conversation can have multiple assigned tags, we experimented with different threshold values to optimize the balance between precision and recall. We systematically evaluated thresholds ranging from 0.25 to 0.5 on a validation set, measuring their impact on classification performance. Our analysis indicated that a threshold of 0.25 yielded the best trade-off between precision and recall, particularly for underrepresented labels.
Process of selecting model architecture
The utilization of transformer-based models, such as Longformer, for classifying clinical or conversational data has been extensively explored in the literature. Studies, such as those of Li et al.23, have consistently demonstrated that Longformer models outperform shorter-sequence transformers like ClinicalBERT29 in various downstream tasks, including clinical document classification. Similarly, in another study by Dai et al.30, which evaluated different approaches for classifying long documents using transformer architectures such as Longformer, it was concluded that employing transformer-based models designed for longer sequences is more effective and efficient than using shorter-sequence models like BERT. This finding is particularly relevant as BERT is constrained by a 512-token limit which prevents the processing of any text that has a token length greater than the limit, while Longformer’s capability to handle longer sequences (up to eight times longer) proves advantageous for tasks requiring a broader context, such as analyzing lengthy conversational data. Other work such as those of Wang et al.31, Zhong et al.26 and Ji et al.32 highlight the significance of additional pre-training methods, such as masked language modeling and next-turn prediction, especially in the context of dialog data. Both studies emphasize the differences between general domain language and dialog, indicating that pre-training on a large corpus of domain-specific dialogs can significantly improve performance on downstream dialog tasks. Particularly Zhong et al.26 demonstrate the benefits of pre-training Longformer using dialog-specific window-based denoising on lengthy dialogs, resulting in a substantial improvement in state-of-the-art tasks such as long dialog understanding. Lastly, Ji et al.32 pre-train RoBERTa33, Longformer and XLNet34 on mental healthcare domain data for the task of mental health classification, achieving superior results in most cases to the base models, demonstrating the effectiveness of significant pre-training on mental healthcare domain data in related downstream tasks.
Supplementary Table 2 compares the performance of the four preliminary models, including Longformer, Conversational BERT, DialogLED, and MVP, all chosen for their suitability for handling either conversational data or long-documents. We used five metrics to evaluate model performance on the test data. The first metric is the standard classification “accuracy” which considers the total percentage of all of the 19 tags predicted by the model correctly for each instance across the full dataset. In this way, to attain full accuracy for a given conversation, the model must predict all of the correct set of tags assigned to the conversation, and not mistakenly predict any tags that are not in the correct set of tags.
Due to the sparsity of assigned tags, with conversations tending to be tagged with only a few tags out of the 19 total, a classifier can attain a high accuracy by not predicting any tags, hence we use a second metric which is referred to as “exact accuracy”, and assesses correctness based on the percentage of conversations where all the predicted issue tags are correct. As such, a single misidentified tag for a conversations means it is classified as an incorrect prediction. In addition to accuracy, we use three other metrics which we call “sample average precision”, “sample average recall”, and “sample average F1-score”.
In the context of multi-label classification, the sample/example-based average calculates the three scores for each sample then averages the scores across all samples. For each sample, the entire set of predicted tags is considered in the calculation of the three scores without isolating each individual tag type, and are compared with the full set of true labels. This is unlike micro-averaging where the scores are calculated globally across all of the total true positives, false negatives and false positives, or macro-averaging where the scores are calculated across all of the true positive, false positive, and false negative counts for a specific tag i before taking the unweighted mean across the scores for all tags. This method of averaging provides a representative result for the entire distribution after assessing scores for each sample individually. The metrics displayed in Supplementary Table 2 are averages across all issue samples.
Both Longformer and Conversational BERT exhibit comparable high performance (Accuracy: 0.94 and sample average F1-score: 0.56). Conversational BERT offers the advantage of being pre-trained on an extensive corpus of conversation data, while Longformer excels in capturing longer sequences. Therefore, we leveraged Longformer due to the nature of our conversations (long sequences), with the intention of performing domain adaptation akin to Conversational BERT to improve its performance. The remaining two models, based on encoder-decoder architectures, underperformed (sample average F1-score <0.35) primarily because they were not originally designed for this particular multi-label classification task. Furthermore, both encoder-decoder models encountered significant practicality issues related to exceptionally lengthy training times and resource limitations, necessitating use of small batch sizes and long inference times. As a result, they were deemed sub-optimal choices for this specific task.
Development step 2: final model development and optimization
For our final model, we employed an ensemble approach combining three Longformer models, each with slightly different initialization and fine-tuning settings. The choice of Longformer as our primary model was based on its superior performance and its capacity to effectively capture long conversations. Each Longformer underwent initial pre-training using the same approach, which included masked language modeling, where a portion of words in each conversation was masked, and the model learned to predict them, on the full baseline training dataset. We applied masking to 15% of tokens per conversation and pre-trained the models for one epoch with a maximum sequence length of 1500 tokens. AdamW35 was used as the optimizer, and a linear scheduler with 500 warm-up steps was applied to improve training stability. Gradient accumulation was used to maintain an effective batch size of 64, ensuring efficient use of GPU resources. This pre-training step required approximately 24 h.
Following the pre-training task, the Longformer models were fine-tuned on a label-balanced training/validation/test data split (60/20/20). Per recommendation of our domain experts, we incorporated additional context information related to the conversation’s priority. Therefore, the beginning sentence included the statement: “This conversation is of < < X > > priority” with X representing one of the three priority levels assigned to each conversation. More on this process of generating these levels is discussed in the “Methods” section: “Priority flag pre-processing”. Each Longformer model was fine-tuned for a maximum of three epochs using a batch size of 16, managed through gradient accumulation, with a learning rate set to 2e−5. Standard Binary Cross Entropy loss was applied during the fine-tuning, with oversampling of conversations with less common issue tags specifically implemented on two of the ensemble models to address the class imbalance. We used AdamW as the weight optimizer and implemented a linear scheduler with the initial 20% of training steps.
Evaluation of FAIIR predictions
Upon completion of the development of the FAIIR tool, we conducted two independent experiments to evaluate its efficacy and performance in generalization. The experiments included both expert assessment and silent testing of the tool and its predictions.
Expert assessment and evaluation
Methods for expert assessment for FAIIR included conducting an evaluation survey completed by CRs. We invited 12 trained CRs to review 40 challenging conversations. The conversation selection criteria were diverse, focusing on those with more than four issue tags and including ambiguous cases where FAIIR’s predictions were confident but incorrect based on our ground-truth labels. Our hypothesis was that for these edge cases, FAIIR requires a deep and nuanced understanding to perform well. Thus, our goal was to assess the model’s ability to identify all relevant issue tags and navigate language nuances.
Twenty of the 40 total conversations annotated were picked at random from the test set to get a representative sample of the data. All 20 conversations were originally labeled with four or more different issue tags. The reason for selecting these conversations is their coverage of a wide range of issues: the potential for annotators picking a different set of issue tags from each other is high, and perspectives to determine which issue tags truly apply to the conversation were of upmost importance. This was also important in evaluating whether the model was able to grasp all nuanced issue tags that may apply less directly to a given conversation.
The remaining 20 of the 40 conversations were mostly originally tagged with three or less issue tags, in an effort to promote a balance between conversations with many tags and those where only a small number may apply. Of these 20, handfuls of conversations were selected according to several differing criteria. A number were selected manually to cover all of the 19 different issue tags, in an effort to build consensus in the identification of all tags for the model to reference. A small sample of conversations were also selected as purposefully ambiguous cases: mainly long conversations which were only annotated with one or two issue tag(s). Although the issue tag(s) assigned at baseline were typically correct, these conversations were an opportunity to gain consensus on a spectrum of more nuanced tags for the purposes of model fine-tuning. The last few conversations were handpicked because they were perceived to be mislabelled in some way. For these conversations, the original issue tag(s) assigned appeared incorrect or incomplete, in that there was another key issue tag missing. Consensus building is important for these examples, in order to improve the original tag(s) assigned, where incorrect. These can also be complex cases for the model to navigate, and thus served as a helpful way to evaluate the tool’s performance.
Each conversation was independently reviewed by six CRs, divided into two groups. In the “open review”, three CRs reviewed conversations with FAIIR’s predicted issue tags explicitly provided. This approach aimed to evaluate whether the model’s predictions were helpful, misleading, or partially correct in identifying the core issues within each conversation. CRs could either agree or disagree with the predicted tags and suggest corrections or refinements where necessary.
In the “blind review”, the remaining three CRs reviewed the same conversations without any prior exposure to FAIIR’s predicted tags. Instead, they independently identified issue tags based solely on the conversation content. Furthermore, they categorized the identified tags into primary issue tags (representing the most pressing concerns) and secondary issue tags (minor but relevant concerns). This approach established a baseline for comparison against FAIIR’s predictions, ensuring that human assessments were made without any influence from the model.
The following five criteria were established to develop a consensus measure for comparison in the blind review setting, which is more challenging than the open review setting. Since human annotations categorize issue tags as primary (most pressing concerns) and secondary (minor but relevant concerns), we evaluated agreement with FAIIR’s outputs based on these distinctions. Notably, FAIIR does not explicitly differentiate between primary and secondary issue tags-all predicted tags are treated equally. Therefore, for the purpose of comparison, we assessed agreement by mapping FAIIR’s predicted tags to human annotations and measuring alignment using the following criteria:
Full agreement on primary issue tags (FA: 1°)—all primary issue tags identified by human annotators are also predicted by FAIIR.
Partial agreement on primary issue tags via majority vote (PA: 1° Maj.)—the majority of human annotators agree on a set of primary issue tags, and these tags overlap with FAIIR’s predictions.
Partial agreement on primary and secondary issue tags via majority vote (PA: 1° + 2° Maj.)—the majority of human annotators agree on a set of both primary and secondary issue tags, and these overlap with FAIIR’s predictions.
Full agreement on primary issue tags via at least one vote (FA: 1° ≥ 1)—at least one human annotator identified a primary issue tag that is also predicted by FAIIR.
Full agreement on primary and secondary issue tags via at least one vote (FA: 1° + 2° ≥ 1)—at least one human annotator identified a primary or secondary issue tag that is also predicted by FAIIR.
Model refinement—modifying the decision boundary
In our evaluation experiments, model refinement involved adjusting the decision boundary (threshold cutoff) to strike a balance between recall and precision. In most experiments, the FAIIR tool’s predictions showed lower precision compared to recall, so we adjusted the threshold to reduce its frequency of outputting the most common tags while lowering the threshold for rare tags. For example, we set the threshold to 0.4 for the three most frequent classes: Anxiety/Stress, Depressed, and Relationship. For the next two most frequent classes, Suicide and Isolated, we adjusted the threshold to 0.3. These five classes encompass the majority of predicted issue tags from the model, hence we targeted them for increased thresholds. The remaining tags were set at a lower threshold of 0.2 to enhance the model’s ability to capture them effectively.
Silent testing
We conducted silent testing to assess the FAIIR tool’s generalization performance on new and recent batches of conversations received over 8 months. Testing on 84,832 conversations that occurred between February to September 2023 served as a valuable representation of how the model adapts to and handles the ever-changing landscape of real-world dialog. In addition to model evaluation, we implemented refinements as explained earlier to strike a balance between precision and recall.
Outcome interpretation and visualization
We leveraged layer-integrated gradients36, a technique to understand which tokens in any given conversation hold the utmost relevance to the predicted issue tags. By doing so, not only can we identify the most pertinent words associated with the primary conversation topics but we also gain insight into the model’s decision-making process, thereby enhancing its overall explainability. To achieve this, we computed an attribution score, which are analogous to an importance score, for each token in the conversations using the final model and tokenizer with respect to the sets of predefined issue tags. Attributed tokens along the axis of issue tags that surpass a predefined threshold are selected as the most relevant words. We refer to these as “natural keywords”. The term “natural” is used because these keywords are not predefined; they are dynamic and align with the nuances of natural language.
To create a cleaner set of natural keywords from the initial set obtained using integrated gradients, we run a series of filters to remove words and symbols that are irrelevant or do not add meaning or additional insights. We automatically filtered stop words, punctuation, and special tokens for this reason. In addition, we devised a predefined word list that contains natural keywords. These keywords, while not categorized as stop words, consistently occurred very frequently across virtually every issue tag (e.g., User, Hello, Connect). Any keywords that fall within this list were also filtered out. Finally, we filter keywords according to their part of speech tags, removing any that fall within a defined set of categories. We filtered conjunctions, determiners, prepositions, modal auxiliary words, and the majority of verbs along with many other categories, yielding keywords that are primarily nouns and adjectives.
To streamline the processing of natural keywords and facilitate communication with knowledge experts, we conducted word embedding visualization and bi-gram analysis at the aggregated level to demonstrate semantic relationships and word proximity in a specific context.
Visualization of natural key words to support explainability
In addition to the core 19 issue tags, we built an explainability pipeline to enable the extraction of keywords, referred to as “natural keywords”, from each conversation. Keywords are dynamic and context-specific tokens associated with the main issue tags being discussed in a conversation. By extracting these, we can derive further insights on more fine-grained important “sub-topics”, that can help to better inform issues of importance. Supplementary Fig. 3 illustrates the occurrence frequency of the top 100 keywords across 10,000 randomly selected conversations in the test set labeled with the Suicide issue tag. In total, 124,578 keywords were generated respectively for the Suicide issue tag, yielding an average of 12.5 keywords per conversation. The top list of keywords generated generally represents ideas and concepts associated with the given issue tag, offering additional insights. For example, we note how in the case of suicide, frequent topics are highlighted based on common keywords, such as emotion-related keywords (i.e., “happy”, “sad”, “mood”, “anxiety”, “scared”, and “pain”) thus demonstrating the distress that is being experienced. Other top keywords like “plan” being in the top 10 may show that many people texting have serious plans of suicide. Location-based words like “home”, “school”, and “friend” also rank high, as CRs are likely trying to determine the location of the individual to try to give the most appropriate support. These insights allow for many observations at the macro-level. Similarly, Supplementary Fig. 5 demonstrates the distribution of top 25 keywords across the three different abuse issue tags (Abuse, Physical; Abuse, Sexual; and Abuse, Emotional), showing the general similarities of keywords across different categories (like “friend”) versus those that are far more frequent in certain tags (like “assault” is to the Abuse, Sexual tag) or how “mom” and “dad” are relatively much more frequent for Abuse, Physical and Abuse, Emotional tags compared to the Suicide and Abuse, Sexual tags. Further details can be seen in Supplementary Fig. 5.
The explainability pipeline in the FAIIR tool offers visualization features that provide valuable insights into the semantic relationship and proximity of keywords through the utilization of bi-gram analysis and word embeddings. For bi-gram analysis, FAIIR provides a graph-based visualization in which nodes represent the keywords, while the edges between nodes illustrate the strength of their relationships (the co-occurrence frequency in conversations). Supplementary Fig. 4, as an example, illustrates the outcome of bi-gram analysis performed on the natural keywords in relation to the Suicide issue tag (the bi-gram analysis on the Abuse, Physical issue tag is shown in Supplementary Materials Fig. 6). Based on the figure, certain keywords like “thought”, “suicidal”, “home”, and “harm” have many connections that are very strong, showing they co-occur more frequently than other pairs. Connections can reveal potential insights about behaviors or where common issues lie, which can be seen with connections like “family” and “pain”, “problem”, “talk”, and “situation”, which can potentially reveal that family troubles are frequent sub-issues tied with being suicidal.
Further insights on extracted key words
Supplementary Fig. 5 illustrates the occurrence frequency of the top 25 keywords across all conversations in the test set labeled with the three abuse issue tags (Abuse, Physical; Abuse, Sexual; and Abuse, Emotional). In total, 1673, 2598, and 3367 conversations were used to generate 26,985, 40,522 and 52,635 keywords, respectively, for the three aforementioned issue tags, yielding an average of 16.2, 15.6, and 15.6 keywords per conversation. The top list of keywords generated generally represents ideas and concepts associated with the given issue tag, offering additional insights. For example, we note how in the case of sexual abuse, both “home” and “school” are frequent keywords, with home being relatively more frequent than school. This may reflect places where service users are more likely to have experienced abuse, thereby aiding in strategy and planning for CRs in managing conversations. We observe similar keywords are common across these issue tags, such as “friend”, “mom”, or “dad” which reveal major overarching topics and pressure points.
In addition to bi-gram analysis, in Supplementary Fig. 7 we demonstrate the word embeddings of the top 100 most frequent keywords within all conversations originally labeled with the issue tag Abuse, Emotional, projected in a three-dimensional space using principal component analysis (PCA). This enables enhanced visualization and easier exploration of keywords that are most similar to a searched keyword using cosine distance between the embeddings. For example, for the keyword “fight”, keywords like “stop”, “help” and “hard” are the closest in embedding space.
Ethics statement
This publication was the result of a quality improvement initiative at Kids Help Phone, and as a result, no REB approval was sought or obtained.
Kids Help Phone (KHP) is deeply committed to the ethical and responsible use of data to enhance our services for youth, recognizing the importance of ethical principles in maintaining the trust of those we serve, especially the most vulnerable. This paper is aimed exclusively at applied research to improve service delivery and accessibility, with a special focus on the ethical application of Artificial Intelligence (AI) to benefit our service network and frontline staff. Through this collaboration, we are dedicated to developing technological tools that provide a personalized and user-friendly experience for those seeking help. Upholding the privacy and confidentiality of our service users is paramount; we adhere to an ethical statement aligned with KHP’s privacy policy (https://kidshelpphone.ca/privacy-policy/), including consent notice for research and rigorous data minimization. Our processes are transparent and accountable, compliant with Canadian privacy regulations. We meticulously remove all direct identifiers from research data, adhering to industry standards for data anonymization, and securely store all research data within KHP’s infrastructure. This reflects our commitment to the highest standards of data security, confidentiality, and ethical practice. By prioritizing ethical data use, KHP can leverage research to improve our services and deliver the best possible support for youth across Canada, embodying our commitment to integrity, respect, and responsibility in every action we take.