Leveraging Prompts in LLMs to Overcome Imbalances in Complex Educational Text Data

Jeanne McClure, Machi Shimmei, Noboru Matsuda, Shiyan Jiang
North Carolina State University,
Raleigh, NC 27695
{jmmcclu3, mshimme, nmatsud, sjiang24}@ncsu.edu
Abstract

Background: The study addresses the challenge of imbalances in educational datasets, which is prominent in the education sector due to the varied cognitive engagement levels among students in their open responses. Traditional machine learning (ML) models often struggle with the complexity and nuanced nature of this data, leading to inadequate analyses, especially for minority data representations (Karimah and Hasegawa, 2022; Radwan and Cataltepe, 2017; Yun et al., 2011). Understanding students’ cognitive engagement is vital as it reflects their mental investment in learning activities, which is closely linked to academic success (Fredricks et al., 2004; Blumenfeld et al., 2006; Corno and Mandinach, 1983; Pintrich, 2000; Schunk et al., 2014).

Objective: The objective of this paper is to investigate the efficacy of Large Language Models (LLMs) enhanced with assertions in tackling the complexities of imbalanced educational datasets, with a special focus on the precise classification of cognitive engagement levels from student texts. This exploration is underpinned by two critical research questions. The first seeks to evaluate how LLMs equipped with Prompt Engineering fare in comparison to conventional ML algorithms when dealing with the inherent challenges of imbalanced educational data. The second question delves into the specific contributions of integrating assertions into LLMs, examining how such augmentations can improve the models’ effectiveness in handling the nuanced difficulties presented by imbalanced textual educational datasets. Through this inquiry, the study aims to shed light on the potential of LLMs and assertions in enhancing the accuracy and reliability of cognitive engagement classification, thereby addressing a significant gap in educational data analysis.

Methods: The study employed an ’Iterative - ICL PE Design Process’ to compare traditional ML models against LLMs augmented with assertions (N=135). A sensitivity analysis on a subset (n=27) examined variance in model performance concerning classification metrics and cognitive engagement levels. This process involved the utilization of assertion-based prompt engineering, comparing the performance of traditional ML models to LLMs with assertions in classifying cognitive engagement from student texts in an educational setting (Shahriar et al., 2023; Brown et al., 2020; Wei et al., 2022a).

Findings: LLMs with assertions significantly outperformed traditional ML models, especially in recognizing cognitive engagement levels with minority representation, showing up to a 32% increase in F1-score. Incorporating targeted assertions into the LLM on the subset enhanced its performance by 11.94%, primarily addressing errors from limitations in understanding context and resolving lexical ambiguities in student responses.

Implications: The study demonstrates the superior capability of LLMs, particularly when augmented with assertions, in addressing the nuanced challenges of imbalanced educational datasets. This advancement not only improves the accuracy of classifying cognitive engagement levels but also opens new avenues for data-driven educational research and practice. The findings suggest a potential paradigm shift towards employing advanced LLM techniques in educational settings to achieve a more nuanced and accurate analysis of student engagement, thereby enhancing learning outcomes. Future research should further explore the capabilities of LLMs across broader educational contexts and investigate additional methods to refine and expand their application in analyzing complex educational data (Shahriar et al., 2023; Zeng et al., 2023).

Keywords Machine Learning  \cdot Text Classification  \cdot Prompt Engineering  \cdot Imbalanced dataset  \cdot LLMs

1 Introduction

Understanding students’ cognitive engagement (CE) at both the school and task levels is crucial, as it offers deep insights into their commitment to learning (Fredricks et al., 2004). This form of engagement, characterized by a student’s deliberate and intentional approach to schoolwork and their willingness to invest the necessary effort in comprehending complex concepts and mastering challenging skills, serves as a key indicator of academic success (Fredricks et al., 2004; Blumenfeld et al., 2006). CE encompasses the psychological investment and effort driven by student motivation and strategies, alongside their dedication to learning (Corno and Mandinach, 1983; Fredricks et al., 2004; Pintrich, 2000; Schunk et al., 2014).

While analyzing students’ CE is crucial for enhancing learning experiences, a significant challenge arises from imbalanced datasets (Radwan and Cataltepe, 2017). These datasets often feature unevenly distributed categories and are typically small, not fitting the ’big data’ criteria usually required for effective Machine Learning (ML) training. This size limitation, along with the disproportionate representation of majority and minority data, further complicates the training process in traditional analyses (Yun et al., 2011). Traditional ML methods, commonly employed to classify CE, often struggle to adequately address these imbalances, raising concerns about the accuracy and reliability of their results. This issue presents a major hurdle in accurately assessing and interpreting CE, as the uneven representation of data can lead to skewed insights and potentially overlook critical aspects of student engagement (Karimah and Hasegawa, 2022). This imbalance in datasets not only complicates the analysis but also raises concerns about the reliability and generalizability of the findings in diverse educational settings (Radwan and Cataltepe, 2017).

The exploration of LLMs provides a promising solution to the limitations of traditional ML approaches. Recent studies, including (Wu, 2021), have highlighted the potential of prompt engineering in reducing the need for extensive training of case labeling which is imperative for imbalance data. LLMs employ techniques like In-context Learning (ICL) (Brown et al., 2020) and Chain-of-Thought (COT) prompting (Wei et al., 2022b), enabling more nuanced and context-aware responses. ICL trains models using examples in specific contexts, improving with scaled model and corpus sizes, as seen in N-shot prompting (Brown et al., 2020). This is illustrated by Brown et al. (2020)’s few-shot learning, where LLMs process input-output pairs in-context, leading to better test-time predictions. Similarly, COT, by Wei et al. (2022b), involves logical, step-by-step natural language reasoning. Furthering this, Shahriar et al. (2023) developed Assertion Enhanced Few-Shot Learning, incorporating domain-specific assertions in prompts to enhance accuracy and reduce errors. These innovations significantly boost LLMs’ task-specific efficiency, surpassing traditional methods.

While LLMs have shown potential in educational research, their application has predominantly been refined to solve logical reasoning or arithmetic problems (Lee et al., 2024), with limited exploration in addressing imbalanced datasets of education. Our study breaks new ground by applying LLMs with Prompt Engineering (PE) to this specific challenge. We hypothesize that LLMs, renowned for their nuanced language understanding, will surpass traditional ML algorithms in classifying cognitive engagement levels from student texts. Our exploration is guided by two research questions: RQ1 addresses the comparative efficacy of LLMs against traditional ML algorithms, and RQ2 investigates the role of assertions in overcoming contextual and lexical challenges within imbalanced datasets. Specifically:

  1. 1.

    How do the results obtained from LLMs with PE compare to traditional Machine Learning algorithms in handling imbalanced educational data?

  2. 2.

    In what ways does the integration of assertions enhance the efficacy of models when addressing the challenges associated with imbalanced textual educational datasets?

This paper examines how AEFL mitigates issues in imbalanced educational data analysis, revealing how these technologies can effectively address the challenges posed by uneven dataset distributions. By applying this cutting-edge technique, we uncover new possibilities for analyzing and interpreting complex educational data. Our findings demonstrate the advantage of AEFL in educational settings, especially where traditional ML methods fall short, opening new avenues for data-driven educational research and practice.

The rest of the paper is set up as follows: Section 2 delves into the background, highlighting the emergence of LLMs as a promising solution in education. Section 3 outlines our methodology, including the Iterative - ICL PE Design Process, and the experimental setup. The results and discussions are presented in Section 4, where we compare the performance of LLMs augmented with assertions against traditional ML models and discuss the impact of assertions on model efficacy and limitations. Finally, Section 5 concludes with our findings and future directions.

2 Background

The exploration of CE within educational research has significantly evolved, transitioning from a simplistic focus on student participation to a complex understanding of mental investment in learning activities. This shift is paramount for fully capturing the essence of engagement, as initially highlighted by Craik and Lockhart (1972) through their distinction between shallow and deep processing. Subsequent work by Appleton et al. (2006) and Fredricks et al. (2004) expanded the concept to encompass behavioral, emotional, and cognitive dimensions, underscoring engagement’s multifaceted nature across various educational contexts. A pivotal insight from this exploration is the strong positive correlation between student learning and cognitive engagement, evidenced by Chi and Wylie (2014), which underscores the significant educational outcomes associated with deep cognitive processes.

CE distinguishes itself within the broader spectrum of educational engagement by focusing on the intensity of students’ mental investment in learning. This stands in contrast to behavioral engagement’s emphasis on participation and emotional engagement’s concern with feelings towards learning Blumenfeld et al. (2006). Such a distinction is crucial for educators and researchers dedicated to enhancing learning outcomes through targeted interventions.

Central to understanding and enhancing CE are theoretical frameworks and models like Bloom’s taxonomy, Corno and Mandinach’s model, and the ICAP model, as well as Wang et al.’s framework for connectivist learning contexts. These models provide comprehensive insights into the various dimensions and components of cognitive engagement, aiding researchers in designing effective studies, developing targeted interventions, and evaluating educational outcomes (Anderson and Krathwohl, 2001; Bloom et al., 1956; Corno and Mandinach, 1983; Chi and Wylie, 2014; Chase et al., 2019; Hsiao et al., 2022; Wang et al., 2016).

Measuring CE, however, presents inherent challenges due to its complex and internal nature. As a latent construct, CE’s assessment relies on inferences from behavioral indicators or through self-report measures (Chi and Wylie, 2014; Fredricks et al., 2004; McCoach et al., 2013). Traditional methods, including self-report questionnaires, surveys, and observational techniques, often inadequately capture the nuanced cognitive processes involved in learning. A variety of measures have been employed in past studies to gauge CE, such as self-reported scales, classroom observations, interviews, teacher ratings, experience sampling, eyetracking, physiological sensors, trace analysis, and content analysis (Greene et al., 2004; Smiley and Anderson, 2011; Lee and Anderson, 1993; Helme and Clarke, 2001; Wigfield et al., 2008; Xie et al., 2019; D’Mello et al., 2017; Bernacki et al., 2012; Ireland and Henderson, 2014). Nonetheless, the complexity of accurately assessing CE through these measures necessitates innovative approaches that more precisely reflect students’ cognitive investment in their educational activities (Fredricks et al., 2004).

In educational research, traditional ML methods have extensively analyzed student data patterns but face limitations when addressing nuanced aspects like cognitive engagement. The problem is exacerbated by imbalanced datasets, leading to skewed insights and overlooking crucial engagement aspects, thus affecting the findings’ accuracy, reliability, and generalizability across diverse educational contexts (Lee and Kinzie, 2012; Fredricks et al., 2004). This issue with imbalanced datasets, characterized by unevenly distributed categories and small sample sizes, highlights the need for specialized techniques to improve model performance and accuracy, ensuring a comprehensive understanding of CE across educational contexts (Chawla, 2010; Fernández et al., 2018; Kulkarni et al., 2020; Japkowicz and Stephen, 2002; Bruce et al., 2020; LemaÃŽtre et al., 2017).

The advent of LLMs presents a promising solution to the issues posed by imbalanced datasets in educational research. Recent breakthroughs in LLMs, particularly with ICL, COT and AEFL prompting techniques, have demonstrated their potential to generate nuanced, context-aware responses beyond the capabilities of traditional ML methods (Brown et al., 2020; Wei et al., 2022b; Shahriar et al., 2023). For example, Savelka et al. (2023) showcased how GPT-3.5 & 4 could effectively classify student help requests in programming courses, illustrating the superior ability of LLMs to handle nuanced educational data. Zeng et al. (2023) delved into the cognitive and reasoning abilities of LLMs, highlighting the necessity for task-specific tuning to address complex reasoning challenges. Cui et al. (2023) introduced the Divide-Conquer-Reasoning (DCR) framework to enhance the consistency and reliability of LLM-generated texts, vital for creating educational content. These examples reveal the capacity of LLMs to offer more accurate classification and analysis of CE, surpassing traditional ML methods in dealing with the intricacies of educational datasets. Additionally, Lee et al. (2024) explored LLMs’ use with CoT prompting to improve automatic scoring systems in science education, further indicating LLMs’ potential to enhance the quality and reliability of educational content analysis.

By harnessing the intrinsic capacity of LLMs to interpret and utilize language within specific contexts, researchers can navigate the challenges posed by imbalanced datasets, facilitating a deeper understanding of student CE.

3 Methodology

3.1 Context and Participants

This study performs a secondary analysis on a dataset originally gathered to assess CE from student responses in a High School English Language Arts course’s AI curriculum. The StoryQ curriculum (Chao et al., 2022), spanned three weeks with daily 45-minute classes, incorporated Machine Learning Practices through open-ended questions in eight modules but our analysis only evaluated three: “Sentiment Analysis,” “Features and Models,” and “All Words.” The initial study’s diverse participant group of 28 students included 17 females, 7 males, and 4 non-specified gender individuals, spanning various grades and racial backgrounds. The racial composition was 43% Black/African American, 17% Hispanic/Latinx, 18% White/Caucasian, with others choosing not to disclose. Students’ CE was evaluated using a modified Interactive-Constructive-Active- Passive (ICAP) framework by Chi and Wylie (2014), focusing on Constructive, Active, and Passive levels. Their open-ended responses (N = 840) were analyzed using the CE coding scheme, see Table 1, yielding a Cohen’s kappa inter-rater reliability of 0.84.

Table 1: ICAP: Cognitive Engagement Label Coding Scheme
Score ICAP Level Description Indicator Example
2 Constructive New information is integrated with activated prior knowledge, and new knowledge is inferred. Deep reasoning, synthesis of new ideas, or forming hypotheses. “I think that the model learned a large positive weight for the feature because if you came to an establishment then that would indicate that you did like it because you chose to come in the first place."
1 Active Behaviors that cause-focused attention while manipulating. Apply, Analyze, or Manipulating. “I think this gained a large amount of weight because it is a commonly used word."
0 Passive Overt activities that are carried out mindlessly. Recalling or Restating. “Amazing, clean, selection, try, regular, seating"

3.2 Prompt Engineering Design

Our prompt development process, grounded in the ICL Prompt Engineering Design (see Figure 1), begins with drafting an initial few-shot ICL format prompt. This prompt, inputting student responses and outputting CE classifications, undergoes validation testing on a subset (n=27). If benchmarks are met, it progresses to full dataset testing; otherwise, we diagnose misclassifications, realigning LLM outputs with our coding standards through domain-specific CE knowledge integration. Adjustments may involve refining COT processes, FewSHOT learning, or embedding conceptual knowledge assertions. After subset retesting and validation, the optimized prompt is applied to the full dataset (n=135), with iterative refinement ensuring optimal performance. See Appendix B for additional LLM-specific prompt details.

Our engineering approach encompasses three components: General COT, FewShot with Reasoning Sequence, and assertions Prompting. General COT, embeds sequential instructions with “think time” to initiate the model’s reasoning on given tasks (Fulford and Ng, 2023). Our General COT prompt follows a seven-step sequence to guide the LLM’s task reasoning (see Figure 1). Initially, the model attentively reads the provided <<Question, Response>> (Step 1), laying the foundation for accurate comprehension and subsequent cognitive engagement analysis. Step 2 involves feeding the model CE domain-specific definitions for Passive, Active, and Constructive levels, requiring it to discern the appropriate engagement level based on the initial input. Progressing to Step 3, the model assesses the rationale behind the assigned cognitive engagement label, ensuring it reflects the response’s depth and nature. In Step 4, the LLM reevaluates the response to prevent misclassification and assesses if a different CE level is more aligned. Steps 5 and 6 prompt the model to consider ways to enhance the CE level, crucial in the validation and diagnostic phases, particularly when integrating assertions. The final step (Step 7) circles back to the initial input, where the LLM reexamines the cognitive engagement level to verify the accuracy and consistency of its prediction. This structured approach is key in sharpening the model’s evaluative and analytical capabilities.

FewShot with Reasoning, guided by gold standard examples (Wang et al., 2023; Shahriar et al., 2023), includes a four-element structure: <<Question, Response, Label, and Reasoning>>. This method enhances LLM’s task-specific learning, incorporating reasoning sequences in the examples. Finally adding assertion Prompts, is crucial for knowledge-building explanations, that are domain-specific insights defined from General COT’s outputs on misclassified predictions (Shahriar et al., 2023).

Refer to caption
Figure 1: ICL Prompt Engineering Design Process to optimize the accuracy of LLMs in classifying educational data with the use of ICL, COT and AEFL.

3.3 Experiment Design

To analyze traditional ML methods (SVM, RF, DT, and ADABoost), we divided our data into training (n=432) and testing sets (n=135), applying default hyperparameters from the Scikit-Learn package (Pedregosa et al., 2011). See Appendix A for hyperparameters. The dataset comprised two majority classes and one minority class (see Table 2. During data preprocessing, we executed text cleaning steps: removing non-alphanumeric/special characters (except periods), new lines, isolated "n" characters, excess spaces, double quotes, and backslashes; converting to lowercase; eliminating stop words; and correcting spelling errors. We transformed the tokenized text using TF-IDF vectorization for ML algorithm suitability. These traditional ML methods served as benchmarks for comparing with LLM prompt results.

Table 2: Dataset numbers for Training, Testing and subsets by cognitive level.
ICAP Level Training Testing Subset
C 202 62 10
A 203 66 10
P 27 7 7

In analyzing LLM, we employed GPT-4 through the Colab Python OpenAI API, setting hyperparameters to temperature = 0 and top p= 0.01 for optimal automatic scoring (Wang et al., 2023). The data preprocessing mirrored the traditional ML approach but without tokenization or vectorization. We maintained the integrity of student sentences, ensuring capitalized start and appropriate punctuation, mainly periods. The final prompt See Appendix B underwent testing with the same dataset (n=135) used in traditional ML.

In our final experiment, we adopted a subset-based iterative modification approach (n=27) as per the ICL Prompt Design Process 3.2. This involved a sensitivity analysis for precise influence measurement of assertions on LLM performance. Each iteration entailed scrutinizing misclassified data, focusing on informal language nuances in text inputs. This qualitative analysis was pivotal for understanding the impact on model accuracy and response. This systematic approach enriched our comprehension of LLM’s interaction with varied prompts and offered insights for enhancing LLM’s performance in processing and interpreting informal language, a significant challenge in educational datasets.

3.4 Analysis

In our multiclass dataset analysis, we utilize Precision, Recall, and F1 Score to evaluate the performance of LLMs with assertions versus traditional ML models. These metrics are integral for assessing model efficacy in a multiclass environment. Precision gauges the model’s accuracy in predicting each class, indicating the reliability of its positive predictions. Recall measures the model’s capacity to correctly identify all instances of each class, vital for ensuring comprehensive representation in a multiclass context. The F1 Score, as the harmonic mean of Precision and Recall, offers a balanced evaluation of the model’s overall performance, particularly important in our study to address potential class imbalance. Following Pennebaker et al. (2015), we emphasize both precision and recall to minimize false positives and negatives, crucial in multiclass datasets. Additionally, we assess the percentage change in F1 score performance to quantify the impact of assertions, using the following formula:

Percent Increase=(F1 score of LLMF1 score of traditional MLF1 score of traditional ML)×100%Percent IncreaseF1 score of LLMF1 score of traditional MLF1 score of traditional MLpercent100\text{Percent Increase}=\left(\frac{\text{F1 score of LLM}-\text{F1 score of % traditional ML}}{\text{F1 score of traditional ML}}\right)\times 100\%Percent Increase = ( divide start_ARG F1 score of LLM - F1 score of traditional ML end_ARG start_ARG F1 score of traditional ML end_ARG ) × 100 %

To further this analysis we examined F1 scores. To differentiate between models, we developed a custom metric, inspired by Cohen’s D (Cohen, 2013). However, unlike the traditional Cohen’s D, which uses standardized effect sizes (small at 0.2, medium at 0.5, large at 0.8) based on pooled standard deviation, our metric directly compares raw F1 score differences. This modification suits our data, where standard deviation calculations aren’t feasible due to single observations per model. We categorized differences in F1 scores as small (up to 10 points), medium (10 to 30 points), and large (over 30 points). We defined a function for calculating pairwise differences in scores misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT M represent any two models, and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are their respective scores. The function:

f(mi,mj)=sisj𝑓subscript𝑚𝑖subscript𝑚𝑗subscript𝑠𝑖subscript𝑠𝑗f(m_{i},m_{j})=s_{i}-s_{j}italic_f ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is defined as the difference between sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

It computes the difference in performance scores between each pair of models. For each combination of models (misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT), the score of model mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is subtracted from that of model misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This function calculates the performance difference between each model pair. We then generate a matrix showcasing these differences, allowing for a thorough pairwise comparison of model performances.

To answer RQ 2 and evaluate the ways that the integration of assertions enhance the efficacy of models when addressing the challenges associated with imbalanced textual educational datasets we chose to test on a subset (N=27, P = 10, A = 10, C= 7) as is common in the research to “increase the depth of our analysis, reduce run-time, and decrease cost” (Rodriguez et al., 2023, p. 2). We chose a sensitivity analysis (Akinwande et al., 2023) to critically assess the impact or influence of the assertions. We did this qualitatively by adding two steps (Step 5 & 6 of General COT into the <<General COT>> and interpreting for the <<model outcome>> for recurring themes. Our examination extended to a comparative analysis of the experiments, employing class-wise analysis to measure each experiment against a baseline prompt that did not incorporate assertions.

Results and Discussion

3.5 RQ1: How do the results obtained from LLMs with Prompt Engineering compare to traditional Machine Learning algorithms in handling imbalanced educational data?

3.5.1 Performance Metrics

The summary results in Table 3 indicated a varied performance across classes. In the Passive class, the LLM significantly outperformed traditional models, showing a 14.9% increase over SVM, 6.25% over RF, 18.0% over DT, and a notable 23.2% increase over AdaBoost. Conversely, in the Active class, traditional models (SVM, RF, and DT) surpassed LLM by 11.1%, while AdaBoost and LLM performances were comparable. The most striking contrast was observed in the Constructive class, where traditional models (SVM, RF, DT, and AdaBoost) failed to effectively identify instances. In contrast, the LLM demonstrated a remarkable improvement with an F1 score of 32, showcasing its superior capability in recognizing elements of the minority class.

Table 3: Summary of Performance Metrics by Cognitive Engagement Level
Passive (62) Active (66) Constructive (7)
Model P R F1 P R F1 P R F1
SVM 75 73 74 68 77 72 0 0 0
RF 70 92 80 80 65 72 0 0 0
DT 71 74 72 71 73 72 0 0 0
ADABoost 64 74 69 65 62 64 0 0 0
LLM 83 87 85 78 54 64 21 71 32

These results suggest that while traditional machine learning models like SVM, RF, DT, and AdaBoost may perform comparably or better in majority classes, the LLM exhibits superior capability in dealing with minority class instances, particularly in complex classification tasks like the Constructive class in our dataset (see Figure 2). The versatility and adaptability of LLMs in handling imbalanced class distributions highlight their potential in enhancing classification tasks, especially in scenarios where minority classes hold substantial importance. These findings affirm our hypothesis that LLMs, especially when augmented with assertions, offer superior capabilities in classifying cognitive engagement levels from student texts, addressing the core of RQ1.

Refer to caption
Figure 2: Performance Metrics Summary by Cognitive Engagement Class showing results for each cognitive class.

3.5.2 Relative Performance

We see similar results in our custom metric inspired by Cohen’s D due to the unique nature of our data, where standard deviation calculations were not applicable, and produced interesting results (see Figure 5). The LLM with assertions for the passive class demonstrated noteworthy advantages over traditional models in various comparisons which resonate with the work of researchers (Shahriar et al., 2023), who demonstrated the enhanced effectiveness of LLMs in educational settings.

Refer to caption
Figure 3: Relative Performance Heatmap by Cognitive Engagement Class

Against SVM, the LLM had a significant edge, showing an 11-point advantage in the F1 score, categorized as a ’medium’ difference according to our threshold range. This indicates a considerably better performance of the LLM over SVM. When compared to DT, the LLM with assertions again showed a ’medium’ difference, outperforming DT by 12 points, underscoring its effectiveness in handling complex classification tasks. In a more striking contrast, the LLM outperformed ADA Boost by 16 points, falling into the ’medium’ range and highlighting a substantial performance gap where the LLM was far superior.

In the Active class, the LLM with assertions exhibited a mixed performance. It showed a close competition with SVM, trailing by just 2 points, which falls into the ’small’ difference category, implying a nearly equivalent performance between the two models. However, the LLM outperformed ADABoost by a margin of 8 points, a ’small’ difference that nonetheless underscores its relative effectiveness. This suggests that while LLMs offer substantial advantages in many areas, their performance can vary depending on the specific classification task, echoing the findings of Lee et al. (2024), who explored the use of LLMs in automatic scoring systems. Against RF and DT, the LLM had a slight disadvantage, trailing by 7 and 3 points respectively, suggesting that in certain scenarios, traditional models may have a slight edge over the LLM.

The Constructive class results were particularly striking. The LLM with assertions demonstrated a pronounced superiority in this category. It dramatically outperformed all traditional models (SVM, RF, DT, and ADABoost), each of which failed to identify instances within the Constructive class effectively, as indicated by their zero scores. The LLM achieved an F1 score of 32, which not only establishes a ’large’ difference according to our threshold but also highlights the LLMs exceptional capability in handling minority classes or complex classification tasks where traditional models fall short. It points to the LLMs’ superior ability to handle imbalanced datasets, a common challenge in educational data analysis, as illustrated by the work of researchers like Zeng et al. (2023), who evaluated the cognitive and reasoning abilities of LLMs.

3.6 RQ2: In what ways does the integration of assertions enhance the efficacy of models when addressing the challenges associated with imbalanced textual educational datasets?

Our analysis aimed to augment Active class metrics and foster a more equitable model across cognitive classes. Throughout the course of ten experiments, including the baseline, the implementation of assertions, particularly those delineated in <<General COT>> (Steps 5 & 6, see Appendix B), was pivotal in surfacing two primary themes post the initial experiment: textual ambiguity and contextual comprehension challenges.

For text ambiguity, the baseline experiment revealed the model’s propensity to misconstrue the depth of student engagement. Instances where contributions appeared analytical but merely constituted a superficial application of known concepts underscored this issue. By systematically applying the assertions detailed in the Methodology, we observed significant improvements in model performance, particularly within the Active and Constructive classes.

With regard to Unusual language, the model’s interpretation of speculative language (e.g., "I think," "possibly," "I believe") as indicative of reflective or analytical thought. Such expressions, particularly when conveying opinions that superficially suggested deeper analysis, were erroneously classified as constructive engagement.

Initially, our approach to integrating assertions was exploratory but became more systematic by the third experiment. For example, between experiments two through four, certain responses intended as "Constructive" were incorrectly classified as "Active":

Misclassified Example 1:

Question: Why do you think the model learned a large negative weight for this feature? Student response: “I think the model learned a negative weight for this feature because the model categorized the reviews as negative and categorized the surprisingly negative features as negative too since that was the whole sentiment of the review.”

Misclassified Example 2:

Question: Why do you think the model learned a large positive weight for this feature? Student response: “I feel like it had to do with the words and how much they were used whenever there was a positive review it would contain more than one good word to go along with it”

By incorporating the assertion <<Do label the statement as Constructive when they form a hypothesis about why the model learned a weight for a certain feature>>, these responses were accurately predicted as constructive, enhancing the Constructive class with precision and recall metrics—specifically, a recall increase of 6.33% and an F1-score improvement of 4.30%.

Moreover, addressing the misuse of speculative language through the assertion <<Avoid labeling a statement as Active or Constructive based solely on speculative language like ’I think’ or ’possibly’>> (see Figure 4) led to an increase in precision for the Active class by 15.96% and an F1-score increase by 6.08%. This adjustment resulted in the most balanced model performance observed, despite a slight decrease in recall for the Active class by 2.34%. Further attempts to amplify Active class metrics by refining definitions in <<General COT>> and enhancing reasoning in <<FewShot with reasoning>> revealed that, while assertions impacted model performance, their effect varied across classes and metrics.

Refer to caption
Figure 4: Left image does not include a targeted assertion while the one on the right does and improves the model output to correctly predict the students cognitive level of their text.

Notably, Experiment 6.1 (see Figure 5) emerged as particularly effective, showcasing the significance of tailored assertions in reducing misclassifications linked to textual ambiguity and unusual language use, thereby contributing to a more balanced and accurate model.

These findings highlight the nuanced role of assertions in enhancing model efficacy against the backdrop of imbalanced educational datasets. By meticulously integrating assertions to counter specific challenges—textual ambiguity and unusual language—the experiments demonstrated a discernible improvement in model precision and balance, particularly within the Active and Constructive classes. This strategic approach underscores the potential of assertions to mitigate inherent dataset imbalances, ultimately contributing to the development of more nuanced and effective educational models.

Refer to caption
Figure 5: Percentage Change in Metrics for Each Class Across Experiments

To further understand our model we compared accuracy of models to the baseline where Experiment 5 marked an 8.96% improvement but Experiment 6.1 stood out with the highest increase in accuracy, at 11.94% from the baseline. This improvement primarily addresses the challenges identified in RQ2, demonstrating the significant role of assertions in resolving errors related to context understanding and lexical ambiguities. The Active and Constructive classes, associated with focused attention and deeper reasoning, respectively, pose classification challenges due to their subtleties and contextual dependencies (Chi and Wylie, 2014). These classes often require inferring cognitive engagement levels from implicit cues and context, making their distinctions less explicit within student responses.

4 Limitations

While our study sheds light on the potential of LLMs and AEFL in addressing imbalanced datasets, it also highlights the need for caution in interpreting these findings without consideration of the broader methodological and technological landscape. Firstly, our reliance on specific LLM techniques and AEFL might not capture the full spectrum of potential solutions available within the rapidly evolving field of machine learning. The specific parameters and configurations employed in our LLM applications (Shahriar et al., 2023; Wei et al., 2022b; Zeng et al., 2023), while effective in this context, might not be universally applicable or optimal across different datasets or learning tasks. While our study provides valuable insights, it echoes the concerns raised by Radwan and Cataltepe (2017) and Yun et al. (2011) regarding the challenges of imbalanced datasets in education and the limitations of traditional ML approaches.

Furthermore, our study’s focus on a AI High School ELA course dataset (Zeng et al., 2023), while providing a rich source of cognitive engagement data, also presents a limitation in terms of diversity and representativeness. The linguistic and cognitive patterns inherent in this specific educational setting may not fully encapsulate the variety of cognitive engagement manifestations across different age groups, subjects, or educational methodologies. This limitation underscores the importance of extending research efforts to encompass a wider range of educational contexts, to ensure the findings’ applicability and robustness, as indicated by Fredricks et al. (2004) and Blumenfeld et al. (2006).

Additionally, while LLMs and AEFL present innovative approaches to overcoming the challenges of imbalanced datasets, they also introduce new complexities and considerations (Shahriar et al., 2023; Wei et al., 2022b). The computational demands and resource requirements of these technologies, coupled with the need for specialized expertise to implement and interpret their outputs, may pose barriers to widespread adoption and application in educational research and practice. The dynamic nature of LLM development also means that the models and techniques used today may rapidly evolve, necessitating continuous updates and adaptations to maintain their effectiveness and relevance.

Lastly, the ethical implications of applying LLMs in educational settings, particularly concerning data privacy, security, and the potential for bias in model training and outcomes, warrant careful consideration (Zeng et al., 2023). As LLMs become more integrated into educational research and practice, it is crucial to develop and adhere to ethical guidelines that prioritize the well-being and rights of students and educators.

These limitations highlight the need for ongoing research and dialogue within the educational and machine learning communities. By addressing these challenges and exploring the vast potential of LLMs and AEFL, we can advance our understanding of cognitive engagement and enhance educational outcomes in diverse and inclusive ways.

5 Conclusion and Future Studies

Our study makes significant contributions to the evolving landscape of cognitive engagement (CE) research, building upon the foundational work of seminal researchers like Craik and Lockhart (1972), Appleton et al. (2006), and Fredricks et al. (2004). We leveraged the capabilities of Large Language Models (LLMs) and Assertion Enhanced Few-Shot Learning (AEFL), marking a notable advancement in the domain of CE. This approach pays homage to the pioneering efforts that have shaped our understanding of CE while extending these concepts through the integration of cutting-edge LLM technologies.

By adeptly navigating the challenges posed by imbalanced datasets and accurately classifying cognitive engagement levels, this study underscores the potential of LLMs to refine our measurement and analysis of CE, setting a new benchmark for educational research. The integration of AEFL enhances contextual comprehension, improving model accuracy and balance, as highlighted by Shahriar et al. (2023). Experiment 6.1 further illustrates the value of tailored assertions in reducing misclassifications linked to textual ambiguities, offering novel insights into AEFL’s effectiveness in managing class-imbalanced data.

The promising outcomes of this research suggest that LLMs hold significant potential for future educational studies, particularly in complex data analysis tasks. These findings encourage the exploration of LLMs’ full capabilities in educational settings, advocating for a paradigm shift towards more sophisticated and nuanced approaches to data analysis. Moreover, the integration of AEFL points to a nuanced method of enhancing model performance, especially in the context of imbalanced textual educational datasets.

Given the multifaceted nature of cognitive engagement and the challenges associated with its measurement, there is a compelling need for further research. Future studies should aim to refine and expand the application of LLMs and AEFL across a broader spectrum of educational contexts. Additionally, exploring additional theoretical frameworks and models could yield deeper insights into cognitive engagement, thereby contributing to the enhancement of educational outcomes. This call for further research not only reflects the complex landscape of CE but also highlights the endless possibilities that LLM technologies and innovative methodologies like AEFL present for advancing our understanding and practices within the educational domain.

References

  • Karimah and Hasegawa [2022] Shofiyati Nur Karimah and Shinobu Hasegawa. Automatic engagement estimation in smart education/learning settings: a systematic review of engagement definitions, datasets, and methods. Smart Learning Environments, 9(1):31, 2022.
  • Radwan and Cataltepe [2017] Akram M Radwan and Zehra Cataltepe. Improving performance prediction on education data with noise and class imbalance. Intelligent Automation & Soft Computing, pages 1–8, 2017.
  • Yun et al. [2011] ZHAI Yun, Ma Nan, Ruan Da, and AN Bing. An effective over-sampling method for imbalanced data sets classification. Chinese Journal of Electronics, 20(3):489–494, 2011.
  • Fredricks et al. [2004] Jennifer A Fredricks, Phyllis C Blumenfeld, and Alison H Paris. School engagement: Potential of the concept, state of the evidence. Review of educational research, 74(1):59–109, 2004.
  • Blumenfeld et al. [2006] Phyllis C Blumenfeld, Toni M Kempler, and Joseph S Krajcik. Motivation and cognitive engagement in learning environments. na, 2006.
  • Corno and Mandinach [1983] Lyn Corno and Ellen B Mandinach. The role of cognitive engagement in classroom learning and motivation. Educational psychologist, 18(2):88–108, 1983.
  • Pintrich [2000] Paul R Pintrich. The role of goal orientation in self-regulated learning. In Handbook of self-regulation, pages 451–502. Elsevier, 2000.
  • Schunk et al. [2014] Dale H Schunk, Paul R Pintrich, and Judith L Meece. Motivation in education: Theory, research, and applications. (No Title), 2014.
  • Shahriar et al. [2023] Tasmia Shahriar, Noboru Matsuda, and Kelly Ramos. Assertion enhanced few-shot learning: Instructive technique for large language models to generate educational explanations. arXiv preprint arXiv:2312.03122, 2023.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Wei et al. [2022a] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  • Zeng et al. [2023] Zhongshen Zeng, Pengguang Chen, Haiyun Jiang, and Jiaya Jia. Challenge llms to reason about reasoning: A benchmark to unveil cognitive depth in llms. arXiv preprint arXiv:2312.17080, 2023.
  • Wu [2021] Jiun-Yu Wu. Learning analytics on structured and unstructured heterogeneous data sources: Perspectives from procrastination, help-seeking, and machine-learning defined cognitive engagement. Computers & Education, 163:104066, 2021.
  • Wei et al. [2022b] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
  • Lee et al. [2024] Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, and Xiaoming Zhai. Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, page 100213, 2024.
  • Craik and Lockhart [1972] Fergus IM Craik and Robert S Lockhart. Levels of processing: A framework for memory research. Journal of verbal learning and verbal behavior, 11(6):671–684, 1972.
  • Appleton et al. [2006] James J Appleton, Sandra L Christenson, Dongjin Kim, and Amy L Reschly. Measuring cognitive and psychological engagement: Validation of the student engagement instrument. Journal of school psychology, 44(5):427–445, 2006.
  • Chi and Wylie [2014] Michelene TH Chi and Ruth Wylie. The icap framework: Linking cognitive engagement to active learning outcomes. Educational psychologist, 49(4):219–243, 2014.
  • Anderson and Krathwohl [2001] Lorin W Anderson and David R Krathwohl. A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives: complete edition. Addison Wesley Longman, Inc., 2001.
  • Bloom et al. [1956] Benjamin S Bloom et al. Taxonomy of educational objectives: The classification of educational goals, by a committee of college and university examiners. Handbook 1: Cognitive domain, 1956.
  • Chase et al. [2019] Catherine C Chase, Jenna Marks, Laura J Malkiewich, and Helena Connolly. How teacher talk guidance during invention activities shapes students’ cognitive engagement and transfer. International Journal of STEM Education, 6:1–22, 2019.
  • Hsiao et al. [2022] Jo-Chi Hsiao, Ssu-Kuang Chen, Wei Chen, and Sunny SJ Lin. Developing a plugged-in class observation protocol in high-school blended stem classes: Student engagement, teacher behaviors and student-teacher interaction patterns. Computers & Education, 178:104403, 2022.
  • Wang et al. [2016] Xu Wang, Miaomiao Wen, and Carolyn P Rosé. Towards triggering higher-order thinking behaviors in moocs. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge, pages 398–407, 2016.
  • McCoach et al. [2013] D Betsy McCoach, Robert K Gable, John P Madura, D Betsy McCoach, Robert K Gable, and John P Madura. Defining, measuring, and scaling affective constructs. Instrument development in the affective domain: School and corporate applications, pages 33–90, 2013.
  • Greene et al. [2004] Barbara A Greene, Raymond B Miller, H Michael Crowson, Bryan L Duke, and Kristine L Akey. Predicting high school students’ cognitive engagement and achievement: Contributions of classroom perceptions and motivation. Contemporary educational psychology, 29(4):462–482, 2004.
  • Smiley and Anderson [2011] Whitney Smiley and Robin Anderson. Measuring students’ cognitive engagement on assessment tests: A confirmatory factor analysis of the short form of the cognitive engagement scale. Research & Practice in Assessment, 6:17–28, 2011.
  • Lee and Anderson [1993] Okhee Lee and Charles W Anderson. Task engagement and conceptual change in middle school science classrooms. American educational research journal, 30(3):585–610, 1993.
  • Helme and Clarke [2001] Sue Helme and David Clarke. Identifying cognitive engagement in the mathematics classroom. Mathematics Education Research Journal, 13(2):133–153, 2001.
  • Wigfield et al. [2008] Allan Wigfield, John T Guthrie, Kathleen C Perencevich, Ana Taboada, Susan Lutz Klauda, Angela McRae, and Pedro Barbosa. Role of reading engagement in mediating effects of reading comprehension instruction on reading outcomes. Psychology in the Schools, 45(5):432–445, 2008.
  • Xie et al. [2019] Kui Xie, Benjamin C Heddy, and Barbara A Greene. Affordances of using mobile technology to support experience-sampling method in examining college students’ engagement. Computers & Education, 128:183–198, 2019.
  • D’Mello et al. [2017] Sidney D’Mello, Ed Dieterle, and Angela Duckworth. Advanced, analytic, automated (aaa) measurement of engagement during learning. Educational psychologist, 52(2):104–123, 2017.
  • Bernacki et al. [2012] Matthew L Bernacki, James P Byrnes, and Jennifer G Cromley. The effects of achievement goals and self-regulated learning behaviors on reading comprehension in technology-enhanced learning environments. Contemporary Educational Psychology, 37(2):148–161, 2012.
  • Ireland and Henderson [2014] Molly E Ireland and Marlone D Henderson. Language style matching, engagement, and impasse in negotiations. Negotiation and conflict management research, 7(1):1–16, 2014.
  • Lee and Kinzie [2012] Youngju Lee and Mable B Kinzie. Teacher question and student response with regard to cognition and language use. Instructional science, 40:857–874, 2012.
  • Chawla [2010] Nitesh V Chawla. Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook, pages 875–886, 2010.
  • Fernández et al. [2018] Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C Prati, Bartosz Krawczyk, and Francisco Herrera. Learning from imbalanced data sets, volume 10. Springer, 2018.
  • Kulkarni et al. [2020] Ajay Kulkarni, Deri Chong, and Feras A Batarseh. Foundations of data imbalance and solutions for a data democracy. In Data democracy, pages 83–106. Elsevier, 2020.
  • Japkowicz and Stephen [2002] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002.
  • Bruce et al. [2020] Peter Bruce, Andrew Bruce, and Peter Gedeck. Practical statistics for data scientists: 50+ essential concepts using R and Python. O’Reilly Media, 2020.
  • LemaÃŽtre et al. [2017] Guillaume LemaÃŽtre, Fernando Nogueira, and Christos K Aridas. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of machine learning research, 18(17):1–5, 2017.
  • Savelka et al. [2023] Jaromir Savelka, Paul Denny, Mark Liffiton, and Brad Sheese. Efficient classification of student help requests in programming courses using large language models. arXiv preprint arXiv:2310.20105, 2023.
  • Cui et al. [2023] Wendi Cui, Jiaxin Zhang, Zhuohang Li, Damien Lopez, Kamalika Das, Bradley Malin, and Sricharan Kumar. A divide-conquer-reasoning approach to consistency evaluation and improvement in blackbox large language models. In Socially Responsible Language Modelling Research, 2023.
  • Chao et al. [2022] Jie Chao, Bill Finzer, Carolyn P Rosé, Shiyan Jiang, Michael Yoder, James Fiacco, Chas Murray, Cansu Tatar, and Kenia Wiedemann. Storyq: a web-based machine learning and text mining tool for k-12 students. In Proceedings of the 53rd ACM Technical Symposium on Computer Science Education V. 2, pages 1178–1178, 2022.
  • Fulford and Ng [2023] Andrew Ng Isa Fulford and A Ng. Chatgpt prompt engineering for developers. deeplearning. ai, 2023.
  • Wang et al. [2023] Xindi Wang, Yufei Wang, Can Xu, Xiubo Geng, Bowen Zhang, Chongyang Tao, Frank Rudzicz, Robert E Mercer, and Daxin Jiang. Investigating the learning behaviour of in-context learning: a comparison with supervised learning. arXiv preprint arXiv:2307.15411, 2023.
  • Pennebaker et al. [2015] James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. The development and psychometric properties of liwc2015. 2015.
  • Cohen [2013] Jacob Cohen. Statistical power analysis for the behavioral sciences. Routledge, 2013.
  • Rodriguez et al. [2023] Alberto D Rodriguez, Katherine R Dearstyne, and Jane Cleland-Huang. Prompts matter: Insights and strategies for prompt engineering in automated software traceability. In 2023 IEEE 31st International Requirements Engineering Conference Workshops (REW), pages 455–464. IEEE, 2023.
  • Akinwande et al. [2023] Victor Akinwande, Yiding Jiang, Dylan Sam, and J Zico Kolter. Understanding prompt engineering may not require rethinking generalization. arXiv preprint arXiv:2310.03957, 2023.

6 Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. DRL-1949110. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Appendix A Appendix A

(

Table 4: Support Vector Machine (SVM), Random Forest (RF), Decision Trees (DT), ADABoost
ML Classifier Parameter Values
SVM Kernel ‘linear’, ‘sigmoid’, ‘poly’, ‘rbf’
C 0.1, 1, 10, 100, 1000
Gamma 1, 0.1, 0.01, 0.001, 0.0001
RF Number of Estimators 100
Criterion ‘Gini’
Max Depth none
Min Samples Leaf 1
Min Samples Split 2
Max Features ‘Auto’
Bootstrap True
DT Criterion ‘Gini’
Max Depth none
Min Samples Leaf 1
Min Samples Split 2
ADABoost Number of Estimators 50
Learning Rate 1.0
Algorithm ‘SAMME.R’
Base Estimator DecisionTreeClassifier(max_depth=1)

)

Appendix B Appendix B

A. Few Shot with reasoning + General COT (step by step) < Black, Green, Red, Blue>

B. Few Shot with reasoning + General COT (step by step) + Assertion (do and don’t)<Black, Green, Red, Blue, Orange>

-----------Prompt Starts Here------------------------------------

Your task is to identify the label of the statement delimited by triple backticks

Read the instructions below:

Step 1: Read the question and statement attentively to understand the context and the nature of the statement provided.

Step 2: Determine the initial cognitive engagement level of the statement using the definitions of the provided cognitive engagement labels - passive, active, and constructive.

1. Passive engagement: a statement is classified as "Passive" when the individual is only receiving information without interacting with it or adding anything to it. Passive engagement typically involves listening, reading, or receiving information without actively processing, manipulating, or reflecting upon it.

2. Active engagement: a statement is classified as "Active" when the response involves applying knowledge, analyzing information, or manipulating information but not generating new ideas or concepts.

3. Constructive engagement: a statement is classified as "Constructive" if it reflects reasoning, justification, or thoughtful consideration based on prior knowledge.

Step 3: Assess why it corresponds to the label you placed it in. Consider the extent to which it demonstrates recall of basic information (passive), application of learned knowledge to slightly different contexts (active), or a deeper level of analysis and synthesis of various concepts (constructive).

Step 4: Critically evaluate whether the statement could potentially belong to other labels. Examine the nuances of the statement to see if there are elements that might indicate a higher or lower level of cognitive engagement.

Step 5: To upgrade the statement to a higher engagement level, propose alterations that would make it align with the criteria for the "Active" category. This could involve adding details that show the application of learned knowledge to familiar yet slightly different contexts, or demonstrating problem-solving based on previous experiences.

Step 6: Explore how the statement can be restructured to meet the criteria of the "Constructive" engagement category. Consider adding elements that showcase deeper analysis, critical evaluation, or synthesis of multiple concepts to create a more nuanced and thoughtful response. Step 7: Finally, revisit the question and statement to evaluate the original cognitive engagement level making sure the prediction of cognitive engagement is accurate.

Based on your understanding of cognitive engagement and the labeled examples provided, determine the level of engagement for the unlabeled text provided.

    ‘‘‘

Question: Why do people write reviews?

Statement: People write reviews to express their feelings on a certain thing to condemn a praise a business, franchise, movie, or book.

Label: <Generate label>

Chain-of-thought: <Generate the chain-of-thought>

    ‘‘‘

Use the following examples delimited by triple quotes to understand which label the statement belongs to.

    ’’’

Question: What features do you think are indicators of positive reviews?

Statement: Words like love, excellent, greatest, amazing, enjoy, awesome, best.

Label: Passive

Reasoning: because it is a direct response that involves recalling or listing words without further analysis or interaction.

Question: What is one strategy you (as a human) can use to determine if a review is positive or negative?

Statement: I can tell if the person liked something or not. Label: Passive Reasoning: because it does not specify any strategies or reflection to distinguish between positive and negative sentiments.

Question: When you click on the row, the feature in this review will be highlighted in the feature graph (like the one you have seen in the Light On Light Off activity).Which feature do you think is it?

Statement: Because it’s associated with positivity.

Label: Passive

Reasoning: because it is simple information without reflection without delving into specific details, analysis, or reflection.

Question: What is one strategy you can use to determine what features someone has used to build a classification model?

Statement: I can use major words that people say in reviews first. Words like ’love,’ ’hate,’ ’bad,’ ’delicious,’ and more.

Label: Passive

Reasoning: because it only has recall words and delve into any analysis, reflection, or application.

Question: What is one strategy you can use to determine what features someone has used to build a classification model?

Statement: You can look at the data set and find words that really stand out to you or words that have a strong emotional connotation. You can also check the graphand the probability in terms of the features being used or how strongly they correlate with the result.

Label: Active

Reasoning: because it summarizes and organizes the information in a broad manner

Question: What is one strategy you (as a human) can use to determine if a review is positive or negative? Statement: One strategy that you can use to determine if a review is positive or negative is looking at diction, which is word choice, and how the words are being used.

Label: Active

Reasoning: because it details a method of analyzing the word choice in reviews, demonstrating the application of acquired knowledge to assess sentiments.

Question: When you click on the row, the feature in this review will be highlighted in the feature graph (like the one you have seen in the Light On Light Off activity). Which feature do you think is it?

Statement: Love is the most defining word in this review, if it were changed to ’hate’ it would have a completely different meaning

Label: Active

Reasoning: because it demonstrates the application and analysis of knowledge in a familiar context but does not generate new ideas.

Question: What is one strategy you can use to determine what features someone has used to build a classification model?

Statement: You can test multiple reviews with words that you think may be the features to determine if they are actually features.

Label: Active

Reasoning: because it demonstrates the application and manipulation of knowledge in a familiar context without generating new insights.

Question: Why do people write reviews?

Statement: To share their experience of a certain product or service so that they can either warn or recommend it to people. Sharing experiences is important so that way others who have not experienced it can know what they are getting in to.

Label: Constructive

Reasoning: because it provides an understanding and reasoning of the broader context and implications why sharing experience is important.

Question: If none of the 10 features are present in your review, try again with another review. If some of the 10 features are in your review, examine both your review and the feature graph. What do you think these features are?

Statement: I think these features are key words and numbers. Like the example used the word ’love’ which implies a positive reply. The numbers also because if you say 1 out of 10 that’s bad but if you say 10 out of 10 that’s good.

Label: Constructive Reasoning: because it provides interpretation and application to generate insights about the potential features in reviews.

Question: What is one strategy you (as a human) can use to determine if a review is positive or negative?

Statement: If I am having a conversation with somebody it will be easy to detect if the review is good or bad by word choice and their tone. If they wrote it, I will be able to see key words that point in either a positive or negative direction.

Label: Constructive

Reasoning: because it demonstrates a depth of reasoning and reflection of how to determine if a review is positive or negative.

Question: What kinds of reviews can make our world a better place?

Statement: Some reviews that can make the world a better place is if it’s a review about a foreign country then it can give some insight into what is happening within that country. Or even here in the United States, it can share what’s happening within their state and let the rest of the world know.

Label: Constructive

Reasoning: because it provides reflection, thoughtful consideration and reasoning about the societal value and potential impact of reviews in fostering global understanding and awareness.

    ’’’

A few facts about identifying the cognitive engagement level that you must assert while determining the level of engagement for the unlabeled text provided:

- Do label the statement as Constructive if they are forming an opinion about its usefulness, and providing reasoning for their opinion.

- Do label the statement as Constructive when the statement provides their interpretation and reasoning to the question.

- Do label the statement as Constructive when they form a hypothesis about why the model learned a weight for a certain feature.

- Do label the statement as Constructive when the statement shows active engagement with the information.

- Avoid labeling a statement as Active or Constructive based solely on speculative language like ’I think’ or ’possibly’.