Prepared by Agrim Jain - 311124 Ajeet Mathew - 311125 Mitanshu Tyagi - 311143 (BDDA - 2, FORE School of Management)
This model classifies the Reddit users recent and past 10 posts into depression-related and non depression related posts. It shows the trend of depression state in recent post and past posts by combining both results. The model classifies the posts using Tree Ensemble Learner and Predictor.
The below steps are done for both - the recent post and past 10 posts of the users seperately:
Data Collection: First the data is read using CSV reader then the target column is converted from number to string. Then for the model to function properly 1000 rows are selected on random basis using stratified sampling using target column.
Data Processing: In the data preprocessing phase, we performed a series of transformations to make the text data suitable for text analysis and classification. The “Strings to Document” node converted the text into a document format. The “Case Converter” ensured consistency in the case of the text, converting it to lowercase. We filtered out numeric values using the “Number Filter” node, as they are often irrelevant in text classification tasks. The “Punctuation Eraser” removed punctuation marks to reduce noise. The “POS Tagger” assigned parts of speech to words, which can be useful for linguistic analysis and feature engineering. Stop words, commonly occurring words that lack significant meaning, were removed with the “Stop Word Filter.” The “Porter Stemmer” reduced words to their root form, aiding feature extraction. Additionally, we used the “N Chars Filter” to exclude very short or long words, which may not be informative. These preprocessing steps collectively prepared the text data for further analysis, ensuring that it is clean and consistent, making it ready for text classification tasks.
Calculation of TF-IDF:
In our project, the TF-IDF metric is employed to weigh and rank words or terms within Reddit posts. It identifies words that are both prevalent within individual posts and unique to those posts, making it a valuable feature for text classification tasks such as identifying depression-related content. By calculating TF-IDF values for terms in our dataset, we can extract meaningful features that aid in the classification of posts as depression-related or not.
Training and Testing of Model: Initially, we partitioned the data, a crucial step that involved splitting it into training and testing sets, ensuring a proper evaluation of the model. Subsequently, we utilized the “Tree Ensemble Learner” node, which facilitated the training of a tree-based ensemble model, a powerful technique often seen in models like Random Forest or Gradient Boosting, ideal for text classification tasks. Once the model was trained, the “Tree Ensemble Predictor” node came into play, allowing us to apply the model’s learned patterns to new data, making predictions efficiently. Lastly, the “Scorer” node was instrumental in assessing the model’s performance. By comparing its predictions to the actual class labels, it generated critical evaluation metrics such as accuracy, precision, recall, providing a comprehensive understanding of how effectively our model distinguished between depression-related and non-depression-related posts.
Specific for the classification of recent post: “Counter Generator” is used to generate unique identifier for the user posts other than row id. “Joiner” is used to join the counter column with the predicted data so that this counter column becomes the unique identifier of the user.
Specific for the classification of 10 recent posts of each user: “Joiner” is used to join the “person” column with the predicted data and the person column is the unique identifier of the user. “Category to number” is used to convert the prediction of the document class from strings into number. “Group by” is used to group the prediction class based on the “person” column. Mean of prediction of past posts is taken.
Final Result: Joiner is used to combine the results of both models based on “Counter” and “Person” column.