Building a COVID-19 Vaccine Sentiment Analysis App Using Pre-trained Huggingface Models | by Alberta Cofie | Jun, 2023


Within the wake of the COVID-19 pandemic, our world has witnessed unprecedented challenges and adjustments in numerous facets of life. Alongside the fast unfold of the virus, the web and social media platforms have grow to be essential sources of data and communication for folks throughout the globe. These platforms haven’t solely offered an avenue for disseminating very important updates however have additionally grow to be an area for expressing feelings, opinions, and sentiments associated to the pandemic.

The event and distribution of COVID-19 vaccines have been monumental milestones in our battle in opposition to the worldwide pandemic. These vaccines have offered hope and a pathway to restoration, providing safety in opposition to the extreme results of the virus. As vaccination efforts proceed to progress worldwide, it turns into more and more necessary to gauge public sentiment and perceive the prevailing attitudes in the direction of COVID-19 vaccines.

Sentiment evaluation, a strong approach in pure language processing (NLP), permits us to extract insights from textual content information and uncover the feelings expressed inside it. By using sentiment evaluation, we will analyze huge quantities of on-line textual content, equivalent to social media posts, information articles, and public opinions, to discern the prevailing attitudes, opinions, and feelings associated to COVID-19 vaccines.

On this article, we are going to discover the method of constructing a COVID-19 vaccines sentiment evaluation app utilizing pre-trained fashions from Hugging Face. Hugging Face has established itself as a outstanding supplier of NLP applied sciences and frameworks, providing pre-trained fashions which have discovered to acknowledge patterns and nuances in language with distinctive accuracy. By harnessing the facility of those fashions, we will develop an app that routinely analyzes textual content and offers useful insights into the feelings expressed by people concerning COVID-19 vaccines.

So, be a part of us on this journey as we leverage the facility of pre-trained Hugging Face fashions to construct a COVID-19 vaccines sentiment evaluation app, and unlock useful insights that may form the discourse surrounding COVID-19 vaccination efforts.

To entry the datasets used, try this hyperlink to my Github:

1. Set up packages

To put in the required packages for constructing a COVID-19 vaccines sentiment evaluation app, you should use the next instructions:

!pip set up datasets
!pip set up transformers
!pip set up sentencepiece

These instructions will set up the required packages: datasets, transformers, and sentencepiece.

The datasets bundle offers a group of fashionable NLP datasets, the transformers bundle presents pre-trained fashions and instruments for NLP duties, and sentencepiece is a library for tokenization.

2. Import Libraries & Load Dataset

The beneath import statements arrange the required dependencies and libraries for subsequent steps in constructing the sentiment evaluation mannequin.

import os
import pandas as pd
url = ""
df = pd.read_csv(url)
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
# from scipy.particular import softmax
from transformers import TrainingArguments, Coach, DataCollatorWithPadding
# from transformers import HubManager
from sklearn.metrics import mean_squared_error

import os: This line imports the os module, which offers a strategy to work together with the working system and carry out operations equivalent to studying or writing information.

import pandas as pd: This line imports the pandas library and assigns it the alias pd. pandas is a strong information manipulation library that permits you to work with structured information, equivalent to CSV information, in an environment friendly method.

url = "": This line assigns a URL to the variable url. It factors to a CSV file containing the coaching information for the sentiment evaluation mannequin. The file is hosted on GitHub.

from datasets import load_dataset: This line imports the load_dataset perform from the datasets library. This perform is used to load and entry datasets, together with pre-defined datasets and customized datasets.

from sklearn.model_selection import train_test_split: This line imports the train_test_split perform from the sklearn.model_selection module. This perform is usually used to separate information into coaching and validation units for mannequin coaching and analysis.

import numpy as np: This line imports the numpy library and assigns it the alias np. numpy is a elementary bundle for scientific computing in Python and is usually used for numerical operations on multi-dimensional arrays.

from transformers import AutoModelForSequenceClassification: This line imports the AutoModelForSequenceClassification class from the transformers library. This class offers a pre-trained mannequin particularly designed for sequence classification duties, equivalent to sentiment evaluation.

from transformers import TFAutoModelForSequenceClassification: This line imports the TFAutoModelForSequenceClassification class from the transformers library. This class is a TensorFlow-compatible model of the AutoModelForSequenceClassification and can be utilized for coaching and inference with TensorFlow.

from transformers import AutoTokenizer, AutoConfig: This line imports the AutoTokenizer and AutoConfig courses from the transformers library. These courses are used for routinely deciding on and loading the suitable tokenizer and mannequin configuration based mostly on the chosen pre-trained mannequin.

from transformers import TrainingArguments, Coach, DataCollatorWithPadding: This line imports the TrainingArguments, Coach, and DataCollatorWithPadding courses from the transformers library. These courses present the required instruments and configurations for coaching a mannequin, together with coaching arguments, the coaching loop, and information collation.

from sklearn.metrics import mean_squared_error: This line imports the mean_squared_error perform from the sklearn.metrics module. This perform is used to guage the efficiency of a regression mannequin by calculating the imply squared error between predicted and true values.

Subsequent, the road of code df = df[~df.isna().any(axis=1)] is used to take away rows from a DataFrame (df) that include any lacking values (NaN).

3. The Knowledge

Break up a DataFrame (df) into coaching and analysis units.

prepare, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

The road of code above splits the DataFrame df into coaching and analysis units, with 80% of the info assigned to the coaching set (prepare) and 20% assigned to the analysis set (eval). The random seed ensures reproducibility, and the stratify parameter preserves the category distribution within the unique dataset in the course of the break up.

Let’s get a snapshot of the primary 5 rows of the info

Prepare Dataset

Snapshot of the prepare dataset

Analysis Dataset


Subsequent, we use the codeeval.label.distinctive() to retrieve the distinctive values within the ‚label‘ column of the analysis set (eval). It’s used to find out the distinct courses or classes current within the ‚label‘ column.

We then createDatasetDict object utilizing the Dataset class from the datasets library. It additionally removes a redundant column from the dataset.

from datasets import DatasetDict, Dataset
train_dataset = Dataset.from_pandas(prepare[['tweet_id', 'safe_text', 'label', 'agreement']])
eval_dataset = Dataset.from_pandas(eval[['tweet_id', 'safe_text', 'label', 'agreement']])

dataset = DatasetDict({'prepare': train_dataset, 'eval': eval_dataset})
dataset = dataset.remove_columns('__index_level_0__')

4. Textual content Preprocessing & Tokenizer Initialization

def preprocess(textual content):
new_text = []
for t in textual content.break up(" "):
t = '@person' if t.startswith('@') and len

Source link


Please enter your comment!
Please enter your name here