License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: January 09, 2024
Last Modified: March 02, 2024
Protocol Integer ID: 93148
Disclaimer
The protocol content here is for informational purposes only and does not constitute legal, medical, clinical, or safety advice, or otherwise; content added to protocols.io is not peer reviewed and may not have undergone a formal approval of any kind. Information presented in this protocol should not substitute for independent professional judgment, advice, diagnosis, or treatment. Any action you take or refrain from taking using or relying upon the information presented here is strictly at your own risk. You agree that neither the Company nor any of the authors, contributors, administrators, or anyone else associated with protocols.io, can be held responsible for your use of the information contained in or linked to this protocol or any of our Sites/Apps and Services.
Abstract
Our code has THREE main parts:
Data Processing: In terms of data cleaning, for the textual data in the database, special characters, punctuation, links, and unnecessary words were removed from the comment texts. This study applied lowercase conversion (removing uppercase letters) and removed stopwords. The NLTK library provides the English stopword list. It consists of common English words with no semantic or informational value and is typically filtered out in natural language processing to enhance the efficiency and accuracy of text analysis. Stopwords play a crucial role in enhancing text feature quality and reducing the significance of text features. Considering a slight overlap in the data sources, this study also conducted duplicate text filtering. In the end, 23,773 entries were retained, forming the database Processed_GPT_total.json.
Sentiment Analysis: The analyzed entries exceeded a total count of 23,773 entries. The two sentiment analysis models, Vader and Textblob, are assigned weights of 0.6 and 0.4 for sentiment classification. The sentiment analysis categorizes the emotional tone of the entries into three distinct parameters: positive, negative, and neutral. This study did histograms and emoji wordclouds of different emotions for analysis.
Based on the GPT-3.5 model, ChatGPT was launched by Open AI on November 30, 2022, gaining a growing user base. On March 15, 2023, OpenAI unveiled the new large-scale multimodal model, GPT-4, available for purchase. This study examines daily sentiment trends by comparing the quantity of positive and negative sentiment posts from January to August 2023 (N=23,773). It aims to ascertain whether version updates and evolution have influenced sentiment towards ChatGPT.
Topic modeling: The topic modeling addresses the research question: What are the emerging topics related to ChatGPT? The LDA model could help determine the most suitable number of topics for classification. The topic modeling is performed with LDA, and the perplexity-topic number curve is plotted. Subsequently, the results for each number of topics are analyzed to identify the optimal number of topics based on the highest topic coherence.
Import Packets & Data Preparation
Import Packets & Data Preparation
import json
import re
import nltk
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image, ImageOps
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.models import CoherenceModel
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from gensim import corpora, models
from pprint import pprint
nltk.download('stopwords')
nltk.download('vader_lexicon')
The 4 files below are the processed datasets (with the keywords GPT3.0, GPT3.5, GPT4.0 with the full sample files)
with open('G3-11730.json', 'r', encoding='utf-8') as f:
data = json.load(f)
# Total number of samples and target retention number
total_samples = len(data)
desired_samples = 10000
# Perform random sampling
if total_samples <= desired_samples:
selected_samples = data
else:
selected_samples = random.sample(data, desired_samples)
# Write the sampled results to a new JSON filewith open('G3.json', 'w', encoding='utf-8') as f:
json.dump(selected_samples, f, ensure_ascii=False, indent=4)
print(f"Random sampling completed, retained {len(selected_samples)} samples.")
Data Merging (GPT3.0, GPT3.5, GPT4.0)
# Read data from each JSON file
file_names = ["G3.json", "G35.json", "G4.json"]
all_data = []
for file_name in file_names:
with open(file_name, 'r', encoding='utf-8') as f:
data = json.load(f)
all_data.extend(data)
# Write the combined data to a new JSON file
output_file_name = "GPT_total.json"with open(output_file_name, 'w', encoding='utf-8') as f:
json.dump(all_data, f, ensure_ascii=False, indent=4)
print(f"The file merge has been completed and the name of the merged file is {output_file_name}, containing {len(all_data)} samples.")
Data Cleaning
def has_missing_body(sample):
return 'body' not in sample
def clean_text(text):
# Data cleaning: remove special symbols and links
text = re.sub(r'http\S+', '', text) # Remove links
text = re.sub(r'\W+', ' ', text) # Remove special symbols
return text
def preprocess_text(text):
# Text preprocessing: convert to lowercase, remove punctuation, and numbers
text = text.lower()
text = re.sub(r'\d+', '', text) # Remove numbers
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
return text
def tokenize_text(text):
# Text tokenization: split the text by space
return text.split()
def remove_stopwords(words_list):
# Stopwords handling: remove stopwords
stop_words = set(stopwords.words('english'))
return [word for word in words_list if word.lower() not in stop_words]
def remove_short_words(words_list):
# Remove words with length less than or equal to 2
return [word for word in words_list if len(word) > 2]
def process_samples(input_file, output_file):
with open(input_file, 'r') as file:
samples = json.load(file)
processed_samples = []
unique_samples = set() # Used to record the unique processed sample content
for sample in samples:
if has_missing_body(sample):
continue
body_text = sample['body']
# Data cleaning
body_text = clean_text(body_text)
# Text preprocessing
body_text = preprocess_text(body_text)
# Text tokenization
words_list = tokenize_text(body_text)
# Stopwords handling and removing short words
words_list = remove_stopwords(words_list)
words_list = remove_short_words(words_list)
# Reassemble the text
processed_text = ' '.join(words_list)
# Check if the same sample content already exists
if processed_text in unique_samples:
continue
# Add to the unique sample set
unique_samples.add(processed_text)
# Update the 'body' field of the sample
sample['body'] = processed_text
processed_samples.append(sample)
with open(output_file, 'w') as file:
json.dump(processed_samples, file, indent=4)
print(f"Sample size after preprocessing: {len(processed_samples)}")
# Replace with your file paths
input_file_path = 'G35.json'
output_file_path = 'Processed_GPT_35.json'
process_samples(input_file_path, output_file_path)
Word Frequency
Word Frequency
Word frequency bar graph display for GPT_total.json
# Replace with the path to the processed file
processed_file_path = 'Processed_GPT_total.json'
with open(processed_file_path, 'r') as file:
processed_samples = json.load(file)
# Extract the text content from samples and create a large text string
all_text = ' '.join([sample['body'] for sample in processed_samples])
# Remove specific words (e.g., "chatgpt" and "gpt")
Changes in Positive and Negative Emotions over Time
# Load the preprocessed filewith open('Processed_GPT_total.json', 'r') as file:
data = json.load(file)
# Extract date and text data
dates = []
texts = []
for item in data:
if item['createdAt'][:4] >= '2023': # Only keep posts from 2023 and later
dates.append(item['createdAt'][:10])
texts.append(item['body']) # Choose the required field based on the actual situation# Calculate sentiment scores using TextBlob and VADER
textblob_weight = 0.4
vader_weight = 0.6
textblob_scores = []
vader_scores = []
analyzer = SentimentIntensityAnalyzer()
for text in texts:
blob = TextBlob(text)
textblob_scores.append(blob.sentiment.polarity)
scores = analyzer.polarity_scores(text)
vader_scores.append(scores['compound'])
# Weight the results
weighted_scores = [(textblob_weight * tb + vader_weight * vd) for tb, vd in zip(textblob_scores, vader_scores)]
# Construct the data frame
df = pd.DataFrame({'Date': dates, 'Weighted Score': weighted_scores})
# Convert the Date column to a datetime type
df['Date'] = pd.to_datetime(df['Date'])
# Calculate the number of positive and negative sentiments daily
df['Sentiment'] = df['Weighted Score'].apply(lambda score: 'Positive' if score > 0 else 'Negative')
# Group by date and sentiment, then calculate daily sentiment counts
daily_sentiment_counts = df.groupby(['Date', 'Sentiment']).size().unstack(fill_value=0)
# Plot the daily positive and negative sentiment counts over time
plt.figure(figsize=(10, 6))
plt.plot(daily_sentiment_counts.index, daily_sentiment_counts['Positive'], label='Positive', color='green')
plt.plot(daily_sentiment_counts.index, daily_sentiment_counts['Negative'], label='Negative', color='red')
plt.title('Daily Positive and Negative Sentiment Counts')
plt.xlabel('Date')
plt.ylabel('Number of Posts')
plt.legend(title='Sentiment')
plt.ylim(-30,450)
plt.grid()
plt.tight_layout()
plt.show()
Topic Modeling
Topic Modeling
LDA topic modeling and plotting perplexity-topic curves
Using the sampling model set up in Part II, use the model from the original dataset "Processed_GPT_total.json" and set random.seed(61) to take 10,000 samples as a subset and name it " subsets_GPT_total.json"
lda_model_2 = lda_models_2[num_topics_7 - 1]
top_words_2 = lda_model_2.show_topics(num_topics=num_topics_7, num_words=10, formatted=False)
print("\nTop 10 words for dataset 2 with {} topics:".format(num_topics_7))
for topic_id, topic_words in top_words_2:
print("Topic {}: {}".format(topic_id, [word[0] for word in topic_words]))