“Information is the brand new oil.” Since The Economist’s famous story, this phrase has grow to be a standard chorus. Setting apart all dialogue and criticism about this metaphor, the significance of information as a useful resource within the digital period is broadly acknowledged. Recognizing this assertion additionally implies acknowledging each side of the coin. Information will be as helpful as it’s harmful, and within the mistaken fingers, data and its attendant distortions can become a weapon.
On this undertaking, we’ll develop a Machine Studying mannequin to establish pretend information. We’ll use Kaggle’s REAL and FAKE news dataset, which incorporates a small assortment of reports articles labeled as REAL or FAKE. This undertaking’s objective is to supply a fundamental understanding of the way to course of actual textual content information and use it to unravel vital present issues.
Pure Language Processing (NLP) combines a number of applied sciences to allow machine “understanding” of human language on each goal and subjective ranges.
I selected NLTK as our fundamental NLP framework. Quoting the framework’s documentation website:
NLTK is a number one platform for constructing Python applications to work with human language information. It gives easy-to-use interfaces to over 50 corpora and lexical sources equivalent to WordNet, together with a collection of textual content processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an lively dialogue discussion board.
he following part gives a fundamental understanding of the NLP ideas that we’ll be utilizing on this undertaking.
Refers back to the physique of the textual content, and will be in a number of languages. For a group of texts, we name it a corpora.
Tokenization is the method of breaking down textual content into particular person tokens. Relying on the use case, there are a number of kinds of tokens, equivalent to:
- Phrase Tokenization: The textual content is damaged into phrases.
- Sentence Tokenization: The textual content is damaged into particular person sentences.
- Subword Tokenization: Phrases within the textual content are damaged down into smaller models or subwords.
- Character Tokenization: The textual content is damaged into particular person characters.
Normalization is the method of remodeling textual content right into a single canonical kind, decreasing its randomness and bringing it nearer to a predefined commonplace.
Stemming is the method of decreasing phrases to their root kind. For instance:
Observe that phrases have been decreased to one thing that isn’t an precise phrase. It’s because stemming is a heuristic course of that achieves normalization by merely stripping the ends of phrases. It’s quick and customarily efficient, however it may well introduce errors, equivalent to decreasing phrases greater than needed (over-stemming) or lower than needed (under-stemming).
Lemmatization reduces phrases to their base kind, correctly dealing with inflected phrases to make sure that the basis phrase is an precise phrase. For instance:
It’s usually a extra refined course of than stemming, and it may be additional improved if phrase contexts are supplied. This may be achieved by way of a course of referred to as parts-of-speech (POS) tagging.
In line with the Cambridge Dictionary, a lexicon is an inventory of all of the phrases utilized in a specific language or topic (a.okay.a dictionary). Within the context of NLP, a lexicon is a bunch of phrases related to particular options in a sure fashion (e.g., elements of speech segmentation, sentiment). It’s used as a supply for deciphering human language by offering particular data, meanings, and grammatical properties of these phrases.
These are generally used phrases in a language that, with out context (i.e., by themselves), carry little or no helpful data, like punctuation, articles, pronouns, and so on. Relying on the use case, they are often ineffective and completely faraway from evaluation or so vital that their presence is required. This undertaking reveals each examples.
Bag of Phrases (BoW) is likely one of the most well-known textual content characteristic extraction strategies. Its output is a desk with counts of phrase occurrences in a set of paperwork.
Time period Frequency – Inverse Doc Frequency (TF-IDF) is a textual content characteristic extraction approach, much like BoW however, as a substitute of phrase occurrences, it ends in a desk with phrase importances. Phrase significance is a measure of a phrase’s frequency in a selected doc with respect to the phrase’s inverse frequency in all paperwork. This may be mathematically outlined as follows:
the place:
- x is a selected phrase;
- y is a selected doc;
- tfₓ,ᵧ is the frequency of x in y;
- dfₓ is the variety of paperwork containing x;
- N is the overall variety of paperwork.
Mainly, a phrase is vital for a textual content doc if it happens ceaselessly in that particular doc however hardly ever in all others.
This undertaking will use among the most well-known libraries for Scientific Computing, Information Visualization, Pure Language Processing, and Machine Studying accessible for Python:
import os
import re
import timeimport joblib
import lightgbm as lgb
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import seaborn as sns
import xgboost as xgb
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.textual content import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from wordcloud import WordCloud
Let’s start by loading the info utilizing Pandas. For simplicity’s sake, the code under executes the next operations:
- Reads the info from the .csv positioned in information/information.csv.
- Shuffles the DataFrame, a required step for the predictive mannequin that we’ll construct.
- Resets the index of our newly shuffled information.
- Drop the undesirable Unnamed: 0 column
- Prints the DataFrame dimensions sizes.
- Show the primary 5 rows of information.
df = (
pd.read_csv("information/information.csv")
.pattern(frac=1)
.reset_index(drop=True)
.drop("Unnamed: 0", axis="columns")
)print(f"Dataset form: {df.form}")
df.head()
The dataset itself incorporates 6,335 entries and three characteristic columns:
- title: The information’s title.
- textual content: The physique textual content of the information.
- label: A binary characteristic indicating if the corresponding information is REAL or FAKE.
All textual content is in English. Sadly, the dataset doesn’t explicitly present the dates, but it surely appears that almost all of the information articles are associated to the context of the US presidential elections.
Now, earlier than persevering with with the evaluation, we’ll obtain and cargo the info that incorporates the definitions of punctuation, cease phrases, and the VADER lexicon that we’ll use. Because of the NLTK package deal, this course of could be very easy.
nltk.obtain("punkt", download_dir="information") # punctuation
nltk.obtain("stopwords", download_dir="information") # cease phrases
nltk.obtain("vader_lexicon", download_dir="information") # VADER lexicon# NLTK appear to not unzip the Vader lexicon mechanically =[
if not os.path.isfile("data/sentiment/vader_lexicon/vader_lexicon.txt"):
!unzip data/sentiment/vader_lexicon.zip -d data/sentiment
nltk.data.path.append("data")
Our analysis will be driven by the following base questions:
- How balanced is the data with relation to the news type (fake and real)?
- How does fake news data differ from real news data?
- News text size
- Most common words
- Overall sentiment
First of all, let’s check the volumetric proportion of fake and real news. A count plot of the type occurrences can answer that.
Real and fake news have almost the same proportions. Good. We don’t have to worry about data imbalance when we reach the Machine Learning model development.
Now let’s check the characteristics of the news data.
The news data itself only has two features: the title and the text body of each news item. For the analysis of text size and word occurrence, the presence of stop words can introduce too much noise in the measurements, so we’ll strip them off. Also, we’ll use the Porter Stemmer to stem the words and give them a standard form.
stop_words = set(stopwords.words("english")) # we're explicitly loading the stop words for English language.porter = PorterStemmer()
def preprocess_text(text):
# We'll apply regex substitution to keep alphabetical token only, removing any other kind.
text = re.sub("[^a-zA-Z]", " ", textual content).decrease()
# The default NLTK phrase tokenizer breaks sentences down by whitespaces and punctuation.
word_tokens = word_tokenize(textual content)
filtered_sequence = [
porter.stem(word) for word in word_tokens if not word.lower() in stop_words
]
# Be a part of remaining tokens toghether once more.
return " ".be part of(filtered_sequence)
# We'll apply our evaluation to each title and textual content physique of the information.
df["title_clean"] = df["title"].apply(preprocess_text)
df["text_clean"] = df["text"].apply(preprocess_text)
df.head()
As we will see, we retailer the cleaned titles and textual content our bodies in extra characteristic columns, so we don’t lose the unique ones. We’ll want them to use the sentiment evaluation a bit later.
The dimensions evaluation consists of simply checking the distributions of the pure textual content lengths of every information kind.
df["title_length"] = df["title_clean"].str.len()
df["text_length"] = df["text_clean"].str.len()plt.determine(figsize=(12, 7))
plt.subplot(2, 2, 1)
sns.boxplot(y="title_length", information=df, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Boxplot of title size by REAL or FAKE titles")
plt.subplot(2, 2, 3)
sns.boxplot(y="text_length", information=df, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Boxplot of textual content size by REAL or FAKE information")
plt.subplot(2, 2, 2)
sns.histplot(x="title_length", information=df, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Histogram of title size by REAL or FAKE titles")
plt.subplot(2, 2, 4)
sns.histplot(x="text_length", information=df, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Histogram of textual content size by REAL or FAKE information")
plt.tight_layout()
plt.present()
- The presence of outliers within the boxplots signifies that the dataset incorporates articles with unusually lengthy titles and textual content lengths for each actual and pretend information.
- The boxplot reveals that actual information titles are typically shorter than pretend information titles. The histogram of title lengths additional helps this conclusion by exhibiting that the distribution of actual information peaks at a decrease character rely and reduces extra sharply than the distribution of pretend information titles.
- In distinction, actual information articles are likely to have a wider vary of textual content lengths, though the median textual content size of each information varieties is fairly related. Once more, the histogram helps the boxplot idea by exhibiting that pretend information articles have a decrease phrase rely, clustering in the direction of the left aspect of the plot, in distinction to the longer proper tail displayed by the actual information articles distribution.
Let’s plot the phrase clouds for titles and textual content for each kinds of information.
plt.determine(figsize=(12, 11))for idx, news_type in enumerate(("FAKE", "REAL")):
titles = " ".be part of(df[df["label"] == news_type]["title_clean"])
news_wordcloud = WordCloud(
width=800, top=800, background_color="white"
).generate(titles)
plt.subplot(2, 2, idx + 1)
plt.imshow(news_wordcloud, interpolation="bilinear")
plt.title(f"Phrase Cloud for {news_type} Information Titles")
plt.axis("off")
for idx, news_type in enumerate(("FAKE", "REAL")):
texts = " ".be part of(df[df["label"] == news_type]["text_clean"])
news_wordcloud = WordCloud(
width=800, top=800, background_color="white"
).generate(texts)
plt.subplot(2, 2, idx + 3)
plt.imshow(news_wordcloud, interpolation="bilinear")
plt.title(f"Phrase Cloud for {news_type} Information Texts")
plt.axis("off")
plt.tight_layout()
plt.present()
- Sure names and phrases equivalent to “Trump,” “Clinton,” and “Hillary” are prominently featured in each pretend and actual information, suggesting that political figures are widespread topics in information articles no matter their veracity.
- Faux titles and textual content appear to characteristic sensational or emotionally charged phrases (e.g., lie, assault, battle, energy), in addition to controversial subjects (e.g., video, electronic mail, report), indicating a tendency to make use of extra provocative language that presents hypothesis or opinion as reality.
- The true information, in distinction, whereas nonetheless that includes among the similar political names, incorporates a notable presence of extra various and policy-oriented phrases (e.g., authorities, home, president, nation, debate), which could replicate a broader protection of subjects and a deal with governance and nationwide points.
Along with the above, the selection of phrases also can replicate the sentiment and tone of the articles, as we’ll see subsequent.
For the sentiment evaluation, we’ll use the uncooked (unfiltered) textual content. Not solely the selection of phrases, but in addition punctuation and even letter case may point out the tone and sentiment of the expression.
For the sentiment classification itself, we’ll use NLTK’s SentimentIntensityAnalyzer, which in flip makes use of the VADER lexicon to offer a floating level rating representing sentiment polarity for a selected textual content. A rating nearer to -1.0 represents a robust unfavourable sentiment, whereas a rating nearer to 1.0 represents a robust constructive sentiment.
For our evaluation, a float rating gained’t be vital. As an alternative, based mostly on this rating, we’ll classify the sentiment polarity in Strongly Detrimental, Detrimental, Impartial, Optimistic, and Strongly Optimistic, on this rising order.
sia = SentimentIntensityAnalyzer(
lexicon_file="information/sentiment/vader_lexicon/vader_lexicon.txt"
)def get_sentiment_score(textual content):
polarity_score = sia.polarity_scores(textual content)["compound"]
if polarity_score <= -0.6:
return "Strongly Detrimental"
elif polarity_score <= -0.2:
return "Detrimental"
elif polarity_score <= 0.2:
return "Impartial"
elif polarity_score <= 0.6:
return "Optimistic"
else:
return "Strongly Optimistic"
df["title_sentiment"] = df["title"].apply(get_sentiment_score)
df["text_sentiment"] = df["text"].apply(get_sentiment_score)
df.head()
With this, we will rely the occurrences of every sentiment polarity class in article titles and textual content.
order = ["Strongly Negative", "Negative", "Neutral", "Positive", "Strongly Positive"]plt.determine(figsize=(12, 5))
plt.subplot(1, 2, 1)
ax1 = sns.countplot(x="title_sentiment", order=order, information=df, hue="label")
ax1.set_xticks(ax1.get_xticks())
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=-45, ha="left")
plt.title("VADER Title Sentiment Distribution by Information Kind")
plt.subplot(1, 2, 2)
ax2 = sns.countplot(x="text_sentiment", order=order, information=df, hue="label")
ax2.set_xticks(ax2.get_xticks())
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=-45, ha="left")
plt.title("VADER Textual content Sentiment Distribution by Information Kind")
plt.tight_layout()
plt.present();
- Actual information titles have a better rely of Impartial sentiment in comparison with pretend information titles, suggesting that actual information tends to make use of a extra measured and goal tone in titles.
- Faux information titles present a broader distribution throughout completely different sentiments, with notably larger counts in Detrimental and Strongly Detrimental sentiments, indicating a bent in the direction of extra emotionally charged language.
- Actual information textual content our bodies present a big lean in the direction of Impartial, reinforcing the concept that the language used is extra factual.
- Faux information has a a lot larger presence in Strongly Detrimental sentiment than actual information, which aligns with the concept that pretend information might use language meant to incite robust emotional reactions, urgency, or controversy.
- Each actual and pretend information have smaller counts for Optimistic and Strongly Optimistic sentiments of their titles and textual content our bodies, suggesting that constructive sentiment is much less generally utilized in information no matter its veracity.
For the Machine Studying mannequin, we’ll go for a easy strategy. We’ll extract the TF-IDF vectors from the information articles’ texts and use them as options. All through my checks, I attempted to make use of different options like TF-IDF vectors from the titles and sentiment polarity from each texts and titles, however I didn’t get any substantial enchancment within the mannequin’s accuracy.
We’ll select the mannequin itself from an inventory of pre-defined fashions and parameters. For every one, we’ll run a cross-validation of 5 folds utilizing the prepare set. The mannequin’s analysis will probably be based mostly on the cross-validation common accuracy, accuracy on the take a look at set, inference time, and mannequin measurement.
20% of the info will probably be separated because the take a look at set whereas the remaining information would be the prepare set. The info will probably be stratified: the proportions of REAL and FAKE information will probably be maintained in each information units.
# We'll use the already cleaned "textual content" colum.
X = df["text_clean"]
y = df["label"]# We'll binarize the goal characteristic proper, method earlier than the spliting, as this course of is neither dependant or dependency for the fashions' coaching.
y_encoder = LabelEncoder()
y = y_encoder.fit_transform(y)
# The parameter "stratify" tells wich characteristic for use for reference to mantain the info proportions ("label" in our case).
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print(X_train.form, y_train.form, X_test.form, y_test.form)
We’ll use Scikit-Be taught’s TfidfVectorizer class to extract the TF-IDF vectors.
data_transformer = TfidfVectorizer(stop_words="english")
We’ll consider the next record of fashions:
- Logistic Regression
- Help Vector Machines
- Random Forest Classifier
- Multinomial Naive Bayes
- XGBoost Classifier
- LightGBM Classifier
Many of the parameter configurations used are default.
fashions = {
# max_iter is ready to keep away from early termination of the algorithm.
"Logistic Regression": LogisticRegression(max_iter=1000),
# The linear kernel will keep away from overfitting.
"SVM": SVC(kernel='linear', chance=True),
"Random Forest": RandomForestClassifier(),
"MultinomialNB": MultinomialNB(),
# use_label_encoder was deprecated, so we set it to False to keep away from warnings.
"XGBoost": xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
"LightGBM": lgb.LGBMClassifier()
}results_entries = []
models_dir = os.path.be part of("information", "fashions")
if not os.path.isdir(models_dir):
os.makedirs(models_dir)
for identify, mannequin in fashions.objects():
print(f"Evaluating mannequin {identify}")
pipeline = make_pipeline(data_transformer, mannequin)
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="accuracy")
start_train_time = time.time()
pipeline.match(X_train, y_train)
train_time = time.time() - start_train_time
start_inference_time = time.time()
y_pred = pipeline.predict(X_test)
inference_time = time.time() - start_inference_time
test_accuracy = accuracy_score(y_test, y_pred)
model_file = f'{identify.substitute(" ", "_").decrease()}_model.joblib'
joblib
model_path = os.path.be part of(models_dir, model_file)
joblib.dump(pipeline, model_path)
model_size = os.path.getsize(model_path)
results_entries.append(
{
"Mannequin": identify,
"Avg Accuracy": scores.imply(),
"Avg Std": scores.std(),
"Check Accuracy": test_accuracy,
"Prepare Time": train_time,
"Inference Time": inference_time,
"Mannequin Measurement KB": model_size / 1024,
}
)
outcomes = pd.DataFrame(results_entries)
outcomes.sort_values(["Avg Accuracy", "Test Accuracy"], ascending=False)
- SVM is probably going the most effective performer by way of cross-validation common accuracy and take a look at accuracy, however this consequence comes at the price of considerably longer coaching and inference occasions and bigger mannequin measurement in comparison with different fashions.
- LightGBM and XGBoost provide an excellent compromise between excessive accuracy and effectivity by way of each coaching/inference time and mannequin measurement.
- Within the contest between LightGBM and XGBoost, the previous has a barely higher efficiency and effectivity, with the addition of a extra steady efficiency indicated by the decrease commonplace deviation.
With the factors above and having in thoughts the trade-off between accuracy and effectivity, we’ll proceed with the LGBMClassifier for the remainder of our evaluation.
Now that we now have our chosen mannequin, it’s time to coach it utilizing the prepare set and analyze the general efficiency on the take a look at set.
mannequin = make_pipeline(data_transformer, lgb.LGBMClassifier(random_state=42))
mannequin.match(X_train, y_train)
We’ll create a dataframe containing the actual and predicted labels for every information article. We’ll additionally embody the clear article texts.
inferece_df = pd.DataFrame(
information={
"news_text": X_test,
"label": y_encoder.inverse_transform(y_test),
"prediction": y_encoder.inverse_transform(mannequin.predict(X_test)),
}
)
inferece_df
With every little thing set, our evaluation will probably be pushed by the next questions:
- How a lot and the way do our mannequin’s predictions fail?
- Are the textual content measurement traits of the articles maintained with respect to their corresponding information varieties?
- Are the widespread phrases of the articles maintained with respect to their corresponding information varieties?
- Are the textual content sentiment traits of the articles maintained with respect to their corresponding information varieties?
Let’s plot the mannequin’s confusion matrix.
conf_matrix = confusion_matrix(inferece_df["label"], inferece_df["prediction"])
plt.determine(figsize=(7, 7))
ax = sns.heatmap(conf_matrix, annot=True, cbar=False, cmap="plasma", fmt="g")
ax.set_xlabel("prediction")
ax.set_xticklabels(["FAKE", "REAL"])
ax.set_ylabel("label")
ax.set_yticklabels(["FAKE", "REAL"])
plt.title(f"Confusion matrix for the {mannequin.steps[-1][-1].__class__.__name__} mannequin");
Let’s additionally verify its precision, recall and F1 rating of the mannequin.
print(classification_report(inferece_df["label"], inferece_df["prediction"]))
- These metrics point out that the mannequin is performing effectively in detecting actual information, with a robust stability between precision and recall.
- The excessive precision means that there are comparatively few false alarms, and the excessive recall implies that the mannequin is nice at catching many of the actual information.
- The F1 rating, being near the precision and recall values, reveals that the mannequin doesn’t considerably favor one measure over the opposite, which is usually fascinating in a balanced classification job.
Now, we’ll verify the final traits of the errors. So we’ll separate them into a brand new dataframe.
errors = inferece_df[inferece_df["label"] != inferece_df["prediction"]].copy()
We’ll additionally retrieve the sentiment polarity of every article.
errors["text_clean"] = errors["news_text"].apply(preprocess_text)
errors["text_sentiment"] = errors["news_text"].apply(get_sentiment_score)
errors.head()
Let’s plot the boxplot and histogram of the textual content measurement distribution of the prediction errors.
errors["text_length"] = errors["text_clean"].str.len()plt.determine(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.boxplot(y="text_length", information=errors, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Boxplot of textual content size by REAL or FAKE information (Errors)")
plt.subplot(1, 2, 2)
sns.histplot(x="text_length", information=errors, hue="label", hue_order=["REAL", "FAKE"])
plt.title("Histogram of textual content size by REAL or FAKE information (Errors)")
plt.tight_layout()
plt.present()
- Certainly, one in every of our first statements in regards to the distinction between REAL and FAKE information isn’t legitimate right here. In these samples, FAKE information articles are likely to have longer texts than REAL information.
- Each distributions nonetheless include outliers, with pretend information having a extra noticeable outlier.
Let’s plot the phrase clouds for the errors.
plt.determine(figsize=(13, 11))for idx, news_type in enumerate(("FAKE", "REAL")):
texts = " ".be part of(errors[errors["label"] == news_type]["text_clean"])
news_wordcloud = WordCloud(
width=800, top=800, background_color="white"
).generate(texts)
plt.subplot(2, 2, idx + 3)
plt.imshow(news_wordcloud, interpolation="bilinear")
plt.title(f"Phrase Cloud for {news_type} Information Texts (Errors)")
plt.axis("off")
plt.tight_layout()
plt.present()
The phrase cloud for each pretend and actual information errors appears to share some widespread phrases with the general information clouds. Nevertheless, some phrases may seem roughly prominently in comparison with the general dataset.
Lastly, let’s plot the sentiment counts within the errors.
order = ["Strongly Negative", "Negative", "Neutral", "Positive", "Strongly Positive"]plt.determine(figsize=(7, 5))
ax = sns.countplot(x="text_sentiment", order=order, information=errors, hue="label")
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), rotation=-45, ha="left")
plt.title("VADER Textual content Sentiment Distribution by Information Kind (Errors)")
plt.tight_layout()
plt.present();
- There’s a big variety of errors within the “Strongly Optimistic” sentiment class, particularly for actual information. This implies that the mannequin struggles with appropriately classifying actual information articles that include strongly constructive sentiment.
- The counts of errors throughout sentiment lessons are comparatively small in comparison with the general distribution, which signifies that the mannequin performs effectively basically. Nevertheless, the errors that happen are disproportionately within the excessive sentiment lessons.
On this undertaking, we addressed the issue of detecting pretend information.
- We first gave a quick assessment of the most typical textual content information ideas and processing strategies.
- Then we proceeded to investigate the principle options of the textual content information, equivalent to textual content measurement, widespread phrases, and sentiment polarity. We additionally utilized some textual content processing strategies to extract a few of these options.
- We utilized the TF-IDF vector extraction approach to create the characteristic matrix and skilled an inventory of machine studying fashions utilizing it to detect pretend information.
- We additionally analyzed the prediction errors of the ultimate mannequin, evaluating them with the evaluation of the entire information.
A lot of the ideas and figures offered right here had been impressed by Aysel Aydin’s posts about Pure Language Processing.
The code for this undertaking and others that I’ve printed will be discovered on this GitHub repository:
Different helpful hyperlinks: