Classification problem with Fake and Real news

Hatem Hassan

In this assignment, I build a text classification to define whether or not a certain article is fake news or real news. Using Natural Language Processing methodologies in Python and Classification Theory, I reached an accuracy of 0.945455 for classifying news as fake.

In [3]:
## This file has all imports and helper functions used throughout the notebook
%run python_helper.py
%matplotlib inline 
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/tommy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

Clean & Save Data

Inspecting the data files, we noticed several issues for processing the traing dataset correctly. Using Regular Expression, we convert all commas between quotations to a pipe, so the CSV parsing works correctly with all values in their correct columns.

In [38]:
input_str = open("fake_or_real_news_training.csv")

# Remove all new lines
noNewLines = re.sub("\n", "", input_str.read())
  
# re-add new line at end of each row
noNewLines = re.sub("X1,X2", "X1,X2\n", noNewLines)
  

noNewLines = re.sub(",FAKE[,]+", ",FAKE,,\n", noNewLines)
# noNewLines = re.sub(",FAKE,(?!,)",",FAKE,,\n",noNewLines)
# noNewLines = re.sub(",FAKE,,(?!,)",",FAKE,,\n",noNewLines)
  
noNewLines = re.sub(",REAL[,]+", ",REAL,,\n", noNewLines)
# noNewLines = re.sub(",REAL,(?!,)",",REAL,,\n",noNewLines)
# noNewLines = re.sub(",REAL,,(?!,)",",REAL,,\n",noNewLines)
  

# Replace any commas between two quotes with |
lines = noNewLines.split('\n')

def removeComma(g):
      t = g.groups()
      t = [t[0], t[1].replace(',', ' |'), t[2], t[3]]
      return "".join(t)

betweenQuotes = lambda line: re.sub(r'(.*,")(.*)(",)(.*)', lambda x: removeComma(x), line)

secondCol = lambda line: re.sub(r'^([0-9]+,)(.*,.*)(,\")(.*)$', lambda x: removeComma(x), line, 1)


lines = [betweenQuotes(l) for l in lines]
lines = [secondCol(l) for l in lines]

finalString = '\n'.join(lines)

Save cleaned file

In [39]:
file = open('fake_or_real_news_training_CLEANED.csv', 'w')
file.write(finalString)
file.close()

Data Preparation

In [2]:
train = pd.read_csv("fake_or_real_news_training_CLEANED.csv")
test = pd.read_csv("fake_or_real_news_test.csv")
In [3]:
len(train)
Out[3]:
3997
In [4]:
len(test)
Out[4]:
2321
In [5]:
train.head()
Out[5]:
ID title text label X1 X2
0 8476 You Can Smell Hillary’s Fear Daniel Greenfield | a Shillman Journalism Fell... FAKE NaN NaN
1 10294 Watch The Exact Moment Paul Ryan Committed Pol... Google Pinterest Digg Linkedin Reddit Stumbleu... FAKE NaN NaN
2 3608 Kerry to go to Paris in gesture of sympathy U.S. Secretary of State John F. Kerry said Mon... REAL NaN NaN
3 10142 Bernie supporters on Twitter erupt in anger ag... — Kaydee King (@KaydeeKing) November 9 | 2016 ... FAKE NaN NaN
4 875 The Battle of New York: Why This Primary Matte... Cruz promised his supporters. ""We're beating... REAL NaN NaN
In [6]:
train = train.drop(['X1', 'X2'], axis=1)

We study if the dataset is unbalanced. From the plot we see this is not the case, as there is a similar amount of Fake and Real news articles. No further actions have to be taken.

In [45]:
from collections import Counter
ax = sns.countplot(train.label, order=[x for x, count in sorted(Counter(train.label).items(), key=lambda x: -x[1])])


for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/len(train)*100),
            ha="center") 
ax.set_title("Test dataset target")
show()
In [7]:
test.head()
Out[7]:
ID title text
0 10498 September New Homes Sales Rise——-Back To 1992 ... September New Homes Sales Rise Back To 1992 Le...
1 2439 Why The Obamacare Doomsday Cult Can't Admit It... But when Congress debated and passed the Patie...
2 864 Sanders, Cruz resist pressure after NY losses,... The Bernie Sanders and Ted Cruz campaigns vowe...
3 4128 Surviving escaped prisoner likely fatigued and... Police searching for the second of two escaped...
4 662 Clinton and Sanders neck and neck in Californi... No matter who wins California's 475 delegates ...

In order to not do double work by doing operations on our train and testset and to analyze general distributions of our data, we stack train and test in df.

In [47]:
test['label'] = None  # empty label for test

df = pd.concat([train, test])
In [48]:
len(df)
Out[48]:
6318
In [49]:
df.tail()
Out[49]:
ID title text label
2316 4490 State Department says it can't find emails fro... The State Department told the Republican Natio... None
2317 8062 The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... None
2318 8622 Anti-Trump Protesters Are Tools of the Oligarc... Anti-Trump Protesters Are Tools of the Oligar... None
2319 4021 In Ethiopia, Obama seeks progress on peace, se... ADDIS ABABA, Ethiopia —President Obama convene... None
2320 4330 Jeb Bush Is Suddenly Attacking Trump. Here's W... Jeb Bush Is Suddenly Attacking Trump. Here's W... None

Data Preprocessing

In this part we will be cleaning the articles with the help of different NLP techniques, of which we will first explain the concept and its importance.

In order to take into account the title in our accuracy prediction, we created an extra column that combines text and title. We will not do seperate predictions on the title, since these might classify as e.g. Fake news, weather the actual text with more explanation tells a Real story.

In [50]:
df['title_and_text'] = df['title'] +' '+ df['text']
df.tail()
Out[50]:
ID title text label title_and_text
2316 4490 State Department says it can't find emails fro... The State Department told the Republican Natio... None State Department says it can't find emails fro...
2317 8062 The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... None The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...
2318 8622 Anti-Trump Protesters Are Tools of the Oligarc... Anti-Trump Protesters Are Tools of the Oligar... None Anti-Trump Protesters Are Tools of the Oligarc...
2319 4021 In Ethiopia, Obama seeks progress on peace, se... ADDIS ABABA, Ethiopia —President Obama convene... None In Ethiopia, Obama seeks progress on peace, se...
2320 4330 Jeb Bush Is Suddenly Attacking Trump. Here's W... Jeb Bush Is Suddenly Attacking Trump. Here's W... None Jeb Bush Is Suddenly Attacking Trump. Here's W...

preprocess() can be found in python_helper.py Here you can read the explanations of the preprocess steps we took

  1. lowercase the text

This preprocessing step is done so words van later be cross checked with the stopword and pos_tag dictionaries. For future analysis purposes, it could have been benefitial to analyze text with a lot of words in capital letters, by adding a flag variable.

  1. remove the words counting just one letter

Idem step one.

  1. remove the words that contain numbers

Idem step one.

  1. tokenize the text and remove punctuation

We performed tokenization with the base python .string function, to split sentences into words (tokens).

  1. remove all stop words

Relevant analysis of the text depends on the most recurring words. Stopwords including words as "the", "as" and "and" appear a lot in a text, but do not give relevant explanation. For this reason they are removed.

  1. remove tokens that are empty

After tokenization, we have to make sure all tokens taken into account contribute to the label prediction.

  1. pos tag the text

We use the pos_tag function included in the ntlk library. This classifies our tokenized words as a noun, verb, adjective or adverb and adds to the understaning of the articles.

  1. lemmatize the text

In order to normalize the text, we apply lemmatization. In this way, words with the same root are processed equally e.g. when took or taken are read in the text, they are lemmatized to take, infinitive of the two verbs.

In [51]:
df['preprocessed_text'] = df['title_and_text'].apply(lambda x: preprocess(x))
In [52]:
## Save preprocessed df
df.to_csv("fake_or_real_news_train_PREPROCESSED.csv", index=False)
In [8]:
df = pd.read_csv("fake_or_real_news_train_PREPROCESSED.csv")
df = df.astype(object).replace(np.nan, 'None')
In [9]:
df.tail()
Out[9]:
ID title text label title_and_text preprocessed_text
6313 4490 State Department says it can't find emails fro... The State Department told the Republican Natio... None State Department says it can't find emails fro... state department say can't find emails clinton...
6314 8062 The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... None The ‘P’ in PBS Should Stand for ‘Plutocratic’ ... ‘p’ pb stand ‘plutocratic’ ‘pentagon’ ‘p’ pb s...
6315 8622 Anti-Trump Protesters Are Tools of the Oligarc... Anti-Trump Protesters Are Tools of the Oligar... None Anti-Trump Protesters Are Tools of the Oligarc... anti-trump protester tool oligarchy informatio...
6316 4021 In Ethiopia, Obama seeks progress on peace, se... ADDIS ABABA, Ethiopia —President Obama convene... None In Ethiopia, Obama seeks progress on peace, se... ethiopia obama seek progress peace security ea...
6317 4330 Jeb Bush Is Suddenly Attacking Trump. Here's W... Jeb Bush Is Suddenly Attacking Trump. Here's W... None Jeb Bush Is Suddenly Attacking Trump. Here's W... jeb bush suddenly attack trump here's matter j...

Split Train and Test again after pre-processing is done

In [10]:
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df)
Train dataset (Full)
(3997, 7)
Train dataset cols
['ID', 'title', 'text', 'label', 'title_and_text', 'preprocessed_text', 'encoded_label']

Train CV dataset (subset)
(2677, 7)
Train Holdout dataset (subset)
(1320, 7)

Test dataset
(2321, 6)
Test dataset cols
['ID', 'title', 'text', 'label', 'title_and_text', 'preprocessed_text']
In [11]:
encoder
Out[11]:
LabelEncoder()

Baseline Modelling

First, we create a dataframe called models to keep track of different models and their scores.

In [56]:
models = pd.DataFrame(columns=['model_name', 'model_object', 'score'])

Vectorizing dataset

For any text to be fed to a model, the text has to be transformed into numerical values. This process is called vectorizing and will be redone everytime a new feature is added.

In [57]:
count_vect = CountVectorizer(analyzer = "word")

count_vectorizer = count_vect.fit(df.preprocessed_text)

train_cv_vector = count_vectorizer.transform(train_cv.preprocessed_text)
train_holdout_vector = count_vectorizer.transform(train_holdout.preprocessed_text)
test_vector = count_vectorizer.transform(test.preprocessed_text)
In [58]:
count_vect.get_feature_names()[:10]
Out[58]:
['___',
 '_blank',
 '_derosa',
 '_rt_op_edge',
 '_strauss',
 'aa',
 'aaa',
 'aaahhh',
 'aab',
 'aachen']

Baseline Model 1: SVC

We create a baseline classification model with a support vector machine, a good model to handle complex classifications.

In [59]:
SVC_classifier = runModel(encoder,
               train_cv_vector,
               train_cv_label,
               train_holdout_vector,
               train_holdout.label,
               "svc",
               "Baseline Model 1: SVC")
models.loc[len(models)] = SVC
Baseline Model 1: SVC
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'C': [1, 10, 50, 100], 'kernel': ['linear']}, {'C': [10, 100, 500, 1000], 'gamma': [0.0001], 'kernel': ['rbf']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.909 (+/-0.022) for params: {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.902 (+/-0.030) for params: {'C': 500, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.893 (+/-0.030) for params: {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.892 (+/-0.036) for params: {'C': 1, 'kernel': 'linear'}
Accuracy: 0.892 (+/-0.036) for params: {'C': 10, 'kernel': 'linear'}
Accuracy: 0.892 (+/-0.036) for params: {'C': 50, 'kernel': 'linear'}
Accuracy: 0.892 (+/-0.036) for params: {'C': 100, 'kernel': 'linear'}
Accuracy: 0.886 (+/-0.011) for params: {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}


Best Estimator Params
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Predictions:
['REAL' 'REAL' 'FAKE' ... 'REAL' 'FAKE' 'REAL']

Accuracy:
0.9136363636363637

Baseline Model 2: Naïve Bayes

from IPython.display import Image Image("/Users/Gerald/Personal Drive/IE MBD/Term III/Natural Language Processing/Ass 1/NLP Fake News Predection | Gerald | Hatem/real_vs_fake.png")

With this hand drawn example (text: rude hell worth), we explain why the Naïve Bayes model is helpful for our classification. The labels Real and Fake text are hidden, but every word, based on our training data, has a certain probability to belong to one of the two categories. The final score is calculated, multiplying all probabilities of the words (0.006 for real, 0.288 for fake). The algo thus does not take into account the order of the words in the multiplication. rude hell worth will be classified as fake.

In [61]:
NB = runModel(encoder,
              train_cv_vector,
              train_cv_label,
              train_holdout_vector,
              train_holdout.label,
              "nb",
              "Baseline Model 2: Naiive Bayes")
models.loc[len(models)] = NB
Baseline Model 2: Naiive Bayes
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid='warn', n_jobs=None, param_grid={},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.887 (+/-0.021) for params: {}


Best Estimator Params
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Predictions:
['REAL' 'REAL' 'REAL' ... 'REAL' 'REAL' 'REAL']

Accuracy:
0.8946969696969697

Baseline Model 3: MaxEnt Classifier

In [ ]:
 
In [62]:
maxEnt = runModel(encoder,
              train_cv_vector,
              train_cv_label,
              train_holdout_vector,
              train_holdout.label,
              "maxEnt",
              "Baseline Model 3: MaxEnt Classifier")
models.loc[len(models)] = maxEnt
Baseline Model 3: MaxEnt Classifier
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.932 (+/-0.013) for params: {'C': 0.1, 'penalty': 'l2'}
Accuracy: 0.926 (+/-0.014) for params: {'C': 0.01, 'penalty': 'l2'}
Accuracy: 0.924 (+/-0.018) for params: {'C': 1, 'penalty': 'l2'}
Accuracy: 0.917 (+/-0.023) for params: {'C': 10, 'penalty': 'l2'}
Accuracy: 0.915 (+/-0.027) for params: {'C': 1, 'penalty': 'l1'}
Accuracy: 0.911 (+/-0.016) for params: {'C': 10, 'penalty': 'l1'}
Accuracy: 0.909 (+/-0.017) for params: {'C': 0.001, 'penalty': 'l2'}
Accuracy: 0.908 (+/-0.028) for params: {'C': 100, 'penalty': 'l1'}
Accuracy: 0.903 (+/-0.009) for params: {'C': 0.1, 'penalty': 'l1'}
Accuracy: 0.898 (+/-0.034) for params: {'C': 100, 'penalty': 'l2'}
Accuracy: 0.891 (+/-0.018) for params: {'C': 1000, 'penalty': 'l1'}
Accuracy: 0.882 (+/-0.023) for params: {'C': 1000, 'penalty': 'l2'}
Accuracy: 0.843 (+/-0.028) for params: {'C': 0.01, 'penalty': 'l1'}
Accuracy: 0.610 (+/-0.019) for params: {'C': 0.001, 'penalty': 'l1'}


Best Estimator Params
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Predictions:
['REAL' 'REAL' 'FAKE' ... 'REAL' 'FAKE' 'REAL']

Accuracy:
0.9234848484848485

Baseline Models Summary

In [63]:
models
Out[63]:
model_name model_object score
0 <class 'sklearn.svm.classes.SVC'> <class 'sklearn.svm.classes.SVC'> <class 'sklearn.svm.classes.SVC'>
1 Baseline Model 2: Naiive Bayes GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.894697
2 Baseline Model 3: MaxEnt Classifier GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.923485

Feature Engineeering

  • Explicit POS tagging
  • TF-IDF weighting
  • Bigram Count Vectorizer

==> Select Final Model and predict on test

1. POS Tagging

Adding a prefix to each word with its type (Noun, Verb, Adjective,...). e.g: I went to school => PRP-I VBD-went TO-to NN-school

Also, after lemmatization it will be 'VB-go NN-school', which indicates the semantics and distingueshes the purpose of the sentence.

This will help the classifier differentiate between different types of sentences.

In [64]:
df['pos_tagged_text'] = df['preprocessed_text'].apply(lambda x: pos_tag_words(x))
In [65]:
df.head()
Out[65]:
ID title text label title_and_text preprocessed_text pos_tagged_text
0 8476 You Can Smell Hillary’s Fear Daniel Greenfield | a Shillman Journalism Fell... FAKE You Can Smell Hillary’s Fear Daniel Greenfield... smell hillary’s fear daniel greenfield shillma... NN-smell JJ-hillary NNP-’ NN-s NN-fear JJ-dani...
1 10294 Watch The Exact Moment Paul Ryan Committed Pol... Google Pinterest Digg Linkedin Reddit Stumbleu... FAKE Watch The Exact Moment Paul Ryan Committed Pol... watch exact moment paul ryan commit political ... NN-watch JJ-exact NN-moment NN-paul JJ-ryan NN...
2 3608 Kerry to go to Paris in gesture of sympathy U.S. Secretary of State John F. Kerry said Mon... REAL Kerry to go to Paris in gesture of sympathy U.... kerry go paris gesture sympathy u.s secretary ... NN-kerry VBP-go JJ-paris NN-gesture JJ-sympath...
3 10142 Bernie supporters on Twitter erupt in anger ag... — Kaydee King (@KaydeeKing) November 9 | 2016 ... FAKE Bernie supporters on Twitter erupt in anger ag... bernie supporter twitter erupt anger dnc try w... NN-bernie NN-supporter NN-twitter JJ-erupt NN-...
4 875 The Battle of New York: Why This Primary Matte... Cruz promised his supporters. ""We're beating... REAL The Battle of New York: Why This Primary Matte... battle new york primary matter primary day new... NN-battle JJ-new NN-york JJ-primary NN-matter ...

Rerun Models on pos-tagged text (FE1)

In [66]:
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False)

count_vect = CountVectorizer(analyzer = "word")

count_vectorizer = count_vect.fit(df.preprocessed_text)

train_cv_vector = count_vectorizer.transform(train_cv.pos_tagged_text)
train_holdout_vector = count_vectorizer.transform(train_holdout.pos_tagged_text)
test_vector = count_vectorizer.transform(test.pos_tagged_text)

a. SVC with FE1

In [67]:
SVC_pos_tag = runModel(encoder,
               train_cv_vector,
               train_cv_label,
               train_holdout_vector,
               train_holdout.label,
               "svc",
               "SVC on pos-tagged text")
models.loc[len(models)] = SVC_pos_tag
SVC on pos-tagged text
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'C': [1, 10, 50, 100], 'kernel': ['linear']}, {'C': [10, 100, 500, 1000], 'gamma': [0.0001], 'kernel': ['rbf']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.921 (+/-0.026) for params: {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.914 (+/-0.020) for params: {'C': 500, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.912 (+/-0.028) for params: {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.912 (+/-0.022) for params: {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.904 (+/-0.018) for params: {'C': 1, 'kernel': 'linear'}
Accuracy: 0.904 (+/-0.018) for params: {'C': 10, 'kernel': 'linear'}
Accuracy: 0.904 (+/-0.018) for params: {'C': 50, 'kernel': 'linear'}
Accuracy: 0.904 (+/-0.018) for params: {'C': 100, 'kernel': 'linear'}


Best Estimator Params
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Predictions:
['REAL' 'REAL' 'FAKE' ... 'REAL' 'REAL' 'FAKE']

Accuracy:
0.9159090909090909

b. NB_pos_tag with FE1

In [68]:
NB_pos_tag = runModel(encoder,
              train_cv_vector,
              train_cv_label,
              train_holdout_vector,
              train_holdout.label,
              "nb",
              "Naiive Bayes on pos-tagged text")
models.loc[len(models)] = NB_pos_tag
Naiive Bayes on pos-tagged text
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid='warn', n_jobs=None, param_grid={},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.889 (+/-0.017) for params: {}


Best Estimator Params
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Predictions:
['REAL' 'REAL' 'REAL' ... 'REAL' 'REAL' 'REAL']

Accuracy:
0.8977272727272727

c. maxEnt with FE1

In [69]:
maxEnt_pos_tag = runModel(encoder,
              train_cv_vector,
              train_cv_label,
              train_holdout_vector,
              train_holdout.label,
              "maxEnt",
              "MaxEnt Classifier on pos-tagged text")
models.loc[len(models)] = maxEnt_pos_tag
MaxEnt Classifier on pos-tagged text
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.930 (+/-0.026) for params: {'C': 0.1, 'penalty': 'l2'}
Accuracy: 0.929 (+/-0.030) for params: {'C': 1, 'penalty': 'l2'}
Accuracy: 0.927 (+/-0.018) for params: {'C': 0.01, 'penalty': 'l2'}
Accuracy: 0.924 (+/-0.028) for params: {'C': 1, 'penalty': 'l1'}
Accuracy: 0.922 (+/-0.024) for params: {'C': 10, 'penalty': 'l2'}
Accuracy: 0.920 (+/-0.022) for params: {'C': 100, 'penalty': 'l2'}
Accuracy: 0.919 (+/-0.021) for params: {'C': 10, 'penalty': 'l1'}
Accuracy: 0.919 (+/-0.020) for params: {'C': 1000, 'penalty': 'l2'}
Accuracy: 0.911 (+/-0.020) for params: {'C': 0.1, 'penalty': 'l1'}
Accuracy: 0.911 (+/-0.016) for params: {'C': 100, 'penalty': 'l1'}
Accuracy: 0.908 (+/-0.011) for params: {'C': 0.001, 'penalty': 'l2'}
Accuracy: 0.896 (+/-0.020) for params: {'C': 1000, 'penalty': 'l1'}
Accuracy: 0.848 (+/-0.026) for params: {'C': 0.01, 'penalty': 'l1'}
Accuracy: 0.662 (+/-0.021) for params: {'C': 0.001, 'penalty': 'l1'}


Best Estimator Params
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Predictions:
['REAL' 'REAL' 'FAKE' ... 'REAL' 'FAKE' 'REAL']

Accuracy:
0.9295454545454546
In [70]:
models
Out[70]:
model_name model_object score
0 <class 'sklearn.svm.classes.SVC'> <class 'sklearn.svm.classes.SVC'> <class 'sklearn.svm.classes.SVC'>
1 Baseline Model 2: Naiive Bayes GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.894697
2 Baseline Model 3: MaxEnt Classifier GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.923485
3 SVC on pos-tagged text GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.915909
4 Naiive Bayes on pos-tagged text GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.897727
5 MaxEnt Classifier on pos-tagged text GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.929545

There seems to be a slight increase in Accuracy after pos-tagging.

2. TF-IDF weighting

Try to add weight to each word using TF-IDF

We are going to calculate the TFIDF score of each term in a piece of text. The text will be tokenized into sentences and each sentence is then considered a text item.

We will also apply those on the cleaned text and the concatinated POS_tagged text.

In [71]:
df["clean_and_pos_tagged_text"] = df['preprocessed_text'] + ' ' + df['pos_tagged_text']
In [72]:
df.head(1)
Out[72]:
ID title text label title_and_text preprocessed_text pos_tagged_text clean_and_pos_tagged_text
0 8476 You Can Smell Hillary’s Fear Daniel Greenfield | a Shillman Journalism Fell... FAKE You Can Smell Hillary’s Fear Daniel Greenfield... smell hillary’s fear daniel greenfield shillma... NN-smell JJ-hillary NNP-’ NN-s NN-fear JJ-dani... smell hillary’s fear daniel greenfield shillma...
In [73]:
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False)

count_vect = CountVectorizer(analyzer = "word")

count_vectorizer = count_vect.fit(df.clean_and_pos_tagged_text)

train_cv_vector = count_vectorizer.transform(train_cv.clean_and_pos_tagged_text)
train_holdout_vector = count_vectorizer.transform(train_holdout.clean_and_pos_tagged_text)
test_vector = count_vectorizer.transform(test.clean_and_pos_tagged_text)


tf_idf = TfidfTransformer(norm="l2")
train_cv_tf_idf = tf_idf.fit_transform(train_cv_vector)
train_holdout_tf_idf = tf_idf.fit_transform(train_holdout_vector)
test_tf_idf = tf_idf.fit_transform(test_vector)  

Rerun Models on preprocessed + pos-tagged (FE1) + TF-IDF weighted text (FE2)

a. SVC with FE1 and FE2

In [74]:
SVC_tf_idf = runModel(encoder,
               train_cv_tf_idf,
               train_cv_label,
               train_holdout_tf_idf,
               train_holdout.label,
               "svc",
               "SVC on preprocessed+pos-tagged TF-IDF weighted text")
models.loc[len(models)] = SVC_tf_idf
SVC on preprocessed+pos-tagged TF-IDF weighted text
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'C': [1, 10, 50, 100], 'kernel': ['linear']}, {'C': [10, 100, 500, 1000], 'gamma': [0.0001], 'kernel': ['rbf']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.937 (+/-0.026) for params: {'C': 50, 'kernel': 'linear'}
Accuracy: 0.937 (+/-0.026) for params: {'C': 100, 'kernel': 'linear'}
Accuracy: 0.937 (+/-0.024) for params: {'C': 10, 'kernel': 'linear'}
Accuracy: 0.935 (+/-0.020) for params: {'C': 1, 'kernel': 'linear'}
Accuracy: 0.895 (+/-0.008) for params: {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.859 (+/-0.016) for params: {'C': 500, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.590 (+/-0.139) for params: {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.481 (+/-0.027) for params: {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}


Best Estimator Params
SVC(C=50, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Predictions:
['REAL' 'REAL' 'FAKE' ... 'REAL' 'FAKE' 'REAL']

Accuracy:
0.9454545454545454

b. NB with FE1 and FE2

In [75]:
NB_tf_idf = runModel(encoder,
               train_cv_tf_idf,
               train_cv_label,
               train_holdout_tf_idf,
               train_holdout.label,
              "nb",
              "Naiive Bayes on preprocessed+pos-tagged TF-IDF weighted text")
models.loc[len(models)] = NB_tf_idf
Naiive Bayes on preprocessed+pos-tagged TF-IDF weighted text
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid='warn', n_jobs=None, param_grid={},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.781 (+/-0.039) for params: {}


Best Estimator Params
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Predictions:
['REAL' 'REAL' 'REAL' ... 'REAL' 'REAL' 'REAL']

Accuracy:
0.8166666666666667

c. maxEnt with FE1 and FE2

In [76]:
maxEnt_tf_idf = runModel(encoder,
               train_cv_tf_idf,
               train_cv_label,
               train_holdout_tf_idf,
               train_holdout.label,
              "maxEnt",
              "MaxEnt on preprocessed+pos-tagged TF-IDF weighted text")
models.loc[len(models)] = maxEnt_tf_idf
MaxEnt on preprocessed+pos-tagged TF-IDF weighted text
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.943 (+/-0.030) for params: {'C': 1000, 'penalty': 'l2'}
Accuracy: 0.940 (+/-0.023) for params: {'C': 1000, 'penalty': 'l1'}
Accuracy: 0.939 (+/-0.018) for params: {'C': 10, 'penalty': 'l1'}
Accuracy: 0.939 (+/-0.026) for params: {'C': 100, 'penalty': 'l2'}
Accuracy: 0.938 (+/-0.023) for params: {'C': 100, 'penalty': 'l1'}
Accuracy: 0.932 (+/-0.021) for params: {'C': 10, 'penalty': 'l2'}
Accuracy: 0.909 (+/-0.007) for params: {'C': 1, 'penalty': 'l2'}
Accuracy: 0.889 (+/-0.017) for params: {'C': 1, 'penalty': 'l1'}
Accuracy: 0.842 (+/-0.026) for params: {'C': 0.1, 'penalty': 'l2'}
Accuracy: 0.724 (+/-0.101) for params: {'C': 0.01, 'penalty': 'l2'}
Accuracy: 0.662 (+/-0.041) for params: {'C': 0.1, 'penalty': 'l1'}
Accuracy: 0.550 (+/-0.244) for params: {'C': 0.001, 'penalty': 'l2'}
Accuracy: 0.514 (+/-0.036) for params: {'C': 0.001, 'penalty': 'l1'}
Accuracy: 0.514 (+/-0.036) for params: {'C': 0.01, 'penalty': 'l1'}


Best Estimator Params
LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Predictions:
['REAL' 'REAL' 'FAKE' ... 'REAL' 'FAKE' 'REAL']

Accuracy:
0.9431818181818182
In [77]:
models
Out[77]:
model_name model_object score
0 <class 'sklearn.svm.classes.SVC'> <class 'sklearn.svm.classes.SVC'> <class 'sklearn.svm.classes.SVC'>
1 Baseline Model 2: Naiive Bayes GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.894697
2 Baseline Model 3: MaxEnt Classifier GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.923485
3 SVC on pos-tagged text GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.915909
4 Naiive Bayes on pos-tagged text GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.897727
5 MaxEnt Classifier on pos-tagged text GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.929545
6 SVC on preprocessed+pos-tagged TF-IDF weighted... GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.945455
7 Naiive Bayes on preprocessed+pos-tagged TF-IDF... GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.816667
8 MaxEnt on preprocessed+pos-tagged TF-IDF weigh... GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.943182

Using TF-IDF increased the score to ~94.5% with SVC and Max-Ent models.

Naive-Bayes rather decreased the score. Therefore we drop it from the pipeline.

3. Use Bigram Vectorizer instead of regular vectorizer

For FE3, we use the Trigram vectorizer, which vectorizes triplets of words rather than each word separately. In this short example sentence, the trigrams are "In this short", "this short example" and "short example sentence".

In [78]:
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False)

trigram_vect = CountVectorizer(analyzer = "word", ngram_range=(1,2))

trigram_vect = count_vect.fit(df.clean_and_pos_tagged_text)

train_cv_vector = trigram_vect.transform(train_cv.clean_and_pos_tagged_text)
train_holdout_vector = trigram_vect.transform(train_holdout.clean_and_pos_tagged_text)
test_vector = trigram_vect.transform(test.clean_and_pos_tagged_text)
In [79]:
tf_idf = TfidfTransformer(norm="l2")
train_cv_bigram_tf_idf = tf_idf.fit_transform(train_cv_vector)
train_holdout_bigram_tf_idf = tf_idf.fit_transform(train_holdout_vector)
test_bigram_tf_idf = tf_idf.fit_transform(test_vector)

Rerun Models on preprocessed + pos-tagged (FE1) + TF-IDF weighted (FE2) + Trigram vectorized text (FE3)

a. SVC with FE1, FE2 and FE3

In [80]:
SVC_trigram_tf_idf = runModel(encoder,
               train_cv_bigram_tf_idf,
               train_cv_label,
               train_holdout_bigram_tf_idf,
               train_holdout.label,
               "svc",
               "SVC on bigram vect.+ TF-IDF")
models.loc[len(models)] = SVC_trigram_tf_idf
SVC on bigram vect.+ TF-IDF
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'C': [1, 10, 50, 100], 'kernel': ['linear']}, {'C': [10, 100, 500, 1000], 'gamma': [0.0001], 'kernel': ['rbf']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.937 (+/-0.026) for params: {'C': 50, 'kernel': 'linear'}
Accuracy: 0.937 (+/-0.026) for params: {'C': 100, 'kernel': 'linear'}
Accuracy: 0.937 (+/-0.024) for params: {'C': 10, 'kernel': 'linear'}
Accuracy: 0.935 (+/-0.020) for params: {'C': 1, 'kernel': 'linear'}
Accuracy: 0.895 (+/-0.008) for params: {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.859 (+/-0.016) for params: {'C': 500, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.590 (+/-0.139) for params: {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.481 (+/-0.027) for params: {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}


Best Estimator Params
SVC(C=50, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Predictions:
['REAL' 'REAL' 'FAKE' ... 'REAL' 'FAKE' 'REAL']

Accuracy:
0.9454545454545454

b. maxEnt with FE1, FE2 and FE3

In [83]:
encoder, train, test, train_cv, train_holdout, train_cv_label, train_holdout_label = split_train_holdout_test(encoder, df, False)

trigram_vect = CountVectorizer(analyzer = "word", ngram_range=(1,3))

trigram_vect = count_vect.fit(df.clean_and_pos_tagged_text)

train_cv_vector = trigram_vect.transform(train_cv.clean_and_pos_tagged_text)
train_holdout_vector = trigram_vect.transform(train_holdout.clean_and_pos_tagged_text)
In [84]:
tf_idf = TfidfTransformer(norm="l2")
train_cv_trigram_tf_idf = tf_idf.fit_transform(train_cv_vector)
train_holdout_trigram_tf_idf = tf_idf.fit_transform(train_holdout_vector)
In [86]:
maxEnt_tf_idf = runModel(encoder,
               train_cv_trigram_tf_idf,
               train_cv_label,
               train_holdout_trigram_tf_idf,
               train_holdout.label,
              "maxEnt",
              "MaxEnt on trigram vect.+ TF-IDF")
models.loc[len(models)] = maxEnt_tf_idf
MaxEnt on trigram vect.+ TF-IDF
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=12345, test_size=0.2, train_size=None),
       error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

CV-scores
Accuracy: 0.943 (+/-0.029) for params: {'C': 1000, 'penalty': 'l2'}
Accuracy: 0.939 (+/-0.018) for params: {'C': 10, 'penalty': 'l1'}
Accuracy: 0.939 (+/-0.026) for params: {'C': 100, 'penalty': 'l2'}
Accuracy: 0.938 (+/-0.023) for params: {'C': 1000, 'penalty': 'l1'}
Accuracy: 0.938 (+/-0.023) for params: {'C': 100, 'penalty': 'l1'}
Accuracy: 0.932 (+/-0.021) for params: {'C': 10, 'penalty': 'l2'}
Accuracy: 0.909 (+/-0.007) for params: {'C': 1, 'penalty': 'l2'}
Accuracy: 0.889 (+/-0.017) for params: {'C': 1, 'penalty': 'l1'}
Accuracy: 0.842 (+/-0.026) for params: {'C': 0.1, 'penalty': 'l2'}
Accuracy: 0.724 (+/-0.101) for params: {'C': 0.01, 'penalty': 'l2'}
Accuracy: 0.662 (+/-0.041) for params: {'C': 0.1, 'penalty': 'l1'}
Accuracy: 0.550 (+/-0.244) for params: {'C': 0.001, 'penalty': 'l2'}
Accuracy: 0.514 (+/-0.036) for params: {'C': 0.001, 'penalty': 'l1'}
Accuracy: 0.514 (+/-0.036) for params: {'C': 0.01, 'penalty': 'l1'}


Best Estimator Params
LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Predictions:
['REAL' 'REAL' 'FAKE' ... 'REAL' 'FAKE' 'REAL']

Accuracy:
0.9431818181818182
In [82]:
models
Out[82]:
model_name model_object score
0 <class 'sklearn.svm.classes.SVC'> <class 'sklearn.svm.classes.SVC'> <class 'sklearn.svm.classes.SVC'>
1 Baseline Model 2: Naiive Bayes GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.894697
2 Baseline Model 3: MaxEnt Classifier GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.923485
3 SVC on pos-tagged text GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.915909
4 Naiive Bayes on pos-tagged text GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.897727
5 MaxEnt Classifier on pos-tagged text GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.929545
6 SVC on preprocessed+pos-tagged TF-IDF weighted... GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.945455
7 Naiive Bayes on preprocessed+pos-tagged TF-IDF... GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.816667
8 MaxEnt on preprocessed+pos-tagged TF-IDF weigh... GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.943182
9 SVC on bigram vect.+ TF-IDF GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.945455
10 MaxEnt on trigram vect.+ TF-IDF GridSearchCV(cv=ShuffleSplit(n_splits=5, rando... 0.943182

It looks like the "MaxEnt on trigram vect.+ TF-IDF" is the best model with the highest score. We will use it to predict and classify the testset.

Predicting on test dataset

1. Train on whole data and predict on test

PREPROCESSED data

In [4]:
test = pd.read_csv("fake_or_real_news_test.csv")
train = pd.read_csv("fake_or_real_news_training_CLEANED.csv")
In [ ]:
train['title_and_text'] = train['title'] +' '+ train['text']
train['preprocessed_text'] = train['title_and_text'].apply(lambda x: preprocess(x))
In [ ]:
test['title_and_text'] = test['title'] +' '+ test['text']
test['preprocessed_text'] = test['title_and_text'].apply(lambda x: preprocess(x))
In [ ]:
test.head()
In [66]:
## Save preprocessed df
train.to_csv("fake_or_real_news_train_PREPROCESSED.csv", index=False)
In [67]:
# Save preprocessed df
test.to_csv("fake_or_real_news_test_PREPROCESSED.csv", index=False)
In [6]:
train = pd.read_csv("fake_or_real_news_train_PREPROCESSED.csv")
train = train.astype(object).replace(np.nan, 'None')

test = pd.read_csv("fake_or_real_news_test_PREPROCESSED.csv")
test = test.astype(object).replace(np.nan, 'None')
In [7]:
test = test.astype(object).replace(np.nan, 'None')
In [9]:
test.head()
Out[9]:
ID title text title_and_text preprocessed_text
0 10498 September New Homes Sales Rise——-Back To 1992 ... September New Homes Sales Rise Back To 1992 Le... September New Homes Sales Rise——-Back To 1992 ... september new home sale rise——-back level sept...
1 2439 Why The Obamacare Doomsday Cult Can't Admit It... But when Congress debated and passed the Patie... Why The Obamacare Doomsday Cult Can't Admit It... obamacare doomsday cult can't admit wrong cong...
2 864 Sanders, Cruz resist pressure after NY losses,... The Bernie Sanders and Ted Cruz campaigns vowe... Sanders, Cruz resist pressure after NY losses,... sander cruz resist pressure ny loss vow fight ...
3 4128 Surviving escaped prisoner likely fatigued and... Police searching for the second of two escaped... Surviving escaped prisoner likely fatigued and... survive escape prisoner likely fatigue prone m...
4 662 Clinton and Sanders neck and neck in Californi... No matter who wins California's 475 delegates ... Clinton and Sanders neck and neck in Californi... clinton sander neck neck california primary ma...
In [10]:
train.head()
Out[10]:
ID title text label X1 X2 title_and_text preprocessed_text
0 8476 You Can Smell Hillary’s Fear Daniel Greenfield | a Shillman Journalism Fell... FAKE None None You Can Smell Hillary’s Fear Daniel Greenfield... smell hillary’s fear daniel greenfield shillma...
1 10294 Watch The Exact Moment Paul Ryan Committed Pol... Google Pinterest Digg Linkedin Reddit Stumbleu... FAKE None None Watch The Exact Moment Paul Ryan Committed Pol... watch exact moment paul ryan commit political ...
2 3608 Kerry to go to Paris in gesture of sympathy U.S. Secretary of State John F. Kerry said Mon... REAL None None Kerry to go to Paris in gesture of sympathy U.... kerry go paris gesture sympathy u.s secretary ...
3 10142 Bernie supporters on Twitter erupt in anger ag... — Kaydee King (@KaydeeKing) November 9 | 2016 ... FAKE None None Bernie supporters on Twitter erupt in anger ag... bernie supporter twitter erupt anger dnc try w...
4 875 The Battle of New York: Why This Primary Matte... Cruz promised his supporters. ""We're beating... REAL None None The Battle of New York: Why This Primary Matte... battle new york primary matter primary day new...

POS Tagging

In [11]:
train['pos_tagged_text'] = train['preprocessed_text'].apply(lambda x: pos_tag_words(x))
test['pos_tagged_text'] = test['preprocessed_text'].apply(lambda x: pos_tag_words(x))

merge clean and pos tagged

In [12]:
train["clean_and_pos_tagged_text"] = train['preprocessed_text'] + ' ' + train['pos_tagged_text']
test["clean_and_pos_tagged_text"] = test['preprocessed_text'] + ' ' + train['pos_tagged_text']
In [13]:
train.head(1)
Out[13]:
ID title text label X1 X2 title_and_text preprocessed_text pos_tagged_text clean_and_pos_tagged_text
0 8476 You Can Smell Hillary’s Fear Daniel Greenfield | a Shillman Journalism Fell... FAKE None None You Can Smell Hillary’s Fear Daniel Greenfield... smell hillary’s fear daniel greenfield shillma... NN-smell JJ-hillary NNP-’ NN-s NN-fear JJ-dani... smell hillary’s fear daniel greenfield shillma...
In [14]:
test.head(1)
Out[14]:
ID title text title_and_text preprocessed_text pos_tagged_text clean_and_pos_tagged_text
0 10498 September New Homes Sales Rise——-Back To 1992 ... September New Homes Sales Rise Back To 1992 Le... September New Homes Sales Rise——-Back To 1992 ... september new home sale rise——-back level sept... VB-september JJ-new NN-home NN-sale JJ-rise——-... september new home sale rise——-back level sept...

Modelling using MaxEnt on trigram vect.+ TF-IDF Grid Search Best params

Trigram + Tfdif + classifier pipeline

In [15]:
from sklearn.pipeline import Pipeline
trigram_vectorizer = CountVectorizer(analyzer = "word", ngram_range=(1,3))
tf_idf = TfidfTransformer(norm="l2")
classifier = LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

pipeline = Pipeline([
     ('trigram_vectorizer', trigram_vectorizer),
     ('tfidf', tf_idf),
     ('clf', classifier),
 ])
In [16]:
pipeline.fit(train.clean_and_pos_tagged_text, encoder.fit_transform(train.label.values))
Out[16]:
Pipeline(memory=None,
     steps=[('trigram_vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])
In [17]:
import pickle
pickle.dump( pipeline, open( "pipeline.pkl", "wb" ) )

2. Predicting on test

In [18]:
print(colored("Predicting on test", 'blue'))
test_predictions = test_predictions = pipeline.predict(test.clean_and_pos_tagged_text)
Predicting on test
In [19]:
test_predictions
Out[19]:
array([0, 0, 1, ..., 1, 1, 1])
In [20]:
test_predictions_decoded = encoder.inverse_transform( test_predictions )
In [21]:
predictions = test
predictions["label"] = test_predictions_decoded
In [22]:
predictions.shape
Out[22]:
(2321, 8)
In [23]:
predictions.head()
Out[23]:
ID title text title_and_text preprocessed_text pos_tagged_text clean_and_pos_tagged_text label
0 10498 September New Homes Sales Rise——-Back To 1992 ... September New Homes Sales Rise Back To 1992 Le... September New Homes Sales Rise——-Back To 1992 ... september new home sale rise——-back level sept... VB-september JJ-new NN-home NN-sale JJ-rise——-... september new home sale rise——-back level sept... FAKE
1 2439 Why The Obamacare Doomsday Cult Can't Admit It... But when Congress debated and passed the Patie... Why The Obamacare Doomsday Cult Can't Admit It... obamacare doomsday cult can't admit wrong cong... NN-obamacare NN-doomsday NN-cult MD-ca RB-n't ... obamacare doomsday cult can't admit wrong cong... FAKE
2 864 Sanders, Cruz resist pressure after NY losses,... The Bernie Sanders and Ted Cruz campaigns vowe... Sanders, Cruz resist pressure after NY losses,... sander cruz resist pressure ny loss vow fight ... NN-sander NNS-cruz VBP-resist NN-pressure JJ-n... sander cruz resist pressure ny loss vow fight ... REAL
3 4128 Surviving escaped prisoner likely fatigued and... Police searching for the second of two escaped... Surviving escaped prisoner likely fatigued and... survive escape prisoner likely fatigue prone m... JJ-survive NN-escape NN-prisoner JJ-likely NN-... survive escape prisoner likely fatigue prone m... REAL
4 662 Clinton and Sanders neck and neck in Californi... No matter who wins California's 475 delegates ... Clinton and Sanders neck and neck in Californi... clinton sander neck neck california primary ma... NN-clinton NN-sander NN-neck NN-neck NN-califo... clinton sander neck neck california primary ma... REAL
In [24]:
predictions.label.describe()
Out[24]:
count     2321
unique       2
top       REAL
freq      1320
Name: label, dtype: object
In [25]:
import collections
ax = sns.countplot(predictions.label,
                order=[x for x, count in sorted(collections.Counter(predictions.label).items(),
                key=lambda x: -x[1])])


for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/len(predictions)*100),
            ha="center") 
ax.set_title("Test dataset target")
show()
In [26]:
predictions.drop(columns=["title","text","title_and_text","preprocessed_text","pos_tagged_text","clean_and_pos_tagged_text"]).head()
Out[26]:
ID label
0 10498 FAKE
1 2439 FAKE
2 864 REAL
3 4128 REAL
4 662 REAL
In [88]:
predictions.to_csv("TEST_PREDICTIONS.csv", index=False)
In [ ]: