README.md

# Sentiment_Analysis

Sentiment Analysis of Covid 19 news paper data.

## List of Papers scraped :-

| Newspaper      | Language  |
|----------------|-----------|
| TheGuardian    | English   |
| TheTelegraph   | English   |
| Yahoonews      | English   |
| MetroNews      | English   |
| MailOnline     | English   |
| DailyMail      | English   |
| SkyNews        | English   |
| Buzzfeed       | English   |
| BirminghamLive | English   |
| Economist      | English   |
| Deutschewelle  | English   |
| googlenews     | English   |
| SurreyNews     | English   |
| DailyMirror    | English   |
| TheSun         | English   |
| BelfastLive    | English   |
| Herald         | English   |
| ALJazeerah     | English   |
| EDP            | English   |
| SkyNews        | English   |
| Express        | English   |
| BingNews       | German    |
| konkret        | German    |
| ZDFOnline      | German    |
| Fokus          | German    |
| TAZ            | German    |
| DieWelt        | German    |
| dasErste       | German    |
| FrankfurterAllgemeine | German |
| telepolis             | German |
| SuddeutscheZeitung    | German |
| Cicero                | German |
| BerlinerMorgenpost    | German |
| RTL                   | German |
| Berlinde              | German |


## Fields to be Scraped 

| Field Name  | Comments  |
|---|---|
| paper_name |  The name of the paper scraped |
| date       |  The date of paper publishing/crawling |
| heading    |  The prominence of this news article   |
| data       |  The contents of the heading           |
| link       |  The link to the newspaper

## Steps to be undertaken : 

# Sentiment-Analysis-Covid-Data

The important steps involved in this task are as follows:-

- [] <b>Data preparation</b>: 

 Our data have two fields : <b>data</b> and <b>heading</b>. We have clarity about data and what we have to do it with. However we may have a roadblock for incorporating           <b>heading</b>. 

     - [] Dividing the data into training and testing subsets.
     
          This I believe is a simple step of dividing the data into 70%-30% ratio. However, points of thought here are :- 
          
          - [] Do we need to clean data of test data ? Does our algorithm performs well on uncleaned data ?
          
     - [] Data clean up.  
     
          - []  What exactly is data clean up in our context ? 
          
          Steps for data clean up uses :-
          
          - [] Text normalization ? Find a python library for english text normalization and understand it ?
          
          - [] Do we need to find word stems ? If yes , how and which library ? If no , why ? 
          
          - [] Implement spell checker to prevent too much embedding of words ? What exactly is embedding of word ? 
          
          - [] Further cleaning (already in repo) and using gensim's phrase medule ? 
          
          - [] Remove proper nouns and pronouns. 
          
        <b>Our goal is to map our input to a consumable output and have clarity in what this output represents. </b>
          
       For this main task are :-
       
          - [] Understand word2vec tutorial : https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial
          
          - [] Genism and levenstein distance ?


- [] <b>Training Unsupervised algorithm </b>:

          - [] K-Means clustering :
          
          - [] Tf- Idf : Why tf - idf ? 


- [] <b>Testing the algorithm </b>:

          - [] Testing using <b>f score </b> in training ? How do we test unsupervised algorithm? 
          
          - [] Do we use testing data ? Why or why not ?