Newer
Older
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
Sentiment Analysis of Covid 19 news paper data.
## List of Papers scraped :-
| Newspaper | Language |
|----------------|-----------|
| TheGuardian | English |
| TheTelegraph | English |
| Yahoonews | English |
| MetroNews | English |
| MailOnline | English |
| DailyMail | English |
| SkyNews | English |
| Buzzfeed | English |
| BirminghamLive | English |
| Economist | English |
| Deutschewelle | English |
| googlenews | English |
| SurreyNews | English |
| DailyMirror | English |
| TheSun | English |
| BelfastLive | English |
| Herald | English |
| ALJazeerah | English |
| EDP | English |
| SkyNews | English |
| Express | English |
| BingNews | German |
| konkret | German |
| ZDFOnline | German |
| Fokus | German |
| TAZ | German |
| DieWelt | German |
| dasErste | German |
| FrankfurterAllgemeine | German |
| telepolis | German |
| SuddeutscheZeitung | German |
| Cicero | German |
| BerlinerMorgenpost | German |
| RTL | German |
| Berlinde | German |
## Fields to be Scraped
| Field Name | Comments |
|---|---|
| paper_name | The name of the paper scraped |
| date | The date of paper publishing/crawling |
| heading | The prominence of this news article |
| data | The contents of the heading |
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
| link | The link to the newspaper
## Steps to be undertaken :
# Sentiment-Analysis-Covid-Data
The important steps involved in this task are as follows:-
- [] <b>Data preparation</b>:
Our data have two fields : <b>data</b> and <b>heading</b>. We have clarity about data and what we have to do it with. However we may have a roadblock for incorporating <b>heading</b>.
- [] Dividing the data into training and testing subsets.
This I believe is a simple step of dividing the data into 70%-30% ratio. However, points of thought here are :-
- [] Do we need to clean data of test data ? Does our algorithm performs well on uncleaned data ?
- [] Data clean up.
- [] What exactly is data clean up in our context ?
Steps for data clean up uses :-
- [] Text normalization ? Find a python library for english text normalization and understand it ?
- [] Do we need to find word stems ? If yes , how and which library ? If no , why ?
- [] Implement spell checker to prevent too much embedding of words ? What exactly is embedding of word ?
- [] Further cleaning (already in repo) and using gensim's phrase medule ?
- [] Remove proper nouns and pronouns.
<b>Our goal is to map our input to a consumable output and have clarity in what this output represents. </b>
For this main task are :-
- [] Understand word2vec tutorial : https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial
- [] Genism and levenstein distance ?
- [] <b>Training Unsupervised algorithm </b>:
- [] K-Means clustering :
- [] Tf- Idf : Why tf - idf ?
- [] <b>Testing the algorithm </b>:
- [] Testing using <b>f score </b> in training ? How do we test unsupervised algorithm?
- [] Do we use testing data ? Why or why not ?