Sponsor Background and Project Goals for Misinformation Analysis
1. 1. (Jerry) sponsor’s background
2. (Jerry) sponsor’s business operations, information management, users profiles etc.
3. (Jerry) Articulate the goals the sponsor requested
4. (Jerry) Articulate the Capstone course objectives based on your project
5. (列表,Chris) What research topics (e.g. list all potential (Jerry) informatics issues: HCI research, (Jerry) security issues,
(Jerry) platform strategies, etc.)
6. (每人说)What are the problems observed in your project, and how significant they are?
7. (每人说) How many hours have you all spent on collecting data (e.g. interviewing sponsors, email communications, field
observations.).
8. (每人说) What are interesting preliminary findings?
9. (每人说) Reflect how the curriculum learning help you conduct this capstone project, what lessons have you learned so far,
and other skills you gained.
10. () Moving ahead, what’s the plan? What are the more expected results (e.g., prototype design, generate report)
3. Mikhail Oet, PhD
Professor in Commerce and Economic Development program
Northeastern University
Our Sponsor
4. Our Sponsors
Department of State
Missions:
● To engage the American
people in the work of the State
Department
● To broaden the Department’s
research base in response to a
proliferation of complex global
challenges
Mission:
● To get the right information
to the right people at the
right time
5. Course Objectives
(1) Systematic way to research (qualitatively and quantitatively)
(2) Data collection, cleaning, & analysis
(3) Communication
(4) Team Work
To Learn:
Sponsor’s Goals
1. Collection and organization of online data related to mis-/disinformation campaigns
(English, Russian, Mandarin)
2. Analysis of online data related to mis-/disinformation campaigns
3. Identification of latent attributes of mis-/disinformation campaigns
4. Visualization of the latent attributes of mis-/disinformation campaigns
5. Detection of latent attributes of mis-/disinformation campaigns
6. Our Research Topics
Stage One: Data Collection Stage Two: Data Analysis
1
Platform Research
● Data Scraping Availability
● Existing Dataset Availability
Data Project Research
● BirdWatch
● Twitter Transparency
● Chinese COVID-19 Fake News Dataset
2
Data Repositories Research
● GitHub
● Kaggle
Data Cleaning Research
● SBS-ready format
● All the datasets (CED, INF, ALY)
3
Data Scraping Tools Research
● Octoparse
● BrightData
● Web Scraping
Data Sentiment Analysis
● Azure Machine Learning
4 U.S., China, Russian Research
Data Analysis Tools Research
● SBS
End Integrate data and Collect useful data
Data visualization
Building predictive models
7. Week 1-Week 3 Research
1. Platform Research——US, CHINA, RUSSIAN
a. 17 different Social Platforms and News Outlets
2. Data Repositories Research——GitHub, Kaggle
a. Project and Datasets
Research Result
————————————————————————
3. Data Scraping Research
a. What programming skills do crawlers need?
b. Exploration of anti-crawling mechanism.
8. Week 4-Week 5 Research
1. Russian Focus Research
a. Russian Platform Research
b. Russian Politics and History Research
2. Data Repositories & Project Research
a. Birdwatch
b. Twitter Transparent Project
c. GitHub Data
————————————————————————
3. Technical Exploration
a. Machine Learning
b. Natural Language Processing
c. Dashboard (Power BI)
9. Week 6 Research
1. Data Cleaning Research
a. DIPLAB 3 assignment
b. SBS-ready format
——————————————————
2. Data Sentiment Analysis
a. Azure Machine Learning--Tool
b. English text dataset (results)
c. Chinese text dataset (unusual results)
10. Jerry’s Finding
Issues for Data Scraping
1. Intellectual property right
2. Anti-crawling
Solutions:
1. Research each website before doing data scraping
2. Improve the algorithm of the data scraping script, and combine the
use of data scraping tools such as BrightData
Russian Research
- Russian citizen might be punished if they post a fake news
Data Cleaning
- Python is powerful. e.g. it can handle a 30gb csv file
- Excel max row is 1,048,576
11. Problems Related to the Project
Problems:
1. The complex work deviation and sponsor relationship
2. The unclear final deliverable & goal
3. Lack of correlative skills
Solutions:
1. Ask for more communication & meetings
2.
a. Create a prototype to show our sponsor what we think the deliverable looks
like
b. To understand a unclear goal is normal in a project
c. Study & experience is our goal
3.
a. Ask Professor & Sponsor for learning resources
b. Self-study
12. Average Time Spent (hours / week)
Reading provided materials 4
Qualitative Research 3
Web Scraping - Python 8
Data Cleaning - Python 10
Data Analysis - SBS 8
Email & Communication & Meetings 6
Report 1
Plans for Next Stage
● Use SBS to do data analysis
1. Get SBS working
2. Generating findings
● Web Scraping
1. Learn Python BeatifulSoup Library
2. Scrape website that have useful
data for analysis or model training
13. Zixun’s Reflection
Problems Interesting Preliminary Findings
1. How to get started?
2. What should I do with the dataset I found?
1. The data repository has a lot of datasets.
2. The relationship between text analysis and
information authenticity.
3. The training data will be what I finally try to do.
Time Cost
Interviewing Sponsors Every Wednesday 1pm-4pm
Team Meeting and Discussion Every Friday 1pm-2pm
Reading Materials and Learning 4-6 hours Every Week
Summarize Research Findings and
Prepare Presentations
1-2 hours Every Week
14. Zixun’s Reflection
Expected Results
1. Available integrated datasets
2. Understand the relationship between the results of text sentiment analysis and the authenticity of
information.
3. Train data and build predictive models
What have I learned?
1. The ability to explore and summarize.
2. Programming language for web scraping data
15. Lingyu’s reflection
Problems:
● Data records garbled
● Data crawling
Hours spend:
● Reading materials: 3hr
● Exploring tools and datasets: 3hr
● Team discussion and prepare weekly report: 3~4 hr
16. Lingyu’s reflection
Interesting findings:
● People have different cultures have different opinions on the same information
● The mis/disinformation are possibly operated by bots
What I had learned:
1. Teamwork and communication
2. Dealing with gabled datasets
Plan for next stage:
● Working on analysis and data visualization by using tools
● Finding potential similarities in datasets of dis/misinformation
17. Ziyan’s Reflection
Problems:
1.Searching for data with no way to get started.
2.Understand my work and make a good and brief report.
Time Spending:
1.Data Collect - 3 hrs per week
2.Tools Explore - 1.5 hrs per week
3.Meeting for the Project - 3 hrs per week
4.Learn from Material - 2 hrs per week
18. Ziyan’s Reflection
Interesting Findings:
1.There is no absolute right or wrong in many things, and different positions will lead to different answers to questions.
2.After finding the characteristics it is easy to find the disinformation.
What I Learned:
1.Learn to explore solutions to problems in unknown areas.
2.Learn tableau for data analysis.
3.Do's and don'ts when presenting to a sponsor.
19. ● Use Tableau to analyze the current datasets (waiting for approve)
● Explore sentiment analysis
● Use SBS to do data analysis
● Learn Web scraping
Next Step
20. Live Data
● Social Medias
● News Websites
A Dashboard
● Power BI
● A website similar to Hamilton 2.0
Data Collection & Cleaning
● Python BeatifulSoup
● Python DataCleaning
● Bright Data
Model Training
● Machine Learning
Data Analysis
● SBS
● Sentiment Analysis
● Tableau
Datasets
● e.g. Twitter Transparency
● e.g. Weibo Datasets
Fake Checking Websites
● Human Verified
○ e.g. Politifact
● Automated Varitied
○ e.g. Duke Reporters Lab
Data Collection & Cleaning
● Python BeatifulSoup
● Python DataCleaning
● Bright Data
Trained Model
Trained Model
Data Analysis
● SBS
● Sentiment Analysis
● Tableau
Reports
Result (real/fake)
Graphical Reports
Finished
Unfinished