mirror of
https://github.com/bellingcat/tiktok-hashtag-analysis.git
synced 2026-06-11 12:58:30 +03:00
updated command-line arguments in documentation, added gitignore, changed screenshots to text in README where possible, removed redundant imports and functions
This commit is contained in:
22
.gitignore
vendored
Normal file
22
.gitignore
vendored
Normal file
@@ -0,0 +1,22 @@
|
||||
# Data directory
|
||||
data/
|
||||
|
||||
# Miscellaneous files
|
||||
**/.DS_Store
|
||||
*.pyc
|
||||
*.ipynb
|
||||
*.db
|
||||
.env
|
||||
*.session
|
||||
*.session-journal
|
||||
service_account.json
|
||||
.vscode/
|
||||
*.log
|
||||
*.lock
|
||||
|
||||
# Unit test / coverage reports
|
||||
reports
|
||||
.coverage*
|
||||
.cache
|
||||
.pytest_cache/
|
||||
cover/
|
||||
121
README.md
121
README.md
@@ -1,64 +1,125 @@
|
||||
# TikTok hashtag analysis toolset
|
||||
The tool helps to download posts and videos from tiktok for a given set of hashtags. It uses the tiktok-scraper (https://github.com/drawrowfly/tiktok-scraper) to download the posts and videos.
|
||||
The tool helps to download posts and videos from TikTok for a given set of hashtags. It uses the [tiktok-scraper](https://github.com/drawrowfly/tiktok-scraper) Node package to download the posts and videos.
|
||||
|
||||
## Pre-requisites
|
||||
1. Make sure you have python 3.6 or a later version installed.
|
||||
2. Download and install TikTok scraper: https://github.com/drawrowfly/tiktok-scraper
|
||||
1. Make sure you have Python 3.6 or a later version installed
|
||||
2. Install the [Pipenv](https://pipenv.pypa.io/en/latest/) Python package using the command:
|
||||
|
||||
`pip3 install pipenv`
|
||||
|
||||
3. Install the dependencies of this tool using the command:
|
||||
|
||||
`pipenv install`
|
||||
|
||||
3. Download and install [TikTok scraper](https://github.com/drawrowfly/tiktok-scraper)
|
||||
|
||||
### Options for running run_downloader.py
|
||||
|
||||
<code> python3 run_downloader.py -h </code>
|
||||
|
||||
|
||||
<img width="686" alt="Screenshot 2022-02-25 at 19 04 26" src="https://user-images.githubusercontent.com/72805812/155765360-47f0956c-220a-4098-8d52-1304a9f11e69.png">
|
||||
$ python run_downloader.py -h
|
||||
usage: run_downloader.py [-h] [-t [T [T ...]]] [-f F] [-p] [-v]
|
||||
|
||||
Download the tiktoks for the requested hashtags
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-t [T [T ...]] List of hashtags
|
||||
-f F File name with the list of hashtags
|
||||
-p Download posts
|
||||
-v Download videos
|
||||
|
||||
### Data organization
|
||||
|
||||
<code> tree ../data </code>
|
||||
$ tree ../data
|
||||
../data
|
||||
├── ids
|
||||
│ └── post_ids.json
|
||||
├── log
|
||||
│ └── log.json
|
||||
├── london
|
||||
│ └── posts
|
||||
│ └── data.json
|
||||
├── newyork
|
||||
│ └── posts
|
||||
│ └── data.json
|
||||
└── paris
|
||||
└── posts
|
||||
└── data.json
|
||||
|
||||
<img width="488" alt="Screenshot 2022-02-25 at 19 21 44" src="https://user-images.githubusercontent.com/72805812/155767522-94bd3774-60eb-45fc-8129-b2abc59c6089.png">
|
||||
|
||||
<code>data</code> folder contains all the downloaded data as shown in the picture above.
|
||||
1. the <code>log</code> folder contains log.json which records the total number of downloaded posts and videos for the hashtags against the time stamp of when the script is run.
|
||||
2. the <code>ids</code> folder contains two files <code>post_ids.json</code> and <code>video_ids.json</code> that records the ids of the downloaded posts and videos for each hashtag.
|
||||
3. Each hashtag has a folder with two subfolders <code>posts</code> and <code>videos</code> that store posts and videos respectively. The posts are stored in the <code>data.json</code> file in the <code>posts</code> folder, and videos are stored as the <code>.mp4</code> files in the <code>videos</code> folder.
|
||||
The `data` folder contains all the downloaded data as shown in the picture above.
|
||||
1. the `log` folder contains log.json which records the total number of downloaded posts and videos for the hashtags against the time stamp of when the script is run.
|
||||
2. the `ids` folder contains two files `post_ids.json` and `video_ids.json` that records the ids of the downloaded posts and videos for each hashtag.
|
||||
3. Each hashtag has a folder with two subfolders `posts` and `videos` that store posts and videos respectively. The posts are stored in the `data.json` file in the `posts` folder, and videos are stored as the `.mp4` files in the `videos` folder.
|
||||
|
||||
|
||||
|
||||
### Post download
|
||||
Run the run_downloader.py with the following option:
|
||||
|
||||
<code> python3 run_downloader.py --h london paris newyork -p </code>
|
||||
python3 run_downloader.py -t london paris newyork -p
|
||||
|
||||
<img width="1301" alt="Screenshot 2022-02-25 at 19 14 06" src="https://user-images.githubusercontent.com/72805812/155766542-7de77313-6389-4ea2-aca5-b5f39fd70160.png">
|
||||
which will produce an output similar to the following log:
|
||||
|
||||
1. The --h option allows to type in hashtag list in the commandline.
|
||||
2. -p option specifies the download posts option.
|
||||
$ python3 run_downloader.py -t london paris newyork -p
|
||||
['london', 'paris', 'newyork']
|
||||
SUCCESS - 962 entries added to ../data/london/posts/data.json!!!
|
||||
SUCCESS - 962 entries added to ../data/ids/post_ids.json!!!
|
||||
Successfully deleted /Users/work/Documents/development_projects/Tiktok/tiktok/data/london/posts/london_1651533070680.json!!!
|
||||
Total posts for the hashtag london are: 962
|
||||
SUCCESS - 961 entries added to ../data/paris/posts/data.json!!!
|
||||
SUCCESS - 961 entries added to ../data/ids/post_ids.json!!!
|
||||
Successfully deleted /Users/work/Documents/development_projects/Tiktok/tiktok/data/paris/posts/paris_1651533102789.json!!!
|
||||
Total posts for the hashtag paris are: 961
|
||||
SUCCESS - 941 entries added to ../data/newyork/posts/data.json!!!
|
||||
SUCCESS - 941 entries added to ../data/ids/post_ids.json!!!
|
||||
Successfully deleted /Users/work/Documents/development_projects/Tiktok/tiktok/data/newyork/posts/newyork_1651533125549.json!!!
|
||||
Total posts for the hashtag newyork are: 941
|
||||
Successfully logged 2864 entries!!!!
|
||||
|
||||
1. The `-t` option allows to type in a space-separated list of hashtags as a command line argument.
|
||||
2. The `-p` option specifies the download posts option.
|
||||
|
||||
|
||||
### Video download
|
||||
<code> python3 run_downloader.py --h london -v</code>
|
||||
` python3 run_downloader.py -t london -v`
|
||||
|
||||
1. --h option allows to type in the list of hashtags as command line argument.
|
||||
2. -v option is for downloading the videos
|
||||
The above code download all the trending videos for the hashtag london. Note that video downloading is a time and data rate consuming task, as a result we strongly recommend to use one hashtag at a time so as to avoid complications.
|
||||
1. The `-t` option allows to type in a space-separated list of hashtags as a command line argument.
|
||||
2. The `-v` option is for downloading the videos
|
||||
The above code download all the trending videos for the hashtag london. Note that video downloading is a time and data rate consuming task, as a result we strongly recommend using one hashtag at a time to avoid complications.
|
||||
|
||||
|
||||
### Top n hashtag occurrences
|
||||
In the analytics folder, the file <code>hashtag_frequencies.py</code> will plot the frequencies of top occurring hashtags in a given set of posts.
|
||||
In the analytics folder, the file `hashtag_frequencies.py` will plot the frequencies of top occurring hashtags in a given set of posts.
|
||||
Assume we want to plot the graph of top 20 occurring hashtags in the downloaded posts of the hashtag london.
|
||||
|
||||
1. Plotting the saving the image as a png file: <code> python3 hashtag_frequencies.py -p ../data/london/posts/data.json 20 -v</code>
|
||||
1. Plotting the saving the image as a png file: ` python3 hashtag_frequencies.py -p ../data/london/posts/data.json 20`
|
||||
|
||||
<img width="1390" alt="Screenshot 2022-02-25 at 19 45 40" src="https://user-images.githubusercontent.com/72805812/155770710-0d167bbb-4c44-44d2-ba1c-fa57026afea8.png">
|
||||
|
||||
The figure above shows the top 20 occurring hashtags among all the posts downloaded for the hashtag london. Clearly, the highest occurrence will be of the hashtag london as the file <code>data/london/posts/data.json</code> contain all the posts with hashtag london.
|
||||
|
||||
2. Printing the result in the shell: <code> python3 hashtag_frequencies.py -d ../data/london/posts/data.json 20 -v</code>
|
||||
<img width="807" alt="Screenshot 2022-02-25 at 19 54 09" src="https://user-images.githubusercontent.com/72805812/155771757-e71b2858-cd9c-4496-8cc5-76146e8a8d32.png">
|
||||
|
||||
The same result of 1 is printed in the shell. The last column shows the ratio of the occurrence to the total posts.
|
||||
The figure above shows the top 20 occurring hashtags among all the posts downloaded for the hashtag london. Clearly, the highest occurrence will be of the hashtag london as the file `data/london/posts/data.json` contain all the posts with hashtag london.
|
||||
|
||||
2. Printing the result in the shell: ` python3 hashtag_frequencies.py -d ../data/london/posts/data.json 20`
|
||||
```
|
||||
Rank Hashtag Occurrences Frequency (Occurrences/Total-Posts(total_posts))
|
||||
0 london 962 1.0
|
||||
1 fyp 493 0.5124740124740125
|
||||
2 uk 238 0.24740124740124741
|
||||
3 foryou 223 0.23180873180873182
|
||||
4 foryoupage 186 0.19334719334719336
|
||||
5 viral 177 0.183991683991684
|
||||
6 fypシ 85 0.08835758835758836
|
||||
7 funny 55 0.057172557172557176
|
||||
8 xyzbca 52 0.05405405405405406
|
||||
9 england 45 0.04677754677754678
|
||||
10 british 44 0.04573804573804574
|
||||
11 trending 39 0.04054054054054054
|
||||
12 fy 33 0.034303534303534305
|
||||
13 comedy 32 0.033264033264033266
|
||||
14 roadman 28 0.029106029106029108
|
||||
15 4u 27 0.028066528066528068
|
||||
16 usa 26 0.02702702702702703
|
||||
17 tiktok 26 0.02702702702702703
|
||||
18 travel 21 0.02182952182952183
|
||||
19 america 20 0.02079002079002079
|
||||
```
|
||||
|
||||
The same result of 1 is printed in the shell. The last column shows the ratio of the occurrence to the total posts.
|
||||
@@ -1,5 +1,5 @@
|
||||
import os, sys
|
||||
import csv, json
|
||||
import json
|
||||
import argparse
|
||||
import matplotlib.pyplot as plt
|
||||
from datetime import datetime
|
||||
@@ -96,7 +96,7 @@ if __name__ == "__main__":
|
||||
"-d" option prints the hashtag frequencies on the shell
|
||||
"-p" option plots the hashtag frequencies and saves as a png file in the folder /data/imgs/
|
||||
|
||||
The function get_occurances is triggered to compute and return the top n occurances and the hashtags.
|
||||
The function get_occurrences is triggered to compute and return the top n occurrences and the hashtags.
|
||||
"""
|
||||
img_folder = global_data.IMAGES
|
||||
file_methods.check_file(img_folder, "dir")
|
||||
@@ -113,11 +113,10 @@ if __name__ == "__main__":
|
||||
|
||||
base = os.path.splitext(args.input_file)[0]
|
||||
path = f"./{base}_sorted_hashtags.csv"
|
||||
occs = get_occurrences(args.input_file, args.n)
|
||||
if args.plot:
|
||||
occs = get_occurrences(args.input_file, args.n)
|
||||
plot(args.n, occs, img_folder)
|
||||
else:
|
||||
occs = get_occurrences(args.input_file, args.n)
|
||||
print_occurrences(occs)
|
||||
else:
|
||||
print(f'ERROR: either {args.input_file} or {args.n} or both contains error.')
|
||||
|
||||
@@ -1,7 +1,4 @@
|
||||
import os
|
||||
from collections import namedtuple
|
||||
from datetime import datetime
|
||||
import global_data
|
||||
import file_methods
|
||||
|
||||
"""
|
||||
|
||||
@@ -24,7 +24,7 @@ def create_file(name, file_type):
|
||||
|
||||
def check_existence(file_path, file_type):
|
||||
"""
|
||||
Checks the existence of a file or a directory. If not found, returns a False, else returns a true.
|
||||
Checks the existence of a file or a directory. If not found, returns False, else returns True.
|
||||
"""
|
||||
if (file_type == "file"):
|
||||
return os.path.isfile(file_path)
|
||||
|
||||
@@ -1,6 +1,5 @@
|
||||
import os, sys
|
||||
import os
|
||||
import time
|
||||
import json
|
||||
import argparse
|
||||
|
||||
import global_data
|
||||
@@ -17,7 +16,7 @@ The run_downloader.py dowloads data using the tiktok-scraper (https://github.com
|
||||
5. "-f" option is used to read the list of hashtags from the user specified file
|
||||
|
||||
Example:
|
||||
1. The command "python3 run_downloader.py --h london paris newyork -p" will download posts for hashtags london, paris and newyork.
|
||||
1. The command "python3 run_downloader.py -t london paris newyork -p" will download posts for hashtags london, paris and newyork.
|
||||
2. The command "python3 run_downloader.py -f hashtag_list -p -v" will download posts and videos for hashtags in the file hashtag_list.
|
||||
|
||||
|
||||
@@ -46,8 +45,7 @@ hashtag_list - this file contains the list of hashtags that the user wants to do
|
||||
def get_hashtag_list(file_name):
|
||||
try:
|
||||
with open(file_name) as f:
|
||||
gn = (line.strip() for line in f if not line.startswith("#"))
|
||||
tags = list(line for line in gn if line)
|
||||
tags = list(filter(None, [line.strip() for line in f if not line.startswith("#")]))
|
||||
return tags
|
||||
except IOError as error:
|
||||
print(error)
|
||||
@@ -85,7 +83,7 @@ def set_download_settings(download_data_type):
|
||||
settings["post_ids"] = global_data.FILES["post_ids"]
|
||||
settings["data_file"] = global_data.FILES["data_file"]
|
||||
|
||||
if download_data_type == "videos":
|
||||
if download_data_type["videos"]:
|
||||
settings["videos"] = global_data.FILES["videos"]
|
||||
settings["video_ids"] = global_data.FILES["video_ids"]
|
||||
|
||||
@@ -159,7 +157,7 @@ def get_data(hashtags, download_data_type):
|
||||
if counter < total_hashtags_offset:
|
||||
time.sleep(settings["sleep"])
|
||||
|
||||
if download_data_type == "videos":
|
||||
if download_data_type["videos"]:
|
||||
settings = set_download_settings(download_data_type)
|
||||
while counter < total_hashtags:
|
||||
tag = hashtags[counter]
|
||||
@@ -179,24 +177,12 @@ def get_data(hashtags, download_data_type):
|
||||
return log_data
|
||||
|
||||
|
||||
|
||||
def get_hashtags(file_name, hashtag_list):
|
||||
"""
|
||||
Loads and returns the list of hashtags from user specified file.
|
||||
"""
|
||||
try:
|
||||
from hashtag_list import hashtag_list
|
||||
return hashtag_list
|
||||
except ImportError:
|
||||
raise ImportError(f"ERROR: something went wrong while reading the file {file_name}!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = create_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
if not (args.t or args.f):
|
||||
parser.error("No hashtags were given, please use either --h option or -f to provide hashtags.")
|
||||
parser.error("No hashtags were given, please use either -t option or -f to provide hashtags.")
|
||||
|
||||
if not (args.p or args.v):
|
||||
parser.error("No argument given, please specify either -p for posts or -v videos or both.")
|
||||
@@ -209,7 +195,7 @@ if __name__ == "__main__":
|
||||
|
||||
print(hashtags)
|
||||
if not hashtags:
|
||||
raise Exception("No hashtags were given, please use either --h option or -f to provide hashtags.")
|
||||
raise Exception("No hashtags were given, please use either -t option or -f to provide hashtags.")
|
||||
|
||||
if (args.p and args.v):
|
||||
download_data_type = {
|
||||
|
||||
Reference in New Issue
Block a user