mirror of
https://github.com/bellingcat/tiktok-hashtag-analysis.git
synced 2026-06-07 19:08:32 +03:00
updated README, made yt-dlp downloading more robust against errors, changed name of videos folder to media (since images and audio files are also downloaded now)
This commit is contained in:
29
README.md
29
README.md
@@ -15,7 +15,7 @@ You should now be ready to start using it.
|
||||
## About the tool
|
||||
### Command-line arguments
|
||||
```
|
||||
usage: tiktok-hashtag-analysis [-h] [--file FILE] [-d] [--number NUMBER] [-p] [-t] [--output-dir OUTPUT_DIR] [--log LOG] [hashtags ...]
|
||||
usage: tiktok-hashtag-analysis [-h] [--file FILE] [-d] [--number NUMBER] [-p] [-t] [--output-dir OUTPUT_DIR] [--config CONFIG] [--log LOG] [hashtags ...]
|
||||
|
||||
Analyze hashtags within posts scraped from TikTok.
|
||||
|
||||
@@ -31,6 +31,7 @@ optional arguments:
|
||||
-t, --table Print a table of the most common co-occurring hashtags
|
||||
--output-dir OUTPUT_DIR
|
||||
Directory to save scraped data and visualizations to
|
||||
--config CONFIG File name of configuration file to store TikTok credentials to
|
||||
--log LOG File to write logs to
|
||||
```
|
||||
|
||||
@@ -38,23 +39,23 @@ optional arguments:
|
||||
```
|
||||
$ tree ../data
|
||||
../data
|
||||
├── ids
|
||||
│ └── post_ids.json
|
||||
├── london
|
||||
│ └── posts
|
||||
│ └── data.json
|
||||
│ ├── plots
|
||||
│ ├── posts.json
|
||||
│ └── media
|
||||
├── newyork
|
||||
│ └── posts
|
||||
│ └── data.json
|
||||
│ ├── plots
|
||||
│ ├── posts.json
|
||||
│ └── media
|
||||
└── paris
|
||||
└── posts
|
||||
└── data.json
|
||||
│ ├── plots
|
||||
│ ├── posts.json
|
||||
│ └── media
|
||||
```
|
||||
|
||||
|
||||
The `data` folder contains all the downloaded data as shown in the tree diagram above.
|
||||
- The `ids` folder contains two files `post_ids.json` and `video_ids.json` that record the ids of the downloaded posts and videos for each hashtag.
|
||||
- Each hashtag has a folder with two subfolders `posts` and `videos` that store posts and videos respectively. The posts are stored in the `data.json` file in the `posts` folder, and videos are stored as the `.mp4` files in the `videos` folder.
|
||||
- Each hashtag has a folder with two subfolders `plots` and `media` that store plots of the most common co-occurring hashtags, and media downloaded from the posts. The posts are stored in the `posts.json` file, and downloaded media is stored as `.mp4` files (for videos) or audio and image files (for image galleries) in the `media` folder.
|
||||
|
||||
|
||||
## How to use
|
||||
@@ -75,8 +76,8 @@ and will produce an output similar to the following log:
|
||||
- The list of hashtags to scrape is specified as a positional argument
|
||||
|
||||
### Video downloading
|
||||
Running the `tiktok-hashtag-analysis` script with the following options will scrape trending videos containing the hashtag `#london`:
|
||||
`tiktok-hashtag-analysis download london --download`
|
||||
Running the `tiktok-hashtag-analysis` script with the following options will scrape trending posts containing the hashtag `#london`:
|
||||
`tiktok-hashtag-analysis london --download`
|
||||
|
||||
- The `--download` flag specifies that video files for scraped posts should be downloaded
|
||||
|
||||
@@ -84,7 +85,7 @@ Note that video downloading is a time and data rate consuming task, as a result
|
||||
|
||||
## Analyzing results
|
||||
### Most common co-occurring hashtags
|
||||
In addition to scraping data and downloading videos, the `tiktok-hashtag-analysis` script can also analyze the frequencies of the most common co-occurring hashtags in a given set of posts.
|
||||
In addition to scraping data and downloading media, the `tiktok-hashtag-analysis` script can also analyze the frequencies of the most common co-occurring hashtags in a given set of posts.
|
||||
|
||||
Assume we want to analyze the 20 most frequently co-occurring hashtags in the downloaded posts of the `#london` hashtag.
|
||||
|
||||
|
||||
@@ -155,7 +155,7 @@ class TikTokDownloader:
|
||||
|
||||
# Define file containing post data and directory to save videos to
|
||||
hashtag_file = self.data_dir / hashtag / "posts.json"
|
||||
video_dir = self.data_dir / hashtag / "videos"
|
||||
video_dir = self.data_dir / hashtag / "media"
|
||||
video_dir.mkdir(exist_ok=True)
|
||||
|
||||
# Get list of post IDs that have previously had their media downloaded
|
||||
@@ -191,8 +191,8 @@ class TikTokDownloader:
|
||||
|
||||
# Download video files for all video posts
|
||||
if len(urls_to_download) > 0:
|
||||
logging.info(f"Downloading videos for hashtag {hashtag}")
|
||||
ydl_opts = {"outtmpl": os.path.join(video_dir, "%(id)s.%(ext)s")}
|
||||
logging.info(f"Downloading media for hashtag {hashtag}")
|
||||
ydl_opts = {"outtmpl": os.path.join(video_dir, "%(id)s.%(ext)s"), "ignore_errors": True}
|
||||
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
|
||||
ydl.download(urls_to_download)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user