updated README, made yt-dlp downloading more robust against errors, changed name of videos folder to media (since images and audio files are also downloaded now)

This commit is contained in:
Tristan Lee
2023-09-04 13:51:28 -05:00
parent 5ae9624968
commit 8c32a3cf16
2 changed files with 18 additions and 17 deletions

View File

@@ -15,7 +15,7 @@ You should now be ready to start using it.
## About the tool
### Command-line arguments
```
usage: tiktok-hashtag-analysis [-h] [--file FILE] [-d] [--number NUMBER] [-p] [-t] [--output-dir OUTPUT_DIR] [--log LOG] [hashtags ...]
usage: tiktok-hashtag-analysis [-h] [--file FILE] [-d] [--number NUMBER] [-p] [-t] [--output-dir OUTPUT_DIR] [--config CONFIG] [--log LOG] [hashtags ...]
Analyze hashtags within posts scraped from TikTok.
@@ -31,6 +31,7 @@ optional arguments:
-t, --table Print a table of the most common co-occurring hashtags
--output-dir OUTPUT_DIR
Directory to save scraped data and visualizations to
--config CONFIG File name of configuration file to store TikTok credentials to
--log LOG File to write logs to
```
@@ -38,23 +39,23 @@ optional arguments:
```
$ tree ../data
../data
├── ids
│ └── post_ids.json
├── london
── posts
└── data.json
── plots
├── posts.json
│ └── media
├── newyork
── posts
└── data.json
── plots
├── posts.json
│ └── media
└── paris
── posts
── data.json
── plots
── posts.json
│ └── media
```
The `data` folder contains all the downloaded data as shown in the tree diagram above.
- The `ids` folder contains two files `post_ids.json` and `video_ids.json` that record the ids of the downloaded posts and videos for each hashtag.
- Each hashtag has a folder with two subfolders `posts` and `videos` that store posts and videos respectively. The posts are stored in the `data.json` file in the `posts` folder, and videos are stored as the `.mp4` files in the `videos` folder.
- Each hashtag has a folder with two subfolders `plots` and `media` that store plots of the most common co-occurring hashtags, and media downloaded from the posts. The posts are stored in the `posts.json` file, and downloaded media is stored as `.mp4` files (for videos) or audio and image files (for image galleries) in the `media` folder.
## How to use
@@ -75,8 +76,8 @@ and will produce an output similar to the following log:
- The list of hashtags to scrape is specified as a positional argument
### Video downloading
Running the `tiktok-hashtag-analysis` script with the following options will scrape trending videos containing the hashtag `#london`:
`tiktok-hashtag-analysis download london --download`
Running the `tiktok-hashtag-analysis` script with the following options will scrape trending posts containing the hashtag `#london`:
`tiktok-hashtag-analysis london --download`
- The `--download` flag specifies that video files for scraped posts should be downloaded
@@ -84,7 +85,7 @@ Note that video downloading is a time and data rate consuming task, as a result
## Analyzing results
### Most common co-occurring hashtags
In addition to scraping data and downloading videos, the `tiktok-hashtag-analysis` script can also analyze the frequencies of the most common co-occurring hashtags in a given set of posts.
In addition to scraping data and downloading media, the `tiktok-hashtag-analysis` script can also analyze the frequencies of the most common co-occurring hashtags in a given set of posts.
Assume we want to analyze the 20 most frequently co-occurring hashtags in the downloaded posts of the `#london` hashtag.

View File

@@ -155,7 +155,7 @@ class TikTokDownloader:
# Define file containing post data and directory to save videos to
hashtag_file = self.data_dir / hashtag / "posts.json"
video_dir = self.data_dir / hashtag / "videos"
video_dir = self.data_dir / hashtag / "media"
video_dir.mkdir(exist_ok=True)
# Get list of post IDs that have previously had their media downloaded
@@ -191,8 +191,8 @@ class TikTokDownloader:
# Download video files for all video posts
if len(urls_to_download) > 0:
logging.info(f"Downloading videos for hashtag {hashtag}")
ydl_opts = {"outtmpl": os.path.join(video_dir, "%(id)s.%(ext)s")}
logging.info(f"Downloading media for hashtag {hashtag}")
ydl_opts = {"outtmpl": os.path.join(video_dir, "%(id)s.%(ext)s"), "ignore_errors": True}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
ydl.download(urls_to_download)