diff --git a/README.md b/README.md index 75e5e26..2c51e2e 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ You should now be ready to start using it. ## About the tool ### Command-line arguments ``` -usage: tiktok-hashtag-analysis [-h] [--file FILE] [-d] [--number NUMBER] [-p] [-t] [--output-dir OUTPUT_DIR] [--log LOG] [hashtags ...] +usage: tiktok-hashtag-analysis [-h] [--file FILE] [-d] [--number NUMBER] [-p] [-t] [--output-dir OUTPUT_DIR] [--config CONFIG] [--log LOG] [hashtags ...] Analyze hashtags within posts scraped from TikTok. @@ -31,6 +31,7 @@ optional arguments: -t, --table Print a table of the most common co-occurring hashtags --output-dir OUTPUT_DIR Directory to save scraped data and visualizations to + --config CONFIG File name of configuration file to store TikTok credentials to --log LOG File to write logs to ``` @@ -38,23 +39,23 @@ optional arguments: ``` $ tree ../data ../data -├── ids -│ └── post_ids.json ├── london -│ └── posts -│ └── data.json +│ ├── plots +│ ├── posts.json +│ └── media ├── newyork -│ └── posts -│ └── data.json +│ ├── plots +│ ├── posts.json +│ └── media └── paris - └── posts - └── data.json +│ ├── plots +│ ├── posts.json +│ └── media ``` The `data` folder contains all the downloaded data as shown in the tree diagram above. -- The `ids` folder contains two files `post_ids.json` and `video_ids.json` that record the ids of the downloaded posts and videos for each hashtag. -- Each hashtag has a folder with two subfolders `posts` and `videos` that store posts and videos respectively. The posts are stored in the `data.json` file in the `posts` folder, and videos are stored as the `.mp4` files in the `videos` folder. +- Each hashtag has a folder with two subfolders `plots` and `media` that store plots of the most common co-occurring hashtags, and media downloaded from the posts. The posts are stored in the `posts.json` file, and downloaded media is stored as `.mp4` files (for videos) or audio and image files (for image galleries) in the `media` folder. ## How to use @@ -75,8 +76,8 @@ and will produce an output similar to the following log: - The list of hashtags to scrape is specified as a positional argument ### Video downloading -Running the `tiktok-hashtag-analysis` script with the following options will scrape trending videos containing the hashtag `#london`: -`tiktok-hashtag-analysis download london --download` +Running the `tiktok-hashtag-analysis` script with the following options will scrape trending posts containing the hashtag `#london`: +`tiktok-hashtag-analysis london --download` - The `--download` flag specifies that video files for scraped posts should be downloaded @@ -84,7 +85,7 @@ Note that video downloading is a time and data rate consuming task, as a result ## Analyzing results ### Most common co-occurring hashtags -In addition to scraping data and downloading videos, the `tiktok-hashtag-analysis` script can also analyze the frequencies of the most common co-occurring hashtags in a given set of posts. +In addition to scraping data and downloading media, the `tiktok-hashtag-analysis` script can also analyze the frequencies of the most common co-occurring hashtags in a given set of posts. Assume we want to analyze the 20 most frequently co-occurring hashtags in the downloaded posts of the `#london` hashtag. diff --git a/tiktok_hashtag_analysis/base.py b/tiktok_hashtag_analysis/base.py index e059dbb..c6aed7e 100644 --- a/tiktok_hashtag_analysis/base.py +++ b/tiktok_hashtag_analysis/base.py @@ -155,7 +155,7 @@ class TikTokDownloader: # Define file containing post data and directory to save videos to hashtag_file = self.data_dir / hashtag / "posts.json" - video_dir = self.data_dir / hashtag / "videos" + video_dir = self.data_dir / hashtag / "media" video_dir.mkdir(exist_ok=True) # Get list of post IDs that have previously had their media downloaded @@ -191,8 +191,8 @@ class TikTokDownloader: # Download video files for all video posts if len(urls_to_download) > 0: - logging.info(f"Downloading videos for hashtag {hashtag}") - ydl_opts = {"outtmpl": os.path.join(video_dir, "%(id)s.%(ext)s")} + logging.info(f"Downloading media for hashtag {hashtag}") + ydl_opts = {"outtmpl": os.path.join(video_dir, "%(id)s.%(ext)s"), "ignore_errors": True} with yt_dlp.YoutubeDL(ydl_opts) as ydl: ydl.download(urls_to_download)