diff --git a/README.md b/README.md index baf9d8a..7d13107 100644 --- a/README.md +++ b/README.md @@ -2,14 +2,19 @@ The tool helps to download posts and videos from TikTok for a given set of hashtags. It uses the [tiktok-scraper](https://github.com/drawrowfly/tiktok-scraper) Node package to download the posts and videos. ## Pre-requisites -1. Make sure you have Python 3.6 or a later version installed. +1. Make sure you have Python 3.6 or a later version installed 2. Download and install TikTok scraper: https://github.com/drawrowfly/tiktok-scraper -3. (Optional) create and activate a virtual environment for this tool, for example by executing the following command, which creates the `env` virtual environment: +3. (Optional) create and activate a virtual environment for this tool, for example by executing the following command, which creates the `.env` virtual environment in the tool's root directory: - `python3 -m venv env` -4. Start your virtual environment - `source ./env/bin/activate` -5. Run `pip install -r requirements.txt` + `python3 -m venv .env` + +4. Start your virtual environment + - On Unix-like operating systems (macOS, Linux), this can be done using the command `source .env/bin/activate` + - On Windows, this can be done using the command `.env\Scripts\activate.bat` + +5. Install the Python package dependencies for this tool by executing the command: + + `pip install -r requirements.txt` You should now be ready to start using the tool. @@ -36,8 +41,6 @@ $ tree ../data ../data ├── ids │ └── post_ids.json -├── log -│ └── log.json ├── london │ └── posts │ └── data.json @@ -51,7 +54,6 @@ $ tree ../data The `data` folder contains all the downloaded data as shown in the tree diagram above. -- (Depricated: logging info is now found in logfile.py in the project folder.) The `log` folder contains the `log.json` file, which records the total number of downloaded posts and videos for the hashtags against the timestamp of when the script was run. - The `ids` folder contains two files `post_ids.json` and `video_ids.json` that record the ids of the downloaded posts and videos for each hashtag. - Each hashtag has a folder with two subfolders `posts` and `videos` that store posts and videos respectively. The posts are stored in the `data.json` file in the `posts` folder, and videos are stored as the `.mp4` files in the `videos` folder. @@ -65,32 +67,23 @@ Running the `run_downloader.py` script with the following options will scrape po and will produce an output similar to the following log: $ python3 run_downloader.py -t london paris newyork -p - ['london', 'paris', 'newyork'] - SUCCESS - 962 entries added to ../data/london/posts/data.json!!! - SUCCESS - 962 entries added to ../data/ids/post_ids.json!!! - Successfully deleted /Users/work/Documents/development_projects/Tiktok/tiktok/data/london/posts/london_1651533070680.json!!! - Total posts for the hashtag london are: 962 - SUCCESS - 961 entries added to ../data/paris/posts/data.json!!! - SUCCESS - 961 entries added to ../data/ids/post_ids.json!!! - Successfully deleted /Users/work/Documents/development_projects/Tiktok/tiktok/data/paris/posts/paris_1651533102789.json!!! - Total posts for the hashtag paris are: 961 - SUCCESS - 941 entries added to ../data/newyork/posts/data.json!!! - SUCCESS - 941 entries added to ../data/ids/post_ids.json!!! - Successfully deleted /Users/work/Documents/development_projects/Tiktok/tiktok/data/newyork/posts/newyork_1651533125549.json!!! - Total posts for the hashtag newyork are: 941 - Successfully logged 2864 entries!!!! + Hashtags to scrape: ['london', 'paris', 'newyork'] + Scraped 963 posts containing the hashtag 'london' + Scraped 961 posts containing the hashtag 'paris' + Scraped 940 posts containing the hashtag 'newyork' + Successfully scraped 2864 total entries - The `-t` flag allows a space-separated list of hashtags to be specified as a command line argument - The `-p` flag specifies that posts, not videos, will be downloaded ### Video downloading -Running the `run_downloader.py` script with the following options will scrape trending videos containing the hashtags `#london`, `#paris`, or `#newyork`: +Running the `run_downloader.py` script with the following options will scrape trending videos containing the hashtag `#london`: ` python3 run_downloader.py -t london -v` - The `-t` flag allows a space-separated list of hashtags to be specified as a command line argument - The `-v` flag specifies that videos, not posts, will be downloaded -Note that video downloading is a time and data rate consuming task, as a result we strongly recommend using one hashtag at a time when using the `-v` flag to avoid complications. +Note that video downloading is a time and data rate consuming task, as a result we recommend using one hashtag at a time when using the `-v` flag to avoid complications. ## Analyzing results ### Top n hashtag occurrences @@ -110,17 +103,18 @@ optional arguments: -d, --print List top n hashtags ``` -Assume we want to analyze the top 20 occurring hashtags in the downloaded posts of the `#london` hashtag. +Assume we want to analyze the 20 most frequently occurring hashtags in the downloaded posts of the `#london` hashtag. - The results can be plotted and saved as a PNG file by executing the following command: `python3 hashtag_frequencies.py -p ../data/london/posts/data.json 20` which will produce a figure similar to that shown below: - - ![Top 20 most frequent hashtags in posts containing the #london hashtag!](https://user-images.githubusercontent.com/72805812/155770710-0d167bbb-4c44-44d2-ba1c-fa57026afea8.png) - - Clearly, the highest occurrence will be of the `#london` hashtag, as all posts in the file `data/london/posts/data.json` contain the hashtag `#london`. +

+ Top 20 most frequent common hashtags in posts containing the #london hashtag +

+ + In the above plot, the highest occurrence is the `#fyp` hashtag, which is tagged in more than half of all posts containing the `#london` hashtag. - The results can be displayed in tabular form by executing the following command: @@ -128,7 +122,7 @@ Assume we want to analyze the top 20 occurring hashtags in the downloaded posts which will produce a terminal output similar to the following: ``` - Rank Hashtag Occurrences Frequency (Occurrences/Total-Posts(total_posts)) + Rank Hashtag Occurrences Frequency 0 london 962 1.0 1 fyp 493 0.5124740124740125 2 uk 238 0.24740124740124741 @@ -151,4 +145,4 @@ Assume we want to analyze the top 20 occurring hashtags in the downloaded posts 19 america 20 0.02079002079002079 ``` - The `Frequency` column shows the ratio of the occurrence to the total number of downloaded posts. \ No newline at end of file + The `Frequency` column shows the ratio of the occurrence to the total number of downloaded posts. diff --git a/logging.ini b/logging.ini deleted file mode 100644 index d56122e..0000000 --- a/logging.ini +++ /dev/null @@ -1,32 +0,0 @@ -[loggers] -keys=root - -[handlers] -keys=consoleHandler,fileHandler - -[formatters] -keys=consoleFormatter,fileFormatter - -[logger_root] -level=INFO -handlers=consoleHandler,fileHandler - -[handler_consoleHandler] -class=StreamHandler -level=DEBUG -formatter=consoleFormatter -args=(sys.stdout,) - -[handler_fileHandler] -class=FileHandler -level=WARNING -formatter=fileFormatter -args=("../logfile.log",) - -[formatter_fileFormatter] -format=%(asctime)s - %(name)s - %(levelname)s - %(message)s -datefmt= - -[formatter_consoleFormatter] -format=%(levelname)s - %(message)s -datefmt= diff --git a/requirements.txt b/requirements.txt index 8039c23..9a8d369 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1 +1,2 @@ -matplotlib==3.5.2 \ No newline at end of file +matplotlib +seaborn \ No newline at end of file diff --git a/tiktok_downloader/data_methods.py b/tiktok_downloader/data_methods.py index ec9cd5d..b638ff0 100644 --- a/tiktok_downloader/data_methods.py +++ b/tiktok_downloader/data_methods.py @@ -1,16 +1,12 @@ from collections import namedtuple +import warnings +import logging + import file_methods -# setting up the logging -import logging -from logging.config import fileConfig - -fileConfig('../logging.ini') logger = logging.getLogger() - - """ The file contains several functions that perform data processing related tasks. """ @@ -62,8 +58,7 @@ def extract_posts(settings, file_name, tag): ids.append(post["id"]) if not ids: - logger.warn(f"WARNING: no posts were found for {tag} in the file - {file_name}") - return + warnings.warn(f"No posts were found for {tag} in the file - {file_name}") status = file_methods.check_existence(settings["post_ids"], "file") if not status: @@ -72,8 +67,7 @@ def extract_posts(settings, file_name, tag): else: new_ids = get_difference(tag, settings["post_ids"], ids) if not new_ids: - logger.warn(f"WARNING: No new posts were found in the downloaded file - {file_name}") - return + warnings.warn(f"No new posts were found in the downloaded file - {file_name}") elif new_ids.filter_posts: new_posts = [post for post in posts if post['id'] in new_ids.ids] new_data = (new_ids.ids, new_posts) @@ -94,8 +88,8 @@ def extract_videos(settings, tag, download_list): else: new_videos = get_difference(tag, settings["video_ids"], download_list) if not new_videos: - logger.warn(f"WARNING: No new videos were found for the {tag} in the downloaded folder.") - return + warnings.warn(f"No new videos were found for the {tag} in the downloaded folder.") + return None else: return new_videos.ids @@ -104,15 +98,12 @@ def update_posts(file_path, file_type, new_data, tag=None): """ Updates the list of post ids (in the file ids/post_ids.json) with the ids of the new posts. """ - try: - status = file_methods.check_existence(file_path, file_type) - if not tag: - file_methods.post_writer(file_path, new_data, status) - else: - log = file_methods.id_writer(file_path, new_data, tag, status) - return log - except: - raise + status = file_methods.check_existence(file_path, file_type) + if not tag: + file_methods.post_writer(file_path, new_data, status) + else: + scraped_data = file_methods.id_writer(file_path, new_data, tag, status) + return scraped_data def update_videos(settings, new_data, tag): @@ -147,8 +138,6 @@ def print_total(file_path, tag, data_type): """ total = get_total_posts(file_path, tag) if (total.total == total.unique): - logger.info(f"Total {data_type} for the hashtag {tag} are: {total.total}") - return + logger.info(f"Scraped {total.total} {data_type} containing the hashtag '{tag}'") else: - logger.warn(f"WARNING: out of total {data_type} for the hashtag {tag} {total.total}, only {total.unique} are unique. Something is going wrong...") - return + warnings.warn(f"Out of total {data_type} for the hashtag {tag} {total.total}, only {total.unique} are unique. Something is going wrong...") diff --git a/tiktok_downloader/file_methods.py b/tiktok_downloader/file_methods.py index ed807af..b041130 100644 --- a/tiktok_downloader/file_methods.py +++ b/tiktok_downloader/file_methods.py @@ -1,18 +1,17 @@ -import os, json, subprocess +import os +import json +import subprocess from datetime import datetime -import global_data import shutil +import warnings - -# setting up the logging import logging -from logging.config import fileConfig -fileConfig('../logging.ini') +logging.basicConfig( + level = logging.INFO, + format = '%(message)s') logger = logging.getLogger() - - """ The file contains the functions that operate on files, such as writing or reading from files etc. """ @@ -27,8 +26,7 @@ def create_file(name, file_type): elif (file_type == "file"): with open(name, "w"): pass else: - logger.exception(f"{file_type} has to be a 'dir' or a 'file'!!!") - return + raise ValueError(f"{file_type} has to be either 'dir' or 'file'") def check_existence(file_path, file_type): @@ -40,7 +38,7 @@ def check_existence(file_path, file_type): elif (file_type == "dir"): return os.path.isdir(file_path) else: - logger.exception(f"{file_type} has to be a 'dir' or a 'file'!!!") + raise ValueError(f"{file_type} has to be either 'dir' or 'file'") def check_file(file_path, file_type): @@ -51,8 +49,6 @@ def check_file(file_path, file_type): if not status: create_file(file_path, file_type) - return - def download_posts(settings, tag): """ @@ -62,18 +58,15 @@ def download_posts(settings, tag): """ path = os.path.join(settings["data"], tag, settings["posts"]) os.chdir(path) - try: - tiktok_command = f"tiktok-scraper hashtag {tag} -t 'json'" - result = subprocess.check_output(tiktok_command, shell=True) - new_file = result.decode('utf-8').split()[-1] - if ("json" in new_file): - os.chdir("../../../tiktok_downloader") - return new_file - else: - logger.warn(f"WARNING: Something's wrong with what is returned by tiktok-scraper for the hashtag {tag} - *{new_file}* is not a json file!!!!") - os.chdir("../../../tiktok_downloader") - return - except: raise + tiktok_command = f"tiktok-scraper hashtag {tag} -t 'json'" + output = subprocess.check_output(tiktok_command, shell=True, encoding = 'utf-8') + new_file = output.split()[-1] + if ("json" in new_file): + os.chdir("../../../tiktok_downloader") + return new_file + else: + warnings.warn(f"Something's wrong with what is returned by tiktok-scraper for the hashtag {tag} - *{new_file}* is not a json file.\n\ntiktok-scraper returned {output}") + os.chdir("../../../tiktok_downloader") @@ -85,27 +78,22 @@ def download_videos(settings, tag): """ path = os.path.join(settings["data"], tag, settings["videos"]) os.chdir(path) - try: - # tiktok_command = f"tiktok-scraper hashtag {tag} -n {settings['number_of_videos']} -d" - tiktok_command = f"tiktok-scraper hashtag {tag} -d" - result = subprocess.check_output(tiktok_command, shell=True) - downloaded_list_tmp = os.listdir(f"./#{tag}") - if downloaded_list_tmp: - downloaded_list = [] - for file in downloaded_list_tmp: - file = file.split('.')[0] - downloaded_list.append(file) - - os.chdir("../../../tiktok_downloader") - return downloaded_list - else: - print(f"WARNING: No video files were downloaded for the hashtag {tag}.") - os.chdir("../../../tiktok_downloader") - shutil.rmtree(settings['videos_delete']) - #subprocess.call(f"rm -rf {settings['videos_delete']}", shell=True) + tiktok_command = f"tiktok-scraper hashtag {tag} -d" + result = subprocess.check_output(tiktok_command, shell=True) + downloaded_list_tmp = os.listdir(f"./#{tag}") + if downloaded_list_tmp: + downloaded_list = [] + for file in downloaded_list_tmp: + file = file.split('.')[0] + downloaded_list.append(file) + + os.chdir("../../../tiktok_downloader") + return downloaded_list + else: + warnings.warn(f"No video files were downloaded for the hashtag {tag}.") + os.chdir("../../../tiktok_downloader") + shutil.rmtree(settings['videos_delete']) - except: raise - def get_data(file_path): """ @@ -122,7 +110,6 @@ def dump_data(file_path, data): """ with open(file_path, "w", encoding = "utf-8") as f: json.dump(data, f) - return def log_writer(log_data): """ @@ -131,78 +118,67 @@ def log_writer(log_data): Writes the dictionary to the log file (logs/log.json). """ total = 0 - try: - log_dict = {} - for ele in log_data: - if ele[0] in log_dict: - if ele[1][0] in log_dict[ele[0]]: - log_dict[ele[0]][ele[1][0]] += ele[1][1] - else: - log_dict[ele[0]][ele[1][0]] = ele[1][1] - total += ele[1][1] + scraped_summary_dict = {} + for hashtag, (data_type, count) in log_data: + if hashtag in scraped_summary_dict: + if data_type in scraped_summary_dict[hashtag]: + scraped_summary_dict[hashtag][data_type] += count else: - log_dict[ele[0]] = { ele[1][0] : ele[1][1] } - total += ele[1][1] + scraped_summary_dict[hashtag][data_type] = count + total += count + else: + scraped_summary_dict[hashtag] = {data_type : count} + total += count - now = datetime.now() - now_str = now.strftime("%d-%m-%Y %H:%M:%S") - data = { now_str : log_dict } + now = datetime.now() + now_str = now.strftime("%d-%m-%Y %H:%M:%S") + data = { now_str : scraped_summary_dict } - logger.warn(data) - logger.info(f"Successfully logged {total} entries!!!!") - return - except: - logger.exception() + logger.debug(f"Logged post data: {data}") + logger.info(f"Successfully scraped {total} total entries") def id_writer(file_path, new_data, tag, status): """ - Writes the list of new ids to the post_ids or video_ds files. + Writes the list of new ids to the post_ids or video_ids files. """ - try: - total = len(new_data) - if status: - try: - data = get_data(file_path) - if tag in data: - data[tag] += new_data - else: - data[tag]= new_data - dump_data(file_path, data) - except json.decoder.JSONDecodeError: - data = { tag : new_data } - dump_data(file_path, data) - else: + total = len(new_data) + if status: + try: + data = get_data(file_path) + if tag in data: + data[tag] += new_data + else: + data[tag]= new_data + dump_data(file_path, data) + except json.decoder.JSONDecodeError: data = { tag : new_data } dump_data(file_path, data) - logger.info(f"SUCCESS - {total} entries added to {file_path}!!!") - log_data = (tag, total) - return log_data - except: - logger.exception() + else: + data = { tag : new_data } + dump_data(file_path, data) + logger.debug(f"SUCCESS - {total} entries added to {file_path}") + number_scraped = (tag, total) + return number_scraped def post_writer(file_path, new_data, status): """ Writes the new posts in the post file of the given hashtag (/data/{hashtag}/posts/data.json) """ - try: - total = len(new_data) - if status: - try: - data = get_data(file_path) - data += new_data - dump_data(file_path, data) - except json.decoder.JSONDecodeError: - data = new_data - dump_data(file_path, data) - else: + total = len(new_data) + if status: + try: + data = get_data(file_path) + data += new_data + dump_data(file_path, data) + except json.decoder.JSONDecodeError: data = new_data dump_data(file_path, data) - logger.info(f"SUCCESS - {total} entries added to {file_path}!!!") - return - except: - logger.exception() + else: + data = new_data + dump_data(file_path, data) + logger.debug(f"SUCCESS - {total} entries added to {file_path}") def delete_file(file_path, file_type): @@ -210,17 +186,15 @@ def delete_file(file_path, file_type): Deletes the directory or the file. """ if not check_existence(file_path, file_type): - logger.exception(f"ERROR: Attempt to delete failed. {file_path} does not exist!!!") + raise OSError(f"Attempt to delete file failed: {file_path} does not exist") elif (file_type == "file"): os.remove(file_path) - logger.info(f"Successfully deleted {file_path}!!!") - return + logger.debug(f"Successfully deleted {file_path}") elif (file_type == "dir"): os.rmdir(file_path) - logger.info(f"Successfully deleted {file_path}!!!") - return + logger.debug(f"Successfully deleted {file_path}") else: - logger.exception(f"OSError: {file_type} needs to be either 'file' or 'dir' !!!") + raise OSError("{file_type} needs to be either 'file' or 'dir'") def clean_video_files(settings, tag, new_data=None): @@ -228,13 +202,10 @@ def clean_video_files(settings, tag, new_data=None): Moves the new videos from the tiktok-scraper video folder to /data/{hashtag}/videos/ Deletes the residual tiktok-scraper video folder. """ - try: - if new_data: - for file in new_data: - settings["videos_from"] = settings['data'] + f"/{tag}/videos/#{tag}/{file}.mp4" - shutil.move(settings['videos_from'], settings['videos_to']) - - shutil.rmtree(settings['videos_delete']) - logger.info(f"Successfully deleted the folder {settings['videos_delete']} folder of videos.") - except: - raise + if new_data: + for file in new_data: + settings["videos_from"] = settings['data'] + f"/{tag}/videos/#{tag}/{file}.mp4" + shutil.move(settings['videos_from'], settings['videos_to']) + + shutil.rmtree(settings['videos_delete']) + logger.debug(f"Successfully deleted the folder {settings['videos_delete']} folder of videos.") diff --git a/tiktok_downloader/hashtag_frequencies.py b/tiktok_downloader/hashtag_frequencies.py index 9e9e1f9..7130558 100644 --- a/tiktok_downloader/hashtag_frequencies.py +++ b/tiktok_downloader/hashtag_frequencies.py @@ -1,22 +1,27 @@ -import os, sys +import os import json import argparse -import matplotlib.pyplot as plt from datetime import datetime -from file_methods import check_file -from global_data import IMAGES +import warnings +warnings.filterwarnings("ignore", message="Glyph (.*) missing from current font") +import logging +import matplotlib.pyplot as plt +import matplotlib.ticker as mtick +import seaborn as sns +sns.set_theme(style="darkgrid") + +from file_methods import check_file, check_existence +from global_data import IMAGES """ Plots the frequency of hashtags appearing in the set of given posts. """ - def get_hashtags(obj): if not obj: - print(f'ERROR: Empty item, no hashtags to be extracted.') - return + raise ValueError(f'Empty item, no hashtags to be extracted.') else: hashtags = {} tags = [ [tag['name'] for tag in ele['hashtags']] for ele in obj ] @@ -50,15 +55,21 @@ def get_occurrences(filename, n=1 , sort=True): def plot(n, occs, img_folder): - plt.scatter(occs["top_n"][0], occs["top_n"][1]) - plt.tight_layout() - plt.xticks(rotation=45) - plt.title(f'Hashtag Distribution') - plt.xlabel(f'Top {n} hashtags from {occs["total"]} posts.') - plt.ylabel(f'Number of occurrences') + y_pos = list(reversed(range(n - 1))) + max_count = occs["top_n"][1][0] + freqs = [count/max_count * 100 for count in occs["top_n"][1][1:]] + labels = occs["top_n"][0][1:] + + fig, ax = plt.subplots(figsize = (5, 6.66)) + ax.barh(y_pos, freqs) + ax.set_yticks(y_pos) + ax.set_yticklabels(labels) + ax.grid(axis = 'y') + ax.set_xlabel('Percent of posts with common hashtag') + ax.set_ylim(min(y_pos)-1, max(y_pos)+1) + ax.set_title(f'Common hashtags for #{occs["top_n"][0][0]} posts') + ax.xaxis.set_major_formatter(mtick.PercentFormatter(decimals = 0)) save_plot(img_folder) - plt.show(block=None) - return def print_occurrences(occs): @@ -67,26 +78,22 @@ def print_occurrences(occs): """ row_number = 0 total_posts = occs["total"] - print ("{:<8} {:<15} {:<15} {:<15}".format("Rank", 'Hashtag','Occurrences',f'Frequency (Occurrences/Total-Posts(total_posts))')) + print ("{:<8} {:<15} {:<15} {:<15}".format("Rank", 'Hashtag','Occurrences','Frequency')) for key,value in zip(occs["top_n"][0], occs["top_n"][1]): ratio = value/total_posts print ("{:<8} {:<15} {:<15} {:<15}".format(row_number, key, value, ratio)) row_number += 1 - return def save_plot(img_folder): """ Saves the plot to a png file in the folder /data/imgs/ """ - try: - now = datetime.now() - current_time = now.strftime("%Y_%m_%d_%H_%M_%S") - plt.savefig(f"{img_folder}/{current_time}.png") - - return - except: raise - + now = datetime.now() + current_time = now.strftime("%Y_%m_%d_%H_%M_%S") + filename = f"{img_folder}/{current_time}.png" + logging.info(f'Plot saved to file: {filename}') + plt.savefig(filename, bbox_inches = 'tight', facecolor = 'white', dpi = 300) if __name__ == "__main__": @@ -105,17 +112,14 @@ if __name__ == "__main__": parser.add_argument("-p", "--plot", help="Plot the occurrences", action="store_true") parser.add_argument("-d", "--print", help="List top n hashtags", action="store_true") args = parser.parse_args() - if args.input_file and args.n: - if args.n < 1: - print(f"Please make sure the number of top occurrences is a positive integer.") - sys.exit() - - base = os.path.splitext(args.input_file)[0] - path = f"./{base}_sorted_hashtags.csv" - occs = get_occurrences(args.input_file, args.n) - if args.plot: - plot(args.n, occs, img_folder) - else: - print_occurrences(occs) + if args.n < 1: + raise ValueError(f"Specified argument `n` (the number of hashtags to analyze) must be greater than zero, not: {args.n}.") + if not check_existence(args.input_file, 'file'): + raise FileNotFoundError(f"Specified argument `input_file` ({args.input_file}) does not exist.") + base = os.path.splitext(args.input_file)[0] + path = f"./{base}_sorted_hashtags.csv" + occs = get_occurrences(args.input_file, args.n) + if args.plot: + plot(args.n, occs, img_folder) else: - print(f'ERROR: either {args.input_file} or {args.n} or both contains error.') + print_occurrences(occs) diff --git a/tiktok_downloader/run_downloader.py b/tiktok_downloader/run_downloader.py index 0d0c68d..529b336 100644 --- a/tiktok_downloader/run_downloader.py +++ b/tiktok_downloader/run_downloader.py @@ -1,61 +1,20 @@ import os import time import argparse +import logging import global_data import file_methods import data_methods -# setting up the logging -import logging -from logging.config import fileConfig - -fileConfig('../logging.ini') logger = logging.getLogger() - -""" -The run_downloader.py dowloads data using the tiktok-scraper (https://github.com/drawrowfly/tiktok-scraper). -1. "-p" option is used by the user to download posts only -2. "-v" option is use to download videos only -3. "-p -v" is used to download posts and videos -4. "-t" is used to specify a list of hashtags as arguments -5. "-f" option is used to read the list of hashtags from the user specified file - -Example: - 1. The command "python3 run_downloader.py -t london paris newyork -p" will download posts for hashtags london, paris and newyork. - 2. The command "python3 run_downloader.py -f hashtag_list -p -v" will download posts and videos for hashtags in the file hashtag_list. - - -The downloaded data is stored in the the data folder. The data is folder is organized as follows: - 1. the log subfolder contains the log.json that records total downloads (posts and videos) for each hashtag with a timestamp of when the script was run. - 2. the ids subfolder contains post_ids.json and video_ids.json that keep the record of post and video ids that are currently in the data set. This helps to filter out only new posts every time tiktok-scraper is run and only those new posts (or videos) are then stored in the data folder. - 3. Each hashtag has a subfolder by its name containing two subfolders, one each for posts and videos. - - -This scripts runs the function get_data in main which in turn triggers the following sequence: - 1. get_posts function is triggered if the user wants to download posts - 2. get_videos function is triggered if the user wants to download videos - 3. both functions above are sequentially triggered if the user wants to download both posts and videos. - 4. After the data is downloaded the log_writer is triggered to log the total number of posts and videos downloaded. - - -------------Files-------------- -global_data - contains global constants relating to paths etc. -data_methods - this file contains data processing methods -file_methods - this file contains methods to write and update data in files -hashtag_list - this file contains the list of hashtags that the user wants to download data for. -""" - - - def get_hashtag_list(file_name): - try: - with open(file_name) as f: - tags = list(filter(None, [line.strip() for line in f if not line.startswith("#")])) - return tags - except IOError: - logger.exception(f"IOError") + if not file_methods.check_existence(file_name, 'file'): + raise OSError(f"{file_name} does not exist") + with open(file_name) as f: + tags = list(filter(None, [line.strip() for line in f if not line.startswith("#")])) + return tags def create_parser(): @@ -102,16 +61,16 @@ def get_posts(settings, tag): 3. calls update_posts from data_methods.py to update the id-list with the ids of newly downloaded posts. """ file_path = file_methods.download_posts(settings, tag) - log = () + number_scraped = () if file_path: new_data = data_methods.extract_posts(settings, file_path, tag) if new_data: data_file = os.path.join(settings["data"], tag, settings["posts"], settings["data_file"]) data_methods.update_posts(data_file, "file", new_data[1]) - log = data_methods.update_posts(settings["post_ids"], "file", new_data[0], tag) + number_scraped = data_methods.update_posts(settings["post_ids"], "file", new_data[0], tag) file_methods.delete_file(file_path, "file") - return log + return number_scraped @@ -122,16 +81,16 @@ def get_videos(settings, tag): 3. calls update_videos from data_methods.py to update the id-list with the ids of newly downloaded videos. 4. the clean_video_files function deletes the residual video folder after the data processing """ - log = () + number_scraped = () download_list = file_methods.download_videos(settings, tag) if download_list: new_data = data_methods.extract_videos(settings, tag, download_list) if new_data: - log = data_methods.update_videos(settings, new_data, tag) + number_scraped = data_methods.update_videos(settings, new_data, tag) else: file_methods.clean_video_files(settings, tag) - return log + return number_scraped @@ -143,7 +102,7 @@ def get_data(hashtags, download_data_type): counter = 0 total_hashtags = len(hashtags) total_hashtags_offset = total_hashtags - 1 - log_data = [] + scraped_summary_list = [] if download_data_type["posts"]: settings = set_download_settings(download_data_type) @@ -153,8 +112,8 @@ def get_data(hashtags, download_data_type): file_methods.check_file(os.path.join(settings["data"], tag, settings["posts"], settings["data_file"]), "file") res = get_posts(settings, tag) if res: - log = ( res[0], ( "posts", res[1] ) ) - log_data.append(log) + number_scraped = ( res[0], ( "posts", res[1] ) ) + scraped_summary_list.append(number_scraped) data_methods.print_total(settings["post_ids"], tag, "posts") counter += 1 @@ -171,14 +130,14 @@ def get_data(hashtags, download_data_type): res = get_videos(settings, tag) if res: res = ( res[0], ( "videos", res[1])) - log_data.append(res) + scraped_summary_list.append(res) data_methods.print_total(settings["video_ids"], tag, "videos") counter += 1 if counter < total_hashtags_offset: time.sleep(settings["sleep"]) - return log_data + return scraped_summary_list if __name__ == "__main__": @@ -197,29 +156,15 @@ if __name__ == "__main__": file_name = args.f hashtags = get_hashtag_list(file_name) - print(hashtags) + logger.info(f"Hashtags to scrape: {hashtags}") if not hashtags: - logger.exception("No hashtags were given, please use either -t option or -f to provide hashtags.") + raise ValueError("No hashtags were specified: please use either the -t flag to specify a sspace-separated list of one or more hashtags as a command-line argument, or use the -f flag to specify a text file of newline-separated hashtags.") - if (args.p and args.v): - download_data_type = { - "posts": True, - "videos": True - } - elif args.p: - download_data_type = { - "posts": True, - "videos": False - } - else: - download_data_type = { - "posts": False, - "videos": True + download_data_type = { + "posts": args.p, + "videos": args.v } - try: - log_data = get_data(hashtags, download_data_type) - if log_data: - file_methods.log_writer(log_data) - except: - logger.exception(f"ERROR") + scraped_summary_list = get_data(hashtags, download_data_type) + if scraped_summary_list: + file_methods.log_writer(scraped_summary_list)