From 14c52e5d758f14549928620f4181143f94f88299 Mon Sep 17 00:00:00 2001
From: Tristan Lee <tristan@bellingcat.com>
Date: Thu, 5 May 2022 02:23:50 -0500
Subject: [PATCH 1/4] simplified logging, used warnings.warn and calling
 exceptions rather than logging them, various code cleanups and clarifications

---
 README.md                                |  39 ++---
 logging.ini                              |  32 ----
 requirements.txt                         |   3 +-
 tiktok_downloader/data_methods.py        |  41 ++---
 tiktok_downloader/file_methods.py        | 203 ++++++++++-------------
 tiktok_downloader/hashtag_frequencies.py |  80 ++++-----
 tiktok_downloader/run_downloader.py      | 105 +++---------
 7 files changed, 187 insertions(+), 316 deletions(-)
 delete mode 100644 logging.ini

diff --git a/README.md b/README.md
index baf9d8a..4a01a1d 100644
--- a/README.md
+++ b/README.md
@@ -4,12 +4,17 @@ The tool helps to download posts and videos from TikTok for a given set of hasht
 ## Pre-requisites
 1. Make sure you have Python 3.6 or a later version installed.
 2. Download and install TikTok scraper: https://github.com/drawrowfly/tiktok-scraper 
-3. (Optional) create and activate a virtual environment for this tool, for example by executing the following command, which creates the `env` virtual environment:
+3. (Optional) create and activate a virtual environment for this tool, for example by executing the following command, which creates the `.env` virtual environment in the project directory:
 
-    `python3 -m venv env`
-4. Start your virtual environment 
-    `source ./env/bin/activate`
-5. Run `pip install -r requirements.txt`
+    `python3 -m venv .env`
+
+4. Start your virtual environment
+    - On Unix-like operating systems (macOS, Linux), this can be done using the command `source .env/bin/activate`
+    - On Windows, this can be done using the command `.env\activate.bat`
+    
+5. Install the Python package dependencies for this tool by executing the command: 
+
+    `pip install -r requirements.txt`
 
 You should now be ready to start using the tool.
 
@@ -36,8 +41,6 @@ $ tree ../data
 ../data
 ├── ids
 │   └── post_ids.json
-├── log
-│   └── log.json
 ├── london
 │   └── posts
 │       └── data.json
@@ -51,7 +54,6 @@ $ tree ../data
 
 
 The `data` folder contains all the downloaded data as shown in the tree diagram above. 
-- (Depricated: logging info is now found in logfile.py in the project folder.) The `log` folder contains the `log.json` file, which records the total number of downloaded posts and videos for the hashtags against the timestamp of when the script was run.
 - The `ids` folder contains two files `post_ids.json` and `video_ids.json` that record the ids of the downloaded posts and videos for each hashtag.
 - Each hashtag has a folder with two subfolders `posts` and `videos` that store posts and videos respectively. The posts are stored in the `data.json` file in the `posts` folder, and videos are stored as the `.mp4` files in the `videos` folder.
 
@@ -65,20 +67,11 @@ Running the `run_downloader.py` script with the following options will scrape po
 and will produce an output similar to the following log:
 
     $ python3 run_downloader.py -t london paris newyork -p
-    ['london', 'paris', 'newyork']
-    SUCCESS - 962 entries added to ../data/london/posts/data.json!!!
-    SUCCESS - 962 entries added to ../data/ids/post_ids.json!!!
-    Successfully deleted /Users/work/Documents/development_projects/Tiktok/tiktok/data/london/posts/london_1651533070680.json!!!
-    Total posts for the hashtag london are: 962
-    SUCCESS - 961 entries added to ../data/paris/posts/data.json!!!
-    SUCCESS - 961 entries added to ../data/ids/post_ids.json!!!
-    Successfully deleted /Users/work/Documents/development_projects/Tiktok/tiktok/data/paris/posts/paris_1651533102789.json!!!
-    Total posts for the hashtag paris are: 961
-    SUCCESS - 941 entries added to ../data/newyork/posts/data.json!!!
-    SUCCESS - 941 entries added to ../data/ids/post_ids.json!!!
-    Successfully deleted /Users/work/Documents/development_projects/Tiktok/tiktok/data/newyork/posts/newyork_1651533125549.json!!!
-    Total posts for the hashtag newyork are: 941
-    Successfully logged 2864 entries!!!!
+    Hashtags to scrape: ['london', 'paris', 'newyork']
+    Scraped 963 posts containing the hashtag 'london'
+    Scraped 961 posts containing the hashtag 'paris'
+    Scraped 940 posts containing the hashtag 'newyork'
+    Successfully scraped 2864 total entries
 
 - The `-t` flag allows a space-separated list of hashtags to be specified as a command line argument
 - The `-p` flag specifies that posts, not videos, will be downloaded
@@ -128,7 +121,7 @@ Assume we want to analyze the top 20 occurring hashtags in the downloaded posts
 
     which will produce a terminal output similar to the following:
     ```
-    Rank     Hashtag         Occurrences     Frequency (Occurrences/Total-Posts(total_posts))
+    Rank     Hashtag         Occurrences     Frequency
     0        london          962             1.0            
     1        fyp             493             0.5124740124740125
     2        uk              238             0.24740124740124741
diff --git a/logging.ini b/logging.ini
deleted file mode 100644
index d56122e..0000000
--- a/logging.ini
+++ /dev/null
@@ -1,32 +0,0 @@
-[loggers]
-keys=root
-
-[handlers]
-keys=consoleHandler,fileHandler
-
-[formatters]
-keys=consoleFormatter,fileFormatter
-
-[logger_root]
-level=INFO
-handlers=consoleHandler,fileHandler
-
-[handler_consoleHandler]
-class=StreamHandler
-level=DEBUG
-formatter=consoleFormatter
-args=(sys.stdout,)
-
-[handler_fileHandler]
-class=FileHandler
-level=WARNING
-formatter=fileFormatter
-args=("../logfile.log",)
-
-[formatter_fileFormatter]
-format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
-datefmt=
-
-[formatter_consoleFormatter]
-format=%(levelname)s - %(message)s
-datefmt=
diff --git a/requirements.txt b/requirements.txt
index 8039c23..9a8d369 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1 +1,2 @@
-matplotlib==3.5.2
\ No newline at end of file
+matplotlib
+seaborn
\ No newline at end of file
diff --git a/tiktok_downloader/data_methods.py b/tiktok_downloader/data_methods.py
index ec9cd5d..b638ff0 100644
--- a/tiktok_downloader/data_methods.py
+++ b/tiktok_downloader/data_methods.py
@@ -1,16 +1,12 @@
 from collections import namedtuple
+import warnings
+import logging
+
 import file_methods
 
-# setting up the logging
-import logging
-from logging.config import fileConfig
-
-fileConfig('../logging.ini')
 logger = logging.getLogger()
 
 
-
-
 """
 The file contains several functions that perform data processing related tasks.
 """
@@ -62,8 +58,7 @@ def extract_posts(settings, file_name, tag):
         ids.append(post["id"])
 
     if not ids:
-        logger.warn(f"WARNING: no posts were found for {tag} in the file - {file_name}")
-        return
+        warnings.warn(f"No posts were found for {tag} in the file - {file_name}")
    
     status = file_methods.check_existence(settings["post_ids"], "file")
     if not status:
@@ -72,8 +67,7 @@ def extract_posts(settings, file_name, tag):
     else:
         new_ids = get_difference(tag, settings["post_ids"], ids)
         if not new_ids:
-            logger.warn(f"WARNING: No new posts were found in the downloaded file - {file_name}")
-            return
+            warnings.warn(f"No new posts were found in the downloaded file - {file_name}")
         elif new_ids.filter_posts:
             new_posts = [post for post in posts if post['id'] in new_ids.ids]
             new_data = (new_ids.ids, new_posts)
@@ -94,8 +88,8 @@ def extract_videos(settings, tag, download_list):
     else:
         new_videos = get_difference(tag, settings["video_ids"], download_list)
         if not new_videos:
-            logger.warn(f"WARNING: No new videos were found for the {tag} in the downloaded folder.")
-            return
+            warnings.warn(f"No new videos were found for the {tag} in the downloaded folder.")
+            return None
         else:
             return new_videos.ids
 
@@ -104,15 +98,12 @@ def update_posts(file_path, file_type, new_data, tag=None):
     """
     Updates the list of post ids (in the file ids/post_ids.json) with the ids of the new posts.
     """
-    try:
-        status = file_methods.check_existence(file_path, file_type)
-        if not tag:
-            file_methods.post_writer(file_path, new_data, status)
-        else:
-            log = file_methods.id_writer(file_path, new_data, tag, status)
-            return log
-    except:
-        raise
+    status = file_methods.check_existence(file_path, file_type)
+    if not tag:
+        file_methods.post_writer(file_path, new_data, status)
+    else:
+        scraped_data = file_methods.id_writer(file_path, new_data, tag, status)
+        return scraped_data
 
 
 def update_videos(settings, new_data, tag):
@@ -147,8 +138,6 @@ def print_total(file_path, tag, data_type):
     """
     total = get_total_posts(file_path, tag)
     if (total.total == total.unique):
-        logger.info(f"Total {data_type} for the hashtag {tag} are: {total.total}")
-        return
+        logger.info(f"Scraped {total.total} {data_type} containing the hashtag '{tag}'")
     else:
-        logger.warn(f"WARNING: out of total {data_type} for the hashtag {tag} {total.total}, only {total.unique} are unique. Something is going wrong...")
-        return
+        warnings.warn(f"Out of total {data_type} for the hashtag {tag} {total.total}, only {total.unique} are unique. Something is going wrong...")
diff --git a/tiktok_downloader/file_methods.py b/tiktok_downloader/file_methods.py
index ed807af..b041130 100644
--- a/tiktok_downloader/file_methods.py
+++ b/tiktok_downloader/file_methods.py
@@ -1,18 +1,17 @@
-import os, json, subprocess
+import os
+import json
+import subprocess
 from datetime import datetime
-import global_data
 import shutil
+import warnings
 
-
-# setting up the logging
 import logging
-from logging.config import fileConfig
 
-fileConfig('../logging.ini')
+logging.basicConfig(
+    level = logging.INFO,
+    format = '%(message)s')
 logger = logging.getLogger()
 
-
-
 """
 The file contains the functions that operate on files, such as writing or reading from files etc.
 """
@@ -27,8 +26,7 @@ def create_file(name, file_type):
     elif (file_type == "file"):
         with open(name, "w"): pass
     else:
-        logger.exception(f"{file_type} has to be a 'dir' or a 'file'!!!")
-    return
+        raise ValueError(f"{file_type} has to be either 'dir' or 'file'")
 
 
 def check_existence(file_path, file_type):
@@ -40,7 +38,7 @@ def check_existence(file_path, file_type):
     elif (file_type == "dir"):
         return os.path.isdir(file_path)
     else:
-        logger.exception(f"{file_type} has to be a 'dir' or a 'file'!!!")
+        raise ValueError(f"{file_type} has to be either 'dir' or 'file'")
 
 
 def check_file(file_path, file_type):
@@ -51,8 +49,6 @@ def check_file(file_path, file_type):
     if not status:
         create_file(file_path, file_type)    
 
-    return
-
 
 def download_posts(settings, tag):
     """
@@ -62,18 +58,15 @@ def download_posts(settings, tag):
     """
     path = os.path.join(settings["data"], tag, settings["posts"])
     os.chdir(path)
-    try:
-        tiktok_command = f"tiktok-scraper hashtag {tag} -t 'json'" 
-        result = subprocess.check_output(tiktok_command, shell=True)
-        new_file = result.decode('utf-8').split()[-1]
-        if ("json" in new_file):
-            os.chdir("../../../tiktok_downloader")
-            return new_file 
-        else:
-            logger.warn(f"WARNING: Something's wrong with what is returned by tiktok-scraper for the hashtag {tag} - *{new_file}* is not a json file!!!!")
-            os.chdir("../../../tiktok_downloader")
-            return
-    except: raise
+    tiktok_command = f"tiktok-scraper hashtag {tag} -t 'json'" 
+    output = subprocess.check_output(tiktok_command, shell=True, encoding = 'utf-8')
+    new_file = output.split()[-1]
+    if ("json" in new_file):
+        os.chdir("../../../tiktok_downloader")
+        return new_file 
+    else:
+        warnings.warn(f"Something's wrong with what is returned by tiktok-scraper for the hashtag {tag} - *{new_file}* is not a json file.\n\ntiktok-scraper returned {output}")
+        os.chdir("../../../tiktok_downloader")
 
 
 
@@ -85,27 +78,22 @@ def download_videos(settings, tag):
     """
     path = os.path.join(settings["data"], tag, settings["videos"])
     os.chdir(path)
-    try:
-        # tiktok_command = f"tiktok-scraper hashtag {tag} -n {settings['number_of_videos']} -d" 
-        tiktok_command = f"tiktok-scraper hashtag {tag} -d" 
-        result = subprocess.check_output(tiktok_command, shell=True)
-        downloaded_list_tmp = os.listdir(f"./#{tag}")
-        if downloaded_list_tmp:
-            downloaded_list = []
-            for file in downloaded_list_tmp:
-                file = file.split('.')[0]
-                downloaded_list.append(file)
-            
-            os.chdir("../../../tiktok_downloader")
-            return downloaded_list
-        else:
-            print(f"WARNING: No video files were downloaded for the hashtag {tag}.")
-            os.chdir("../../../tiktok_downloader")
-            shutil.rmtree(settings['videos_delete'])
-            #subprocess.call(f"rm -rf {settings['videos_delete']}", shell=True)
+    tiktok_command = f"tiktok-scraper hashtag {tag} -d" 
+    result = subprocess.check_output(tiktok_command, shell=True)
+    downloaded_list_tmp = os.listdir(f"./#{tag}")
+    if downloaded_list_tmp:
+        downloaded_list = []
+        for file in downloaded_list_tmp:
+            file = file.split('.')[0]
+            downloaded_list.append(file)
+        
+        os.chdir("../../../tiktok_downloader")
+        return downloaded_list
+    else:
+        warnings.warn(f"No video files were downloaded for the hashtag {tag}.")
+        os.chdir("../../../tiktok_downloader")
+        shutil.rmtree(settings['videos_delete'])
         
-    except: raise
-
 
 def get_data(file_path):
     """
@@ -122,7 +110,6 @@ def dump_data(file_path, data):
     """
     with open(file_path, "w", encoding = "utf-8") as f:
         json.dump(data, f)
-        return            
 
 def log_writer(log_data):
     """
@@ -131,78 +118,67 @@ def log_writer(log_data):
     Writes the dictionary to the log file (logs/log.json).
     """
     total = 0
-    try:
-        log_dict = {}
-        for ele in log_data:
-            if ele[0] in log_dict:
-                if ele[1][0] in log_dict[ele[0]]:
-                    log_dict[ele[0]][ele[1][0]] += ele[1][1]
-                else:
-                    log_dict[ele[0]][ele[1][0]] = ele[1][1]
-                total += ele[1][1]
+    scraped_summary_dict = {}
+    for hashtag, (data_type, count) in log_data:
+        if hashtag in scraped_summary_dict:
+            if data_type in scraped_summary_dict[hashtag]:
+                scraped_summary_dict[hashtag][data_type] += count
             else:
-                log_dict[ele[0]] = { ele[1][0] : ele[1][1] }
-                total += ele[1][1]
+                scraped_summary_dict[hashtag][data_type] = count
+            total += count
+        else:
+            scraped_summary_dict[hashtag] = {data_type : count}
+            total += count
 
-        now = datetime.now()
-        now_str = now.strftime("%d-%m-%Y %H:%M:%S")
-        data = { now_str : log_dict }
+    now = datetime.now()
+    now_str = now.strftime("%d-%m-%Y %H:%M:%S")
+    data = { now_str : scraped_summary_dict }
 
-        logger.warn(data)
-        logger.info(f"Successfully logged {total} entries!!!!")
-        return
-    except:
-        logger.exception()
+    logger.debug(f"Logged post data: {data}")
+    logger.info(f"Successfully scraped {total} total entries")
 
 
 def id_writer(file_path, new_data, tag, status):
     """
-    Writes the list of new ids to the post_ids or video_ds files.
+    Writes the list of new ids to the post_ids or video_ids files.
     """
-    try:
-        total = len(new_data)
-        if status:
-            try:
-                data = get_data(file_path)
-                if tag in data:
-                    data[tag] += new_data
-                else:
-                    data[tag]= new_data 
-                dump_data(file_path, data)
-            except json.decoder.JSONDecodeError:
-                data = { tag : new_data }
-                dump_data(file_path, data)
-        else:
+    total = len(new_data)
+    if status:
+        try:
+            data = get_data(file_path)
+            if tag in data:
+                data[tag] += new_data
+            else:
+                data[tag]= new_data 
+            dump_data(file_path, data)
+        except json.decoder.JSONDecodeError:
             data = { tag : new_data }
             dump_data(file_path, data)
-        logger.info(f"SUCCESS - {total} entries added to {file_path}!!!")
-        log_data = (tag, total)
-        return log_data
-    except:
-        logger.exception()
+    else:
+        data = { tag : new_data }
+        dump_data(file_path, data)
+    logger.debug(f"SUCCESS - {total} entries added to {file_path}")
+    number_scraped = (tag, total)
+    return number_scraped
 
 
 def post_writer(file_path, new_data, status):
     """
     Writes the new posts in the post file of the given hashtag (/data/{hashtag}/posts/data.json)
     """
-    try:
-        total = len(new_data)
-        if status:
-            try:
-                data = get_data(file_path)
-                data += new_data
-                dump_data(file_path, data)
-            except json.decoder.JSONDecodeError:
-                data = new_data
-                dump_data(file_path, data)
-        else:
+    total = len(new_data)
+    if status:
+        try:
+            data = get_data(file_path)
+            data += new_data
+            dump_data(file_path, data)
+        except json.decoder.JSONDecodeError:
             data = new_data
             dump_data(file_path, data)
-        logger.info(f"SUCCESS - {total} entries added to {file_path}!!!")
-        return
-    except:
-        logger.exception()
+    else:
+        data = new_data
+        dump_data(file_path, data)
+    logger.debug(f"SUCCESS - {total} entries added to {file_path}")
 
 
 def delete_file(file_path, file_type):
@@ -210,17 +186,15 @@ def delete_file(file_path, file_type):
     Deletes the directory or the file.
     """
     if not check_existence(file_path, file_type):
-        logger.exception(f"ERROR: Attempt to delete failed. {file_path} does not exist!!!")
+        raise OSError(f"Attempt to delete file failed: {file_path} does not exist")
     elif (file_type == "file"):
         os.remove(file_path)
-        logger.info(f"Successfully deleted {file_path}!!!")
-        return
+        logger.debug(f"Successfully deleted {file_path}")
     elif (file_type == "dir"):
         os.rmdir(file_path)
-        logger.info(f"Successfully deleted {file_path}!!!")
-        return
+        logger.debug(f"Successfully deleted {file_path}")
     else:
-        logger.exception(f"OSError: {file_type} needs to be either 'file' or 'dir' !!!")
+        raise OSError("{file_type} needs to be either 'file' or 'dir'")
 
 
 def clean_video_files(settings, tag, new_data=None):
@@ -228,13 +202,10 @@ def clean_video_files(settings, tag, new_data=None):
     Moves the new videos from the tiktok-scraper video folder to /data/{hashtag}/videos/
     Deletes the residual tiktok-scraper video folder.
     """
-    try:
-        if new_data:
-            for file in new_data:
-                settings["videos_from"] = settings['data'] + f"/{tag}/videos/#{tag}/{file}.mp4"
-                shutil.move(settings['videos_from'], settings['videos_to'])
-             
-        shutil.rmtree(settings['videos_delete'])
-        logger.info(f"Successfully deleted the folder {settings['videos_delete']} folder of videos.")
-    except:
-        raise
+    if new_data:
+        for file in new_data:
+            settings["videos_from"] = settings['data'] + f"/{tag}/videos/#{tag}/{file}.mp4"
+            shutil.move(settings['videos_from'], settings['videos_to'])
+            
+    shutil.rmtree(settings['videos_delete'])
+    logger.debug(f"Successfully deleted the folder {settings['videos_delete']} folder of videos.")
diff --git a/tiktok_downloader/hashtag_frequencies.py b/tiktok_downloader/hashtag_frequencies.py
index 9e9e1f9..7130558 100644
--- a/tiktok_downloader/hashtag_frequencies.py
+++ b/tiktok_downloader/hashtag_frequencies.py
@@ -1,22 +1,27 @@
-import os, sys
+import os
 import json
 import argparse
-import matplotlib.pyplot as plt
 from datetime import datetime
-from file_methods import check_file
-from global_data import IMAGES
+import warnings
+warnings.filterwarnings("ignore", message="Glyph (.*) missing from current font")
+import logging
 
+import matplotlib.pyplot as plt
+import matplotlib.ticker as mtick
+import seaborn as sns
+sns.set_theme(style="darkgrid")
+
+from file_methods import check_file, check_existence
+from global_data import IMAGES
 
 """
 Plots the frequency of hashtags appearing in the set of given posts.
 """
 
 
-
 def get_hashtags(obj):
     if not obj:
-        print(f'ERROR: Empty item, no hashtags to be extracted.')
-        return
+        raise ValueError(f'Empty item, no hashtags to be extracted.')
     else:
         hashtags = {}
         tags = [ [tag['name'] for tag in ele['hashtags']] for ele in obj ]
@@ -50,15 +55,21 @@ def get_occurrences(filename, n=1 , sort=True):
 
 
 def plot(n, occs, img_folder):
-    plt.scatter(occs["top_n"][0], occs["top_n"][1])
-    plt.tight_layout()
-    plt.xticks(rotation=45)
-    plt.title(f'Hashtag Distribution')
-    plt.xlabel(f'Top {n} hashtags from {occs["total"]} posts.')
-    plt.ylabel(f'Number of occurrences')
+    y_pos = list(reversed(range(n - 1)))
+    max_count = occs["top_n"][1][0]
+    freqs = [count/max_count * 100 for count in occs["top_n"][1][1:]]
+    labels = occs["top_n"][0][1:]
+
+    fig, ax = plt.subplots(figsize = (5, 6.66))
+    ax.barh(y_pos, freqs)
+    ax.set_yticks(y_pos)
+    ax.set_yticklabels(labels)
+    ax.grid(axis = 'y')
+    ax.set_xlabel('Percent of posts with common hashtag')
+    ax.set_ylim(min(y_pos)-1, max(y_pos)+1)
+    ax.set_title(f'Common hashtags for #{occs["top_n"][0][0]} posts')
+    ax.xaxis.set_major_formatter(mtick.PercentFormatter(decimals = 0))
     save_plot(img_folder)
-    plt.show(block=None)
-    return
 
 
 def print_occurrences(occs):
@@ -67,26 +78,22 @@ def print_occurrences(occs):
     """
     row_number = 0
     total_posts = occs["total"]
-    print ("{:<8} {:<15} {:<15} {:<15}".format("Rank", 'Hashtag','Occurrences',f'Frequency (Occurrences/Total-Posts(total_posts))'))
+    print ("{:<8} {:<15} {:<15} {:<15}".format("Rank", 'Hashtag','Occurrences','Frequency'))
     for key,value in zip(occs["top_n"][0], occs["top_n"][1]):
         ratio = value/total_posts 
         print ("{:<8} {:<15} {:<15} {:<15}".format(row_number, key, value, ratio))
         row_number += 1
-    return
 
 
 def save_plot(img_folder):
     """
     Saves the plot to a png file in the folder /data/imgs/
     """
-    try:
-        now = datetime.now()
-        current_time = now.strftime("%Y_%m_%d_%H_%M_%S")
-        plt.savefig(f"{img_folder}/{current_time}.png")
-
-        return
-    except: raise
-
+    now = datetime.now()
+    current_time = now.strftime("%Y_%m_%d_%H_%M_%S")
+    filename = f"{img_folder}/{current_time}.png"
+    logging.info(f'Plot saved to file: {filename}')
+    plt.savefig(filename, bbox_inches = 'tight', facecolor = 'white', dpi = 300)
 
 
 if __name__ == "__main__":
@@ -105,17 +112,14 @@ if __name__ == "__main__":
     parser.add_argument("-p", "--plot", help="Plot the occurrences", action="store_true")
     parser.add_argument("-d", "--print", help="List top n hashtags", action="store_true")
     args = parser.parse_args()
-    if args.input_file and args.n:
-        if args.n < 1:
-            print(f"Please make sure the number of top occurrences is a positive integer.")
-            sys.exit()
-
-        base = os.path.splitext(args.input_file)[0]
-        path = f"./{base}_sorted_hashtags.csv"
-        occs = get_occurrences(args.input_file, args.n)
-        if args.plot:
-            plot(args.n, occs, img_folder)
-        else:
-            print_occurrences(occs)
+    if args.n < 1:
+        raise ValueError(f"Specified argument `n` (the number of hashtags to analyze) must be greater than zero, not: {args.n}.")
+    if not check_existence(args.input_file, 'file'):
+        raise FileNotFoundError(f"Specified argument `input_file` ({args.input_file}) does not exist.")
+    base = os.path.splitext(args.input_file)[0]
+    path = f"./{base}_sorted_hashtags.csv"
+    occs = get_occurrences(args.input_file, args.n)
+    if args.plot:
+        plot(args.n, occs, img_folder)
     else:
-        print(f'ERROR: either {args.input_file} or {args.n} or both contains error.')
+        print_occurrences(occs)
diff --git a/tiktok_downloader/run_downloader.py b/tiktok_downloader/run_downloader.py
index 0d0c68d..529b336 100644
--- a/tiktok_downloader/run_downloader.py
+++ b/tiktok_downloader/run_downloader.py
@@ -1,61 +1,20 @@
 import os
 import time
 import argparse
+import logging
 
 import global_data
 import file_methods
 import data_methods
 
-# setting up the logging
-import logging
-from logging.config import fileConfig
-
-fileConfig('../logging.ini')
 logger = logging.getLogger()
 
-
-"""
-The run_downloader.py dowloads data using the tiktok-scraper (https://github.com/drawrowfly/tiktok-scraper).
-1. "-p" option is used by the user to download posts only
-2. "-v" option is use to download videos only
-3. "-p -v" is used to download posts and videos
-4. "-t" is used to specify a list of hashtags as arguments
-5. "-f" option is used to read the list of hashtags from the user specified file
-
-Example: 
-    1. The command "python3 run_downloader.py -t london paris newyork -p" will download posts for hashtags london, paris and newyork.  
-    2. The command "python3 run_downloader.py -f hashtag_list -p -v" will download posts and videos for hashtags in the file hashtag_list.
-
-
-The downloaded data is stored in the the data folder. The data is folder is organized as follows:
-    1. the log subfolder contains the log.json that records total downloads (posts and videos) for each hashtag with a timestamp of when the script was run.
-    2. the ids subfolder contains post_ids.json and video_ids.json that keep the record of post and video ids that are currently in the data set. This helps to filter out only new posts every time tiktok-scraper is run and only those new posts (or videos) are then stored in the data folder.
-    3. Each hashtag has a subfolder by its name containing two subfolders, one each for posts and videos.
-
-
-This scripts runs the function get_data in main which in turn triggers the following sequence:
-    1. get_posts function is triggered if the user wants to download posts
-    2. get_videos function is triggered if the user wants to download videos
-    3. both functions above are sequentially triggered if the user wants to download both posts and videos.
-    4. After the data is downloaded the log_writer is triggered to log the total number of posts and videos downloaded.
-
-
-------------Files--------------
-global_data - contains global constants relating to paths etc.
-data_methods - this file contains data processing methods
-file_methods - this file contains methods to write and update data in files
-hashtag_list - this file contains the list of hashtags that the user wants to download data for.
-"""
-
-
-
 def get_hashtag_list(file_name):
-    try:
-        with open(file_name) as f:
-            tags = list(filter(None, [line.strip() for line in f if not line.startswith("#")]))
-            return tags
-    except IOError:
-        logger.exception(f"IOError")
+    if not file_methods.check_existence(file_name, 'file'):
+        raise OSError(f"{file_name} does not exist")
+    with open(file_name) as f:
+        tags = list(filter(None, [line.strip() for line in f if not line.startswith("#")]))
+        return tags
 
 
 def create_parser():
@@ -102,16 +61,16 @@ def get_posts(settings, tag):
     3. calls update_posts from data_methods.py to update the id-list with the ids of newly downloaded posts.
     """
     file_path = file_methods.download_posts(settings, tag)
-    log = ()
+    number_scraped = ()
     if file_path:
         new_data = data_methods.extract_posts(settings, file_path, tag)
         if new_data:
             data_file = os.path.join(settings["data"], tag, settings["posts"], settings["data_file"])
             data_methods.update_posts(data_file, "file", new_data[1])
-            log = data_methods.update_posts(settings["post_ids"], "file", new_data[0], tag)
+            number_scraped = data_methods.update_posts(settings["post_ids"], "file", new_data[0], tag)
         file_methods.delete_file(file_path, "file")
     
-    return log
+    return number_scraped
 
 
 
@@ -122,16 +81,16 @@ def get_videos(settings, tag):
     3. calls update_videos from data_methods.py to update the id-list with the ids of newly downloaded videos.
     4. the clean_video_files function deletes the residual video folder after the data processing 
     """
-    log = ()
+    number_scraped = ()
     download_list = file_methods.download_videos(settings, tag)
     if download_list:
         new_data = data_methods.extract_videos(settings, tag, download_list)
         if new_data:
-            log = data_methods.update_videos(settings, new_data, tag)
+            number_scraped = data_methods.update_videos(settings, new_data, tag)
         else:
             file_methods.clean_video_files(settings, tag)
 
-    return log
+    return number_scraped
 
 
 
@@ -143,7 +102,7 @@ def get_data(hashtags, download_data_type):
     counter = 0
     total_hashtags = len(hashtags)
     total_hashtags_offset = total_hashtags - 1
-    log_data = []
+    scraped_summary_list = []
    
     if download_data_type["posts"]:
         settings = set_download_settings(download_data_type)
@@ -153,8 +112,8 @@ def get_data(hashtags, download_data_type):
             file_methods.check_file(os.path.join(settings["data"], tag, settings["posts"], settings["data_file"]), "file")
             res = get_posts(settings, tag)
             if res:
-                log = ( res[0], ( "posts", res[1] ) )
-                log_data.append(log)
+                number_scraped = ( res[0], ( "posts", res[1] ) )
+                scraped_summary_list.append(number_scraped)
                 data_methods.print_total(settings["post_ids"], tag, "posts")
             
             counter += 1
@@ -171,14 +130,14 @@ def get_data(hashtags, download_data_type):
             res = get_videos(settings, tag)
             if res:
                 res = ( res[0], ( "videos", res[1]))
-                log_data.append(res)
+                scraped_summary_list.append(res)
                 data_methods.print_total(settings["video_ids"], tag, "videos")
  
             counter += 1
             if counter < total_hashtags_offset:
                 time.sleep(settings["sleep"])
 
-    return log_data
+    return scraped_summary_list
 
 
 if __name__ == "__main__":
@@ -197,29 +156,15 @@ if __name__ == "__main__":
         file_name = args.f
         hashtags = get_hashtag_list(file_name)
 
-    print(hashtags)
+    logger.info(f"Hashtags to scrape: {hashtags}")
     if not hashtags:
-        logger.exception("No hashtags were given, please use either -t option or -f to provide hashtags.")
+        raise ValueError("No hashtags were specified: please use either the -t flag to specify a sspace-separated list of one or more hashtags as a command-line argument, or use the -f flag to specify a text file of newline-separated hashtags.")
 
-    if (args.p and args.v):
-        download_data_type = {
-                "posts": True,
-                "videos": True
-                }
-    elif args.p:
-        download_data_type = {
-                "posts": True,
-                "videos": False
-                }
-    else:
-        download_data_type = {
-                "posts": False,
-                "videos": True
+    download_data_type = {
+                "posts": args.p,
+                "videos": args.v
                 }
    
-    try: 
-        log_data = get_data(hashtags, download_data_type)
-        if log_data:
-            file_methods.log_writer(log_data)
-    except:
-        logger.exception(f"ERROR")
+    scraped_summary_list = get_data(hashtags, download_data_type)
+    if scraped_summary_list:
+        file_methods.log_writer(scraped_summary_list)

From 64354f60992ec681c6fac13471682bffcd0b812c Mon Sep 17 00:00:00 2001
From: Tristan Lee <tristan@bellingcat.com>
Date: Thu, 5 May 2022 02:32:32 -0500
Subject: [PATCH 2/4] Updated plot figure in README

---
 README.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index 4a01a1d..d843a20 100644
--- a/README.md
+++ b/README.md
@@ -110,9 +110,10 @@ Assume we want to analyze the top 20 occurring hashtags in the downloaded posts
     `python3 hashtag_frequencies.py -p ../data/london/posts/data.json 20`
     
     which will produce a figure similar to that shown below:
-
-    ![Top 20 most frequent hashtags in posts containing the #london hashtag!](https://user-images.githubusercontent.com/72805812/155770710-0d167bbb-4c44-44d2-ba1c-fa57026afea8.png)
-
+    <p align="center">
+        <img src="https://user-images.githubusercontent.com/18430739/166878928-d146b352-b68c-4ab4-bd2c-feb2f0140df9.png" height="500" alt="Top 20 most frequent common hashtags in posts containing the #london hashtag">
+    </p>
+    
     Clearly, the highest occurrence will be of the `#london` hashtag, as all posts in the file `data/london/posts/data.json` contain the hashtag `#london`.
 
 - The results can be displayed in tabular form by executing the following command:
@@ -144,4 +145,4 @@ Assume we want to analyze the top 20 occurring hashtags in the downloaded posts
     19       america         20              0.02079002079002079
     ```
 
-    The `Frequency` column shows the ratio of the occurrence to the total number of downloaded posts.
\ No newline at end of file
+    The `Frequency` column shows the ratio of the occurrence to the total number of downloaded posts.

From cd883eeeb13729fcb3d549c85893cb49484eb7d7 Mon Sep 17 00:00:00 2001
From: Tristan Lee <tristan@bellingcat.com>
Date: Thu, 5 May 2022 02:39:23 -0500
Subject: [PATCH 3/4] minor fixes in the README and LICENSE

---
 LICENSE   |  2 +-
 README.md | 12 ++++++------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/LICENSE b/LICENSE
index e76b7c5..f13ef72 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2021 Mohit Singh Thakur
+Copyright (c) 2022 Bellingcat
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
diff --git a/README.md b/README.md
index d843a20..b89ea8c 100644
--- a/README.md
+++ b/README.md
@@ -2,9 +2,9 @@
 The tool helps to download posts and videos from TikTok for a given set of hashtags. It uses the [tiktok-scraper](https://github.com/drawrowfly/tiktok-scraper) Node package  to download the posts and videos.
 
 ## Pre-requisites
-1. Make sure you have Python 3.6 or a later version installed.
+1. Make sure you have Python 3.6 or a later version installed
 2. Download and install TikTok scraper: https://github.com/drawrowfly/tiktok-scraper 
-3. (Optional) create and activate a virtual environment for this tool, for example by executing the following command, which creates the `.env` virtual environment in the project directory:
+3. (Optional) create and activate a virtual environment for this tool, for example by executing the following command, which creates the `.env` virtual environment in the tool's root directory:
 
     `python3 -m venv .env`
 
@@ -77,13 +77,13 @@ and will produce an output similar to the following log:
 - The `-p` flag specifies that posts, not videos, will be downloaded
 
 ### Video downloading
-Running the `run_downloader.py` script with the following options will scrape trending videos containing the hashtags `#london`, `#paris`, or `#newyork`:
+Running the `run_downloader.py` script with the following options will scrape trending videos containing the hashtag `#london`:
 ` python3 run_downloader.py -t london -v`
 
 - The `-t` flag allows a space-separated list of hashtags to be specified as a command line argument
 - The `-v` flag specifies that videos, not posts, will be downloaded
 
-Note that video downloading is a time and data rate consuming task, as a result we strongly recommend using one hashtag at a time when using the `-v` flag to avoid complications.
+Note that video downloading is a time and data rate consuming task, as a result we recommend using one hashtag at a time when using the `-v` flag to avoid complications.
 
 ## Analyzing results 
 ### Top n hashtag occurrences 
@@ -103,7 +103,7 @@ optional arguments:
   -d, --print  List top n hashtags
   ```
 
-Assume we want to analyze the top 20 occurring hashtags in the downloaded posts of the `#london` hashtag.
+Assume we want to analyze the 20 most frequently occurring hashtags in the downloaded posts of the `#london` hashtag.
 
 - The results can be plotted and saved as a PNG file by executing the following command: 
 
@@ -114,7 +114,7 @@ Assume we want to analyze the top 20 occurring hashtags in the downloaded posts
         <img src="https://user-images.githubusercontent.com/18430739/166878928-d146b352-b68c-4ab4-bd2c-feb2f0140df9.png" height="500" alt="Top 20 most frequent common hashtags in posts containing the #london hashtag">
     </p>
     
-    Clearly, the highest occurrence will be of the `#london` hashtag, as all posts in the file `data/london/posts/data.json` contain the hashtag `#london`.
+    In the above plot, the highest occurrence is the `#fyp` hashtag, which is tagged in more than half of all posts containing the `#london` hashtag.
 
 - The results can be displayed in tabular form by executing the following command:
 

From af5bcc9433a138bedf7285268e4423bd2c1c6f67 Mon Sep 17 00:00:00 2001
From: Tristan Lee <tristan@bellingcat.com>
Date: Thu, 5 May 2022 02:58:42 -0500
Subject: [PATCH 4/4] fixed typo in Windows venv activation command

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index b89ea8c..7d13107 100644
--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@ The tool helps to download posts and videos from TikTok for a given set of hasht
 
 4. Start your virtual environment
     - On Unix-like operating systems (macOS, Linux), this can be done using the command `source .env/bin/activate`
-    - On Windows, this can be done using the command `.env\activate.bat`
+    - On Windows, this can be done using the command `.env\Scripts\activate.bat`
     
 5. Install the Python package dependencies for this tool by executing the command: