Commit Graph

  • 8cf81e9bfc Fix twitter-profile scraper JustAnotherArchivist 2020-09-25 02:45:07 +00:00
  • d90f06b389 Extract more information on users from Twitter JustAnotherArchivist 2020-09-24 18:39:32 +00:00
  • c519832755 Clarify twitter-list-posts argument value JustAnotherArchivist 2020-09-24 18:37:37 +00:00
  • 397a0b988e Remove Twitter list member scraper JustAnotherArchivist 2020-09-24 18:34:15 +00:00
  • f1428fa0e0 Fix crash on nested quoted tweets JustAnotherArchivist 2020-09-24 02:45:49 +00:00
  • 7d2c546ee5 Deprecate hacky fields in Tweet objects JustAnotherArchivist 2020-09-24 02:00:45 +00:00
  • 2332c30e26 Replace locale-dependent strptime date parsing with email.utils.parsedate_to_datetime JustAnotherArchivist 2020-09-24 02:00:21 +00:00
  • b78bf3e642 Fix crash on banner-less profiles and nested descriptionUrls JustAnotherArchivist 2020-09-24 01:58:38 +00:00
  • 1a09f9b9a3 Extract more information from Twitter JustAnotherArchivist 2020-09-24 01:45:08 +00:00
  • 5ae5ec7bcd Bump Python version classifier JustAnotherArchivist 2020-09-23 22:25:38 +00:00
  • c0ff6631aa Update README JustAnotherArchivist 2020-09-22 22:30:08 +00:00
  • ae60a4d0fd Add Weibo scraper JustAnotherArchivist 2020-09-13 02:27:35 +00:00
  • 800cfd5be0 Add support for disabling following redirects JustAnotherArchivist 2020-09-13 00:52:26 +00:00
  • f296f9d21d Refactor post extraction of VK again to work around their weird behaviours JustAnotherArchivist 2020-09-12 02:00:50 +00:00
  • 8265ffc19e Work around geoblocked posts on VK JustAnotherArchivist 2020-09-12 01:24:19 +00:00
  • f8efe98608 Fix post order on VK: reinsert pinned post at the correct location in the stream JustAnotherArchivist 2020-09-12 00:03:29 +00:00
  • 2b5444f89e Restrict --max-results to zero or positive values; use zero to indicate fetching only the entity JustAnotherArchivist 2020-09-11 15:37:22 +00:00
  • 07d446fd19 Fix crash in VK scraper JustAnotherArchivist 2020-09-10 21:05:03 +00:00
  • a25426043b Fix Telegram username canonicalisation JustAnotherArchivist 2020-09-09 09:33:57 +00:00
  • 84692846b9 Fix crash in Telegram scraper JustAnotherArchivist 2020-09-09 09:22:00 +00:00
  • 039b2c6719 Restructure Twitter classes since the 'common' scraper is only used for the old design anymore JustAnotherArchivist 2020-09-07 02:38:27 +00:00
  • 70a3d9ba3a Fix infinite loop at the end of profile pages JustAnotherArchivist 2020-09-01 04:01:27 +00:00
  • bd619bf4e9 Log and ignore tweets which are not contained in the globalObjects JustAnotherArchivist 2020-09-01 03:45:23 +00:00
  • 072519f539 Fix pagination on profile pages JustAnotherArchivist 2020-09-01 03:23:45 +00:00
  • d9572ec450 Correctly serialise nested NamedTuples JustAnotherArchivist 2020-09-01 03:16:25 +00:00
  • ba250aabf2 Extract retweeted tweet if present JustAnotherArchivist 2020-09-01 03:15:21 +00:00
  • 0cc4f0c016 Add support for Twitter profile pages JustAnotherArchivist 2020-09-01 03:13:49 +00:00
  • 1a2e367a87 Cache entities JustAnotherArchivist 2020-09-01 02:34:21 +00:00
  • 4f24843f89 Extract user ID JustAnotherArchivist 2020-09-01 02:26:13 +00:00
  • bfb92a47b9 Move Tweet object generation to TwitterAPIScraper JustAnotherArchivist 2020-09-01 02:25:00 +00:00
  • dc5d55004b Refactor API interaction into something cleaner and more reusable JustAnotherArchivist 2020-09-01 01:56:07 +00:00
  • d8e7f96d4d Add support for Reddit JustAnotherArchivist 2020-08-31 03:38:20 +00:00
  • bb83d1d72f Validate Twitter usernames JustAnotherArchivist 2020-08-24 19:03:52 +00:00
  • 1480260e47 Handle Telegram channels without public posts JustAnotherArchivist 2020-08-24 17:54:30 +00:00
  • c8d688d39f Fix crash on Telegram pages without a description JustAnotherArchivist 2020-08-24 17:53:50 +00:00
  • 9df4352089 Fix crash on VK pages without an info div JustAnotherArchivist 2020-08-24 17:42:33 +00:00
  • dd25fd0526 Add support for extracting the entity behind a scrape JustAnotherArchivist 2020-08-24 01:38:27 +00:00
  • c90fd54b6b Make datetime.date serialisable JustAnotherArchivist 2020-08-24 01:12:38 +00:00
  • 9528df48cd Refactor base URL handling JustAnotherArchivist 2020-08-24 01:12:06 +00:00
  • 924c35f883 Refactor guest token extraction code JustAnotherArchivist 2020-08-22 22:59:43 +00:00
  • 588ec415ff Force TwitterThreadScraper to fetch the old design (take 2) JustAnotherArchivist 2020-08-12 17:19:42 +00:00
  • bf229414ba Add JSONL output format JustAnotherArchivist 2020-08-12 15:09:02 +00:00
  • afa819547d Update README JustAnotherArchivist 2020-08-11 22:18:04 +00:00
  • dbcdc159ef Add support for scraping Facebook page visitor posts aka 'Community' JustAnotherArchivist 2020-08-11 22:14:27 +00:00
  • 30f945897a Clean Facebook group post URLs JustAnotherArchivist 2020-08-11 20:48:14 +00:00
  • eee5794ff9 Extract Facebook group post in chronological order (instead of by last comment) JustAnotherArchivist 2020-08-11 20:47:42 +00:00
  • 966a6ebd8e Skip promoted tweets/ads JustAnotherArchivist 2020-08-11 20:28:35 +00:00
  • 4d3d0fe0d7 Update search API parameter values to the ones currently used on Twitter JustAnotherArchivist 2020-08-11 20:26:56 +00:00
  • 7b967ff82a Twitter reverted their guest token change (90f9598e) v0.3.4 JustAnotherArchivist 2020-07-08 22:07:18 +00:00
  • 90f9598ecc Adjust to Twitter's new method of handing out guest tokens v0.3.3 JustAnotherArchivist 2020-06-24 21:22:58 +00:00
  • 7b3c7deb28 Catch login redirects on Instagram v0.3.2 JustAnotherArchivist 2020-05-30 00:56:34 +00:00
  • 040a11656c Update README JustAnotherArchivist 2020-05-30 00:47:03 +00:00
  • 1459245258 Consistently raise ScraperException on fatal errors JustAnotherArchivist 2020-05-30 00:40:04 +00:00
  • dbe4c5ce55 Remove Google+ module JustAnotherArchivist 2020-05-30 00:35:06 +00:00
  • 80491ecc2c Remove Gab module JustAnotherArchivist 2020-05-30 00:23:33 +00:00
  • 1a71b58101 Add support for Telegram JustAnotherArchivist 2020-05-29 23:44:01 +00:00
  • 0ce37a69d4 Log exception details on crashes JustAnotherArchivist 2020-05-29 22:29:23 +00:00
  • 722bfd5f7c Handle Twitter tombstones JustAnotherArchivist 2020-05-29 22:12:37 +00:00
  • b6cc3180d9 Force TwitterThreadScraper and TwitterListMembersScraper to fetch the old design v0.3.1 JustAnotherArchivist 2020-03-04 00:40:23 +00:00
  • 613395d1c2 Port TwitterSearchScraper to redesign JustAnotherArchivist 2020-03-04 00:33:48 +00:00
  • 82a87b7b5a Merge pull request #53 from JackDallas/add-more-insta-fields JustAnotherArchivist 2020-02-09 23:48:59 +00:00
  • 9568028bf9 Update changed fields Jack Dallas 2020-02-07 11:30:16 +00:00
  • 6df351772e Fix crash in Facebook scraper on link-less entries JustAnotherArchivist 2020-02-05 16:15:10 +00:00
  • 541173b0c8 Merge pull request #54 from jodizzle/fix/vkontakte-user JustAnotherArchivist 2020-02-05 14:56:12 +00:00
  • b6772d3778 vkontakte-user: Handle additional un-scrapeable profile case Jody Leonard 2019-10-30 22:21:29 -04:00
  • 20ea117a2c Fix vkontakte-user pagination Jody Leonard 2019-10-30 20:39:27 -04:00
  • ff54c350bc Add more fields to the instagram scraper JackDallas 2019-08-30 12:43:02 +01:00
  • e6aae35304 Use setuptools_scm for versioning through git tags v0.3.0 JustAnotherArchivist 2019-07-01 17:13:13 +00:00
  • b698a201f5 Update scraper list JustAnotherArchivist 2019-07-01 16:05:21 +00:00
  • 7fe72cf708 Add a note about reporting issues with proper debugging information JustAnotherArchivist 2019-07-01 16:01:11 +00:00
  • 4651cde447 Refactor CLI logging and add --dump-locals for better debugging JustAnotherArchivist 2019-07-01 15:46:10 +00:00
  • c99cc4b5d3 Remove existing logging handlers JustAnotherArchivist 2019-07-01 15:42:06 +00:00
  • 628074d6fc Print contents when ignoring a link-less entry JustAnotherArchivist 2019-07-01 01:35:00 +00:00
  • 64b293bd9e Add support for media sets JustAnotherArchivist 2019-07-01 01:34:17 +00:00
  • 180f4dfeb7 Add support for photo.php URLs JustAnotherArchivist 2019-06-30 18:36:39 +00:00
  • 6d6e3fa16c Fix crash on (some?) inexistent groups JustAnotherArchivist 2019-06-30 18:36:30 +00:00
  • 5f7e6936c1 Add support for Facebook groups JustAnotherArchivist 2019-06-30 17:16:09 +00:00
  • e2c05c9e0c Split common code off into FacebookCommonScraper and refactor odd link detection in preparation of group scraping JustAnotherArchivist 2019-06-30 16:28:33 +00:00
  • 14e11b28d2 Add support for Twitter lists JustAnotherArchivist 2019-06-30 14:39:29 +00:00
  • 1a07b3b7e8 Add support for Twitter threads JustAnotherArchivist 2019-06-30 02:11:46 +00:00
  • 4d8cc7bdb9 Extract outlinks from Facebook JustAnotherArchivist 2019-06-27 15:29:05 +00:00
  • eec83f181e Check HTTP status code before attempting parsing JustAnotherArchivist 2019-06-27 15:25:26 +00:00
  • fae7432c64 Log details about failed JSON parsing JustAnotherArchivist 2019-06-27 15:25:08 +00:00
  • 757818474d Add tweet ID and username fields to Tweet items JustAnotherArchivist 2019-06-23 11:48:54 +00:00
  • e6c934c0b8 Retrieve as many posts at once as possible for Instagram hashtags JustAnotherArchivist 2019-06-21 09:56:12 +00:00
  • d2315feec1 Add support for Instagram locations JustAnotherArchivist 2019-06-21 09:55:30 +00:00
  • 765ceeeb10 More complete and more readable exception dump JustAnotherArchivist 2019-06-18 14:25:38 +00:00
  • 731a2e8c8b Check that Instagram returned valid JSON, take 2 JustAnotherArchivist 2019-06-10 15:03:15 +00:00
  • 7d1916292c Twitter: stop recursion based on whether the server returns the same position instead of detecting an empty feed JustAnotherArchivist 2019-06-10 14:38:25 +00:00
  • 0d509c4ba0 Check that Instagram returned valid JSON (fixes #22) JustAnotherArchivist 2019-05-30 15:04:05 +00:00
  • 907a003a59 Fix crash when Twitter search produces no results (fixes #41) JustAnotherArchivist 2019-05-24 11:51:50 +00:00
  • 8ada279b57 Add warning if Twitter module gets no results JustAnotherArchivist 2019-05-24 11:50:39 +00:00
  • 900eae54a6 Ignore branded content link on Facebook silently JustAnotherArchivist 2019-05-24 11:49:44 +00:00
  • 7989af27b5 Handle tweets by temporarily blocked accounts (which show up in the search results but don't have a date or content) JustAnotherArchivist 2019-05-21 22:37:43 +00:00
  • e528ca3f26 Dump locals only for snscrape modules (closes #39) JustAnotherArchivist 2019-05-18 01:08:49 +00:00
  • 32a427dac3 Fix pagination on Twitter (fixes #40) JustAnotherArchivist 2019-05-18 01:08:00 +00:00
  • 7001983556 Skip timeline entries that don't have a link (fixes #36) JustAnotherArchivist 2019-05-16 23:17:46 +00:00
  • 64438afc92 Work around tweet URLs that don't have a data-expanded-url attribute (fixes #38) JustAnotherArchivist 2019-05-16 22:51:22 +00:00
  • 9e6538556a Dump also the deeper frames, not just the get_items one JustAnotherArchivist 2019-05-16 22:48:35 +00:00
  • 9c8bbf051c Fix order of processing in Twitter module for more useful locals dump output JustAnotherArchivist 2019-05-16 22:22:53 +00:00