Commit Graph

84 Commits

Author SHA1 Message Date
Jody Leonard
20ea117a2c Fix vkontakte-user pagination 2019-10-30 22:29:49 -04:00
JustAnotherArchivist
e6aae35304 Use setuptools_scm for versioning through git tags v0.3.0 2019-07-01 17:41:18 +00:00
JustAnotherArchivist
b698a201f5 Update scraper list 2019-07-01 16:05:21 +00:00
JustAnotherArchivist
7fe72cf708 Add a note about reporting issues with proper debugging information 2019-07-01 16:01:11 +00:00
JustAnotherArchivist
4651cde447 Refactor CLI logging and add --dump-locals for better debugging 2019-07-01 15:46:10 +00:00
JustAnotherArchivist
c99cc4b5d3 Remove existing logging handlers 2019-07-01 15:42:06 +00:00
JustAnotherArchivist
628074d6fc Print contents when ignoring a link-less entry 2019-07-01 01:35:00 +00:00
JustAnotherArchivist
64b293bd9e Add support for media sets
Closes #48
2019-07-01 01:34:17 +00:00
JustAnotherArchivist
180f4dfeb7 Add support for photo.php URLs
Fixes #42
2019-06-30 18:36:39 +00:00
JustAnotherArchivist
6d6e3fa16c Fix crash on (some?) inexistent groups 2019-06-30 18:36:30 +00:00
JustAnotherArchivist
5f7e6936c1 Add support for Facebook groups
Closes #47
2019-06-30 17:16:09 +00:00
JustAnotherArchivist
e2c05c9e0c Split common code off into FacebookCommonScraper and refactor odd link detection in preparation of group scraping 2019-06-30 16:28:33 +00:00
JustAnotherArchivist
14e11b28d2 Add support for Twitter lists
Closes #46
2019-06-30 14:39:29 +00:00
JustAnotherArchivist
1a07b3b7e8 Add support for Twitter threads 2019-06-30 02:11:46 +00:00
JustAnotherArchivist
4d8cc7bdb9 Extract outlinks from Facebook 2019-06-27 15:29:05 +00:00
JustAnotherArchivist
eec83f181e Check HTTP status code before attempting parsing 2019-06-27 15:25:26 +00:00
JustAnotherArchivist
fae7432c64 Log details about failed JSON parsing 2019-06-27 15:25:08 +00:00
JustAnotherArchivist
757818474d Add tweet ID and username fields to Tweet items 2019-06-23 11:48:54 +00:00
JustAnotherArchivist
e6c934c0b8 Retrieve as many posts at once as possible for Instagram hashtags 2019-06-21 09:56:12 +00:00
JustAnotherArchivist
d2315feec1 Add support for Instagram locations 2019-06-21 09:55:30 +00:00
JustAnotherArchivist
765ceeeb10 More complete and more readable exception dump 2019-06-18 14:25:38 +00:00
JustAnotherArchivist
731a2e8c8b Check that Instagram returned valid JSON, take 2
Fixes #22
2019-06-10 15:03:15 +00:00
JustAnotherArchivist
7d1916292c Twitter: stop recursion based on whether the server returns the same position instead of detecting an empty feed
Fixes #37
2019-06-10 14:38:25 +00:00
JustAnotherArchivist
0d509c4ba0 Check that Instagram returned valid JSON (fixes #22) 2019-05-30 15:04:05 +00:00
JustAnotherArchivist
907a003a59 Fix crash when Twitter search produces no results (fixes #41) 2019-05-24 11:51:50 +00:00
JustAnotherArchivist
8ada279b57 Add warning if Twitter module gets no results 2019-05-24 11:50:39 +00:00
JustAnotherArchivist
900eae54a6 Ignore branded content link on Facebook silently 2019-05-24 11:49:44 +00:00
JustAnotherArchivist
7989af27b5 Handle tweets by temporarily blocked accounts (which show up in the search results but don't have a date or content) 2019-05-21 22:37:43 +00:00
JustAnotherArchivist
e528ca3f26 Dump locals only for snscrape modules (closes #39) 2019-05-18 01:08:49 +00:00
JustAnotherArchivist
32a427dac3 Fix pagination on Twitter (fixes #40) 2019-05-18 01:08:00 +00:00
JustAnotherArchivist
7001983556 Skip timeline entries that don't have a link (fixes #36) 2019-05-16 23:17:46 +00:00
JustAnotherArchivist
64438afc92 Work around tweet URLs that don't have a data-expanded-url attribute (fixes #38) 2019-05-16 22:51:22 +00:00
JustAnotherArchivist
9e6538556a Dump also the deeper frames, not just the get_items one 2019-05-16 22:48:35 +00:00
JustAnotherArchivist
9c8bbf051c Fix order of processing in Twitter module for more useful locals dump output 2019-05-16 22:22:53 +00:00
JustAnotherArchivist
c6a11298ac Fix missing linebreak in locals dump output 2019-05-16 22:22:21 +00:00
JustAnotherArchivist
02cbf6ddf6 Dump locals to a temporary file in case of an exception 2019-05-16 18:29:30 +00:00
JustAnotherArchivist
3817aa59d4 Add support for extracting links from tweets (including cards)
Both the t.co and the original URLs can be extracted. Note that card links are always t.co since Twitter's HTML does not include the original URL for those.
2019-05-16 16:42:52 +00:00
JustAnotherArchivist
46a51008f8 Fix Instagram signature calculation 2019-05-16 16:19:51 +00:00
JustAnotherArchivist
f91979eb32 Add --max-position option to twitter-search scraper as a workaround for pagination stopping early (#37)
The value needs to be of the format 'TWEET-<seenID>-<newestID>' where <seenID> is the last result that was returned by a previous scrape and <newestID> is the first result returned by the initial scrape.
2019-05-10 17:30:15 +00:00
JustAnotherArchivist
85fff319bc Disable Twitter's spelling correction
src=typd means "this is what was typed in and could be incorrect". src=spxr is "no, I really mean that". src=sprv appears to be an alias of spxr that is no longer used.
2019-05-10 16:43:59 +00:00
JustAnotherArchivist
6b145526b7 Update README with new modules 2019-04-21 23:10:32 +02:00
JustAnotherArchivist
abf31764b1 Version 0.2.0 v0.2.0 2019-04-21 23:03:21 +02:00
JustAnotherArchivist
64693f74bb Update Instagram query hash 2019-04-19 01:47:38 +02:00
JustAnotherArchivist
a7d08ed51c Remove leftover debugging print 2019-04-19 01:40:29 +02:00
JustAnotherArchivist
f48ca7726e Add support for Gab 2019-04-19 00:40:43 +02:00
JustAnotherArchivist
78c295f7e0 Add support for VKontakte (fixes #13) 2019-04-18 18:39:21 +02:00
JustAnotherArchivist
a5aca1a14f Add support for Instagram hashtags (fixes #29) 2019-04-18 16:14:54 +02:00
JustAnotherArchivist
96f7d871c1 Ignore Scraper subclasses which don't set a name 2019-04-18 16:14:26 +02:00
JustAnotherArchivist
b5dfd37949 Support unix timestamps in --since 2019-04-18 16:01:35 +02:00
JustAnotherArchivist
b511397791 Add --since option to return only results newer than a certain date (fixes #19) 2019-04-18 15:12:29 +02:00