Commit Graph

44 Commits

Author SHA1 Message Date
JustAnotherArchivist
6b145526b7 Update README with new modules 2019-04-21 23:10:32 +02:00
JustAnotherArchivist
abf31764b1 Version 0.2.0 v0.2.0 2019-04-21 23:03:21 +02:00
JustAnotherArchivist
64693f74bb Update Instagram query hash 2019-04-19 01:47:38 +02:00
JustAnotherArchivist
a7d08ed51c Remove leftover debugging print 2019-04-19 01:40:29 +02:00
JustAnotherArchivist
f48ca7726e Add support for Gab 2019-04-19 00:40:43 +02:00
JustAnotherArchivist
78c295f7e0 Add support for VKontakte (fixes #13) 2019-04-18 18:39:21 +02:00
JustAnotherArchivist
a5aca1a14f Add support for Instagram hashtags (fixes #29) 2019-04-18 16:14:54 +02:00
JustAnotherArchivist
96f7d871c1 Ignore Scraper subclasses which don't set a name 2019-04-18 16:14:26 +02:00
JustAnotherArchivist
b5dfd37949 Support unix timestamps in --since 2019-04-18 16:01:35 +02:00
JustAnotherArchivist
b511397791 Add --since option to return only results newer than a certain date (fixes #19) 2019-04-18 15:12:29 +02:00
JustAnotherArchivist
536fcb3303 Return proper items from scrapers including clean URLs (fixes #9 and #10) 2019-04-18 14:44:21 +02:00
JustAnotherArchivist
f8d812f799 Include permalink.php, events, and notes (fixes #32) 2019-04-18 04:22:47 +02:00
JustAnotherArchivist
c2cebd9166 Accept-Language header to get an English response unconditionally 2019-04-18 03:58:37 +02:00
JustAnotherArchivist
73bc99596f Treat Twitter responses without a Content-Type header as invalid (fixes #21) 2019-04-18 02:24:35 +02:00
JustAnotherArchivist
8458c12218 Rewrite link extraction on Facebook (fixes #17)
Facebook's returned HTML has a large number of inconsistencies; some (most) pages include a <link rel="canonical" /> but some don't, for example. This was at the root of the failing post extraction for some Facebook pages (#17). The previous link extraction technique was also quite poor for other reasons though. The new method uses the relevant CSS classes instead. Despite probably being the result of a CSS minimiser or similar, these seem to be quite stable: they haven't changed in the past two years (but the more readable ones have!).
2019-04-18 02:14:21 +02:00
JustAnotherArchivist
b59c7e8d8f Merge pull request #28 from peterk/master
Adds socks proxy support (via requests)
2019-03-11 13:32:07 +01:00
Peter Krantz
3ceb849d98 Adds socks proxy support (via requests) 2019-01-10 22:54:42 +01:00
JustAnotherArchivist
f5ee1f7ac5 Merge pull request #26 from ludios/avoid-twitter-bans
twitter: randomize user agent to avoid Twitter's (IP, UA)-keyed bans
2018-12-25 02:19:17 +01:00
Ivan Kozik
1984110f78 twitter: randomize user agent to avoid Twitter's (IP, UA)-keyed bans 2018-12-24 08:03:33 +00:00
JustAnotherArchivist
c5a5dcb92c snscrape is now on PyPI 2018-10-09 17:26:03 +02:00
JustAnotherArchivist
cfb1c9a2aa Version 0.1.3 v0.1.3 2018-10-01 03:26:22 +02:00
JustAnotherArchivist
d0d3c8b2a6 Better log output for temporary failures (fixes #2) 2018-10-01 03:24:29 +02:00
JustAnotherArchivist
4d0350e541 Disable "quality filter" on Twitter (fixes #3) 2018-10-01 02:51:33 +02:00
JustAnotherArchivist
d17aa15bcb Version 0.1.2 v0.1.2 2018-09-11 12:44:07 +02:00
JustAnotherArchivist
d1ef280d6e Fix snscrape.modules not getting installed 2018-09-11 12:43:10 +02:00
JustAnotherArchivist
2823272e0b Version 0.1.1 v0.1.1 2018-09-11 12:30:35 +02:00
JustAnotherArchivist
540f557002 Fix typo in setup.py preventing installation 2018-09-11 12:30:21 +02:00
JustAnotherArchivist
5fc60fe978 Version 0.1 v0.1 2018-09-10 22:15:11 +02:00
JustAnotherArchivist
cf36e8be97 Add README, LICENSE, and metadata 2018-09-10 22:15:03 +02:00
JustAnotherArchivist
0350ab0692 Fix Facebook scraper returning strings instead of Items 2018-09-10 19:38:43 +02:00
JustAnotherArchivist
6b6ae3d33b Rename from socialmediascraper to snscrape 2018-08-21 22:54:14 +02:00
JustAnotherArchivist
9fb3ac6013 Add support for Google+ user profiles 2018-08-21 18:58:43 +02:00
JustAnotherArchivist
897f5bebe6 Add support for POST requests 2018-08-21 18:58:09 +02:00
JustAnotherArchivist
e28a2cdb4b Fix Instagram again
- __a=1 is no longer supported, so we need to extract the JSON from the HTML page instead.
- There is now a X-Instagram-GIS header that needs to be set correctly.
2018-08-21 18:55:40 +02:00
JustAnotherArchivist
5a084af85c Fix Instagram
Instagram dropped the max_id parameter, so it is no longer possible to iterate over the posts so easily. Switch to GraphQL instead, which is what's used in the browser as well.
2018-08-21 18:50:00 +02:00
JustAnotherArchivist
14831d4137 Add support for Facebook user profiles 2018-08-21 18:48:34 +02:00
JustAnotherArchivist
6d54655a7f Add support for Instagram user profiles 2018-08-21 18:47:44 +02:00
JustAnotherArchivist
3ab69a1a0f Merge Twitter user and hashtag into one, and add support for generic Twitter search scrapes 2018-08-21 18:46:34 +02:00
JustAnotherArchivist
d03c82d413 Support nested inheritance from socialmediascraper.base.Scraper 2018-08-21 18:44:15 +02:00
JustAnotherArchivist
02473876d7 Add milliseconds to the log timestamps 2018-08-21 18:43:44 +02:00
JustAnotherArchivist
e3190ee541 Add support for Twitter hashtags 2018-08-21 18:40:42 +02:00
JustAnotherArchivist
606b81e066 Use a session for proper cookie handling, and add exponential backoff in case of errors 2018-08-21 18:39:39 +02:00
JustAnotherArchivist
d085018a5f Split up into modules 2018-08-21 18:35:40 +02:00
JustAnotherArchivist
1ae006b268 Initial commit 2018-08-21 18:28:16 +02:00