JustAnotherArchivist
32a427dac3
Fix pagination on Twitter ( fixes #40 )
2019-05-18 01:08:00 +00:00
JustAnotherArchivist
7001983556
Skip timeline entries that don't have a link ( fixes #36 )
2019-05-16 23:17:46 +00:00
JustAnotherArchivist
64438afc92
Work around tweet URLs that don't have a data-expanded-url attribute ( fixes #38 )
2019-05-16 22:51:22 +00:00
JustAnotherArchivist
9e6538556a
Dump also the deeper frames, not just the get_items one
2019-05-16 22:48:35 +00:00
JustAnotherArchivist
9c8bbf051c
Fix order of processing in Twitter module for more useful locals dump output
2019-05-16 22:22:53 +00:00
JustAnotherArchivist
c6a11298ac
Fix missing linebreak in locals dump output
2019-05-16 22:22:21 +00:00
JustAnotherArchivist
02cbf6ddf6
Dump locals to a temporary file in case of an exception
2019-05-16 18:29:30 +00:00
JustAnotherArchivist
3817aa59d4
Add support for extracting links from tweets (including cards)
...
Both the t.co and the original URLs can be extracted. Note that card links are always t.co since Twitter's HTML does not include the original URL for those.
2019-05-16 16:42:52 +00:00
JustAnotherArchivist
46a51008f8
Fix Instagram signature calculation
2019-05-16 16:19:51 +00:00
JustAnotherArchivist
f91979eb32
Add --max-position option to twitter-search scraper as a workaround for pagination stopping early ( #37 )
...
The value needs to be of the format 'TWEET-<seenID>-<newestID>' where <seenID> is the last result that was returned by a previous scrape and <newestID> is the first result returned by the initial scrape.
2019-05-10 17:30:15 +00:00
JustAnotherArchivist
85fff319bc
Disable Twitter's spelling correction
...
src=typd means "this is what was typed in and could be incorrect". src=spxr is "no, I really mean that". src=sprv appears to be an alias of spxr that is no longer used.
2019-05-10 16:43:59 +00:00
JustAnotherArchivist
6b145526b7
Update README with new modules
2019-04-21 23:10:32 +02:00
JustAnotherArchivist
abf31764b1
Version 0.2.0
v0.2.0
2019-04-21 23:03:21 +02:00
JustAnotherArchivist
64693f74bb
Update Instagram query hash
2019-04-19 01:47:38 +02:00
JustAnotherArchivist
a7d08ed51c
Remove leftover debugging print
2019-04-19 01:40:29 +02:00
JustAnotherArchivist
f48ca7726e
Add support for Gab
2019-04-19 00:40:43 +02:00
JustAnotherArchivist
78c295f7e0
Add support for VKontakte ( fixes #13 )
2019-04-18 18:39:21 +02:00
JustAnotherArchivist
a5aca1a14f
Add support for Instagram hashtags ( fixes #29 )
2019-04-18 16:14:54 +02:00
JustAnotherArchivist
96f7d871c1
Ignore Scraper subclasses which don't set a name
2019-04-18 16:14:26 +02:00
JustAnotherArchivist
b5dfd37949
Support unix timestamps in --since
2019-04-18 16:01:35 +02:00
JustAnotherArchivist
b511397791
Add --since option to return only results newer than a certain date ( fixes #19 )
2019-04-18 15:12:29 +02:00
JustAnotherArchivist
536fcb3303
Return proper items from scrapers including clean URLs ( fixes #9 and #10 )
2019-04-18 14:44:21 +02:00
JustAnotherArchivist
f8d812f799
Include permalink.php, events, and notes ( fixes #32 )
2019-04-18 04:22:47 +02:00
JustAnotherArchivist
c2cebd9166
Accept-Language header to get an English response unconditionally
2019-04-18 03:58:37 +02:00
JustAnotherArchivist
73bc99596f
Treat Twitter responses without a Content-Type header as invalid ( fixes #21 )
2019-04-18 02:24:35 +02:00
JustAnotherArchivist
8458c12218
Rewrite link extraction on Facebook ( fixes #17 )
...
Facebook's returned HTML has a large number of inconsistencies; some (most) pages include a <link rel="canonical" /> but some don't, for example. This was at the root of the failing post extraction for some Facebook pages (#17 ). The previous link extraction technique was also quite poor for other reasons though. The new method uses the relevant CSS classes instead. Despite probably being the result of a CSS minimiser or similar, these seem to be quite stable: they haven't changed in the past two years (but the more readable ones have!).
2019-04-18 02:14:21 +02:00
JustAnotherArchivist
b59c7e8d8f
Merge pull request #28 from peterk/master
...
Adds socks proxy support (via requests)
2019-03-11 13:32:07 +01:00
Peter Krantz
3ceb849d98
Adds socks proxy support (via requests)
2019-01-10 22:54:42 +01:00
JustAnotherArchivist
f5ee1f7ac5
Merge pull request #26 from ludios/avoid-twitter-bans
...
twitter: randomize user agent to avoid Twitter's (IP, UA)-keyed bans
2018-12-25 02:19:17 +01:00
Ivan Kozik
1984110f78
twitter: randomize user agent to avoid Twitter's (IP, UA)-keyed bans
2018-12-24 08:03:33 +00:00
JustAnotherArchivist
c5a5dcb92c
snscrape is now on PyPI
2018-10-09 17:26:03 +02:00
JustAnotherArchivist
cfb1c9a2aa
Version 0.1.3
v0.1.3
2018-10-01 03:26:22 +02:00
JustAnotherArchivist
d0d3c8b2a6
Better log output for temporary failures ( fixes #2 )
2018-10-01 03:24:29 +02:00
JustAnotherArchivist
4d0350e541
Disable "quality filter" on Twitter ( fixes #3 )
2018-10-01 02:51:33 +02:00
JustAnotherArchivist
d17aa15bcb
Version 0.1.2
v0.1.2
2018-09-11 12:44:07 +02:00
JustAnotherArchivist
d1ef280d6e
Fix snscrape.modules not getting installed
2018-09-11 12:43:10 +02:00
JustAnotherArchivist
2823272e0b
Version 0.1.1
v0.1.1
2018-09-11 12:30:35 +02:00
JustAnotherArchivist
540f557002
Fix typo in setup.py preventing installation
2018-09-11 12:30:21 +02:00
JustAnotherArchivist
5fc60fe978
Version 0.1
v0.1
2018-09-10 22:15:11 +02:00
JustAnotherArchivist
cf36e8be97
Add README, LICENSE, and metadata
2018-09-10 22:15:03 +02:00
JustAnotherArchivist
0350ab0692
Fix Facebook scraper returning strings instead of Items
2018-09-10 19:38:43 +02:00
JustAnotherArchivist
6b6ae3d33b
Rename from socialmediascraper to snscrape
2018-08-21 22:54:14 +02:00
JustAnotherArchivist
9fb3ac6013
Add support for Google+ user profiles
2018-08-21 18:58:43 +02:00
JustAnotherArchivist
897f5bebe6
Add support for POST requests
2018-08-21 18:58:09 +02:00
JustAnotherArchivist
e28a2cdb4b
Fix Instagram again
...
- __a=1 is no longer supported, so we need to extract the JSON from the HTML page instead.
- There is now a X-Instagram-GIS header that needs to be set correctly.
2018-08-21 18:55:40 +02:00
JustAnotherArchivist
5a084af85c
Fix Instagram
...
Instagram dropped the max_id parameter, so it is no longer possible to iterate over the posts so easily. Switch to GraphQL instead, which is what's used in the browser as well.
2018-08-21 18:50:00 +02:00
JustAnotherArchivist
14831d4137
Add support for Facebook user profiles
2018-08-21 18:48:34 +02:00
JustAnotherArchivist
6d54655a7f
Add support for Instagram user profiles
2018-08-21 18:47:44 +02:00
JustAnotherArchivist
3ab69a1a0f
Merge Twitter user and hashtag into one, and add support for generic Twitter search scrapes
2018-08-21 18:46:34 +02:00
JustAnotherArchivist
d03c82d413
Support nested inheritance from socialmediascraper.base.Scraper
2018-08-21 18:44:15 +02:00