Commit Graph

133 Commits

Author SHA1 Message Date
JustAnotherArchivist
07d446fd19 Fix crash in VK scraper 2020-09-10 21:05:03 +00:00
JustAnotherArchivist
a25426043b Fix Telegram username canonicalisation 2020-09-09 09:33:57 +00:00
JustAnotherArchivist
84692846b9 Fix crash in Telegram scraper 2020-09-09 09:22:00 +00:00
JustAnotherArchivist
039b2c6719 Restructure Twitter classes since the 'common' scraper is only used for the old design anymore 2020-09-07 02:38:27 +00:00
JustAnotherArchivist
70a3d9ba3a Fix infinite loop at the end of profile pages 2020-09-01 04:01:27 +00:00
JustAnotherArchivist
bd619bf4e9 Log and ignore tweets which are not contained in the globalObjects
Fixes #61
2020-09-01 03:45:23 +00:00
JustAnotherArchivist
072519f539 Fix pagination on profile pages 2020-09-01 03:23:45 +00:00
JustAnotherArchivist
d9572ec450 Correctly serialise nested NamedTuples 2020-09-01 03:16:25 +00:00
JustAnotherArchivist
ba250aabf2 Extract retweeted tweet if present 2020-09-01 03:15:21 +00:00
JustAnotherArchivist
0cc4f0c016 Add support for Twitter profile pages
Closes #5
2020-09-01 03:13:49 +00:00
JustAnotherArchivist
1a2e367a87 Cache entities 2020-09-01 02:34:21 +00:00
JustAnotherArchivist
4f24843f89 Extract user ID 2020-09-01 02:26:13 +00:00
JustAnotherArchivist
bfb92a47b9 Move Tweet object generation to TwitterAPIScraper 2020-09-01 02:25:00 +00:00
JustAnotherArchivist
dc5d55004b Refactor API interaction into something cleaner and more reusable 2020-09-01 01:56:07 +00:00
JustAnotherArchivist
d8e7f96d4d Add support for Reddit
Closes #15
2020-08-31 03:38:20 +00:00
JustAnotherArchivist
bb83d1d72f Validate Twitter usernames
Closes #55
2020-08-24 19:03:52 +00:00
JustAnotherArchivist
1480260e47 Handle Telegram channels without public posts 2020-08-24 17:54:30 +00:00
JustAnotherArchivist
c8d688d39f Fix crash on Telegram pages without a description 2020-08-24 17:53:50 +00:00
JustAnotherArchivist
9df4352089 Fix crash on VK pages without an info div 2020-08-24 17:42:33 +00:00
JustAnotherArchivist
dd25fd0526 Add support for extracting the entity behind a scrape
Closes #11

Backwards incompatibility: snscrape.modules.twitter.Account is now called User. However, this was previously only used on the list member scraper, which has been broken for a while since the list member list is no longer publicly accessible.

For compatibility reasons, the CLI does not output the entity by default; the new option --with-entity enables it.
2020-08-24 01:38:27 +00:00
JustAnotherArchivist
c90fd54b6b Make datetime.date serialisable 2020-08-24 01:12:38 +00:00
JustAnotherArchivist
9528df48cd Refactor base URL handling 2020-08-24 01:12:06 +00:00
JustAnotherArchivist
924c35f883 Refactor guest token extraction code 2020-08-22 22:59:43 +00:00
JustAnotherArchivist
588ec415ff Force TwitterThreadScraper to fetch the old design (take 2) 2020-08-12 17:19:42 +00:00
JustAnotherArchivist
bf229414ba Add JSONL output format 2020-08-12 15:09:02 +00:00
JustAnotherArchivist
afa819547d Update README 2020-08-11 22:18:04 +00:00
JustAnotherArchivist
dbcdc159ef Add support for scraping Facebook page visitor posts aka 'Community'
Closes #18
2020-08-11 22:14:27 +00:00
JustAnotherArchivist
30f945897a Clean Facebook group post URLs
Most of the time, the URLs are already clean, but occasionally, Facebook includes tracking parameters (__xts__[0], __tn__)...
2020-08-11 20:48:14 +00:00
JustAnotherArchivist
eee5794ff9 Extract Facebook group post in chronological order (instead of by last comment)
Fixes #66
2020-08-11 20:47:42 +00:00
JustAnotherArchivist
966a6ebd8e Skip promoted tweets/ads
Fixes #67
2020-08-11 20:28:35 +00:00
JustAnotherArchivist
4d3d0fe0d7 Update search API parameter values to the ones currently used on Twitter 2020-08-11 20:26:56 +00:00
JustAnotherArchivist
7b967ff82a Twitter reverted their guest token change (90f9598e) v0.3.4 2020-07-08 22:07:18 +00:00
JustAnotherArchivist
90f9598ecc Adjust to Twitter's new method of handing out guest tokens
Fixes #64
v0.3.3
2020-06-24 21:22:58 +00:00
JustAnotherArchivist
7b3c7deb28 Catch login redirects on Instagram v0.3.2 2020-05-30 00:56:34 +00:00
JustAnotherArchivist
040a11656c Update README 2020-05-30 00:53:52 +00:00
JustAnotherArchivist
1459245258 Consistently raise ScraperException on fatal errors 2020-05-30 00:53:49 +00:00
JustAnotherArchivist
dbe4c5ce55 Remove Google+ module
Google+ was mostly shut down in early 2019. What remained (Google+ for G Suite) was renamed to Google Currents and is for internal communication only (and therefore out of scope for snscrape).
2020-05-30 00:35:06 +00:00
JustAnotherArchivist
80491ecc2c Remove Gab module
Since Gab's move to a fork of Mastodon in July 2019, the module had been broken, and a new module would better be written from scratch as the platform changed entirely.
2020-05-30 00:23:33 +00:00
JustAnotherArchivist
1a71b58101 Add support for Telegram
Closes #50
2020-05-29 23:44:01 +00:00
JustAnotherArchivist
0ce37a69d4 Log exception details on crashes 2020-05-29 22:29:23 +00:00
JustAnotherArchivist
722bfd5f7c Handle Twitter tombstones
Fixes #63
2020-05-29 22:12:37 +00:00
JustAnotherArchivist
b6cc3180d9 Force TwitterThreadScraper and TwitterListMembersScraper to fetch the old design v0.3.1 2020-03-04 00:40:49 +00:00
JustAnotherArchivist
613395d1c2 Port TwitterSearchScraper to redesign
Fixes #57
2020-03-04 00:40:49 +00:00
JustAnotherArchivist
82a87b7b5a Merge pull request #53 from JackDallas/add-more-insta-fields
Add more fields to the instagram scraper
2020-02-09 23:48:59 +00:00
Jack Dallas
9568028bf9 Update changed fields 2020-02-07 11:30:16 +00:00
JustAnotherArchivist
6df351772e Fix crash in Facebook scraper on link-less entries 2020-02-05 16:15:10 +00:00
JustAnotherArchivist
541173b0c8 Merge pull request #54 from jodizzle/fix/vkontakte-user
Fix vkontakte-user: pagination returns JSON now, and handle some unscrapable profiles.
2020-02-05 14:56:12 +00:00
Jody Leonard
b6772d3778 vkontakte-user: Handle additional un-scrapeable profile case 2019-10-31 16:01:29 -04:00
Jody Leonard
20ea117a2c Fix vkontakte-user pagination 2019-10-30 22:29:49 -04:00
JackDallas
ff54c350bc Add more fields to the instagram scraper 2019-08-30 12:43:02 +01:00