Commit Graph

318 Commits

Author SHA1 Message Date
Tristan Lee
c18ca0f047 Merge branch 'master' into telegram-media 2022-05-09 09:21:40 -05:00
Tristan Lee
5648e957d0 improved consistency of code formatting and added _STYLE_MEDIA_URL_PATTERN as variable 2022-04-27 16:41:24 -05:00
Tristan Lee
21f7b620ec moved forward finding out of tgme_widget_message_text clause, since it wasn't correctly getting the forwarding information in forwarded posts that contained attachments but no text 2022-04-21 18:26:31 -05:00
Tristan Lee
9b3faec980 added additional attributes for hashtags and user mentions, removed redundant outlinks 2022-04-21 18:06:43 -05:00
Tristan Lee
97d38e5cde added additional termination criteria to Telegram scraper 2022-04-21 09:41:53 -05:00
Tristan Lee
b276c3cc27 fixed issue where some videos and photos weren't being scraped (because they weren't in a post containing a 'tgme_widget_message_text' div 2022-04-17 06:50:43 -05:00
Tristan Lee
1e4e0c278d fixed issue where Telegram scraper terminated early because some pages didn't have a next page link (added reasonable default) 2022-04-17 04:33:22 -05:00
Tristan Lee
babcddda19 made Telegram scraper not return full channel info for forwarded_from attribute; fixed video edge cases. 2022-04-17 03:55:37 -05:00
Tristan Lee
f978954bb3 Merge branch 'JustAnotherArchivist:master' into master 2022-04-03 01:49:28 -05:00
Tristan Lee
2ce014ade4 fixed edge case for videos that have data-link-attr but no href attribute 2022-04-03 01:45:25 -05:00
JustAnotherArchivist
5d156c6a15 Detect and raise error on redirect from GraphQL endpoint to login
#165
2022-04-03 02:34:30 +00:00
Tristan Lee
4e59638e7c added a forwardedUrl attribute to TelegramPost and made forwarded attribute type Channel. 2022-03-30 21:33:03 -05:00
Tristan Lee
a7eb54d226 implemented Media dataclasses for Telegram, and added variable for extracting a post's view count 2022-03-30 21:07:17 -05:00
Tristan Lee
d32c9add8a added capability to scrape multiple videos from a single post 2022-03-30 18:13:15 -05:00
Tristan Lee
fb8d73ac95 handled case where channel has no profile image 2022-03-29 13:15:53 -05:00
Tristan Lee
ed829163a0 added capability to extract the number of channel members when the the string in membersDiv has the word 'subscribers' rather than 'members'. 2022-03-29 01:12:07 -05:00
JustAnotherArchivist
694657ef80 Fix broken exception references 2022-03-09 01:01:47 +00:00
JustAnotherArchivist
1ab0f4fccb Fix missing quoted tweet reference in certain buggy cases 2022-03-07 22:16:58 +00:00
JustAnotherArchivist
3a92b5bf0d Add log message for guest token file deletion 2022-02-26 19:32:55 +00:00
JustAnotherArchivist
2480b173f4 Fix crash on race condition in CLI guest token manager resets
Fixes #414
2022-02-26 19:31:08 +00:00
Logan Williams
de4ebed81f Fix KeyError caused by retweets without URLs in TwitterProfileScraper 2022-02-24 18:08:12 +01:00
Logan Williams
72b26f2373 Scrape images, video, and post forwarding information for Telegram channel posts 2022-02-24 15:31:02 +01:00
JustAnotherArchivist
77bbb9f61f Remove useless pass 2022-02-20 18:54:51 +00:00
JustAnotherArchivist
57a624c618 Merge pull request #410 from AccentuSoft/master
Fix Vkontakte-user module crash on users with millions of followers
2022-02-18 06:01:35 +00:00
AccentuSoft
b1cfd51121 Implementing changes 2022-02-17 21:52:15 +02:00
AccentuSoft
ace2c16f54 Fix Vkontakte-user module crash on users with millions of followers 2022-02-17 15:42:46 +02:00
JustAnotherArchivist
2f9c0457df Convert t.co card URLs to unshortened when possible 2022-02-17 01:50:15 +00:00
JustAnotherArchivist
878f2a3c7a Handle cards without descriptions and thumbnails
Fixes #407
2022-02-17 01:49:32 +00:00
JustAnotherArchivist
25ee014e29 Extract cards 2022-02-16 02:59:21 +00:00
JustAnotherArchivist
a192dc6236 Handle TweetWithVisibilityResults
Fixes #400
2022-02-14 18:08:59 +00:00
JustAnotherArchivist
a7242f340b Remove obsolete TODO
There is no retweetedTweetRef in Twitter's JS.
2022-02-14 18:08:29 +00:00
JustAnotherArchivist
359cc25cdf Fix crash on entity attribute when scraping suspended users
Fixes #396
2022-02-10 04:22:59 +00:00
JustAnotherArchivist
01799a7391 Detect when CLI guest token from file has expired 2022-02-08 19:38:45 +00:00
JustAnotherArchivist
b0753c34ed Fix forgotten method name changes in 7d939c11
Fixes #393
2022-02-08 15:35:49 +00:00
JustAnotherArchivist
7f78fa0bc0 Recurse through all tweets encountered, not only ones with a positive replyCount
Fixes #266
2022-02-07 18:13:56 +00:00
JustAnotherArchivist
8702a9c7e2 Add Reddit submission scraper
Closes #312
2022-02-07 04:43:54 +00:00
JustAnotherArchivist
8ac1fd3ea8 Refactor Pushshift code to separate the general things from the search 2022-02-07 04:43:19 +00:00
JustAnotherArchivist
9235890f9a Fix KeyError crash on attempting to scrape inexistent tweet ID 2022-02-07 04:04:21 +00:00
JustAnotherArchivist
7d939c110c Port profile and tweet scrapers to GraphQL API
Fixes #367
2022-02-07 03:49:14 +00:00
JustAnotherArchivist
8e95e9a9a7 Fix crash on places without a bounding box
Fixes #374
2022-02-07 00:38:22 +00:00
JustAnotherArchivist
aa7d7d3dc3 Refactor automatic importing in snscrape.modules to something less hacky
Cf. #357
2022-02-05 03:22:55 +00:00
JustAnotherArchivist
560c78c5cf Make all optional scraper arguments keyword-only and fix Mastodon argument style to conform with the other scrapers
Cf. #376
2022-01-30 00:21:18 +00:00
JustAnotherArchivist
107c3c71c2 Remove unnecessary f-strings
Cf. #370
2022-01-28 21:22:13 +00:00
JustAnotherArchivist
7f88678253 Merge pull request #359 from own3dh2so4/master
Added proxy option to Scraper base
2022-01-13 23:08:28 +00:00
David Garcia Alvarez
52e4f9fb69 Added proxy option to Scraper base 2022-01-13 16:56:00 +01:00
JustAnotherArchivist
eebdfc1c55 Refactor username vs ID mess
Closes #354
2022-01-12 22:36:26 +00:00
JustAnotherArchivist
e6076353c8 Fix user ID being a string instead of an int on the entity 2022-01-12 22:35:50 +00:00
JustAnotherArchivist
a32d79fab2 Fix crash on certain mblogs that lack the raw_text attribute 2022-01-12 22:31:49 +00:00
JustAnotherArchivist
65391297f6 Move CLI methods to end of class definition for consistent code style 2022-01-12 21:09:38 +00:00
JustAnotherArchivist
deb2659dd6 Prefix CLI-related methods with an underscore
Closes #355
2022-01-12 21:07:10 +00:00