355 Commits

Author SHA1 Message Date
JustAnotherArchivist
0336ce13ed Add support for fetching a guest token from the API 2021-12-23 04:26:50 +00:00
JustAnotherArchivist
193d4f80d6 Fix user agent in API headers staying constant 2021-12-23 04:25:23 +00:00
JustAnotherArchivist
e7d35ec1eb Fix date parsing on quoted posts 2021-12-15 16:55:14 +00:00
JustAnotherArchivist
8540045658 Fix typo 2021-12-15 16:36:28 +00:00
JustAnotherArchivist
1f1c1bd8af Fix docstring style 2021-12-14 20:05:51 +00:00
JustAnotherArchivist
7fdc8bcb53 Randomise user agent when the guest token can't be found 2021-12-14 20:04:46 +00:00
JustAnotherArchivist
4b3c6aefe7 Add default values to user and tweet scrapers for a more untuitive usage 2021-12-12 04:57:16 +00:00
JustAnotherArchivist
525cd71225 Retry guest token retrieval
Fixes #325 (hopefully)
2021-12-12 00:10:59 +00:00
JustAnotherArchivist
72abff9e5c Reuse guest tokens across scrapes
Cf. #326
2021-12-11 23:18:42 +00:00
JustAnotherArchivist
bcaa477b3d Update list of scrapers 2021-12-08 08:29:02 +00:00
JustAnotherArchivist
66d4c99f82 Remove dev version notice 2021-12-08 08:25:21 +00:00
JustAnotherArchivist
0ac50f1383 Add README to package metadata 2021-12-08 08:18:25 +00:00
JustAnotherArchivist
c2257ad16e Add Python 3.10 classifier 2021-12-08 08:15:05 +00:00
JustAnotherArchivist
58f654405f Add --citation
Closes #229
2021-12-08 07:51:28 +00:00
JustAnotherArchivist
35fb61a327 Fix crash on dumping scopes which have a variable pointing to a dataclass 2021-11-24 03:39:06 +00:00
JustAnotherArchivist
a6b6f3faaa Throw an error on empty arguments
Fixes #290
2021-10-10 17:43:27 +00:00
JustAnotherArchivist
5e829e2541 Refactor class instantiation to remove the need to repeat 'retries' everywhere 2021-09-30 09:58:10 +00:00
JustAnotherArchivist
d4567da23c Improve list of scrapers on --help output
Don't list all scrapers in the usage line, and provide a sorted readable list instead.
2021-09-30 09:35:17 +00:00
JustAnotherArchivist
e5e0da25a0 Remove unused imports 2021-09-30 09:24:18 +00:00
JustAnotherArchivist
821326bcfb Fix a few f-strings 2021-09-30 09:23:56 +00:00
JustAnotherArchivist
4bf9ef239c Restructure usage section 2021-09-30 09:18:43 +00:00
JustAnotherArchivist
e382891642 Fix Twitter trends not having a str representation 2021-09-21 21:40:50 +00:00
JustAnotherArchivist
e5f4389464 Add Twitter trend scraper
Due to restrictions on Twitter's side, it is not possible to get trends from a custom location as that would require using an account and/or their API.

Closes #206
2021-09-21 21:28:41 +00:00
JustAnotherArchivist
d91f971f51 Refactor user label implementation and add support for bot accounts
Closes #281
2021-09-21 19:39:40 +00:00
JustAnotherArchivist
67e8295293 Merge pull request #280 from edsu/master
User Labels
2021-09-19 03:35:49 +00:00
JustAnotherArchivist
5fc2562642 Add user label support on entity retrieval 2021-09-19 03:32:35 +00:00
JustAnotherArchivist
2825bd0a73 Remove accidental empty line 2021-09-19 03:31:56 +00:00
Ed Summers
9831f2a4a0 missing ext
While doing some long term data collection I found some user objects
that lack the key 'ext'. This would cause an exception unless it's
checked for before trying to dig out results.
2021-09-16 13:31:47 -04:00
Ed Summers
a11eef6b06 User label url
Each label also has a URL which is used for learning more about the
label. While there are more label descriptions than label URLs the URLs
do seem to group language variants of the same label. For example
https://help.twitter.com/rules-and-policies/state-affiliated-china is
used for all of the following label descriptions:

* Média affilié à un État, Chine
* China state-affiliated media
* 中国官方媒体
* Çin devletine bağlı medya
* China government official

In some analysis contexts it could be useful to group these together.
2021-09-16 13:04:57 -04:00
Ed Summers
3fb731ade1 User Labels
In August of 2020 Twitter started to label the accounts of government
officials and state-affiliated media entities:

https://blog.twitter.com/en_us/topics/product/2020/new-labels-for-government-and-state-affiliated-media-accounts

This information is extremely important for researchers who are studying
the impact of social media on political discourse, especially because it is not
currently available through either Twitter's v1.1 or v2 API endpoints.

The code in this small PR may seem a bit brittle but I've been using it
to collect data with each of the twitter subcommands and it seems to
work reliably. While there are image and page URLs associated with each
label I chose to only collect the text description of the lable since it
should be sufficient for finding the additional information later if
needed.
2021-09-16 08:06:05 -04:00
JustAnotherArchivist
c76f1637ce Handle 403s from Twitter search
Closes #269
2021-08-30 23:29:20 +00:00
JustAnotherArchivist
ed117e8891 Log response status code and redirects 2021-08-29 18:26:00 +00:00
JustAnotherArchivist
f9a3fafb3f Fix --cursor on twitter-search 2021-08-01 20:59:16 +00:00
JustAnotherArchivist
660b8c7a0a Retry empty result sets from Twitter as a workaround for random early stops
#37
2021-07-18 23:59:52 +00:00
JustAnotherArchivist
0c22608dc7 Extract video view count
Also fix the broken ext values sent to Twitter

Closes #246
2021-07-01 17:58:45 +00:00
JustAnotherArchivist
2bb706feda Dump request and response attributes of RequestExceptions
Cf. #243
2021-06-30 21:44:02 +00:00
JustAnotherArchivist
5e6bc4ec50 Fix type of content field (may be None on text-less posts) 2021-05-27 00:33:12 +00:00
JustAnotherArchivist
57d0aaafc1 Remove dirtyUrl which does not appear to be used anymore by Instagram
#234
2021-05-27 00:32:03 +00:00
JustAnotherArchivist
157e4d4265 Fix default value of username field
#234
2021-05-27 00:29:33 +00:00
JustAnotherArchivist
54588e9c42 Add support for fetching top instead of live/chronological tweets
Closes #109
2021-05-23 03:24:30 +00:00
JustAnotherArchivist
9e7274f3d7 Clean up params dict construction 2021-05-23 03:24:11 +00:00
JustAnotherArchivist
ac4e335bdb Clean up duplicated default values 2021-05-23 03:03:32 +00:00
JustAnotherArchivist
1d255de48d Add hashtags and cashtags 2021-05-23 02:51:38 +00:00
JustAnotherArchivist
9c1dcd37f9 Add Tweet.{inReplyToTweetId,inReplyToUser}
This makes User.displayname optional because the replied-to user is not always present in the user mentions.
2021-05-23 02:44:40 +00:00
JustAnotherArchivist
f8dac183d0 Fix type of User.id 2021-05-23 02:43:53 +00:00
JustAnotherArchivist
45d1fa27de Add twitter-tweet scraper for retrieving tweets by ID, including scroll and recursion modes
Closes #51, closes #137
2021-05-23 02:12:13 +00:00
JustAnotherArchivist
98b798b0e5 Remove obsolete twitter-thread scraper
It was still based on the old, deprecated Twitter UI and broke a long time ago.

Closes #176
2021-05-22 22:37:21 +00:00
JustAnotherArchivist
f18b64e7da Add support for scraping Twitter users by ID
Closes #222
2021-05-22 21:17:14 +00:00
JustAnotherArchivist
460be9d581 Add _type attribute on all JSON objects, remove separate attribute on Twitter media 2021-05-22 18:14:54 +00:00
JustAnotherArchivist
97c8caea48 Set Accept-Language header on API requests to English 2021-04-20 01:50:14 +00:00