* [instagram] Add support for GraphSidecar media types
Refactor _extract_postpage() to always return a list of medias.
Fetch common keywords and gracefully handle GraphSidecar media type
by extracting each single media and adding `sidecar_media_id' and
`sidecar_shortcode' keywords to indicate the parent of sidecar
childrens.
While here join the copyright comment lines in a single one.
Closes#178.
* [instagram] Use `yield from' instead of `for ... yield' (thanks @mikf)!
* [instagram] Adjust filename for GraphSidecar medias
Add a possible leading `media_id' of the sidecar for GraphSidecar
media.
Thanks to @mikf for the suggestion!
* [instagram] Add extra metadata for youtube-dl in GraphSidecar childrens
GraphSidecar children ytdl: URLs when consumed by youtube-dl
redirects to the URL of their parent. In GraphSidecar-s with
multiple GraphVideo-s this leads to downloading the same video
multiple times.
Add a `_ytdl_index' field to indicate the index of the youtube-dl
playlist corresponding the children of the sidecar.
This will be used by the `ytdl' downloader.
- use original image if available
- support video formats
- remove user info for ImageExtractor (it is no longer possible to get
image owner information for a single image)
An URL alone isn't good enough to distinguish between a gallery or a
gallery-listing, so the new extractor decides what to do based on the
page's content.
- Sometimes an ad interfered when trying to get a download URL
- Resolving "www.hentai-foundry.com" yields an invalid(?) IPv6 address
(2607:5300:60:ca9e:feed:dead:beef:1) and urllib3 only tries to connect
to the IPv4 variant after a rather long wait time
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)
Example: "https://example.org/path/filename.ext"
before:
- filename : filename.ext
- name : filename
- extension: ext
now:
- filename : filename
- extension: ext
This allows for stuff like "{extractor.url}" and "{extractor.category}"
in logging format strings.
Accessing 'extractor' and 'job' in any way will return "None" if those
fields aren't defined, i.e. in general logging messages.
Child extractors are now directly constructed with Extractor.from_url()
if the extractor class is known beforehand, instead of using
extractor.find() and searching through all possible extractor classes.
Instead of a strict list of (URL, RESULTS)-tuples, extractor result
tests can now be a single (URL, RESULTS)-tuple, if it's just one test,
and "only matching" tests can now be a simple string.
HTML structure for gallery pages changed quite a bit, so it is now using
the embedded JSON data. This changes a lot of metadata field names, but
'gallery_id', 'title', and 'user' are still provided for backwards
compatibility.
The internal API endpoint for user galleries also changed its data
structure, but nothing too major.
- allow instances to specify their own 'category'
- improve config lookup:
- first look into extractor.<category>.*
- and afterwards look into extractor.mastodon.<instance>.*
- add a default entry for pawoo.net in a way that actually works
- add an 'instance' keyword and turn 'tags' into a usable list
The former implementation would produce a complete list of all subalbums
for each (sub)album extraction. This would for example result in a
level 2 subalbum getting "extracted" twice: once through the root-album
(level 0) and once through its parent album on level 1.
In the current implementation only the next level of subalbums are
returned, which themselves will handle their next level in a recursive
fashion.
Extractors for Mastodon instances can now be dynamically generated,
based on the instance names in the 'extractor.mastodon.*' config path.
Example:
{
"extractor": {
"mastodon": {
"pawoo.net": { ... },
"mastodon.xyz": { ... },
"tabletop.social": { ... },
...
}
}
}
Each entry requires an 'access-token' value, which can be generated with
'gallery-dl oauth:mastodon:<instance URL>'.
An 'access-token' (as well as a 'client-id' and 'client-secret') for
pawoo.net is always available, but can be overwritten as necessary.
Using the same base-dict for each asset of a project causes unwanted
side effects like re-using image filename extensions for videos,
resulting in errors with the youtube-dl downloader.
... via HTTP Basic Auth with username and "password".
The password value in this case is not the account password itself,
but the"api_key" found in your user profile.
Hidden / dashboard-only blogs are pretty straightforward and "only"
require a valid 'access-token' and 'access-token-secret' for the given
'api-key' and 'api-secret', so that signed OAuth1.0 requests are possible.
Private / password protected blogs on the other hand are a bit
cumbersome. In addition to a valid 'access-token' and
'access-token-secret', they also require the account belonging to those
tokens to be a member of the blog itself. Knowing the password and
entering it in the website isn't enough to access a blog through the
API. Following a private blog is also impossible, so that option can't
work either.