japanese

How to download 800gb of Manga and put it all onto a MicroSD card

Autumn Skerritt

19 Sep 2024 • 5 min read

Photo by Shiva Prasad Gaddameedi / Unsplash

Why do this?

I work in tech and love tech.

When you love tech, sometimes you make something just because you can. There is no real reason as to why you want to do this, but perhaps if you are not familar with having fun there are some reasons:

You may be in the military and need to go 1+ years without internet.
You are in a country with good internet and have to go back to a country with old internet.
You work as a researcher in Antartica and only have access to the internet for a few minutes a day, and most of that time will be needed for downloading research datasets.
You plan to travel Japan for a long time and will have limited internet.

Downloading the catalogue

💡

Only do this if it's legal in your country.

Many people say to use wget.

Partly right, but it's quite slow as it's not multi-threaded and there's a bunch of absolute shit in the catalogue it will take 500 years to download.

Here's a rough script:

import os
import requests
import random
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import subprocess

def download_directory_list(url):
    # we do it per directory to stop wget different threads of wget downloading the same files
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        anchor_tags = soup.find_all('a', href=True)
        return anchor_tags
    else:
        print(f"Failed to retrieve directory list from {url}")
        return []

def download_files_in_directory(directory_url):
    try:
        subprocess.check_call(["wget", "-r", "-nc", "--no-parent", "--reject", "db", "--reject-regex", ".*&quot.*", directory_url])
        print(f"Download successful from {directory_url}")
    # if ANY manga fails to download, fail the whole thing. its a bug, and we want to know why it failed so we can fix it and not have broken manga
    except subprocess.CalledProcessError as e:
        print(f"Error downloading from {directory_url}: {e}")

if __name__ == "__main__":
    base_url = "https://mokuro.moe/manga/"
    anchor_tags = download_directory_list(base_url)
    anchor_tags.pop(0)

    if anchor_tags:
        # this is bound to number of cores a system has. u can increase it if u want it to go faster.
        with ThreadPoolExecutor() as executor:
            executor.map(download_files_in_directory, [os.path.join(base_url, tag.get('href')) for tag in anchor_tags])

💡

this script took me maybe a day to download everything. It was surprisingly fast.

There's a massive db file that we want to ignore, and a bunch of broken &quot files too.

Turns out, that's not the only thing we want to ignore.

I learnt this the hard way....

Pleaes edit the above script with the information you'll find if you read on.

Deleting stuff we don't need

there is an /audio directory. We want manga, not audio. Delete this.
delete android.db
some of the manga has been processed with a literal potato and mokuro reader barely works on it. some of the manga is such low quality you will actually have to go back to nyaa to find the originals and reprocess them. do not expect this catalogue to be perfect.
theres a folder z_bad_formatting of what appears to be badly formatted manga. at least they admit to it. delete this too.

Now, let's talk about the real crap in there.

There's 1.7 million files which do not contribute to the manga (at least in my download). These are mostly json files.

🐬

If you want to re-mokuro in the future or use an older mokuro-adjacaent application that requires these json files, do not delete them. I tested it and it works fine in Migaku & the Reader however.

This 800gb catalogue balloons to about 2tb if you include the json files. But... not really? Basically size on disk != size.

Imagine paper. You want to print the word "hello". You print it on a whole bit of paper.

You just used a whole bit of paper to print "hello". Now, whether or not you fill that paper up with more data is up to you. But most of the time, it will just store 1 item.

Some printers let you use different size paper, so you can write "hello" on a card and save space. Others use massive bits of paper.

My SD card used big bits of paper. This meant that a very small, mostly empty file would take up a whole bit of paper.

This meant that although in theory it's 800gb, in reality it would be more like 2tb depending on how much paper your storage device uses.

So we want to delete all the files not useful to us.

Here's a list of all file types in the Mokuro catalogue.

txt
jpg
html
mokuro
JPG
PNG
png
zip
jpeg
avif
url
bat
URL
json
gif
csv
info
ini
nomedia
webp
bmp
rar
VIX
ico
py
sha256
dat
db
torrent

You can safely delete (or choose to ignore when downloading):

txt
html (mokuro reader uses the .mokuro files, not the .html files. if you use migaku keep .html since mokuro reader is broken for migaku)
url
bat (throwback to the 90s with this file extension)
URL
json (only needed if you want to re-mokuro the files. Otherwise not used by mokuro reader / the .html file. I tested it 😄)
gif (it is an image.... but mokuro does not work with gifs. only png / jpg / jpeg / webp https://github.com/kha-white/mokuro/blob/ad8af0e374361c1d56f50cc24af5cd6f1dba9328/mokuro/volume.py#L109)
csv
info
bmp
rar (mokuro reader only uses zip folders, not rar)
nomedia
ini
db - all the local yomitan audio / android.db stuff.
vix
ico
py
sha256
dat (I presume half-downloaded files)
torrent (these do not have seeders. I checked loads of them)

In total we have 3.1 million files.

After deleting these extensions with (or if you are downloading, please ignore these files in your download):

find /mnt/d/Manga/mokuro.moe/manga -type f \( \
-name "*.txt" -o \
-name "*.url" -o \
-name "*.bat" -o \
-name "*.URL" -o \
-name "*.json" -o \
-name "*.gif" -o \
-name "*.csv" -o \
-name "*.info" -o \
-name "*.bmp" -o \
-name "*.rar" -o \
-name "*.nomedia" -o \
-name "*.ini" -o \
-name "*.db" -o \
-name "*.vix" -o \
-name "*.ico" -o \
-name "*.py" -o \
-name "*.sha256" -o \
-name "*.dat" -o \
-name "*.torrent" \
\) -delete

We saved 20gb of data for deleting 1.7 million files.

BUT! Our size on disk == size (very roughly, there is 5gb difference). So by deleting 20gb of files, we saved about 1.2tb by my calculations (I can't tell because I ran out of storage space before the manga was completely transferred. Just guessing).

I then used Teracopy to transfer them all onto this micro sd card:

I told Teracopy to not validate the files or run checksums. this made it faster.

If I paused the transfer it looked like size on disk started to spiral again. I am not sure why, maybe some low level sector curse has been placed on my SD card if you pause it. So I ended up deleting everything and making the transfer one last time, except this time I never paused it.

after 1 day and 18 hours, it's finally done 🥳 Now I can have the entire catalogue on my tablet :3