There are some pages with links to a large number of articles, like this one. Looking around for easily converting all those articles into a single book readable on an e-ink reader like the Kindle, my first trial was with Calibre and its News feature. Just adding that html page as the source did not work - the created ebook only had that page and nothing else.
Next, trying creating an RSS feed with fivefilters and giving that feed to Calibre - limited to 5 links, so not sure how it would work with the hundreds of old articles.
Trying HTTrack to download pages - was taking a long time due to downloading CSS, images, etc. And we were not really interested in images.
So, used DownloadThemAll, choosing to download only *.htm and *.html linked from the index page, created a table of contents html page as noted in the FAQ. But Calibre complained of broken image links and stopped the conversion.
cat *.txt > combinedfile.txt
and converted the combinedfile.txt. This created a combined ebook, but with ugly hard line-breaks every other line. Then, chose to enable heuristics processing with all options, and converted again. This time, got good results. With heuristics processing, the conversion takes longer, around 5-6 minutes for a 3 MB text file.
- DownloadThemAll
- HtmlAsText
- cat
- Calibre with heuristics enabled.
No comments:
Post a Comment