Friday, August 13, 2021

making a book readable on the Kindle from a set of articles

There are some pages with links to a large number of articles, like this one. Looking around for easily converting all those articles into a single book readable on an e-ink reader like the Kindle, my first trial was with Calibre and its News feature. Just adding that html page as the source did not work - the created ebook only had that page and nothing else.

Next, trying creating an RSS feed with fivefilters and giving that feed to Calibre - limited to 5 links, so not sure how it would work with the hundreds of old articles. 

Trying HTTrack to download pages - was taking a long time due to downloading CSS, images, etc. And we were not really interested in images.

So, used DownloadThemAll, choosing to download only *.htm and *.html linked from the index page, created a table of contents html page as noted in the FAQ. But Calibre complained of broken image links and stopped the conversion. 

Then tried converting all the html to txt using html2text - did not work, maybe my commandline piping was incorrect? Next, tried converting all the html to txt using HtmlAsText, running it as 
wine HtmlAsText.exe
This worked without a hitch. Then, made another table of contents html file with all the html files replaced by the txt files in the table of contents. But that conversion too ended up with only the table of contents being added to the ebook. 

Next, tried
cat *.txt > combinedfile.txt
and converted the combinedfile.txt. This created a combined ebook, but with ugly hard line-breaks every other line. Then, chose to enable heuristics processing with all options, and converted again. This time, got good results. With heuristics processing, the conversion takes longer, around 5-6 minutes for a 3 MB text file.  

So, to sum up - 
  • DownloadThemAll
  • HtmlAsText
  • cat
  • Calibre with heuristics enabled.


No comments:

Post a Comment