These are the steps I take to convert 'HTML books' to PDF files with nice (clickable) tables of contents.
The used softwares (ruby wget tidy iconv and htmldoc) are all opensource.
Note: this recipe works well for large text documents. The layout of pages with images is spoiled.
The following command downloads a html file with required stuff (images, style sheets)
wget -p -k -F -e robots=off "http://..../....html"
When the html file is encoded with utf8 (see html header), the character encoding has to be converted to something 'simple':
iconv -f UTF8 -t ISO-8859-1//TRANSLIT in.html > out.html
The following command generates a pdf file that fits nicely on the screen of my eReader (Irex Iliad).
htmldoc -f book.pdf --header "" --footer "" --top 3mm --bottom 1mm --left 1mm --right 1mm --size 12x15cm out.html
Remarks: * For books form 'Project Gutenberg' you may want to shorten the title (in the html 'title' tag) before converting
inplace: tidy -m -asxml file.html
The following ad hoc script that removes <hr> nodes and navigation tables from a downloaded book. It's just an example for reference.
#!/usr/bin/env ruby require "rexml/document" include REXML # files are 1.html, 1.1.html, ... # cleanup: tidy -m -asxml *.html files = Dir['*.html'] # sort file names in 'floating point' order files.sort! {|a,b| a.to_f <=> b.to_f} files.each do |file| begin # we're only interested in the <body> node body = Document.new(IO.read(file)).root.elements["body"] # remove <hr> body.elements.each("//hr") { |e| e.remove} # remove <table class='nav'> body.elements.each("//table") do |e| e.remove if e.attributes["class"]=='nav' end # use what's left over body.elements.each do |e| print e.to_s end rescue Exception print "<h1>##### BADLUCK: "+file+" #####</h1>" $stderr.print "BADLUCK: "+file+" (#{$!})" end end
This script converts a text file from Project Gutenberg to simple html:
#!/usr/bin/ruby txt = IO.read(ARGV[0]) txt.gsub!(/\r/,'') parts = txt.split(/\n\n\n\n/) parts.shift parts.pop $stderr.print "%s: bytes=%d, parts=%d\n" % [ARGV[0], txt.size, parts.size] print "<html>\n<head><title>#{ARGV[0]}</title></head>\n<body>\n" parts.each do |part| pars = part.split(/\n\n+/) head = pars.shift print "<h1>#{head}</h1>\n" pars.each do |par| par.gsub!(/\[\d+\].+/, '') par.gsub!(/_(.*?)_/m, '<i>\1</i>') print "<p>#{par}</p>\n" end end print "</body></html>"