Text -> Html -> Pdf

These are the steps I take to convert 'HTML books' to PDF files with nice (clickable) tables of contents.

The used softwares (ruby wget tidy iconv and htmldoc) are all opensource.

Note: this recipe works well for large text documents. The layout of pages with images is spoiled.

Get html files


The following command downloads a html file with required stuff (images, style sheets)

wget -p -k -F -e robots=off "http://..../....html"

utf8 --> iso

When the html file is encoded with utf8 (see html header), the character encoding has to be converted to something 'simple':

iconv -f UTF8 -t ISO-8859-1//TRANSLIT  in.html > out.html

html -> pdf

The following command generates a pdf file that fits nicely on the screen of my eReader (Irex Iliad).

htmldoc -f book.pdf --header "" --footer "" --top 3mm --bottom 1mm --left 1mm --right 1mm --size 12x15cm out.html

Remarks: * For books form 'Project Gutenberg' you may want to shorten the title (in the html 'title' tag) before converting

Html manipulation

html to clean xml

inplace: tidy -m -asxml file.html

Hacking XML with Ruby / REXML

The following ad hoc script that removes <hr> nodes and navigation tables from a downloaded book. It's just an example for reference.

#!/usr/bin/env ruby
require "rexml/document"
include REXML
# files are 1.html, 1.1.html, ...
# cleanup: tidy -m -asxml *.html
files = Dir['*.html']
# sort file names in 'floating point' order
files.sort! {|a,b| a.to_f <=> b.to_f}
files.each do |file|
    # we're only interested in the <body> node
    body = Document.new(IO.read(file)).root.elements["body"]
    # remove <hr>
    body.elements.each("//hr") { |e| e.remove}
    # remove <table class='nav'>
    body.elements.each("//table") do |e|
      e.remove if e.attributes["class"]=='nav'
    # use what's left over
    body.elements.each do |e|
      print e.to_s
  rescue Exception
    print "<h1>##### BADLUCK: "+file+" #####</h1>"
    $stderr.print "BADLUCK: "+file+" (#{$!})"

text -> html

Project Gutenberg

This script converts a text file from Project Gutenberg to simple html:

txt = IO.read(ARGV[0])
parts = txt.split(/\n\n\n\n/)
$stderr.print "%s: bytes=%d, parts=%d\n" % [ARGV[0], txt.size, parts.size]
print "<html>\n<head><title>#{ARGV[0]}</title></head>\n<body>\n"
parts.each do |part|
  pars = part.split(/\n\n+/)
  head = pars.shift
  print "<h1>#{head}</h1>\n"
  pars.each do |par|
   par.gsub!(/\[\d+\].+/, '')
   par.gsub!(/_(.*?)_/m, '<i>\1</i>')
   print "<p>#{par}</p>\n"
print "</body></html>"
Last modified: 2009/08/10 16:52 Mijn Fijne Site hosting en design