Horrid html
First off, there’s an update to the YK Converter available from here. It’s a ‘related file’ for the Yoshikoder, for some reason. These updates are primarily because I’ve just finished working on a project that required scraping a lot of truly horrid web pages, and the current machinery wasn’t quite up to dealing with them. In fact, they bust everything except TagSoup.
So, now there’s a way to specify the encoding of the text or html page you’re working with, since we can’t always guess what that is. And also a cheeky way to grab a small section of text, which I’ve been thinking of as ‘Tuc’ because it deletes all the text that is not currently selected. It seemed just the thing for manually pulling out a snippet of text from a html-stripped page without a lot of editing effort.
Revisiting the conversion task pointed out to me that technology has moved on since I first wrote the code. In particular, the Apache POI project is now further along, and Tika, a general text scraping library is in the incubator. Ultimately I’d like the converter to be a thin but pretty wrapper around Tika. In any case I should update the POI components in the next release.