Horrid html
First off, there’s an update to the YK Converter available from here. It’s a ‘related file’ for the Yoshikoder, for some reason. These updates are primarily because I’ve just finished working on a project that required scraping a lot of truly horrid web pages, and the current machinery wasn’t quite up to dealing with them. In fact, they bust everything except TagSoup.
A little bit of infrastructure
OK. About the transparency thing. There’s now a proper place to send bug reports and feature requests for the Yoshikoder. I had considered using the many options available from Sourceforge, but plumped for something that was much simpler, arguably more elegant, and most importantly: blue.
And another one
Another preview (RC2) is available here. Most of the debugging happened on the Mac side, but it ought to go slightly better everywhere. Read the rest of this entry »
Sneaky preview
In a remarkably productive Christmas break I finally got working on the much neglected, at this point almost mythical ‘next version’ of the Yoshikoder. It’s not quite there yet, but I thought it might be nice to share a sneaky new year preview with you.
A small addition
Folk have requested the possibility to run their dictionaries over all project documents at once to generate a ‘unified dictionary report’ mirroring the unified word frequency report. This function is now attached to a menu item in the latest preview release. It drops the resulting report straight into a CSV file, for easy import into whatever you like to do your data analysis in.
JFreq – a little command line tool
Just a quick pointer to a bit of code that you might find useful, if you’re a command line kind of person.
JFreq is a simple word counter. It takes your text files, filters them various ways, and spits out a table of counts organized word by document. The handy bit is probably the filtering. JFreq can currently stem in 12 languages (courtesy of the lucene project). It can also remove currency references, number references, and stop words from a list you provide. Requires Java 5 or higher. Output is in UTF-8.
JFreq is already in use as the part of the backend of the Stata implementation of Wordscores, and has been used by the Wordfish folk for research on the EU. One of these days it will get a nice graphical interface, but given the speed of Yoshikoder development lately, that’s unlikely to happen soon. And now comes with a nice graphical interface if you don’t fancy the command line version.
Minor converter update
There’s a tiny weeny little update to the converter, available from the usual spot. I’ve made the help a bit better, and it should feel a bit slightly more native on Windows.
Batch dictionary reports
Folk have been asking about being able to run dictionary reports over all their documents rather than one at a time. Since the code for the next version of the Yoshikoder is in pieces around my bedroom, with several bits having rolled under the carpet or been borrowed by the cat, I’ve made a little program to do dictionary reports in a batch. This application currently lives here, and is wrapped up for Windows. Give it a project file from the Yoshikoder and it will run a dictionary report on every document in the project, and drop the results into a file. At least, that’s the idea.
Concordance reports in the preview
Yes, I bust the concordance reports in the latest Preview. They are now unbusted. Version 0.6.3-Preview.1 is a bug-fix release, available from the usual place.