Sometimes it’s necessary to turn a whole bunch of text files in one encoding into another. Like when my Windows-using collaborators send me things in CP1252 or Mac folk send MacRoman. If for some reason you don’t feel like using the command line to convert a folder full of files, there is now the re-encoder to do it for you. Just identify your folder of text files, preview one or two of them to make sure your ‘from’ encoding is actually reasonable, pick a ‘to’ encoding, and press the button to get a folder full of new files. (The old ones aren’t deleted in case that was a mistake you just made).
Not a particularly tricky task, but it might be useful to someone.
(0.6.4)
Feel free to poke it a bit. Menus have moved, lots of little things have changed, Mac users get a more OSX-ish feel, and we’re now all set up for the next move forward. The help is rewritten, but a bit sparse. and there’s a little bit more info in the README on the sourceforge pages.
And if you don’t like it, all your files will still work with the old version.
If you want to run a Yoshikoder dictionary over a large number of documents more quickly than the Yoshikoder would do it, you can now use the new version of JFreq. Plus you get stemming, stopword removal and other preprocessing steps too, if you want them. Just drag your documents into the window, upload your saved dictionary (not project), select your preprocessing options, pick an output file, and press Run.
First off, there’s an update to the YK Converter available from here. It’s a ‘related file’ for the Yoshikoder, for some reason. These updates are primarily because I’ve just finished working on a project that required scraping a lot of truly horrid web pages, and the current machinery wasn’t quite up to dealing with them. In fact, they bust everything except TagSoup.
Read more…
OK. About the transparency thing. There’s now a proper place to send bug reports and feature requests for the Yoshikoder. I had considered using the many options available from Sourceforge, but plumped for something that was much simpler, arguably more elegant, and most importantly: blue.
Read more…
Another preview (RC2) is available here. Most of the debugging happened on the Mac side, but it ought to go slightly better everywhere. Read more…
In a remarkably productive Christmas break I finally got working on the much neglected, at this point almost mythical ‘next version’ of the Yoshikoder. It’s not quite there yet, but I thought it might be nice to share a sneaky new year preview with you.
Read more…
It seems the regressive imagery dictionaries for French and Portugese were not in good shape to import into the Yoshikoder.
Now they are.
Thanks to Sophie for pointing that out.
Folk have requested the possibility to run their dictionaries over all project documents at once to generate a ‘unified dictionary report’ mirroring the unified word frequency report. This function is now attached to a menu item in the latest preview release. It drops the resulting report straight into a CSV file, for easy import into whatever you like to do your data analysis in.
Just a quick pointer to a bit of code that you might find useful, if you’re a command line kind of person.
JFreq is a simple word counter. It takes your text files, filters them various ways, and spits out a table of counts organized word by document. The handy bit is probably the filtering. JFreq can currently stem in 12 languages (courtesy of the lucene project). It can also remove currency references, number references, and stop words from a list you provide. Requires Java 5 or higher. Output is in UTF-8.
JFreq is already in use as the part of the backend of the Stata implementation of Wordscores, and has been used by the Wordfish folk for research on the EU. One of these days it will get a nice graphical interface, but given the speed of Yoshikoder development lately, that’s unlikely to happen soon. And now comes with a nice graphical interface if you don’t fancy the command line version.