الأربعاء، 13 يناير 2010

Analyzing the "linguistic fingerprints" of authors

Works by Herman Melville, Thomas Hardy, and D.H. Lawrence have been examined to see how many different words an author uses only once in that piece of writing.  Obviously, a longer work would tend to have more unique words, in a pattern that eventually forms a plateau based on the author's vocabulary, and in a shape that may be characteristic of that author.
The team suggests that a work by an unknown author could therefore be compared to prior works, with the curve acting as a linguistic "fingerprint".

"It doesn't matter if I pull out 10,000 words from a book of 100,000 or from a book of 200,000, I get the same behaviour; you always simply pull a piece out of your very, very big 'meta book', which is just a representation of your style," said Sebastian Bernhardsson, who led the work.
For an interesting comparison piece, see the post I wrote about the aging-related changes in the vocabulary of Agatha Christie.

And on a tangentially-related matter, one year ago I tested the "readability level" of this blog, results of which suggested the readership would be quite well educated.  That particular test is no longer available, so I tried a different one this morning and got the results below.  The Gunning-Fox index of 14 is a "rough measure of how many years of schooling it would take someone to understand the content" of the blog.  The test apparently just sampled the front page (last 25 posts) of TYWKIWDBI, so the number would change from time to time (and I rather suspect it also samples the sidebar, which would greatly skew the results downward).

Got a blog?  Test your blog's readability here.

ليست هناك تعليقات:

إرسال تعليق