Corpus Analytics: Automated Tag Creation

When I first read on Niall Kennedy’s blog that Yahoo had released their Term Extraction API, I immediately thought, “Cool, if I can rig up a method of inserting the extracted terms into my blog posts, I can get all of the benfits of Tags, without the added time and hassle of creating them all by hand.”

The first step for this project was looking at the Yahoo API, and more specifically, the examples of the terms that it extracts. And that’s where the experiment ended.

The problem is, of course, that the Term Extractor is just that, not a Meaningful Term Extractor. While I strongly believe that corpus analytics holds amazing promise (sorry, couldn’t resist the tease), Yahoo’s freebie offering is far too crude a tool to move straight to automated tag creation.

I just read (via Technorati’s David Sifry) that Jonas Luster has built the solution I only imagined (kudos Jonas). While the results bear out my assumptions, they are nonetheless a valuable starting point.

If we assume that “meaningful” is hard to automate near-term (it is), need we give up? Not necessarily. Rather than late binding the extracted terms to the post (computationally expensive, as Jonas notes, and with mixed results, as above)… what if we could invoke Jonas’s service before publishing a post (or even more interestingly, dynamically, as we’re typing)?

  1. That would allow me (the publisher and editor) to winnow the relevant terms from the automated result set, including them as tags in the post;
  2. What if the winnowed terms were then automatically passed to a process that would return all of my blog posts (likely as hyperlinked headlines) that use those terms? Well, with that, I could again winnow the result set so that I could in-line a “buzzhit!’s related articles” offering (or, draw examples, previous ideas and writing, etc from the returned articles to strengthen the post under creation);
  3. Similarly, through another service, I could receive and in-line” articles from around the web” (and/or, “from my reading list” [OPML file], and/or “from my social network [XFN et al]”

Lots of potential here (and more broadly for mashups resulting in automated meta-data creation). On the latter topic, I’ve got a few ideas that I’d love to “reduce to practice” if any of my more savvy dev buddies have a few spare cycles. ;-p Drop me a note…

Update:

Greg Linden (of Findory, a cool service) drops by and gently asks “What’s the point?” (my words) and suggests a more considered approach.

Agreed. Remember that this post is focused on building something of use off of what Jonas has built out in the open (i.e., prototyping or as is trendy, ‘hacking’) with publicly available APIs. To your point, there are definitely smarter ways of satisfying the ‘needs’ that I’m expressing.

Given that, and the stated assumption that people need to be involved in the process to get the best results, the basic thinking is that, well, “people are lazy” and “human memory is lossy”. I’m looking for a suite of services that would automate the process of assisting me as an author/editor by recommending relevant pieces of content and meta-data during the blog post creation process. Specifically, I could use help with tag recommendations/mgmt, and surfacing what I and others have written about this topic/entity in the past. (Am I the only one who hates manually invoking these activities in a bunch of extra windows?!)

But hell, I’m still waiting for basic stuff, like a NOFOLLOW checkbox in the hyperlink dlog. 😉

Posted in Uncategorized
4 comments on “Corpus Analytics: Automated Tag Creation
  1. Greg Linden says:

    I’m not sure I understand the idea of automated tag creation.

    If the goal is to find all the posts that contain specific keywords, how is a tag search over automatically generated tags different than a search on a search engine like Feedster or Google?

    If the goal is to find related posts, shouldn’t we use the full information available, not just a couple tags, to find relevant content?

    I’m not trying to be argumentative. I really don’t get it. Can you help me out, Tony? What’s the goal of automated tag extraction? What are we trying to help users do?

  2. Jeff Clavier says:

    Automatic generation/association would make sense for information like locations (London = Londres, etc.) or entities (IBM = International Business Machine).

    Automatic categorization has been around for a number of years (Autonomy being a leader in the space), but these technlogies relate to a taxonomy. So are you suggesting to define a personal taxonomy and “tag” documents accordingly ?

  3. Hey Tony,

    Check out my automated tagger, it trains itself with already labelled tags so its more able to give meaningful tags. I tested it and it’s accurate 75% of the time.

    Check out the demo here. If the site is down, just check back later, since this is hosted on my home computer.

  4. Hey Tony,

    Check out my automated tagger, it trains itself with already labelled tags so its more able to give meaningful tags. I tested it and it’s accurate 75% of the time.

    Check out the demo here. If the site is down, just check back later, since this is hosted on my home computer.