buzzhit!: Tony Gentile's Internet Strategy Consultancy
Home
About
Services
Clients
Contact
BLOG
 

Sunday, April 10, 2005

Corpus Analytics: Automated Tag Creation

When I first read on Niall Kennedy's blog that Yahoo had released their Term Extraction API, I immediately thought, "Cool, if I can rig up a method of inserting the extracted terms into my blog posts, I can get all of the benfits of Tags, without the added time and hassle of creating them all by hand."

The first step for this project was looking at the Yahoo API, and more specifically, the examples of the terms that it extracts. And that's where the experiment ended.

The problem is, of course, that the Term Extractor is just that, not a Meaningful Term Extractor. While I strongly believe that corpus analytics holds amazing promise (sorry, couldn't resist the tease), Yahoo's freebie offering is far too crude a tool to move straight to automated tag creation.

I just read (via Technorati's David Sifry) that Jonas Luster has built the solution I only imagined (kudos Jonas). While the results bear out my assumptions, they are nonetheless a valuable starting point.

If we assume that "meaningful" is hard to automate near-term (it is), need we give up? Not necessarily. Rather than late binding the extracted terms to the post (computationally expensive, as Jonas notes, and with mixed results, as above)... what if we could invoke Jonas's service before publishing a post (or even more interestingly, dynamically, as we're typing)?


  1. That would allow me (the publisher and editor) to winnow the relevant terms from the automated result set, including them as tags in the post;
  2. What if the winnowed terms were then automatically passed to a process that would return all of my blog posts (likely as hyperlinked headlines) that use those terms? Well, with that, I could again winnow the result set so that I could in-line a "buzzhit!'s related articles" offering (or, draw examples, previous ideas and writing, etc from the returned articles to strengthen the post under creation);
  3. Similarly, through another service, I could receive and in-line" articles from around the web" (and/or, "from my reading list" [OPML file], and/or "from my social network [XFN et al]"

Lots of potential here (and more broadly for mashups resulting in automated meta-data creation). On the latter topic, I've got a few ideas that I'd love to "reduce to practice" if any of my more savvy dev buddies have a few spare cycles. ;-p Drop me a note...

Update:

Greg Linden (of Findory, a cool service) drops by and gently asks "What's the point?" (my words) and suggests a more considered approach.

Agreed. Remember that this post is focused on building something of use off of what Jonas has built out in the open (i.e., prototyping or as is trendy, 'hacking') with publicly available APIs. To your point, there are definitely smarter ways of satisfying the 'needs' that I'm expressing.

Given that, and the stated assumption that people need to be involved in the process to get the best results, the basic thinking is that, well, "people are lazy" and "human memory is lossy". I'm looking for a suite of services that would automate the process of assisting me as an author/editor by recommending relevant pieces of content and meta-data during the blog post creation process. Specifically, I could use help with tag recommendations/mgmt, and surfacing what I and others have written about this topic/entity in the past. (Am I the only one who hates manually invoking these activities in a bunch of extra windows?!)

But hell, I'm still waiting for basic stuff, like a NOFOLLOW checkbox in the hyperlink dlog. ;-)

4 Comments:

At 7:38 PM, April 10, 2005, Greg Linden said...

I'm not sure I understand the idea of automated tag creation.

If the goal is to find all the posts that contain specific keywords, how is a tag search over automatically generated tags different than a search on a search engine like Feedster or Google?

If the goal is to find related posts, shouldn't we use the full information available, not just a couple tags, to find relevant content?

I'm not trying to be argumentative. I really don't get it. Can you help me out, Tony? What's the goal of automated tag extraction? What are we trying to help users do?

 
At 9:46 AM, April 11, 2005, Jeff Clavier said...

Automatic generation/association would make sense for information like locations (London = Londres, etc.) or entities (IBM = International Business Machine).

Automatic categorization has been around for a number of years (Autonomy being a leader in the space), but these technlogies relate to a taxonomy. So are you suggesting to define a personal taxonomy and "tag" documents accordingly ?

 
At 7:43 PM, May 27, 2005, Hisham Al-Shurafa said...

Hey Tony,

Check out my automated tagger, it trains itself with already labelled tags so its more able to give meaningful tags. I tested it and it's accurate 75% of the time.

Check out the demo here. If the site is down, just check back later, since this is hosted on my home computer.

 
At 7:44 PM, May 27, 2005, Hisham Al-Shurafa said...

Hey Tony,

Check out my automated tagger, it trains itself with already labelled tags so its more able to give meaningful tags. I tested it and it's accurate 75% of the time.

Check out the demo here. If the site is down, just check back later, since this is hosted on my home computer.

 

Post a Comment

<< Home

 
About This Blog

Analysis of online business and technology trends, including: Search and Directory, Digital Media, Social Networking, RSS, and E-commerce. Written by buzzhit!'s Tony Gentile.

Syndication

Keep up-to-date with buzzhit! via your favorite Feed Aggregator:

  • Subscribe in MyFeedster
  • Subscribe with Bloglines
  • Subscribe in NewsGator Online
  • Subscribe in MyYahoo!
Advertising

Recent Posts

Search This Blog
buzzhit!
Archives
Blog Roll
Misc
  • This page is powered by Blogger. Isn't yours?
home - about - services - clients - contact - BLOG
Copyright 2002-2005 Tony Gentile. All Rights Reserved Worldwide.