App developers, start your engines.
Diffbot is a new form of visual-based content extraction technology that views and understands Web content the same way human beings do. The technology identifies and extracts the important objects on any Web page using artificial intelligence, computer vision, machine learning and natural language processing.
And the technology is for everyone. Diffbot’s APIs give application developers a way to instantly utilize data from any Web page in their own applications, effectively turning the entire Web into a usable database. Diffbot is now processing 100 million API calls per month on behalf of its customers, who are using it for Web site mobilization, content management system migration, tag generation, article grouping/clustering and a host of other functions.
How is it being used so far? Diffbot's website gives the following example:
Editions by AOL uses Diffbot to pull relevant content from the web and lay it out into an easy-to-use iPad magazine. Using Diffbot AOL is able to identify headlines, full-text, authors and related images; deliver new content by tracking article changes; generate tags for every article; and automatically group similar items using Diffbot clustering algorithms.
Diffbot just secured a $2 million investment from technology veterans, including Sky Dayton, founder of EarthLink; Andy Bechtolsheim, co-founder of Sun Microsystems; Joi Ito, Director of the MIT Media Lab; Brad Garlinghouse, CEO of YouSendIt, and other top executives and founders from Facebook, Twitter and Yahoo, with participation from Matrix Partners.
“Diffbot is an incredibly sophisticated tool for developers to rapidly build compelling applications around Web content,” Dayton said. “The more developers use Diffbot, the more it learns about and adds structure to data on the Web. This technology is becoming the basis for a new kind of Web experience enhanced by machine interpretation of content.”
Diffbot has categorized the Web into about 20 different page types that can be visually analyzed using layout and contextual cues, including everything from product and review pages to social networking profiles and recipes. Amazingly, this visual-based processing lets Diffbot instantly understand and extract the content on any page, in any language. To date, the company has released developer APIs for two of the most commonly consumed page types, Front Pages and Articles. The Front Page API is designed for analyzing home and index pages using common layout markers (headlines, bylines, images, articles, ads and more), while the Article API is used to extract clean article text, related images and videos and generate unique cross-referenced tags from news and blog Web pages.
“Our goal with Diffbot is to understand every corner of the Web, and make every bit of it accessible for developers trying to create new, rich applications and experiences,” said Michael Tung, Diffbot Founder and CEO.
Visit the website to test-drive Diffbot right now.