rounded-top
Skip to content.
Sections
Home / KNOWLEDGEWISE / Winter 2007
List of Services here corner_piece_seeallservices.png
   

KNOWLEDGEWISEWinter 2007

Welcome to KNOWLEDGEWISE, as we continue our mission to provide our perspective on the latest trends in Content, Knowledge and Publishing. This issue focuses on techniques and tactics for helping readers find content, including tips on integrating subject matter experts into an automated indexing approach and an article by Innodata Isogen Consultant Bob DuCharme on the best ways to set up taxonomies and ontologies that take advantage of RDF/OWL. As always, we welcome your feedback and suggestions on how we can make KNOWLEDGEWISE even better. Enjoy.

The Editors, KNOWLEDGEWISE
kwise11-06-main.jpg


















Machine Aided Expertise... Four keys for Integrating Subject Matter Experts into Automated Indexing Workflows


With the vast wealth of digital information now available over both the Internet and within corporate databases, enterprises and publishers alike are naturally viewing automated indexing as a cheaper and faster way to classify data.

machine_aided.jpg

But companies that decide to automate indexing without considering other options may be making a mistake. In fact, several search companies are retreating from full-scale support of automated indexing programs that scan documents and identify keywords, according to Dan Penny, an analyst for EPS (Electronic Publishing Services.) For example, he noted that Autonomy, the first company to fully automate XML tagging for knowledge management applications, and also Verity, now embrace an approach that ensures that human indexers review the output of the automatic information processing programs.

“While automated indexing programs are quite adept at tagging a lot of information quickly, taxonomies still tend to be quite broad,” says Penny. “Consequently, many readers still have to wade through a great deal of irrelevant information.”

Even so, the time and cost-saving benefits of automated indexing are hard to overlook. In response, some organizations have begun to deploy architectures that automate pieces of the indexing process, while maintaining a workflow that accommodates the unique and custom requirements of indexers. This architecture, or set of tools enhanced by manual workflow processes, is referred to as machined aided indexing or MAI.

Designed to increase human indexers’ ability to be more efficient and consistent, MAI uses an application framework that provides content workers with an interactive, intuitive tool for determining the precise subject area of documents’ content, and identifying keywords and terms for various uses.

The key is to design a combination of automated software components and manual workflow to manage the tagging process, while also ensuring that subject matter experts remain an integral part of the process.

“One cannot overlook the importance of subject matter experts who make a value-based judgment on classifying articles and other materials in ways that will pinpoint readers to the information they are after,” says Jan Palmen, vice president of Innodata Isogen’s Publishing Practice. “This is particularly important for primary or secondary publishers who need subject matter experts to insert applicable terms and keywords that will aid readers to exactly locate an article or report for download.”

In addition, trained professionals who have expertise in a particular topic also bring advanced language skills to the table, such as the ability to interpret phrases from different languages as well as to understand issues in context so they can add index terms that might not necessarily be found in the original material. Another complication with indexing content written in different languages is that nuances in sentence structure require an innate understanding of the material.

But with all automated and manual processes, it’s important to design a workflow that gives subject matter experts opportunities to modify and correct the system. A key step is to establish ground rules that build in checkpoints to ensure that the indexing process is accurate.

In fact, companies adopting machine aided indexing approaches may want to follow these four tips for ensuring that subject matter experts play a key role:

  • Identify experts within the organization who can set clearly defined rules for establishing taxonomies and ontologies
  • Implement clearly governed authority processes for enhancements and additions to existing taxonomies
  • Develop a pilot to test the workflow and ensure that experts can modify the process if necessary
  • Once the process begins, establish checkpoints that allow the subject matter experts to check back regularly to verify the accuracy of the keywords and metadata

On the other end of the indexing spectrum, some enterprises are considering plans to build off the popularity of community-driven, social networking sites like del.icio.us and flickr by allowing users to tag content on their corporate sites.

While that may be an enticing prospect for some companies, EPS’ Penny warns that there are “inherent dangers in viewing user-generated metadata as a long-term solution.”

machine_aided2.jpg

He cited a recent EPS article that revealed that a core group of 10 users had posted and tagged nearly 30 percent of the articles on digg.com. “Choosing good stories is a skill few people possess,” says Penny, “and this just demonstrates why organizations need to dedicate people to tagging content and ensure that it remains fresh and relevant to readers.”

For organizations that want to embrace the social networking approach for their sites, Penny suggests that they designate experts to create controlled vocabularies and taxonomies. This same strategy could also be extended to information providers that want to incorporate “folksonomies” or user-created taxonomies into their search architectures.

They can select editors from the public to serve as über taggers who will revisit the taxonomy and revise as necessary to ensure that data and content are characterized correctly. By providing clearly defined taxonomies, they can help users refine their search by adding their own terms, but within a carefully defined context.

“Trying to overcome information overload is a pressing concern for all information providers and publishers,” says Penny. “And in that sense, adapting a flexible approach that combines the best of both automated and manual approaches makes the most sense.”

Back to Toptoparrow

Ontologies, RDF/OWL and Corporate Data


By Bob DuCharme, Innodata Isogen

RDF/OWL, or the Resource Description Framework Web Ontology Language, is gaining ground among tool developers as a W3C standard for storing ontologies. An increasing number of commercial and free RDF/OWL development tools are becoming available and more organizations appear to be eager to take advantage of RDF/OWL’s ability to describe and extract metadata stored within corporate databases and internal and external web sites. However, before embracing this new search option, many organizations may also need to take a hard look at their approaches for creating their taxonomies and related ontologies. In this article, Innodata Isogen’s Bob DuCharme offers some helpful tips for organizations seeking to improve the way they store data about data.

ontologies2.jpg

Businesses often need to assign information in the form of news articles, emails, or catalog entries to one or more categories to make it easier to search for them in a database. Someone in the workflow typically examines each item and enters a category name into a field on a screen, which is then associated with that item in an XML file or a relational database. A customer or another publishing process can then retrieve a group of items with something in common—for example, all catalog entries in the "outdoor furniture" category—because those entries have been assigned to a particular category.

The people doing the classifying, however, should rarely be allowed to make up their own category names as they go along, because an unmanaged list could lead to content falling through cracks. For example, if one classifier assigns Adirondack chairs to "outdoor furniture" and another assigns a cedar bench to "garden furniture," then a customer who wants something to put in the garden may not know that Adirondack chairs are an option. To manage this, many organizations maintain a specific list of category names for classifiers to pick from, so that Adirondack chairs and cedar benches end up in the same category. This managed list, which we call a controlled vocabulary, can be a list of products or as simple as a list of words such as "yes, no, maybe" or "male, female."

Organizations that have a lot of categories might want to group the categories themselves into their own categories. A hardware store catalog might group the plant, fertilizer and outdoor furniture categories under "gardening" and the hammer and saw categories under "tools." Like the depth of a hard disk's directory structure, the levels of categorization depend on the complexity and variety. If no single category has more than a handful of sub-categories, this hierarchical relationship can ease the job of both the category assigner and, more importantly, the customer trying to find a specific piece of content. When a controlled vocabulary clearly defines the hierarchical relationship, we call this a taxonomy.

While a controlled vocabulary tells us no more than the names of its members, a taxonomy uses its hierarchical structure to tell us about the relationships between some of its members—for example, that a cedar bench and an Adirondack chair are both outdoor furniture, and that outdoor furniture and fertilizers are both used in gardens. Ontologies go further than taxonomies by letting people describe even more about the relationships between the classes of things that we’re tracking—for example, that the two categories “duct tape” and “gaffer tape” have the same meaning, or that items in the patio lamps category require items from the bulb category to function.

Businesses use ontologies for more than just tracking relationships among the categories in complex controlled vocabularies. For example, some use ontologies to track relationships between data in different databases. This makes it easier to integrate those databases, and to get better, more centralized control over the knowledge that they're managing. For example, after two companies merge, the combined company may have one database where a purchase order field is called "po" and another database where it's called "p_order". An ontology can specify that these two fields represent the same thing, making it possible for a single query to look through both when someone needs to search purchase orders. Some companies are also developing enterprise-wide ontologies of their business vocabulary to ensure that everyone means the same thing when they use a given term, making it easier for different divisions to work together without misunderstanding the terms of their relationship.

ontologies.jpg

The first rule of ontology use is to check whether an existing one can serve or be adapted to various needs before creating a new one from scratch. More and more ontologies are being defined using the W3C standard RDF/OWL, which builds on the RDF metadata language to let developers define the classes of metadata that make up an ontology. Both free and commercial tools such as the University of Maryland's Swoop, Stanford University's Protégé, and TopQuadrant's TopBraid Composer are available to design and maintain RDF/OWL ontologies. RDF/OWL doesn’t lock systems into a dependency on one of these tools, however, because each can read and write ontology information created by the others. Libraries and tools are also available to put these ontologies to use—for example, to generate ontologies based on existing relational databases—so that a user can query the databases from tools that can read those ontologies.

These ontology development tools are helpful enough that many organizations use them to design and maintain taxonomies without getting into the extra power and complexity of ontology design. This is often a good first step, as these organizations work out the interaction of their controlled vocabulary with the systems, people, and workflows that need them. Once this is settled, they can begin to explore the extra power that the more sophisticated features of an ontology development tool can offer as they build on their taxonomy so that it can give them greater control over the data and metadata in their organization.

Back to Toptoparrow

Innodata Isogen in the News


Innodata Isogen at Online Information 2006

Innodata Isogen hosted a series of workshops on editorial services outsourcing, large-scale digitization and on-demand composition services at its booth at Online Information 2006. The event, held Nov. 28 through 30 at the Olympia Grand Hall, London, annually draws some 11,000 information professionals from around the world.

online_logo.gif

In addition to the workshops, the Innodata Isogen booth also hosted presentations by partners Mark Logic, Tizra and Ontopia and industry analyst Electronic Publishing Services. EPS was also joined by representatives from the Software & Information Industry Association (SIIA).

A speech by Jack Abuhoff, Innodata Isogen CEO and chairman, on the topic of “Operational Excellence as Competitive Advantage in Publishing" highlighted the company's presentations to the general conference. Rich Schochler, a senior analyst with Innodata Isogen, delivered a presentation on “Maximizing Economic Value from Large-Scale Digitization Projects” and also on “Managing Value through Editorial Services Outsourcing.”

To view Mr. Schochler's presentations, click here. A podcast of Mr. Abuhoff's presentation will be available in early December. For more information about the show, contact Carolyn Muzyka at (201) 371-2522 or cmuzyka@innodata-isogen.com.

Innodata Isogen Named to EContent 100 for Second Consecutive Year

For the second consecutive year, Innodata Isogen, a leading provider of content-focused IT and BPO services, has been named to EContent magazine’s EContent 100, a prestigious compilation of the leading solutions providers in 13 categories, including content creation, content management and digital publishing.

"The EContent 100 recognizes leadership, expertise and innovation in the marketplace," said Michelle Manafy, editor of EContent. "With its broad range of content-related IT and BPO services, Innodata Isogen has established itself as a leader in its space."

Award winners range from publishing and information services giants such as ProQuest, Reed Elsevier, Reuters and EBSCO to technology innovators such as Google, IBM, Socialtext and Fast Search & Transfer.

"Our repeated selection to the EContent 100 is partial testament to the hard work, skill and enterprise of Innodata Isogen's talented knowledge workers," said Al Girardi, vice president of Marketing at Innodata Isogen. "Our reputation for methodical quality and continuous innovation in knowledge-related services and systems is a prize we've earned."

Jack Abuhoff offers tips on editorial services outsourcing in Global Services Magazine

The editorial services outsourcing market, pegged at $500 million in 2006, is expected to grow to more than $2 billion dollars within five years, according to a cover story in Global Services magazine, which also featured an interview with Innodata Isogen CEO and Chairman Jack Abuhoff.

Turning to offshore providers does more than just save wages, Abuhoff said, noting that Innodata Isogen breaks the editorial process “down into its individual parts, eliminating any unnecessary stages and introducing new technology.”

Schochler describes evolution of DITA for TheContentWrangler

In an interview with TheContentWrangler, Rick Schochler, Innodata Isogen senior analyst, recounted Innodata Isogen’s role in the evolution of Darwin Information Typing Architecture. DITA is a topic-based, modular approach to authoring that is particularly effective for technical data.

As a member of the OASIS committee that developed DITA, Innodata Isogen has been involved with the technology from the outset. Schochler advised of the importance of using a tool-neutral format such as XML before choosing the content structure that best fits the project requirements to achieve single-source publishing.

Palmen talks on localization at SIIA global summit

Jan Palmen, vice president of Innodata Isogen’s Publishing Practice, delivered a presentation on localization at the Global Information Industry Summit of the Software & Information Industry Association (SIIA) in September. Palmen offered real world examples of how companies improved their localization efforts by reengineering their content supply chain. The event was SIIA’s first major conference in Europe.

For more information, visit SIIA’s Web site

Back to Toptoparrow

In the News


Asia digital media trend multiplies by the blogs

A rampant rise in interest in blogging in Japan suggests that the real growth area for the blogosphere is in Asia, according to a recent “Insight” column by Electronic Publishing Services’ Paul Woodward.

Some 20 percent of Japanese regularly read blogs and the remainder are being served by “blooks” or books that are based on blogs, he said, quoting a Wall Street Journal article. The recent publishing of “Demon Wife Diaries” is an example of how old media is responding to the new Internet-based media. The “blook,” based on a blogger writing as Kazuma about his unhappy marriage, also has spun off a TV series, comic book and video game.

Close on the heels of Japan’s trend is China, where about 17.5 million regular bloggers are followed by some 55 million readers.

Community sites are a small, small world

Publishers aiming to create Web-based community sites may need to offer incentives to attract the core group of super-users that develop and maintain the content, Electronic Publishing Services Associate Majied Robinson suggests in a recent Insight column.

community.jpg

Duggtrends.com, the news site where members submit and rate news stories, revealed that only 0.5% of its 450,000 members create content. Wikipedia founder Jimbo Wales similarly found that nearly 74 percent of all edits to the on-line encyclopedia are performed by 2 percent of its members, or about 1,400 persons. Some 98 percent of MySpace traffic is visitors reading, not contributing, profiles.

The battle to attract core users has already begun. Netscape-owned Weblogs Inc. claims to have wooed several core contributors from Duggtrends.com by offering them $1,000 a month to build and maintain content on its newly launched user-selected news service.

Ease of establishing community-based sites and the value that they can create makes them attractive to aspiring publishers. YouTube, the video-sharing site, for example, was acquired recently by Google for $1.65 billion less than two years after its February 2005 startup. However, publishers must be aware that a site concept may be too niche-focused to draw a critical mass of super-users, who may need incentives to be drawn to more mainstream concepts.

Google digitizing projects extend globally and beyond books

Google’s worldwide digitizing goals are growing wider with the creation of a Web site that pulls together its books, video, mapping and blogging services. The site is intended to help teachers and educational organizations share resources in an effort to fight illiteracy.

Complutense University of Madrid also recently became the first library in a non-English-speaking country to join Google’s project to scan every book in print. The library contains 3 million works in Spanish, French, German, Latin, Italian and English. The move vastly extends the project’s global scope, since more than 400 million people speak Spanish. Google said it aims to make complete texts of all public domain books available on line in the controversial digitizing project that is the focus of several lawsuits by authors and publisher groups trying to block it.

In addition to books, the new Literacy Project Web is encouraging teachers to upload videos of successful classroom activities. Among the earliest to post material was an Indian organization that uses subtitled Bollywood films to teach reading.

Partner Corner


Astoria offers expanded DITA support

Astoria Software’s XML content management solutions now support the Darwin Information Typing Architecture (DITA) standard. Astoria’s support for DITA includes all document types specified by DITA, including topic, task, reference, concept and Ditabase. It also supports standard table models and custom or semantic tables.

The software supports DITA’s reuse model in which content can be referenced and reused from any other file along with full referential integrity. Astoria offers element level control and reuse, maximizing the benefits organizations can achieve by implementing the DITA standard.

TEMIS launches LUXID based on UIMA standard

TEMIS’ new Luxid information intelligence solution is based on the Unstructured Information Management Architecture. UIMA provides a standard framework with interoperability that allows the use of a combination of multiple technologies, enabling more powerful, cutting-edge information retrieval and analytics.

Luxid’s refinements enable organizations to easily integrate the product into their enterprise content management and information retrieval applications. The product is fully compliant with the IBM OmniFind enterprise search and discovery platform, enabling searches to be enhanced with semantics. It also allows Luxid to take advantage of any third-party UIMA-compliant annotators.

Back to Toptoparrow


Upcoming Events


Sep 9 - Sep 11, 2008  Software and Information Industry Association (SIIA) Global Information Industry Summit
Join senior information and publishing executives from Europe, North America and India to gain insight on the global strategies of market leaders, identify new markets well-suited for your company, and meet the partners positioned to help you succeed.

Oct 16 - Oct 19, 2008  Frankfurt Book Fair
As the world's largest marketplace for trading in publishing rights and licenses, the Frankfort Book Fair attracts many of the industry's leading figures, from authors and publishers to booksellers and art dealers.

Dec 2 - Dec 4, 2008  Online Information 08
The world's leading event for online content and information management solutions.

Back to Toptoparrow

Additional Resources

Visit our Knowledge Center, a continually updated archive on all aspects of content creation, management and distribution.
Knowledge Center »

KNOWLEDGEWISE Issues

KNOWLEDGEWISE, a report on content and knowledge management trends, knowledge services and publishing technologies from the work process and technology experts at Innodata Isogen.

Volume 3, Issue 2
Volume 3, Issue 1
Volume 2, Issue 3
Volume 2, Issue 2
Volume 2, Issue 1
Volume 1, Issue 2
Volume 1, Issue 1

CONTACT US

Contact Us
(201) 371-2828
Request Information