Designing systems that work.
Contact us to get started today.

8 Benefits of having a Meta Data Dictionary for your Search Engine

When you’re deploying a search engine – big or small, for intranets or web sites, creating a meta data dictionary (or meta data registry) and making sure you stick to it. This is important for many reasons, and is most important when content is being indexed into the search engine from more than one source or will be a large amount of content. This article explains why, and eight of the benefits you can expect.

What’s a meta data dictionary?

Whenever content is indexed into a search engine, some of the information retrieved is meta data. Meta data usually consists of items such as author names, date of creation, location of content, and so.

This list of meta data stored by a content system can get quite long – in a search engine where it comes together from several places, it can be even longer.

A  meta data dictionary (sometimes called a meta data registry) is simply a definition of each of the meta field definitions in the search system, and what they mean to the content in the engine. It is much like any kind of data dictionary.

The creation of a meta dictionary involves creating a set of standard meta field names that are applicable to some or all pieces of content in the search engine. This allows for each of the meta items available from a content source to be mapped into this standard name space, where possible, or identify an opportunity for name space extension where ever additional meta data appears.

What’s the point?

There’s little point in creating a meta dictionary if every time content is indexed (pulled) into the search engine, a new set of meta fields are created that are unique to the content source. The point of the meta dictionary is that it provides a guide to be used that allows search engineers to determine where meta items in the new content should be stored in the search engine.

What’s the format of a meta dictionary?

There’s no standard format that I’m aware of to represent a meta dictionary (although I suppose you could technically mark one up in XSD or something like that). 

A meta dictionary should identify the name of the meta field, the valid values that can fall into a meta field where appropriate (the vocabulary), and what encoding is used (also very important). A clear human readable description of the field should be included also. Some fields might be required, the dictionary would identify these too.

Examples of meta fields

The core set of meta data in your dictionary is likely to cover:

  • a unique identifier for the content (a URI to the entry in the source system, probably, as this is normally required in any case for a web based search tool)
  • the title of the document
  • an identifier for the source system from which the content was retrieved
  • mso-bidi-font-size:10.0pt;font-family:Wingdings;mso-fareast-font-family:Wingdings;
    mso-bidi-font-family:Wingdings;color:#DC873C"> the type of the content (file type such as HTML, Adobe PDF or Microsoft Word document – an mime type encoding is usually used in this field) 

Other examples include the date on which the content was published or the language that every document is written in (many documents contain two languages, but there is usually a primary language).

Adding new meta

Perhaps another field exists that identifies the subject of the content. In this case, the subject will identify that a document belongs to a particular classification of content in a system. This is different from the title, in that this would use kind of controlled vocabulary to determine the subject – there’s a finite list of subjects the content can relate to. The search engineer identifies that this type of meta information is not currently represented in the dictionary. A meaningful discussion can be had which talks about whether this information will be important to anyone using the search system. For example, would a future service or interface find it beneficial that the search results could be filtered or displayed with this additional data? If so, the meta dictionary is extended with a new name to document where this type of information is stored. Some kind of standard vocabulary and encoding is identified (either created or reused from somewhere) for the field, and the dictionary is extended to show mappings between the content system’s values for a subject and the standard new vocabulary. Remember that this meta data field might be reused across multiple systems, so make sure the vocabulary will meet your needs in the future as well.

If this information is not useful in the search engine, filter it out from being pulled into the search engine – leave it in the content system where it may well have a use.

There’s lots of ways to think about meta information and determine if it’s going to be useful, but that would be the topic of another post entirely.

The author problem

A good way to see how important this process of defining what meta data means can be is to think about the author of a document. If you’re indexing a content system, or even a web site like a blog, anywhere there’s an actual author or creator of the content – what does that mean? So often, the author really means the last person to upload a document or piece of content. In a blog, the author tends to be the original writer, and anyone else is an editor or contributor. Why isn’t that true in a content system? Perhaps it is in yours. But let’s be clear, there’s a big difference between the original author or creator, a contributor, an editor, and the publisher (or the last person to upload content). It might suffice, in your content system, to define the contributor, editor and publisher as the same person – but unless you make it clear, they’re more often than not the original author.

This (or these) distinctions only become really important when I’m searching for content. If I want content written or created by a certain individual, I really don’t want stuff that was just uploaded by them.

By clearly defining what these different meta fields mean, and mapping meta data in a content system into these fields, I’ve taken the hassle out of leveraging the search system and making use of the data. If I leave those decisions and mapping processes until later, I might not even be able to draw the distinction in the search system between one type of author and another.

Consistent view

At this point, we’ve created a consistent representation of meta information across multiple pieces of content, which might cover multiple content types from multiple systems. 

Why is this so important?

There are many reasons that using and sticking to a meta dictionary are important.

Reducing the number of fields that look the same and making it clear what each meta field is, has lots of benefits. The obvious results of a reduced set of consistent meta fields are around simple data management issues and performance.

  1. Improved load time: the search engine isn’t loading up multiple, often repeated and useless content fields on start up.
  2. Fewer resources required (disk, memory and CPU): less data means less disk! Log files (which can become huge) tend to end up being smaller too. If your search engine is on another machine to your search interface, the details of the results being pulled across the wire will be smaller – network latency can really slow down performance on large data sets. The search engine needs less memory and CPU for fewer meta fields, there’s fewer handles to different meta indexes, and of course the any intermediate API’s, services and tools will also be dealing with and passing around less data. All this means less demand on supporting hardware.
  3. Complexity management: when a search engineer, developer, or infrastructure engineer is troubleshooting the system or viewing logs, there’s less complexity to look at. It’s immediately obvious what a field is for (if not, it can quickly be looked up). There’s not a bunch of spurious fields messing up the debug output – often a field might look like the one you need in terms of its content, and even have a similar name, but it could be the developer is using the wrong field entirely.
  4. Long term maintenance: knowing which fields are relevant to search tools and which aren’t, whether a field should be kept up to date, and what information is needed in the case that the source content system that’s being indexed changes, are all important benefits.

Then there are the slightly less obvious reasons.

  1. I’ve seen situations where there are several fields indexed from a content system that contain an identity of someone – all of them similarly named as either author or updater. In this case, it’s hard to know which field to use and what it’s really telling you. Developers spend time figuring out which meta is relevant – often the people running the search engine don’t even know which should be used or what they mean. Make it easy by identifying the one or two that really mean something and defining precisely what those meanings are.
  2. Meta fields that have indexes applied to them can be quickly and sensibly applied, and will apply to all documents at once. There’s no confusion working out if we need to index two or three different meta fields (such as three different meta fields that refer to the document type, for example) just to make searching nice and fast for our primary search tool.
  3. If you’re sensible enough to wrap a search engine in a series of services (or even just the native search engine API), the contact for this service just got a lot simpler. There’s no need now to allow arbitrary fields to be passed or returned if you’d rather have a well defined and fixed contract for supported fields. You know which ones your search engine has in it thanks to the meta dictionary. In addition, any user trying to leverage the search service know precisely what you mean when you define the author of a document vs. it’s contributors.

And perhaps the biggest reason of all.

  1. It helps define what to ask for. If you have a clear set of required and possible meta fields, when a new content system is to be indexed, it’s helps with planning what to pull from that system and what each meta field will really mean. The process of managing and mapping to the meta dictionary forces you to think about what the source system meta content really means, and what we really know about it.
    More often than not, we’re not going to be able to truly say we know what we thought we knew about the source content. Perhaps we actually don’t know the author, just the publisher. Is it okay to call them one in the same? Should we map the meta into a different meta field in the search engine instead?
    The answers to questions like these depend on your use of the search system, the use you will make of your content, how your source systems are used, and on and on. There’s definitely at least one good answer, possibly several good ways to look at it – but at least the meta dictionary forces you to look at it, think about it, and make that decision early on, when it matters and when the meta data still means something and can still be useful to whomever might be able to use it.

In an ideal world, the same thought processes and planning and the same meta dictionary, are all applied directly to the source systems we’re going to index, before we even index them. In fact, that’s really another secret benefit to having the dictionary defined: perhaps you can convince your data architect’s to own it and apply it to the whole organization. Data cleanliness and planning are simply good practice – it’s just that meta data frequently get’s left out of the scope of concern for data architects, especially because these data fields are usually only applicable to unstructured content systems. The search platform, however, is supposed to offer additional services to the enterprise around finding and presenting information, and often this might be the one opportunity to think ahead and apply these practices and approaches.

Due to the nature of a search engine being a service overlaying multiple unstructured data systems, it’s in a unique position to be able to overlay this consistent structure. In addition, the search engine is often responsible for extracting meta data where none exists in the source platform (or where it’s not readily made available). As such, the meta data planning exercise becomes even more important – the search engine in this case is the source of new meta information (or at least the exposure of this information).

So my advice is, plan your meta dictionary now, and save time and headaches in the future.

The next article I write on search will cover good approaches for meta dictionary design and implementation, and include some tips and advice on how best to define your dictionary and name your meta fields.

Are there any other benefits to meta dictionaries that come to mind for you? Have you suffered from messy meta fields in the past, or have you successfully implemented some control. Feel free to share your thoughts, and thanks for reading!  normal">

Leave a Reply