Friday, May 6, 2011

Python to get Media Metadata

At Babo Labs, we're interested in eliminating work for our digital merchants by providing them enabling technologies. An enabling technology is one that assists a user in completing a task more productively and efficiently, while minimizing intrusiveness or inconvenience. One example of an enabling technology is Google's instant search bar which shows search engine results as you type your query, in real time (statistics show this service saves 2-5 seconds per query on average).

One way our social e-commerce platform, Babolog, accomplish this is by passively-dynamically collecting meta information about the digital media files our merchants upload, and then displaying these meaningful specifications to their customers.

Over the past month, Stephen and I have tested a variety of Python modules for extracting metadata from media files. Here''s a few good ones:

1. kaa.metadata (freevo multimedia kaa metadata package)
    • Documentation
    • Installation: Ubuntu apt install
      • sudo apt-get install python-kaa-metadata 
import kaa.metadata

def getKaaMetadata(filepath):
    meta = kaa.metadata.parse(filepath)
    print meta
    return meta 

 2. pyPdf (for pdf files)
    • Documentation
    • Installation: installation
      • sudo python -m easy_install pypdf
from pyPdf import PdfFileReader

def getPdfMetadata(filename):
    pdf = PdfFileReader(file(filename, "rb"))
    basic_info = pdf.getDocumentInfo()
    preview = []

        for outline in pdf.outlines:
        preview = []

    return basic_info, preview
3. ID3 (for mp3 ID3 metadata)
    • Documentation
    • Installation: Ubuntu apt install 
      • sudo apt-get install python-id3 
from ID3 import *
    id3info = ID3('/some/file/moxy.mp3')
    print id3info
    id3info['TITLE'] = "Green Eggs and Ham"
    id3info['ARTIST'] = "Moxy Früvous"
    for k, v in id3info.items():
        print k, ":", v
except InvalidTagError, message:
    print "Invalid ID3 tag:", message

4. Magic (MIME inference)
def getMimeType(filename):
    Notes that the magic package has been marked as deprecated.
    We still find it useful for our needs. 
    m =
    return m.file(filename) 

We've found the kaa.metadata module to be pretty __awesome__. It provides valuable metadata for a variety of different file formats and media types including: jpg, avi, mp3 (including id3 and exif). It's a great tool if you are looking for an easy all-in-one solution. For our purposes, parse the results of several services in order to obtain a tailored solution for our platform.

If you'd like to learn more about getting metadata for a specific media type, or more about what metadata these modules can fetch, just leave a comment!

- Michael E. Karpeles
- Stepehen A. Balaban

No comments:

Post a Comment