Within the ISO/IEC 14496 MPEG-4 standard there are several components that make up the structure for the storage of moving images or audio assets. They are all based and derived from the ISO Base Media File Format , which is published as part of the JPEG 2000 standards.
A container format allows you to combine streams (most of the time audio and video) into one single file.
Multimedia containers are for example the well known AVI (.avi), MPEG (.mpg, .mpeg), Matroska (.mkv, .mka), OGM (.ogm), Quicktime (.mov) or Realmedia (.rm, .rmvb)
MP4 is the global file extension for the official container format defined in the MPEG-4 standard (ISO 14496-14)
What extensions does MP4 use?
– .mp4: only official extension, for all audio and video and advanced content files (and combinations)
other related extensions:
– .m4v: .mp4 files with the WRONG extension introduced by apple for video+audio files, m4v can safely be renamed to .mp4
– .m4a: .mp4 files with the WRONG extension introduced by apple for audio-only files, m4a can safely be renamed to .mp4
– .m4p: DRM protected files sold in iTunes, using the DRM sheme developed by apple
– .m4e: renamed .sdp files used by Envivio for streaming
– .m4v, .mp4v, .cmp, .divx, .xvid, .264: normally raw mpeg-4 video streams (not inside MP4)
– .aac: raw aac audio streams/adts (not Inside MP4)
– .3gp, .3g2: used by mobile phones, can also include content not defined for .mp4 (H.263, AMR), see question 20
– .mov: technologically similar container, but not the same as MP4, see question 20
H.264 or MPEG-4 Part 10, Advanced Video Coding (MPEG-4 AVC) is a video compression format that is currently one of the most commonly used formats for the recording, compression, and distribution of video content. – Wikipedia, H.264/MPEG-4 AVC
H.264 is an industry standard for video compression (converting digital video into a format that takes up less capacity when stored or played back). Video compression (or coding) is a critical technology for internet video streaming. Video compression makes it possible for products from different manufacturers (e.g. encoders, players and storage systems) to speak a common language via video. An encoder converts video into a compressed format and a decoder converts compressed video back into an uncompressed format. MPEG4 is a ‘compressed format’.
– Baseline Profile offers I/P-Frames, supports progressive and CAVLC only
– Extended Profile offers I/P/B/SP/SI-Frames, supports progressive and CAVLC only
– Main Profile offers I/P/B-Frames, supports progressive and interlaced, and offers CAVLC or CABAC
– High Profile (aka FRExt) adds to Main Profile: 8×8 intra prediction, custom quants, lossless video coding, more yuv formats (4:4:4…)
MPEG-4 (ISO 14496) : a broad open standard developed by the Moving Picture Experts Group (MPEG), a working group of the International Organization for Standardization (ISO) which also did the well known MPEG-1 (MP3, VCD) and MPEG-2 (DVD, SVCD) standards, that brought together all sorts of audio/video compression formats and much more.
The MPEG-4 Standard wasn’t designed to be a single product (e.g. a codec comparable to DVD) but to cover a broad range of sub-standards. Developers can choose to follow the standards or not, according to what they need for what they’re doing. See? Simple: ‘There is no spoon.’ < /sarcasm >
The mpeg4 file is designed to contain structural and media data for timed presentations of multimedia such as audio, video, images, closed captioning … This structure is left ‘open’ or at the very least ‘vague’, so that by structuring files in different ways the same root spec could be used for multiple applications.
MPEG4 is engineered to deliver ‘timed media’ information in a flexible, malleable format that facilitates interchange, management, editing, and presentation. The presentation may be a file on your local computer or mobile device, or may be coming via a network (such as streaming on the internet).
The presentation, or files, have a logical structure, a time structure, and a physical structure, and these structures are not required to be coupled or used together. I do believe this is what throws a lot of people off understanding how this format ticks.
The logical structure of the file is of a movie that in turn contains a set of time-parallel tracks.
The time structure of the file is that the tracks contain sequences of samples in time, and those sequences are mapped into the timeline of the overall movie by optional edit lists.
The physical structure of the file separates the data needed for logical, time, and structural de-composition, from the media data samples themselves. This structural information is concentrated in a movie box, possibly extended in time by movie fragment boxes. The movie box documents the logical and timing relationships of the samples, and also contains pointers to where they are located. Those pointers may be into the same file or another one, referenced by a URL.
Each media stream is contained in a track specialized for that media type (audio, video etc.), and is further parameterized by a sample entry. The sample entry contains the ‘name’ of the exact media type (i.e., the type of the decoder needed to decode the stream) and any parameterization of that decoder needed. The name also takes the form of a four-character code. There are defined sample entry formats not only for MPEG-4 media, but also for the media types used by other organizations using this file format family. They are registered at the MP4 registration authority .
Protected streams are also supported by the file format (e.g. streams encrypted for use in a digital rights management systems or DRM). There is a general structure for protected streams, which documents the underlying format, and also documents the protection system applied and operational parameters it requires. Basically what we’re talking about is metadata surrounding the images that make up the video.
Support for this metadata takes two forms. First, timed metadata may be stored in a track, synchronized in any number of ways with what it is describing. Secondly, there is general support for non-timed metadata attached to the movie or to an individual track. The structural support is general, and allows for the storage of metadata resources elsewhere in the file … or in another file. These resources may be named, and may be protected.
These generalized metadata structures may also be used at the file level, above or parallel with, or in the absence of, the movie box. Which means the metadata box is the primary entry into the presentation. This structure is used for MPEG-21 files and other bodies are using it to wrap together other integration specifications (e.g. SMIL ) with the media integrated.
Sometimes the samples within a track have different characteristics or need to be specially identified. One of the most common and important characteristic is the synchronization point (often a video I-frame). These points are identified by a special table in each track. More generally, the nature of dependencies between track samples can also be documented.
Finally, there is a concept of named sample groups. These permit the documentation of arbitrary characteristics that are shared by some of the samples in a track. In the AVC file format, sample groups are used to support the concept of layering and sub-sequences.
MP4 File Format
MP4 files are generally used to contain MPEG-4 media, including not only MPEG-4 audio and/or video, but also MPEG-4 presentations. When a complete or partial presentation is stored in an MP4 file, there are specific structures that document that presentation.
MPEG-4 presentations are scenes, described by the scene language MPEG-4 BIFS. Within those scenes media objects can be placed; these media objects might be audio, video, or entire sub-scenes. Each object is described by an object descriptor, and within the object descriptor the streams that make up that object are described. The entire scene is described by an initial object descriptor (IOD). This is stored in a special box within the movie atom in MP4 files. The scene and the object descriptors it uses are stored in tracks — a scene track, and an object descriptor track; for files that comprise a full MPEG-4 presentation this IOD and these two tracks are required.
Each stream is described by an elementary stream descriptor. When a complete scene is delivered, these are delivered as part of the object descriptor stream. However, for ease of composition, and to manage files that contain only media streams, these elementary stream descriptors are stored with the media streams themselves — in the descriptive track structures — in MP4 files.
MPEG-21 File Format
As described above, the general meta-box can be used at the file level to contain a description and its associated or included data. This structure is used for MPEG-21 files. A file-level meta-box is used to hold an MPEG-21 Digital Item Declaration (DID) , The meta-box also contains a list of attached resources; which may have local names, and may be located within the same file or in another file.
When media is delivered over a streaming protocol it often must be transformed from the way it is represented in the file. The most obvious example of this is the way media is transmitted over the Real Time Protocol (RTP) . In the file, for example, each frame of video is stored contiguously as a file-format sample. In RTP, packetization rules specific to the codec used, must be obeyed to place these frames in RTP packets.
A streaming server may calculate such packetization at run-time if it wishes. However, there is support for the assistance of the streaming servers. Special tracks called hint tracks may be placed in the files. Hint tracks contain general instructions for streaming servers as to how to form packet streams, from media tracks, for a specific protocol. Because the form of these instructions is media-independent, servers do not have to be revised when new codecs are introduced. In addition, the encoding and editing software can be unaware of streaming servers. Once editing is finished on a file, then a piece of software called a hinter may be used that adds hint tracks to the file, before placing it on a streaming server. There is a defined hint track format for RTP streams in the MP4 file format specification.
The following table contains a summary of some of the common file types in the ISO Base Media File Format Family. The formal registration authorities (e.g. the MP4 registration authority  for brands, or the Internet Assigned Numbers Authority  for MIME types) and the appropriate specifications should be consulted for definitive information.
 ISO/IEC 14496-12, ISO Base Media File Format; technically identical to ISO/IEC 15444-12
 ISO/IEC 14496-14, MP4 File Format
 ISO/IEC 14496-15, Advanced Video Coding (AVC) file format
 ISO/IEC 14496-10, Advanced Video Coding
 ISO/IEC 21000-9, MPEG-21 File Format
 ISO/IEC 15444-1, JPEG 2000 Image Coding System
 The MP4 Registration Authority, http://www.mp4ra.org/
 ISO/IEC 9834-8:2004 Information Technology, “Procedures for the operation of OSI Registration of Universally Unique Identifiers (UUIDs) and their use as ASN.1 Object Identifier components” ITU-T Rec. X.667, 2004
 SMIL: Synchronized Multimedia Integration Language; World-Wide Web Consortium (W3C) http://www.w3.org/TR/SMIL2/
 ISO/IEC 21000-2 Digital Item Declaration
 RTP: A Transport Protocol for Real-Time Applications; IETF RFC 3550, http://www.ietf.org/rfc/rfc3550.txt
 The Internet Assigned Numbers Authority http://www.iana.org/
4 karma points