CSCI 1710
Basic Audio and Video on the Web

Multimedia is simply a different way to say that we want to reproduce real life on the web, music and movies in particular. There is a problem though: music and movies require a tremendous amount of bandwidth. For example, the words "Mary had a little lamb" use 22 bytes in this HTML file. To say them, however, in this audio file takes 27,498 bytes, over 1200 times more space! If the phrase above has only 5 words, imagine what it would take to store this file of over 2,000 words in audio format!

And to have a motion picture of me reading "Mary had a little lamb" would hog even more bandwidth. Using a very small 120 pixel by 100 pixel screen, an uncompressed video of the 1.24 seconds it took me to read "Mary had a little lamb" would take:

1.24 seconds * 24 frames/second * 12,000 pixels/frame * 3 color bytes/pixel = 1,071,360 bytes

Wow! That's over 1 Meg for about a second of video!

After looking at this data, one might then say, "But we've had television for years, and it shows 'real life' motion and audio just fine. Why can't we have the same on the computer?"

There are a couple of differences between television and the Internet. First, television has dedicated channels, i.e., everyone is watching the same thing at the same time. This means that the television studio or cable company only has to provide one stream of data. In addition, this stream of data does not share its "wire" with any other channels. Servers on the Internet, however, provide on-demand downloads to millions of users.

That means that one second of video coming from a cable company to 1,000 users would result in about 1 Meg of data being sent along one cable. That same second of video on the Internet to 1,000 users would require 1,000 requests to the server for the data and 1,000 * 1 Meg which is equal to 1 Gig of data being sent from the server per second.

In addition, the network being used by the cable company is owned by the cable company. No other systems use it. The Internet, however, is a shared network which causes frequent interruptions in service due to other packets of information going through the same wires.

In addition, the method that is used to pass information to your television has been standardized for a number of years. One format means only one type of television receiver is needed to receive and display any television broadcast.

The Internet, however, is still trying to resolve all of proprietary the methods for the transmission of multimedia.

Basic Concepts of Multimedia

Computers think in what is referred to as digital data. In other words, they are only capable of storing a limited number of values that have a limited resolution, i.e., a limited number of digits after a decimal point. Data from the real-world, however, is analog. This is a fancy word for saying an infinite number of values each with an infinite number of digits after the decimal point. This is a problem for the computer since we have yet to invent a hard drive with infinite storage.

For example, temperatures do not take on quantized levels such as the integers 32^o, 33^o, 34^o, and so on. Rather a temperature can be one of an infinite number of possibilities taken out to infinite decimal places. Looking around us we see that all measurements in the real world are analog: color, light intensity, volume, etc. We may try to force a value into integer values such as the 256 possible levels of blue in the 24-bit colors we've been using on our web pages.

Converting an analog signal into a digital value does two things:

limits the high and low values so that the number must fit into a range e.g., 0 for the lowest level of blue and 255 for the highest and
limits resolution, e.g., we can't define a level 124.3425 of blue for a web page.

For the most part this is all right, but in some cases, a degradation in the quality of the media is apparent.

It all begins with sampling. Sampling involves taking measurements of an analog signal at regular intervals.

Analog Signal

Sampled Signal

The quality of the sampled analog signal can be improved by:

taking the samples quicker (faster sampling);
using more digits after the decimal point (better resolution); or
reducing the number of signals being recorded (e.g., stereo recording rather than mono).

Each of these methods, however, has a direct affect on the size of the resulting multimedia file.

Compression

To combat the large file size problem, we need to create algorithms that allow us to shrink the size of the file for storage and transmission, then "unshrink" it for playback. This compression and decompression almost always degrades the quality of the original signal, but the improvements in download time are usually well worth it.

There are many ways to compress streams of data.

We can remove parts of the signal that are not detectable by the human eye or ear. For example, remove audio frequencies that are outside the range of our hearing. Why bother storing data that only dogs can hear?
We can also develop knew ways to store the information. For example, remove redundant data by indicating how many times it repeats rather than storing the same value multiple times.

Playing Multimedia

Plug-ins expand browser capability by adding functions that the browsers cannot handle. Usually used for multimedia.

In 1995, the web was still new and had primitive methods for displaying files other than text and graphics. Netscape Navigator 1.2 allowed a rudimentary form of multimedia, but it didn't work well and did not provide for any technological advances. And unlike television, no standards for web-based multimedia had been developed that Netscape could use in their browser

Their solution was to create a generic interface that allowed an HTML web page to pass data to an executable designed by third party multimedia experts. This would be done with the <embed> tag. The browser would look through the plug-ins folder to see if a plug-in was available for the type of data the HTML file was requesting, and if not, prompt the user to download a new plug-in.

There were a few problems with this arrangement:

Broken links for older or obsolete plug-ins or plug-ins no longer supported by the manufacture were a hassle.
Lengthy download time for large plug-ins was annoying and many times the user would give up and forgo being able to view the multimedia.
If user wanted a plug-in, they had to go to another page, download plug-in, shutdown browser, install plug in, restart machine, restart browser, and then find site again that needed the plug-in

Microsoft addressed the problem by expanding their Internet Explorer to automatically download multimedia code using ActiveX. It did basically the same thing plug-ins did except that it automatically downloaded the code and the user wasn't forced to go through the lengthy installation process.

It had problems too, however.

ActiveX wasn't platform independent, i.e., it only ran on PC's running Windows.
It still didn't solve the lengthy download time for large ActiveX programs.
People did not like the fact that Microsoft was taking control of their machines and downloading code for them.

Non-streaming Audio Files:

Okay, enough with the history, let's get an introduction to some basic multimedia.

PC WAV

This is an uncompressed (read "large"), digitally sampled recording. Quality is usually very good but download times are unrealistic for anything greater than 5 seconds. (Sample from www.dailywav.com)

MIDI

Developed by musicians who wanted a way to have their synthesizers communicate with other equipment. Instead of sampling and storing analog values, MIDE stores the type of instrument, the note, and the duration of the note. This seriously reduces the file size and increases download speed. Unfortunately, this is at a terrible cost to quality. The playback quality is synthesized, mechanical, and typically poor. (A portion of Mozart's "Piano Sonata in C major" from The Classical Piano Midi Page)

MP3

This audio standard is popular for a number of reasons. This is a compressed format, the development of which began in the mid-1980s. Unlike Unisys charging patent fees for their GIF format, the patent holders for MPEG made the technology freely available. MPEG stands for Moving Picture Experts Group (similar to the Joint Photographic Experts Group), and the three indicates that this is the third generation.

The typical compression rate for MP3 is 10:1 which translates to about 5 megabytes for a song recording. This is quite a savings over the 50 megabytes for a similar WAV recording.

MP3 is also gaining popularity because of its expanded use through products such as the Rio portable MP3 player. Devices such as this have replaced the old-fashioned Sony Walkman^TM allowing a user to download music to a hand-held device for playback. The absence of moving parts in these new players make for more reliable operation and lower power consumption.

MP3 has also allowed for the strong emergence of bootleg recording practices such as the old Napster.

There are also a number of plug-in players for the PC and the Mac. For the PC, there's Winamp, Sonique, Musicmatch, and Real Jukebox. This list has a number of resources for the Mac.

For more legitimate sources of MP3 files, check out the mp3.com web site.

Incorporating Audio Into Your Web Site

The easiest way to link a sound file to your web page is to simply provide a link. Assuming the browser is set up to play the file, it will automatically play it with whatever default player it was set up to use.

Streaming vs. Non-streaming:

As for how multimedia is delivered to the web, there are two methods:

Non-streaming (static) -- Browser downloads sound file in its entirety to the client's (browser's) harddrive cache. Once it's finished downloading, it plays the sound with whatever plug-in or application the browser is configured to use

Streaming -- Instead of loading an entire file into a plug-in, it plays the sound back to the user as each piece of data is received. No data is actually stored to the user's cache or disk.

Streaming is sort of like reading from flash cards being revealed one at a time and non-streaming is like reading from a book. You have to have the whole book before you can begin reading in non-streaming. In streaming, however, if the person flipping the flash cards to you drops one or pauses, your reading will be paused too.

The are a number of important differences between non-streaming and streaming.

Advantages of non-streaming:

No special software is needed at the server side other than the basic http server software. The user, however, will still need to have a plug-in.

Disadvantages of non-streaming:

Very long download times are required before the multimedia can begin playing.
Once the file has been downloaded, the end user has a copy of the multimedia file on their harddrive. This would make copyright violations easier.

Advantages of streaming:

To the end user, it seems like streaming drastically reduces download time by starting playback once the stream begins rather than waiting for the whole file to download.
Streaming does not save the file to the client's machine thereby protecting copyrights.

Disadvantages of streaming:

Streaming requires additional server software to drive all the streams being requested. This also puts quite a load on the server in terms of maintaining all of the streams.
Faults in the connection leave dead times in playback

Non-Streaming Applications -- Shorter files such as sound effects or short sound bytes

Streaming Applications -- Large files or continuous "live" feeds or copyrighted material that the owner does not want stored on client's harddrive

Video

Video is slightly different than audio. Audio takes "samples" of the data at equal intervals. Each sample is a numeric value that represents the voltage level of the audio signal. Video also samples the signal. The difference is that the video signal is a picture, not a voltage level, therefore every sample is an image. In a matter of seconds, you could have a heck of alot of data! In fact, 10 seconds of raw television video would take 300 megabytes to store! Even commercials are longer than ten seconds.

Each "sample" of video (i.e., each captured picture) is called a frame. When these frames are displayed one after the other, the human eye interprets it as a moving picture.

A number of variables affect the file size of video multimedia.

Duration of the video clip (of course)
Frame size -- every time you double the horizontal and vertical dimensions, you quadruple image size. Most common video window size is 160x120 (a fourth of a 640x480 screen)
Frame rate -- This represents the number of images (frames) displayed per second. Due to bandwidth constraints, web-based video is typically 10 to 15 frames per second.
Quality -- Compression of the original video signal can be greater if a lesser quality is allowed.
Color Bit Depth -- Just like with audio, if you use smaller numbers to represent the data, your file size will be reduced. Black and white (gray scale) takes far less to store than 16-bit color for example.

Compression

All web-based video is compressed. As with the audio, there is a two step process called CODEC (for compression/decompression) which takes a large file, makes it smaller to transmit across the network, then reproduce the original file with varying amounts of accuracy or quality for playback to the user.

So how do people create video small enough to play over the Web? In the past few years, there have been big advancements in the development of codecs, which allow video to be reduced to a reasonable file size for use on CD-ROMs and for Web delivery.

CODECs can be created from software (an application running on a computer) or from hardware (physical components on a circuit board for example).

Benefits of a Hardware CODEC

By far the fastest and highest quality.
Doesn't require that the original video be stored to a harddrive. Stored immediately as compressed.

Drawbacks of a Hardware CODEC

Very expensive
Require both client and server to have same CODEC

Typical Hardware CODEC Applications

Video conferencing systems

Benefits of a Software CODEC

Very inexpensive (freeware available)
Doesn't require additional hardware to connect to your system; just download and run.

Drawbacks of a Software CODEC

Very slow; they draw heavily on CPU cycles.
Must have raw video stored to harddrive.

Typical Software CODEC Applications

Internet applications.

There are two compression methods for video: temporal and spatial. Both methods remove redundant data or data that the human eye cannot decipher.

Temporal -- This method of compression is best for "talking heads", i.e., people who have their mouths moving, but nothing much else in the image is changing. This causes successive frames to simply copy redundant information for items such as the background. The compression in this case is achieved by only storing data representing the changes from one frame to the next (e.g., the moving mouth).

This compression starts with a reference frame called the key frame. This frame stores the entire image. Subsequent frames, called delta frames, store any changes from the key frame. Any significant changes to the picture cause the compression technique to store another key frame followed by delta frames that use the new key frame as a reference.

Spatial -- This method of compression is similar to the methods of compression for still images, reducing the amount of data used to represent a single frame. (This is according to our textbook.) Other sources I have seen say that spatial compression also looks for areas of a frame that remain the same from frame to frame like temporal. The difference is that spacial represents the data that stays the same with a coordinate system while temporal does it on a pixel-by-pixel basis.

If the resulting file from either of these methods does not give you a small enough file size, then you need to look back at the original data and see if you can reduce frame rate, color depth, size, or duration.

CSCI 1710Basic Audio and Video on the Web