The compressed data stream produced by satellites is bits per day. Two steps need to be taken to convert this
information into a form appropriate for queries: (1) some signal
processing of the image data needs to be performed, and (2) the image
at each grid point needs to be converted into information indicating
the type of ground cover (wheat, rice, rock, etc.), and the state of
the ground cover (dormant, germinating, ready for harvest, etc.).
This final form we call a model, which consists of an identifier
and a state. It is this form that is of most use to those making
queries of the database. Since perfect matches with models will be
rare, the database will need to associate with each grid point
several ``closest matching models.'' For simplicity, we assume that
there will be
different identifier/state combinations, and
that
closest matching models will be retained for each grid
point.
The resulting data rates generated by the satellites are summarized in
Figure 4.2. Notice that if the down link rates per
satellite were increased to bits per second, nearly every
grid point could be sampled each day.
The signal processing typically will require computing the Fourier transform of the image and performing some filtering operations on the resulting set of spectral coefficients. Using some distance metric, these filtered spectral coefficients then are compared against a table of spectral coefficients for different identifier/state combinations and ``closest matching models'' are selected.
For simplicity, we assume that the signal processing required is a
small integer multiple of the cost of performing a Fourier transform
of the image attached to each grid point. Each of these images is a
array of intensities. Assuming an
point fast Fourier transform takes
floating point
operations, the signal processing required per grid point is
FLOPS. Since the satellites can transmit
images per second, we will need roughly
floating
point operations per second just to perform the basic signal
processing. This computation, however, is highly parallelizable and
requires little inter-processor communication, since each image can be
processed independently.
For template matching, we need to match against total models.
Each grid point consists of
spectral coefficients, and
the distance between these coefficients and the corresponding
coefficients of each model must be computed. Using the simple minded
approach of computing the distance between the image and each of the
models will require a total of
operations for each
grid point. Since
images are provided by the
satellites each second,
comparisons will be
required per second. Without improving the comparison algorithm, this
dominates the signal processing cost. Again this computation is
highly parallelizable. Storing the data required by the
closest
matching models will require
bits.
A database consisting of one set of models per grid point would
require
bits. The satellites will produce
model bits per day (since not every grid point is
modeled each day). Thirty years of modeled data will require
bits of modeled data, or one petabyte.
It is not sufficient to preserve the transformed (modeled) versions of
the data. Over time, improved templates will be developed and it
then will be useful to re-do the template matching. Each day, the data
produced by the satellites in compressed grid point images is bits. The raw data produced by the satellites over a
30-year period is at least
bits, or one exabyte.