Teradata Architecture
The
Teradata database currently runs normally on NCR Corporation�s
WorldMark Systems in the UNIX MP-RAS environment. Some of these systems
consist of a single processing node (computer) while others are several
hundred nodes working together in a single system. The NCR nodes are
based entirely on industry standard CPU processor chips, standard
internal and external bus architectures like PCI and SCSI, and standard
memory modules with 4-way interleaving for speed.
At
the same time, Teradata can run on any hardware server in the single
node environment when the system runs Microsoft NT and Windows 2000.
This single node may be any computer from a large server to a laptop.
Whether
the system consists of a single node or is a massively parallel system
with hundreds of nodes, the Teradata RDBMS uses the exact same
components executing on all the nodes in parallel. The only difference
between small and large systems is the number of processing components.
When
these components exist on different nodes, it is essential that the
components communicate with each other at high speed. To facilitate the
communications, the multi-node systems use the BYNET interconnect. It
is a high speed, multi-path, dual redundant communications channel.
Another amazing capability of the BYNET is that the bandwidth increases
with each consecutive node added into the system. There is more detail
on the BYNET later in this chapter.
Teradata Components
As
previously mentioned, Teradata is the superior product today because of
its parallel operations based on its architectural design. It is the
parallel processing by the major components that provide the power to
move mountains of data. Teradata works more like the early Egyptians
who built the pyramids without heavy equipment using parallel,
coordinated human efforts. It uses smaller nodes running several
processing components all working together on the same user request.
Therefore, a monumental task is completed in record time.
Teradata
operates with three major components to achieve the parallel
operations. These components are called: Parsing Engine Processors,
Access Module Processors and the Message Passing Layer. The role of
each component is discussed in the next sections to provide a better
understanding of Teradata. Once we understand how Teradata works, we
will pursue the SQL that allows storage and access of the data.
Parsing Engine Processor (PEP or PE)
The
Parsing Engine Processor (PEP) or Parsing Engine (PE), for short, is
one of the two primary types of processing tasks used by Teradata. It
provides the entry point into the database for users on mainframe and
networked computer systems. It is the primary director task within
Teradata.
As
users �logon� to the database they establish a Teradata session. Each
PE can manage 120 concurrent user sessions. Within each of these
sessions users submit SQL as a request for the database server to take
an action on their behalf. The PE will then parse the SQL statement to
establish which database objects are involved. For now, let�s assume
that the database object is a table. A table is a two-dimensional array
that consists of rows and columns. A row represents an entity stored
in a table and it is defined using columns. An example of a row might
be the sale of an item and its columns include the UPC, a description
and the quantity sold.
Any
action a user requests must also go through a security check to
validate their privileges as defined by the database administrator.
Once their authorization at the object level is verified, the PE will
verify that the columns requested actually exist within the objects
referenced.
Next,
the PE optimizes the SQL to create an execution plan that is as
efficient as possible based on the amount of data in each table, the
indices defined, the type of indices, the selectivity level of the
indices, and the number of processing steps needed to retrieve the data.
The PE is responsible for passing the optimized execution plan to
other components as the best way to gather the data.
An
execution plan might use the primary index column assigned to the
table, a secondary index or a full table scan. The use of an index is
preferable and will be discussed later in this chapter. For now, it is
sufficient to say that a full table scan means that all rows in the
table must be read and compared to locate the requested data.
Although
a full table scan sounds really bad, within the architecture of
Teradata, it is not necessarily a bad thing because the data is divided
up and distributed to multiple, parallel components throughout the
database. We will look next at the AMPs that perform the parallel disk
access using their file system logic. The AMPs manage all data storage
on disks. The PE has no disks.
Activities of a PE:
� Convert incoming requests from EBCDIC to ASCII (if from an IBM mainframe)
� Parse the SQL to determine type and validity
� Validate user privileges
� Optimize the access path(s) to retrieve the rows
� Build an execution plan with necessary steps for row access
� Send the plan steps to Access Module Processors (AMP) involved
Access Module Processor (AMP)
The
next major component of Teradata�s parallel architecture is called an
Access Module Processor (AMP). It stores and retrieves the distributed
data in parallel. Ideally, the data rows of each table are distributed
evenly across all the AMPs. The AMPs read and write data and are the
workhorses of the database. Their job is to receive the optimized plan
steps, built by the PE after it completes the optimization, and execute
them. The AMPs are designed to work in parallel to complete the request
in the shortest possible time.
Optimally,
every AMP should contain a subset of all the rows loaded into every
table. By dividing up the data, it automatically divides up the work of
retrieving the data. Remember, all work comes as a result of a users�
SQL request. If the SQL asks for a specific row, that row exists in its
entirety (all columns) on a single AMP and other rows exist on the
other AMPs.
If
the user request asks for all of the rows in a table, every AMP should
participate along with all the other AMPs to complete the retrieval of
all rows. This type of processing is called an all AMP operation and an
all rows scan. However, each AMP is only responsible for its rows, not
the rows that belong to a different AMP. As far as the AMPs are
concerned, it owns all of the rows. Within Teradata, the AMP
environment is a �shared nothing� configuration. The AMPs cannot access each other�s data rows, and there is no need for them to do so.
Once
the rows have been selected, the last step is to return them to the
client program that initiated the SQL request. Since the rows are
scattered across multiple AMPs, they must be consolidated before
reaching the client. This consolidation process is accomplished as a
part of the transmission to the client so that a final comprehensive
sort of all the rows is never performed. Instead, all AMPs sort only
their rows (at the same time � in parallel) and the Message Passing
Layer is used to merge the rows as they are transmitted from all the
AMPs.
Therefore,
when a client wishes to sequence the rows of an answer set, this
technique causes the sort of all the rows to be done in parallel. Each
AMP sorts only its subset of the rows at the same time all the other
AMPs sort their rows. Once all of the individual sorts are complete,
the BYNET merges the sorted rows. Pretty brilliant!
Activities of the AMP:
� Store and retrieve data rows using the file system
� Aggregate data
� Join processing between multiple tables
� Convert ASCII returned data to EBCDIC (IBM mainframes only)
� Sort and format output data
Message Passing Layer (BYNET)
The Message Passing Layer varies depending on the specific hardware on which the Teradata database is executing. In the latter part of the 20th
century, most Teradata database systems executed under the UNIX
operating system. However, in 1998, Teradata was released on
Microsoft�s NT operating system. Today it also executes under Windows
2000. The initial release of Teradata, on the Microsoft systems, is for
a single node.
When
using the UNIX operating system, Teradata supports up to 512 nodes.
This massively parallel system establishes the basis for storing and
retrieving data from the largest commercial databases in the world,
Teradata. Today, the largest system in the world consists of 176 nodes.
There is much room for growth as the databases begin to exceed 40 or
50 terabytes.
For the NCR UNIX systems, the Message Passing Layer is called the BYNET.
The amazing thing about the BYNET is its capacity. Instead of a fixed
bandwidth that is shared among multiple nodes, the bandwidth of the
BYNET increases as the number of nodes increase. This feat is
accomplished as a result of using virtual circuits instead of using a
single fixed cable or a twisted pair configuration.
To understand the workings of the BYNET,
think of a telephone switch used by local and long distance carriers.
As more and more people place phone calls, no one needs to speak slower.
As one switch becomes saturated, another switch is automatically used.
When your phone call is routed through a different switch, you do not
need to speak slower. If a natural or other type of disaster occurs and
a switch is destroyed, all subsequent calls are routed through other
switches. The BYNET is designed to work like a telephone switching
network.
An additional aspect of the BYNET
is that it is really two connection paths, like having two phone lines
for a business. The redundancy allows for two different aspects of its
performance. The first aspect is speed. Each path of the BYNET
provides bandwidth of 10 Megabytes (MB) per second with Version 1 and 60
MB per second with Version 2. Therefore the aggregate speed of the two
connections is 20MB/second or 120MB/second. However, as mentioned
earlier, the bandwidth grows linearly as more nodes are added. Using
Version 1 any two nodes communicate at 40MB/second (10MB/second * 2
BYNETs * 2 nodes). Therefore, 10 nodes can utilize 200MB/second and 100
nodes have 2000MB/second available between them. When using the
version 2 BYNET, the same 100 nodes communicate at 12,000MB/second (60MB/second * 2 BYNETs * 100 nodes).
The second and equally important aspect of the BYNET
uses the two connections for availability. Regardless of the speed
associated with each BYNET connection, if one of the connections should
fail, the second is completely independent and can continue to function
at its individual speed without the other connection. Therefore,
communications continue to pass between all nodes.
Although the BYNET
is performing at half the capacity during an outage, it is still
operational and SQL is able to complete without failing. In reality,
when the BYNET is performing at only 10MB/second per node, it is still a
lot faster than many normal networks that typically transfer messages
at 10MB per second.
All messages going across the BYNET
offer guaranteed delivery. So, any messages not successfully delivered
because of a failure on one connection automatically route across the
other connection. Since half of the BYNET is not working, the bandwidth
reduces by half. However, when the failed connection is returned to
service, its topology is automatically configured back into service and
it begins transferring messages along with the other connection. Once
this occurs, the capacity returns to normal.