Teradata Architecture

The Teradata database currently runs normally on NCR Corporation�s WorldMark Systems in the UNIX MP-RAS environment. Some of these systems consist of a single processing node (computer) while others are several hundred nodes working together in a single system. The NCR nodes are based entirely on industry standard CPU processor chips, standard internal and external bus architectures like PCI and SCSI, and standard memory modules with 4-way interleaving for speed.

At the same time, Teradata can run on any hardware server in the single node environment when the system runs Microsoft NT and Windows 2000. This single node may be any computer from a large server to a laptop.

Whether the system consists of a single node or is a massively parallel system with hundreds of nodes, the Teradata RDBMS uses the exact same components executing on all the nodes in parallel. The only difference between small and large systems is the number of processing components.

When these components exist on different nodes, it is essential that the components communicate with each other at high speed. To facilitate the communications, the multi-node systems use the BYNET interconnect. It is a high speed, multi-path, dual redundant communications channel. Another amazing capability of the BYNET is that the bandwidth increases with each consecutive node added into the system. There is more detail on the BYNET later in this chapter.

Teradata Components

As previously mentioned, Teradata is the superior product today because of its parallel operations based on its architectural design. It is the parallel processing by the major components that provide the power to move mountains of data. Teradata works more like the early Egyptians who built the pyramids without heavy equipment using parallel, coordinated human efforts. It uses smaller nodes running several processing components all working together on the same user request. Therefore, a monumental task is completed in record time.

Teradata operates with three major components to achieve the parallel operations. These components are called: Parsing Engine Processors, Access Module Processors and the Message Passing Layer. The role of each component is discussed in the next sections to provide a better understanding of Teradata. Once we understand how Teradata works, we will pursue the SQL that allows storage and access of the data.

Parsing Engine Processor (PEP or PE)

The Parsing Engine Processor (PEP) or Parsing Engine (PE), for short, is one of the two primary types of processing tasks used by Teradata. It provides the entry point into the database for users on mainframe and networked computer systems. It is the primary director task within Teradata.

As users �logon� to the database they establish a Teradata session. Each PE can manage 120 concurrent user sessions. Within each of these sessions users submit SQL as a request for the database server to take an action on their behalf. The PE will then parse the SQL statement to establish which database objects are involved. For now, let�s assume that the database object is a table. A table is a two-dimensional array that consists of rows and columns. A row represents an entity stored in a table and it is defined using columns. An example of a row might be the sale of an item and its columns include the UPC, a description and the quantity sold.

Any action a user requests must also go through a security check to validate their privileges as defined by the database administrator. Once their authorization at the object level is verified, the PE will verify that the columns requested actually exist within the objects referenced.

Next, the PE optimizes the SQL to create an execution plan that is as efficient as possible based on the amount of data in each table, the indices defined, the type of indices, the selectivity level of the indices, and the number of processing steps needed to retrieve the data. The PE is responsible for passing the optimized execution plan to other components as the best way to gather the data.

An execution plan might use the primary index column assigned to the table, a secondary index or a full table scan. The use of an index is preferable and will be discussed later in this chapter. For now, it is sufficient to say that a full table scan means that all rows in the table must be read and compared to locate the requested data.

Although a full table scan sounds really bad, within the architecture of Teradata, it is not necessarily a bad thing because the data is divided up and distributed to multiple, parallel components throughout the database. We will look next at the AMPs that perform the parallel disk access using their file system logic. The AMPs manage all data storage on disks. The PE has no disks.

Activities of a PE:

� Convert incoming requests from EBCDIC to ASCII (if from an IBM mainframe)

� Parse the SQL to determine type and validity

� Validate user privileges

� Optimize the access path(s) to retrieve the rows

� Build an execution plan with necessary steps for row access

� Send the plan steps to Access Module Processors (AMP) involved

Access Module Processor (AMP)

The next major component of Teradata�s parallel architecture is called an Access Module Processor (AMP). It stores and retrieves the distributed data in parallel. Ideally, the data rows of each table are distributed evenly across all the AMPs. The AMPs read and write data and are the workhorses of the database. Their job is to receive the optimized plan steps, built by the PE after it completes the optimization, and execute them. The AMPs are designed to work in parallel to complete the request in the shortest possible time.

Optimally, every AMP should contain a subset of all the rows loaded into every table. By dividing up the data, it automatically divides up the work of retrieving the data. Remember, all work comes as a result of a users� SQL request. If the SQL asks for a specific row, that row exists in its entirety (all columns) on a single AMP and other rows exist on the other AMPs.

If the user request asks for all of the rows in a table, every AMP should participate along with all the other AMPs to complete the retrieval of all rows. This type of processing is called an all AMP operation and an all rows scan. However, each AMP is only responsible for its rows, not the rows that belong to a different AMP. As far as the AMPs are concerned, it owns all of the rows. Within Teradata, the AMP environment is a �shared nothing� configuration. The AMPs cannot access each other�s data rows, and there is no need for them to do so.

Once the rows have been selected, the last step is to return them to the client program that initiated the SQL request. Since the rows are scattered across multiple AMPs, they must be consolidated before reaching the client. This consolidation process is accomplished as a part of the transmission to the client so that a final comprehensive sort of all the rows is never performed. Instead, all AMPs sort only their rows (at the same time � in parallel) and the Message Passing Layer is used to merge the rows as they are transmitted from all the AMPs.

Therefore, when a client wishes to sequence the rows of an answer set, this technique causes the sort of all the rows to be done in parallel. Each AMP sorts only its subset of the rows at the same time all the other AMPs sort their rows. Once all of the individual sorts are complete, the BYNET merges the sorted rows. Pretty brilliant!

Activities of the AMP:

� Store and retrieve data rows using the file system

� Aggregate data

� Join processing between multiple tables

� Convert ASCII returned data to EBCDIC (IBM mainframes only)

� Sort and format output data

Message Passing Layer (BYNET)

The Message Passing Layer varies depending on the specific hardware on which the Teradata database is executing. In the latter part of the 20^th century, most Teradata database systems executed under the UNIX operating system. However, in 1998, Teradata was released on Microsoft�s NT operating system. Today it also executes under Windows 2000. The initial release of Teradata, on the Microsoft systems, is for a single node.

When using the UNIX operating system, Teradata supports up to 512 nodes. This massively parallel system establishes the basis for storing and retrieving data from the largest commercial databases in the world, Teradata. Today, the largest system in the world consists of 176 nodes. There is much room for growth as the databases begin to exceed 40 or 50 terabytes.

For the NCR UNIX systems, the Message Passing Layer is called the BYNET. The amazing thing about the BYNET is its capacity. Instead of a fixed bandwidth that is shared among multiple nodes, the bandwidth of the BYNET increases as the number of nodes increase. This feat is accomplished as a result of using virtual circuits instead of using a single fixed cable or a twisted pair configuration.

To understand the workings of the BYNET, think of a telephone switch used by local and long distance carriers. As more and more people place phone calls, no one needs to speak slower. As one switch becomes saturated, another switch is automatically used. When your phone call is routed through a different switch, you do not need to speak slower. If a natural or other type of disaster occurs and a switch is destroyed, all subsequent calls are routed through other switches. The BYNET is designed to work like a telephone switching network.

An additional aspect of the BYNET is that it is really two connection paths, like having two phone lines for a business. The redundancy allows for two different aspects of its performance. The first aspect is speed. Each path of the BYNET provides bandwidth of 10 Megabytes (MB) per second with Version 1 and 60 MB per second with Version 2. Therefore the aggregate speed of the two connections is 20MB/second or 120MB/second. However, as mentioned earlier, the bandwidth grows linearly as more nodes are added. Using Version 1 any two nodes communicate at 40MB/second (10MB/second * 2 BYNETs * 2 nodes). Therefore, 10 nodes can utilize 200MB/second and 100 nodes have 2000MB/second available between them. When using the version 2 BYNET, the same 100 nodes communicate at 12,000MB/second (60MB/second * 2 BYNETs * 100 nodes).

The second and equally important aspect of the BYNET uses the two connections for availability. Regardless of the speed associated with each BYNET connection, if one of the connections should fail, the second is completely independent and can continue to function at its individual speed without the other connection. Therefore, communications continue to pass between all nodes.

Although the BYNET is performing at half the capacity during an outage, it is still operational and SQL is able to complete without failing. In reality, when the BYNET is performing at only 10MB/second per node, it is still a lot faster than many normal networks that typically transfer messages at 10MB per second.

All messages going across the BYNET offer guaranteed delivery. So, any messages not successfully delivered because of a failure on one connection automatically route across the other connection. Since half of the BYNET is not working, the bandwidth reduces by half. However, when the failed connection is returned to service, its topology is automatically configured back into service and it begins transferring messages along with the other connection. Once this occurs, the capacity returns to normal.

Tera Data Database

Tuesday, March 5, 2013