ANNA UNIVERSITY-
CHENNAI-JUNE 2010 & DECEMBER 2010
DEPARTMENT OF COMPUTER
SCIENCE AND ENGINEERING
SUB CODE/SUB NAME:
CS9221 / DATABASE TECHNOLOGY
Part A – (10*2=20
Marks)
1. What is fragmentation?
(JUNE 2010)
Fragmentation is a database
server feature that allows you to control where data is stored at the table
level. Fragmentation enables you to define groups of rows or index keys within
a table according to some algorithm or scheme. You use SQL statements to create
the fragments and assign them to dbspaces.
2. What is Concurrency
control? (JUNE 2010) (NOV/ DEC 2010)
Concurrency control is the
activity of coordinating concurrent accesses to a database in a multiuser
system. Concurrency control allows user to access a database in a
multiprogrammed fashion while preserving the consistency of the data.
3. What is Persistence? (JUNE 2010) (NOV/ DEC 2010)
Persistence is the property
of an object through which its existence transcends time i.e. (the object
continues to exist after its creator ceases to exist), and/or space (i.e. the
object’s location moves from the address space in which it was created).
4. What is Transaction
Processing? (JUNE 2010)
A Transaction Processing
system (TPS) is a set of information which processes the data transaction in
database system that monitors transaction programs (a special kind of program).
For e.g. in an electronic payment is made, the amount must be both withdrawn from
one account and added to the other; it cannot complete only one of those steps.
Either both must occur, or neither. In case of a failure preventing transaction
completion, the partially executed transaction must be 'rolled back' by the
TPS.
5. What is Client/Server
model? (JUNE 2010)
·
The server in a
client/server model is simply the DBMS, whereas the client is the database application
serviced by the DBMS.
·
The client/server model of
a database system is classified into basic & distributed
client/server model.
6. What is the difference
between data warehousing and data mining? (JUNE 2010)
Data warehousing: It is the process that is used to integrate
and combine data from multiple sources and format into a single unified schema.
So it provides the enterprise with a storage mechanism for its huge amount of
data.
Data mining: It is the process of extracting interesting
patterns and knowledge from huge amount of data. So we can apply data mining
techniques on the data warehouse of an enterprise to discover useful patterns.
7. Why do we need
Normalization? (JUNE 2010)
Normalization is a process
followed for eliminating redundant data and establishes a meaningful
relationship among tables based on rules and regulations in order to maintain integrity
of data. It is done for maintaining storage space and also for performance
tuning.
8. What is Integrity? (JUNE 2010) (NOV/ DEC 2010)
Integrity refers to the
process of ensuring that a database remains an accurate reflection of the
universe of discourse it is modeling or representing. In other words there is a
close correspondence between the facts stored in the database and the real
world it models
9. Give two features of
Multimedia Databases (JUNE
2010) (NOV/ DEC 2010)
·
The multimedia database systems
are to be used when it is required to administrate a huge amounts of multimedia
data objects of different types of data media (optical storage, video, tapes,
audio records, etc.) so that they can be used (that is, efficiently accessed
and searched) for as many applications as needed.
·
The Objects of Multimedia
Data are: text, images, graphics, sound recordings, video recordings, signals,
etc., that are digitalized and stored.
10. What are Deductive
Databases? (JUNE 2010) (NOV/ DEC 2010)
A Deductive Database is the
combination of a conventional database containing facts, a knowledge base
containing rules, and an inference engine which allows the derivation of information
implied by the facts and rules.
A deductive database system specify rules through a declarative language - a language in which we specify what to achieve
rather than how to achieve it. An inference engine within the system can deduce new facts from the database by
interpreting these rules. The model used for deductive databases is related to
the relational data model and also related to the field of logic programming and the Prolog language.
11. What is query
processing? (NOV/ DEC 2010)
Query processing is a set
of activities involving in getting the result of a query expressed in a
high-level language. These activities include parsing the queries and translate
them into expressions that can be implemented at the physical level of the file
system, optimizing the query of internal form to get suitable execution
strategies for processing and then doing the actual execution of queries to get
the results.
12. Give two features of
object-oriented databases. (NOV/ DEC 2010)
The features of Object Oriented Databases are,
1. It provides persistent storage for objects.
2. They may provide one or more of the following: a query language;
indexing; transaction support with rollback and commit; the possibility of
distributing objects transparently over many servers.
13. What is Data
warehousing? (NOV/ DEC
2010)
Data warehousing: It is the process that is used to integrate
and combine data from multiple sources and format into a single unified schema.
So it provides the enterprise with a storage mechanism for its huge amount of
data.
14. What is Normalization?
(NOV/ DEC 2010)
Normalization is a process
followed for eliminating redundant data and establishes a meaningful
relationship among tables based on rules and regulations in order to maintain integrity
of data. It is done for maintaining storage space and also for performance
tuning.
15. Mention two features
of parallel Databases. (NOV/ DEC 2010)
1. It is used to provide speedup, where queries are executed faster
because more resources, such as processors and disks, are provided.
2. It is also used to provide scaleup, where increasing workloads are
handled without increased response time, via an increase in the degree of
parallelism.
Part B – (5*16=80 Marks) (JUNE 2010) & (DECEMBER 2010)
1. (a)Explain the
architecture of Distributed Databases. (16) (JUNE 2010)
Or
(b) Discuss in detail the architecture of
distributed database. (16)
(NOV/DEC 2010)
(b)Write notes on the following:
(i) Query processing. (8) (JUNE 2010)
(ii) Transaction processing. (8) (JUNE 2010)
2. (a)Discuss the
Modeling and design approaches for Object Oriented Databases
(JUNE 2010)
Or
(b) Describe modeling and design approaches
for object oriented database. (16)
(NOV/DEC 2010)
(b) Explain the Multi-Version Locks and
Recovery in Query Languages. (16) (JUNE 2010)
Or
(a) Explain the Multi-Version Locks and Recovery in Query Languages.
(DECEMBER 2010)
3. (a) Discuss in detail
Data Warehousing and Data Mining. (JUNE 2010)
Or
(a) Explain the features of data
warehousing and data mining. (16)
(NOV/DEC 2010)
Data Warehouse:
Large organizations have complex internal organizations, and have
data stored at different
locations, on different operational (transaction
processing) systems, under different
schemas
Data sources often store only current data, not historical data
Corporate decision making requires a unified view of all
organizational data, including
historical data
A data warehouse is a repository (archive) of information gathered
from multiple sources,
stored under a unified schema, at a single site
· Greatly simplifies
querying, permits study of historical trends
· Shifts decision support
query load away from transaction processing systems
When and how to gather data
· Source driven architecture:
data sources transmit new information to warehouse, either
continuously or periodically (e.g. at night)
· Destination driven
architecture: warehouse periodically requests new information from
data sources
· Keeping warehouse exactly
synchronized with data sources (e.g. using two-phase commit)
is too expensive
· Usually OK to have slightly
out-of-date data at warehouse
· Data/updates are
periodically downloaded form online transaction processing (OLTP)
systems.
What schema to use
· Schema integration
Data cleansing
· E.g. correct mistakes in
addresses
· E.g. misspellings, zip code
errors
Merge address lists from different sources and
purge duplicates
· Keep only one address
record per household (“householding”)
How to propagate updates
· Warehouse schema may be a
(materialized) view of schema from data sources
· Efficient techniques for
update of materialized views
What data to summarize
· Raw data may be too large
to store on-line
· Aggregate values
(totals/subtotals) often suffice
· Queries on raw data can often be transformed by
query optimizer to use aggregate values.
Typically warehouse data is multidimensional,
with very large fact tables
· Examples of dimensions:
item-id, date/time of sale, store where sale was made, customer
identifier
· Examples of measures:
number of items sold, price of items
Dimension values are usually encoded using small
integers and mapped to full values via
dimension tables
Resultant schema is called a star schema
More complicated schema structures
· Snowflake schema: multiple
levels of dimension tables
· Constellation: multiple
fact tables
Data Mining
Broadly speaking, data mining is the process of semi-automatically
analyzing large
databases to find useful patterns.
Like knowledge discovery in artificial intelligence data mining
discovers statistical rules
and patterns
Differs from machine learning in that it deals with large volumes
of data stored primarily
on disk.
Some types of knowledge discovered from a database can be
represented by a set of rules.
e.g.,: “Young women with annual incomes greater
than $50,000 are most likely to buy
sports cars”.
Other types of knowledge represented by equations, or by
prediction functions.
Some manual intervention is usually required
· Pre-processing of data, choice of which type of
pattern to find, postprocessing to
find novel patterns
Applications of Data Mining
Prediction based on
past history
· Predict if a credit card
applicant poses a good credit risk, based on some attributes
(income, job type, age, ..) and past history
· Predict if a customer is
likely to switch brand loyalty
· Predict if a customer is
likely to respond to “junk mail”
· Predict if a pattern of
phone calling card usage is likely to be fraudulent
Some examples of prediction mechanisms:
Classification
· Given a training set consisting of items
belonging to different classes, and a new
item whose class is unknown, predict which class
it belongs to.
Regression formulae
· Given a set of
parameter-value to function-result mappings for an unknown
function, predict the function-result for a new
parameter-value
Descriptive Patterns
Associations
· Find books that are often
bought by the same customers. If a new customer buys
one such book, suggest that he buys the others
too.
· Other similar applications:
camera accessories, clothes, etc.
Associations may also be used as a first step in
detecting causation
· E.g. association between
exposure to chemical X and cancer, or new medicine and
cardiac problems
Clusters
· E.g. typhoid cases were
clustered in an area surrounding a contaminated well
· Detection of clusters remains
important in detecting epidemics
Classification Rules
Classification rules help assign new objects to a set of classes.
E.g., given a new
automobile insurance applicant, should he or she
be classified as low risk, medium risk or
high risk?
Classification rules for above example could use a variety of
knowledge, such as
educational level of applicant, salary of
applicant, age of applicant, etc.
" person P, P.degree =
masters and P.income > 75,000
Þ P.credit = excellent
" person P, P.degree = bachelors
and
(P.income ³ 25,000 and P.income £ 75,000)
Þ P.credit = good
Rules are not necessarily exact: there may be some
misclassifications
Classification rules can be compactly shown as a decision tree.
Decision Tree
· Training set: a data sample in which the grouping for each
tuple is already known.
· Consider credit risk example: Suppose degree is chosen to partition the data at the root.
Since degree has a small number of possible values, one child is
created for each
value.
· At each child node of the root, further
classification is done if required. Here, partitions are
defined by income.
Since income is a continuous attribute, some number of intervals
are chosen, and
one child created for each interval.
· Different classification
algorithms use different ways of choosing which attribute to
partition on at each node, and what the
intervals, if any, are.
· In general
Different branches of the tree could grow to different levels.
Different nodes at the same level may use different partitioning attributes.
· Greedy top down generation of decision trees.
Each internal node of the tree partitions the
data into groups based on a partitioning
attribute, and a partitioning condition
for the node
More on choosing partioning attribute/condition shortly
Algorithm is greedy: the choice is made once and not revisited as
more of the tree is
constructed
The data at a node is not partitioned further if
either
All (or most) of the items at the node belong to the same class,
or
All attributes have been considered, and no further partitioning
is possible.
Such a node is a leaf node.
Otherwise the data at the node is partitioned
further by picking an attribute for
partitioning data at the node.
Decision-Tree Construction Algorithm
Procedure Grow.Tree(S)
Partition(S);
Procedure Partition (S)
if (purity(S) > dp or |S| < ds) then
return;
for each attribute A
evaluate splits on attribute A;
Use best split found (across all attributes) to
partition
S into S1, S2, …., Sr,
for i = 1, 2, ….., r
Partition(Si);
Other Types of Classifiers
Further types of classifiers
Neural net classifiers
Bayesian classifiers
Neural net classifiers use the training data to
train artificial neural nets
· Widely studied in AI, won’t
cover here
Bayesian classifiers use Bayes theorem, which says
where
p(cj | d) = probability of instance d being in class cj,
p(d | cj) = probability of
generating instance d given class cj,
p(cj) = probability of occurrence of class cj, and
p(d) = probability of instance d occuring
Naive Bayesian Classifiers
Bayesian classifiers require
computation of p(d | cj)
precomputation of p(cj)
p(d) can be ignored since it is
the same for all classes
To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions,
and thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* (p(dn|cj)
Each of the p(di|cj) can be estimated from a
histogram on di values for each class cj
· the histogram is computed
from the training instances
Histograms on multiple attributes are more
expensive to compute and store.
Regression
Regression deals with the prediction of a value,
rather than a class.
Given values for a set of variables, X1, X2, …, Xn, we wish to predict the
value of a variable
Y.
One way is to infer coefficients a0, a1, a1, …, an such that
Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn
Finding such a linear polynomial is called
linear regression.
In general, the process of finding a curve that
fits the data is also called curve fitting.
The fit may only be approximate
· because of noise in the
data, or
· because the relationship is
not exactly a polynomial
Regression aims to find coefficients that give
the best possible fit.
Association Rules
Retail shops are often interested in
associations between different items that people buy.
Someone who buys bread is quite likely also to
buy milk
A person who bought the book Database System Concepts is quite likely also to buy the
book Operating System Concepts.
Associations information can be used in several
ways.
E.g. when a customer buys a particular book, an
online shop may suggest associated books.
Association rules:
bread Þ milk DB-Concepts, OS-Concepts Þ Networks
Left hand side: antecedent, right hand side:
consequent
An association rule must have an associated
population; the population consists of a set of
instances E.g. each transaction (sale) at a shop
is an instance, and the set of all transactions is the
population.
Rules have an associated support, as well as an
associated confidence.
Support is a measure of what fraction of the population satisfies both the
antecedent and the
consequent of the rule.
E.g. suppose only 0.001 percent of all purchases
include milk and screwdrivers. The support for
the rule is milk Þ screwdrivers is low.
We usually want rules with a reasonably high
support
Rules with low support are usually not very
useful
Confidence is a measure of how often the consequent is true
when the antecedent is true.
E.g. the rule bread Þ milk has a confidence of 80
percent if 80 percent of the purchases that
include bread also include milk.
Usually want rules with reasonably large
confidence.
Finding Association Rule
We are generally only interested in association
rules with reasonably high support (e.g. support of
2% or greater)
Naïve algorithm
1. Consider all possible sets of relevant items.
2. For each set find its support (i.e. count how
many transactions purchase all items in
the set).
H Large itemsets: sets with sufficiently high support
3. Use large itemsets to generate association
rules.
H From itemset A generate the rule A - {b} Þb for each b Î A.
4 Support of rule = support (A).
4 Confidence of rule =
support (A ) / support (A - {b})
Other Types of Associations
Basic association rules have several limitations
Deviations from the expected probability are
more interesting
E.g. if many people purchase bread, and many
people purchase cereal, quite a few would be
expected to purchase both (prob1 * prob2)
We are interested in positive as well as negative
correlations between sets of items
Positive correlation: co-occurrence is higher
than predicted
Negative correlation: co-occurrence is lower
than predicted
Sequence associations/correlations
E.g. whenever bonds go up, stock prices go down
in 2 days
Deviations from temporal patterns
E.g. deviation from a steady growth
E.g. sales of winter wear go down in summer
Not surprising, part of a known pattern.
Look for deviation from value predicted using
past patterns
Clustering
· Clustering: Intuitively,
finding clusters of points in the given data such that similar points
lie in the same cluster
· Can be formalized using
distance metrics in several ways
E.g. Group points into k sets (for a given k) such that the average
distance of points from
the centroid of their assigned group is
minimized
· Centroid: point defined by
taking average of coordinates in each dimension.
Another metric: minimize average distance
between every pair of points in a cluster
· Has been studied
extensively in statistics, but on small data sets
Data mining systems aim at clustering techniques
that can handle very large data sets
E.g. the Birch clustering algorithm (more
shortly)
Hierarchical Clustering
Example from biological classification
Other examples: Internet directory systems (e.g.
Yahoo, more on this later)
Agglomerative clustering algorithms
Build small clusters, then cluster small
clusters into bigger clusters, and so on
Divisive clustering algorithms
Start with all items in a single cluster,
repeatedly refine (break) clusters into smaller ones
(b) Discuss the features of Web databases
and Mobile databases. (16) (JUNE 2010)
Mobile databases:
Recent advances in portable and wireless technology led to mobile
computing, a new
dimension in data communication and processing.
Portable computing devices coupled with wireless communications
allow clients to access
data from virtually anywhere and at any time.
There are a number of hardware and software problems that must be
resolved before the
capabilities of mobile computing can be fully
utilized.
Some of the software problems – which may involve data management,
transaction
management, and database recovery – have their
origins in distributed database systems.
In mobile computing, the problems are more difficult, mainly:
The limited and intermittent connectivity afforded by wireless
communications.
The limited life of the power supply(battery).
The changing topology of the network.
In addition, mobile computing introduces new architectural
possibilities and
challenges.
Mobile Computing Architecture
The general architecture of a mobile platform is illustrated in
Fig 30.1.
It is distributed architecture where a number of computers,
generally referred to as Fixed
Hosts and Base Stations are interconnected
through a high-speed wired network.
Fixed hosts are general purpose computers configured to manage
mobile units.
Base stations function as gateways to the fixed network for the
Mobile Units.
Wireless Communications –
The wireless medium have bandwidth significantly lower than those
of a wired
network.
The current generation of wireless technology has data rates range
from the
tens to hundreds of kilobits per second (2G
cellular telephony) to tens of
megabits per second (wireless Ethernet,
popularly known as WiFi).
Modern (wired) Ethernet, by comparison, provides data rates on the
order of
hundreds of megabits per second.
The other characteristics distinguish wireless connectivity
options:
interference,
locality of access,
range,
support for packet switching,
seamless roaming throughout a geographical region.
Some wireless networks, such as WiFi and Bluetooth, use unlicensed
areas of the
frequency spectrum, which may cause interference
with other appliances, such as
cordless telephones.
Modern wireless networks can transfer data in units called
packets, that are used in
wired networks in order to conserve bandwidth.
Client/Network Relationships –
Mobile units can move freely in a geographic mobility domain, an area that is
circumscribed by wireless network coverage.
To manage entire mobility domain is divided into one or more
smaller
domains, called cells, each of which is supported by at least one base
station.
Mobile units be unrestricted throughout the cells of domain, while
maintaining information access contiguity.
The communication architecture described earlier is designed to
give the mobile
unit the impression that it is attached to a
fixed network, emulating a traditional
client-server architecture.
Wireless communications, however, make other architectures
possible. One
alternative is a mobile ad-hoc network (MANET), illustrated in 29.2.
In a MANET, co-located mobile units do not need to
communicate via a fixed
network, but instead, form their own using
cost-effective technologies such as
Bluetooth.
In a MANET, mobile units are responsible for routing their
own data, effectively
acting as base stations as well as clients.
Moreover, they must be robust enough to handle changes in the
network
topology, such as the arrival or departure of
other mobile units.
MANET applications can be considered as peer-to-peer, meaning that
a mobile unit
is simultaneously a client and a server.
Transaction processing and data consistency control become more
difficult
since there is no central control in this
architecture.
Resource discovery and data routing by mobile units make computing
in a
MANET even more complicated.
Sample MANET applications are multi-user games, shared whiteboard,
distributed calendars, and battle information
sharing.
Characteristics of Mobile Environments
The characteristics of mobile computing include:
Communication latency
Intermittent connectivity
Limited battery life
Changing client location
The server may not be able to reach a client.
A client may be unreachable because it is dozing – in an
energy-conserving state in
which many subsystems are shut down – or because
it is out of range of a base
station.
In either case, neither client nor server can reach the other, and
modifications must be made to the architecture
in order to compensate for
this case.
Proxies for unreachable components are added to the architecture.
For a client (and symmetrically for a server), the proxy can cache
updates
intended for the server.
Mobile computing poses challenges for servers as well as clients.
The latency involved in wireless communication makes scalability a
problem.
Since latency due to wireless communications increases the time to
service
each client request, the server can handle fewer
clients.
One way servers relieve this problem is by broadcasting data
whenever possible.
A server can simply broadcast data periodically.
Broadcast also reduces the load on the server, as clients do not
have to maintain
active connections to it.
Client mobility also poses many data management challenges.
Servers must keep track of client locations in order to
efficiently route messages to
them.
Client data should be stored in the network location that
minimizes the traffic
necessary to access it.
The act of moving between cells must be transparent to the client.
The server must be able to gracefully divert the shipment of data
from one base to
another, without the client noticing.
Client mobility also allows new applications that are
location-based.
Data Management Issues
From a data management standpoint, mobile computing may be
considered a variation of
distributed computing. Mobile databases can be
distributed under two possible scenarios:
The entire database is distributed mainly among the wired
components, possibly
with full or partial replication.
A base station or fixed host manages its own database with a
DBMS-like functionality, with additional functionality for locating mobile
units and additional query and transaction management features to meet the requirements
of mobile Environments.
The database is distributed among wired and wireless components.
Data management responsibility is shared among base stations or
fixed hosts and mobile units.
Data management issues as it is applied to mobile databases:
Data distribution and replication
Transactions models
Query processing
Recovery and fault tolerance
Mobile database design
Location-based service
Division of labor
Security
Application: Intermittently Synchronized
Databases
Whenever clients connect – through a process known in industry as
synchronization of a client with a server – they receive a batch of updates to
be installed on their local database.
The primary characteristic of this scenario is that the clients
are mostly disconnected; the server is not necessarily able reach them.
This environment has problems similar to those in distributed and
client-server databases, and some from mobile databases.
This environment is referred to as Intermittently Synchronized
Database Environment (ISDBE).
The characteristics of Intermittently Synchronized Databases (ISDBs) make them
distinct from the mobile databases are:
A client connects to the server when it wants to exchange updates.
The communication can be unicast –one-on-one communication between the server and the client–
or multicast– one sender or server may periodically
communicate to a set of receivers or update a group of clients.
A server cannot connect to a client at will.
The characteristics of ISDBs (contd.) :
Issues of wireless versus wired client connections and power
conservation are
generally immaterial.
A client is free to manage its own data and transactions while it
is disconnected. It
can also perform its own recovery to some
extent. A client has multiple ways connecting to a server and, in case of many
servers, may choose a particular server to connect to based on proximity,
communication nodes available, resources available, etc
Web databases
A database system is essentially just a way to
manage lists of information. The information can come from a variety of
sources. For example, it can represent research data, business records, customer
requests, sports statistics, sales reports, personal hobby information,
personnel records, bug reports or student grades. The power of a database
system come in when the information you want to organize and manage becomes
voluminous or complex so that your records become more burdensome than you care
to deal with by hand.
Databases can be used by large corporations
processing millions of transactions a day, of course, but even small-scale
operations involving a single person maintaining information of personal interest
may require a database. It's not difficult to think of situations in which the
use of a database can be beneficial because you needn't have huge amounts of
information before that information becomes difficult to manage.
Web Database Applications
Database systems are used now to provide
services in ways that were not possible until relatively recently. The manner
in which many organizations use a database in conjunction with a Web site is a
good example.
Suppose your company has an inventory database that is used by the
service desk staff when customers call to find out whether or not you have an
item in stock and how much it costs. That's a relatively traditional use for a
database. However, if your company puts up a Web site for customers to visit,
you can provide an additional service: a search engine that allows customers to
determine item pricing and availability themselves.
This gives your customers the information they want, and the way
you provide it is by
supplying a database application to search the
inventory information stored in your database for the items in question ã
automatically. The customer gets the information immediately, without being put
on hold listening to annoying canned music or being limited by the hours your
service desk is open. And for every customer who uses your Web site, that's one
less phone call that needs to be handled by a person on the service desk payroll.
(Perhaps the Web site pays for itself this way.)
But you can put the database to even better use than that.
Web-based inventory search requests can provide information not only to your
customers, but to you as well. The queries tell you what your customers are
looking for, and the query results tell you whether or not you're able to
satisfy their requests. To the extent you don't have what they want, you're
probably losing business. So it makes sense to record information about
inventory searches: what customers were looking for, and whether or not you had
it in stock. Then you can use this information to adjust your inventory and
provide better service to your customers.
Another recent application for databases is to serve up banner
advertisements on Web pages. We don't like them any better than you do, but the
fact remains that they are a popular application for Web databases, which can
be used to store advertisements, and retrieve them for display by a Web server.
Three Tier Architecture
4. (a)With an example
explain E-R Model in detail. (JUNE 2010)
Or
(b) (i) Explain E-R model with an example.
(8) (NOV/DEC 2010)
Entity-Relationship Model: The entity-relationship (E-R) data model perceives the real world as
consisting of basic objects, called entities, and relationships among these
objects.
Entity Sets
A database can be
modeled as:
a collection of entities,
Relationships among entities.
An entity is an object that exists
and is distinguishable from other objects.
Example: specific person, company, event, plant An
entity set is a set of entities of the same type that
share the same properties.
Example: set of all persons, companies, trees,
holidays
Attributes
An entity is represented by a set of attributes,
that is, descriptive properties possessed by all members of an entity set.
Examples:
customer = (customer-name, social-security,
customer-street, customer-city)
account = (account-number, balance)
Domain -- the set of permitted values for each attribute
Attribute types:
Simple and composite attributes.
Single-valued and multi-valued attributes.
Null attributes
Derived attributes
Relationship Sets
A relationship is an association among several
entities
Examples:
Hayes depositor A-102
customer-entity relationship setaccount entity
A relationship set is a mathematical relation among n³ 2 entities,each taken from entity sets
{(e1,e2,..en)ï e1ÎE1,e2ÎE2,..en Î En}
where (e1,e2,..en) is a relationship
--Example: (Hayes,A-102) Îdepositor
An attribute can also
be a property of a relationship set. For instance, the depositor relationship set between entity sets customer and account may have the attribute access-date
Degree of a Relationship Set
§ Refers to number of entity sets that participate in a relationship
set.
§ Relationship sets that
involve two entity sets are binary (or degree two). Generally
, most relationship sets in a database system are binary
§ Relationship sets may
involve more than two entity sets. The entity sets customer, loan, and branch may be linked by the
ternary (degree three) relationship set CLB.
Roles
Entity sets of a relationship need not be
distinct
§ The labels”manager”and”worker”are
called roles; they specify how employee entities
interact via the works-for relationship set.
§ Roles are indicated in E-R diagrams by labeling the lines that
connect diamonds to rectangles.
§ Role labels are optional, and are used to clarify semantics of the
relationship
Design Issues
§ Use of entity sets vs. attributes
Choice mainly depends on the structure of the
enterprise being modeled, and on the
semantics associated with the attribute in
question.
§ Use of entity sets vs. relationship sets
Possible guideline is to designate a
relationship set to describe an action that occurs
between entities
§ Binary versus n-ary relationship sets
Although it is possible to replace a nonbinary (n-ary , for n > 2) relationship set by
a
number of distinct binary relationship sets, a n-ary relationship set shows more clearly that several entities
participate in a single relationship.
Mapping Cardinalities
§ Express the number of entities to which another entity can be
associated via a relationship set.
§ Most useful in describing binary relationship sets.
§ For a binary relationship set the mapping cardinality must be one
of the following types:
– One to one
– One to many
– Many to one
– Many to many
§ We distinguish among these types by drawing either a directed
line( ),signifying “one,”or an undirected line( ),signifying “many,” between
the relationship set and the entity set.
One-To-One Relationship
§ A customer is associated
with at most one loan via the relationship borrower
§ A loan is associated with
at most one customer via borrower
One-To-Many and Many-To-One Relationship
§ In the one-to-many relationship (a), a loan is associated with at
most one customer via
borrower; a customer is associated with several
(including 0) loans via borrower
§ In the many-to-one relationship (b), a loan is associated with
several (including 0)
customers via borrower; a customer is associated
with at most one loan via borrower
Many-To-Many Relationship
§ A customer is associated with several (possibly 0) loans via
borrower
§ A loan is associated with several (possibly 0) customers via
borrower
Existence Dependencies
§ If the existence of entity
x depends on the existence of entity y, then x is said to be
existence dependent on y.
– y is a dominant entity (in example below, loan)
– x is a subordinate entity
(in example below, payment)
§ If a loan entity is deleted, then all its associated payment entities must be deleted also.
E-R Diagram Components
§ Rectangles represent entity
sets.
§ Ellipses represent
attributes.
§ Diamonds represent
relationship sets.
§ Lines link attributes to
entity sets and entity sets to relationship sets.
§ Double ellipses represent
multivalued attributes.
§ Dashed ellipses denote
derived attributes.
§ Primary key attributes are underlined.
Weak Entity Sets
§ An entity set that does not
have a primary key is referred to as a weak entity set.
§ The existence of a weak entity set depends on the existence of a
strong entity set; it must
relate to the strong set via a one-to-many relationship
set.
§ The discriminator (or partial key) of a
weak entity set is the set of attributes that
distinguishes among all the entities of a weak
entity set.
§ The primary key of a weak entity set is formed by the primary key
of the strong entity set on which the weak entity set is existence dependent,
plus the weak entity set's
discriminator.
§ We depict a weak entity set by double rectangles.
§ We underline the discriminator of a weak entity set with a dashed
line.
§ payment-number -- discriminator of the payment entity set
§ Primary key for payment -- (loan-number, payment-number)
Specialization
§ Top-down design process; we designate subgroupings within an
entity set that are
distinctive from other entities in the set.
§ These subgroupings become lower-level entity sets that have
attributes or participate in relationships that do not apply to the
higher-level entity set.
§ Depicted by a triangle component labeled ISA (i.e., savings-account”is an”account)
Generalization
§ A bottom-up design process -- combine a number of entity sets that
share the same features into a higher-level entity set
§ Specialization and generalization are simple inversions of each
other; they are represented in an E-R diagram in the same way.
§ Attribute Inheritance -- a lower-level entity set inherits all the
attributes and relationship participation of the higher-level entity set to
which it is linked.
Design Constraints on a Generalization
§ Constraint on which entities can be members of a given lower-level
entity set.
condition-defined user-defined
§ Constraint on whether or not entities may belong to more than one
lower-level entity set within a single generalization.
Disjoint overlapping
§ Completeness constraint -- specifies whether or not an entity in
the higher-level entity set must belong to at least one of the lower-level
entity sets within a generalization.
Total Partial
Aggregation
§ Relationship sets borrower and loan-officer represent
the same information
§ Eliminate this redundancy
via aggregation
Treat relationship as an abstract entity
Allows relationships between relationships
Abstraction of relationship into new entity
§ Without introducing redundancy , the following diagram represents
that:
A customer takes out a loan
An employee may be a loan officer for a customer-loan pair
E-R Design Decisions
§ The use of an attribute or entity set to represent an object
§ Whether a real-world concept is best expressed by an entity set or
a relationship set.
§ The use of a ternary relationship versus a pair of binary relationships.
§ The use of a strong or weak entity set.
§ The use of generalization -- contributes to modularity in the
design.
§ The use of aggregation -- can treat the aggregate entity set as a
single unit without concern
for the details of its internal structure.
Reduction of an E-R schema to Tables
§ Primary keys allow entity
sets and relationship sets to be expressed uniformly as tables
which represent the contents of the database.
§ A database which conforms to an E-R diagram can be represented by
a collection of tables.
§ For each entity set and relationship set there is a unique table
which is assigned the name of
the corresponding entity set or relationship
set.
§ Each table has a number of columns (generally corresponding to
attributes), which have
unique names.
§ Converting an E-R diagram to a table format is the basis for
deriving a relational database
design from an E-R diagram.
Representing Entity Sets as Tables
§ A strong entity set reduces to a table with the same attributes.
§ A weak entity set becomes a table that includes a column for the
primary key of the
identifying strong entity set.
Representing Relationship Sets as Tables
§ A many-to-many relationship set is represented as a table with
columns for the primary
keys of the two participating entity sets, and
any descriptive attributes of the relationship
set.
§ The table corresponding to
a relationship set linking a weak entity set to its identifying
strong entity set is redundant. The payment table already contains the information that
would appear in the loan-payment table (i.e., the columns loan-number and paymentnumber).
E-R Diagram for Banking Enterprise
Representing Generalization as Tables
§ Method 1: Form a table for the generalized entity account Form a
table for each entity set
that is generalized (include primary key of
generalized entity set)
§ Method 2: Form a table for each entity set that is generalized
table
(b) Explain the features of Temporal &
Spatial Databases in detail. (JUNE 2010)
Or
(a) Give features of
Temporal and Spatial Databases. Temporal Database.
(DECEMBER 2010)
Temporal Database
Time Representation, Calendars, and Time
Dimensions
Time is considered ordered sequence of points in
some granularity
• Use the term choronon instead of point to
describe minimum granularity
A calendar organizes time into different time
units for convenience.
• Accommodates various calendars
Gregorian (western), Chinese, Islamic, Etc.
Point events
• Single time point event
E.g., bank deposit
• Series of point events can form a time series
data
Duration events
• Associated with specific time period
Time period is represented by start time and end
time
Transaction time
• The time when the information from a certain
transaction becomes valid
Bitemporal database
• Databases dealing with two time dimensions
Incorporating Time in Relational Databases Using
Tuple Versioning
Add to every tuple
• Valid start time
• Valid end time
Incorporating Time in Object-Oriented Databases
Using Attribute Versioning
A single complex object stores all temporal
changes of the object
Time varying attribute
• An attribute that changes over time
E.g., age
Non-Time varying attribute
• An attribute that does not changes over time
E.g., date of birth
Spatial Database
Types of Spatial Data
Point Data
Points in a multidimensional space
E.g., Raster data such as satellite imagery,
where each
pixel stores a measured value
E.g., Feature vectors extracted from text
Region Data
Objects have spatial extent with location and boundary.
DB typically uses geometric approximations
constructed using line segments, polygons,
etc., called vector data.
Types of Spatial Queries
Spatial Range Queries
Find all cities within 50 miles of Madison
Query has associated region (location, boundary)
Answer includes ovelapping or contained data
regions
Nearest-Neighbor Queries
Find the 10 cities nearest to Madison
Results must be ordered by proximity
Spatial Join Queries
Find all cities near a lake
Expensive, join condition involves regions and
proximity
Applications of Spatial Data
Geographic Information Systems (GIS)
E.g., ESRI’s ArcInfo; OpenGIS Consortium
Geospatial information
All classes of spatial queries and data are
common
Computer-Aided Design/Manufacturing
Store spatial objects such as surface of
airplane fuselage
Range queries and spatial join queries are
common
Multimedia Databases
Images, video, text, etc. stored and retrieved
by content
First converted to feature vector form; high
dimensionality
Nearest-neighbor queries are the most common
Single-Dimensional Indexes
B+ trees are fundamentally single-dimensional
indexes.
When we create a composite search key B+ tree,
e.g., an index on <age, sal>, we effectively
linearize the 2-dimensional space since we sort
entries first by age and then by sal.
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
Multi-dimensional Indexes
A multidimensional index clusters entries so as
to exploit “nearness” in multidimensional space.
Keeping track of entries and maintaining a
balanced index structure presents a challenge!
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
51
Motivation for Multidimensional Indexes
Spatial queries (GIS, CAD).
Find all hotels within a radius of 5 miles from the
conference venue.
Find the city with population 500,000 or more
that is nearest to Kalamazoo, MI.
Find all cities that lie on the Nile in Egypt.
Find all parts that touch the fuselage (in a
plane design).
Similarity queries (content-based retrieval).
Given a face, find the five most similar faces.
Multidimensional range queries.
50 < age < 55 AND 80K < sal < 90K
Drawbacks
An index based on spatial location needed.
One-dimensional indexes don’t support
multidimensional searching efficiently.
Hash indexes only support point queries; want to
support range queries as well.
Must support inserts and deletes gracefully.
Ideally, want to support non-point data as well (e.g., lines,
shapes).
The R-tree meets these requirements, and variants are widely used
today.
R-Tree
R-Tree Properties
Leaf entry = < n-dimensional box, rid >
This is Alternative (2), with key value being a
box.
Box is the tightest bounding box for a data
object.
Non-leaf entry = < n-dim box, ptr to child node >
Box covers all boxes in child node (in fact,
subtree).
All leaves at same distance from root.
Nodes can be kept 50% full (except root).
Can choose a parameter m that is <= 50%, and
ensure that every node is at least m% full.
Example of R-Tree
Search for Objects Overlapping Box Q
Start at root.
1. If current node is non-leaf, for each entry
<E, ptr>, if box E overlaps Q, search subtree identified
by ptr.
2. If current node is leaf, for each entry
<E, rid>, if E overlaps Q, rid identifies an object that might
overlap Q.
Improving Search Using Constraints
It is convenient to store boxes in the R-tree as approximations of
arbitrary regions, because
boxes can be represented compactly.
But why not use convex polygons to approximate query regions more
accurately?
Will reduce overlap with nodes in tree, and
reduce the number of nodes fetched by
avoiding some branches altogether.
Cost of overlap test is higher than bounding box
intersection, but it is a main-memory
cost, and can actually be done quite
efficiently. Generally a win.
Insert Entry <B, ptr>
Start at root and go down to “best-fit” leaf L.
Go to child whose box needs least enlargement to
cover B; resolve ties by going to
smallest area child.
If best-fit leaf L has space, insert entry and stop. Otherwise,
split L into L1 and L2.
Adjust entry for L in its parent so that the box
now covers (only) L1.
Add an entry (in the parent node of L) for L2.
(This could cause the parent node to
recursively split.)
Splitting a Node during Insertion
The entries in node L plus the newly inserted entry must be
distributed between L1 and L2.
Goal is to reduce likelihood of both L1 and L2 being searched on
subsequent queries.
Idea: Redistribute so as to
minimize area of L1 plus area of L2.
R-Tree Variants
The R* tree uses the concept of forced reinserts to reduce overlap
in tree nodes. When a node overflows, instead of splitting.
Remove some (say, 30% of the) entries and
reinsert them into the tree.
Could result in all reinserted entries fitting
on some existing pages, avoiding a split.
R* trees also use a different heuristic, minimizing box perimeters
rather than box areas during insertion.
Another variant, the R+ tree, avoids overlap by inserting an
object into multiple leaves if necessary.
Searches now take a single path to a leaf, at
cost of redundancy.
GiST
The Generalized Search Tree (GiST) abstracts the “tree” nature of
a class of indexes
including B+ trees and R-tree variants.
Striking similarities in insert/delete/search
and even concurrency control algorithms
make it possible to provide “templates” for
these algorithms that can be customized to
obtain the many different tree index structures.
B+ trees are so important (and simple enough to
allow further specialization) that they are implemented specially in all DBMSs.
GiST provides an alternative for implementing
other tree indexes in an ORDBS.
Indexing High-Dimensional Data
Typically, high-dimensional datasets are collections of points,
not regions.
E.g., Feature vectors in multimedia
applications.
Very sparse
Nearest neighbor queries are common.
R-tree becomes worse than sequential scan for
most datasets with more than a dozen dimensions.
As dimensionality increases contrast (ratio of distances between
nearest and farthest points) usually decreases; “nearest neighbor” is not
meaningful.
In any given data set, advisable to empirically
test contrast.
5. (a) Explain the
features of Parallel and Text Databases in detail. (JUNE 2010)
Parallel Databases
Introduction
Parallel machines are becoming quite common and
affordable
· Prices of microprocessors,
memory and disks have dropped sharply
Databases are growing increasingly large
· Large volumes of
transaction data are collected and stored for later analysis.
· multimedia objects like
images are increasingly stored in databases
Large-scale parallel database systems
increasingly used for:
· storing large volumes of
data
· processing time-consuming
decision-support queries
· providing high throughput for transaction
processing
Parallelism in Databases
Data can be partitioned across multiple disks
for parallel I/O.
Individual relational operations (e.g., sort,
join, aggregation) can be executed in parallel
· Data can be partitioned and each processor can
work independently on its own partition.
Queries are expressed in high level language
(SQL, translated to relational algebra)
· Makes parallelization
easier.
Different queries can be run in parallel with
each other.
Concurrency control takes care of conflicts.
Thus, databases naturally lend themselves to
parallelism.
I/O Parallelism
Reduce the time required to retrieve relations
from disk by partitioning the relations on multiple disks.
Horizontal partitioning – tuples of a relation
are divided among many disks such that each tuple resides on one disk.
Partitioning techniques (number of disks = n):
Round-robin:
Send the ith tuple inserted in the relation to
disk i mod n.
Hash partitioning:
Choose one or more attributes as the
partitioning attributes.
Choose hash function h with range 0…n – 1
Let i denote result of hash function h applied
tothe partitioning attribute value of a tuple. Send
tuple to disk i.
Partitioning techniques (cont.):
Range partitioning:
Choose an attribute as the partitioning
attribute.
A partitioning vector [vo, v1, ..., vn-2] is chosen.
Let v be the partitioning
attribute value of a tuple. Tuples such that vi_ vi+1 go to disk I + 1. Tuples
with v < v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1.
E.g., with a partitioning vector [5,11], a tuple
with partitioning attribute value of 2 will go to disk
0, a tuple with value 8 will go to disk 1, while
a tuple with value 20 will go to disk2.
Comparison of Partitioning Techniques
Evaluate how well partitioning techniques
support the following types of data access:
1. Scanning the entire relation.
2. Locating a tuple associatively – point queries.
E.g., r.A = 25.
3. Locating all tuples such that the value of a
given attribute lies within a specified range – range
queries.
E.g., 10 _ r.A < 25.
Round robin:
Advantages
· Best suited for sequential
scan of entire relation on each query.
· All disks have almost an
equal number of tuples; retrieval work is thus well balanced
between disks.
Range queries are difficult to process
· No clustering -- tuples are
scattered across all disks
Hash partitioning:
Good for sequential access
· Assuming hash function is good, and partitioning
attributes form a key, tuples will be
equally distributed between disks
· Retrieval work is then well
balanced between disks.
Good for point queries on partitioning attribute
Can lookup single disk, leaving others available for answering
other queries.
Index on partitioning attribute can be local to disk, making
lookup and update more
efficient
No clustering, so difficult to answer range
queries
Range partitioning:
Provides data clustering by partitioning
attribute value.
Good for sequential access
Good for point queries on partitioning
attribute: only one disk needs to be accessed.
For range queries on partitioning attribute, one
to a few disks may need to be accessed
− Remaining disks are available for other queries.
− Good if result tuples are from one to a few blocks.
− If many blocks are to be fetched, they are still fetched from one
to a few disks, and
potential parallelism in disk access is wasted
Example of execution skew.
Partitioning a Relation across Disks
! If a relation contains only a few tuples which will fit into a
single disk block, then assign the relation to a single disk.
! Large relations are preferably partitioned across all the
available disks.
! If a relation consists of m disk blocks and there are n disks available in the system, then the relation should be
allocated min(m,n) disks.
Handling of Skew
The distribution of tuples to disks may be skewed — that is, some disks have many tuples, while others may have
fewer tuples.
Types of skew:
Attribute-value skew:" Some values appear in the partitioning attributes of many tuples;
all the tuples with the same value for the partitioning attribute end up in the
same partition. “Can occur
with range-partitioning and hash-partitioning.
Partition skew: " With range-partitioning,
badly chosen partition vector may assign too many tuples to some partitions and
too few to others. " Less likely with hash-partitioning if a good
hashfunction is chosen.
Handling Skew in Range-Partitioning
To create a balanced partitioning vector
(assuming partitioning attribute forms a key of the relation):
· Sort the relation on the
partitioning attribute.
· Construct the partition vector by scanning the
relation in sorted order as follows.
" After every 1/nth of the relation has been read, the value of the partitioning
attribute of
the next tuple is added to the partition vector.
· n denotes the number of partitions to be constructed.
· Duplicate entries or imbalances can result if duplicates
are present in partitioning attributes.
Alternative technique based on histograms used in practice
Handling Skew using Histograms
Balanced partitioning vector can be constructed
from histogram in a relatively straightforward fashion.
Assume uniform distribution within each range of the histogram
Histogram can be constructed by scanning
relation, or sampling (blocks containing) tuples of the relation
Interquery Parallelism
Queries/transactions execute in parallel with
one another.
Increases transaction throughput; used primarily
to scale up a transaction processing system to support a larger number of
transactions per second.
Easiest form of parallelism to support,
particularly in a shared memory parallel database, because even sequential
database systems support concurrent processing.
More complicated to implement on shared-disk or
shared-nothing architectures
Locking and logging must be coordinated by passing messages
between processors.
Data in a local buffer may have been updated at another processor.
Cache-coherency has to be maintained — reads and writes of data in buffer must
find latest version of data.
Cache Coherency Protocol
Example of a cache coherency protocol for shared
disk systems:
Before reading/writing to a page, the page must be locked in
shared/exclusive mode.
On locking a page, the page must be read from disk
Before unlocking a page, the page must be written to disk if it
was modified. More complex protocols with fewer disk reads/writes exist. Cache
coherency protocols for shared-nothing systems are similar. Each database page
is assigned a home processor. Requests to fetch the page or write
it to disk are sent to the home processor.
Intraquery Parallelism
Execution of a single query in parallel on
multiple processors/disks; important for speeding up long-running queries.
Two complementary forms of intraquery
parallelism:
Intraoperation Parallelism – parallelize the execution of each individual
operation in the query.
Interoperation Parallelism – execute the different operations in a query
expression in parallel.
The first form scales better with increasing
parallelism because the number of tuples processed by each operation is
typically more than the number of operations in a query.
Parallel Sort
Range-Partitioning Sort
Choose processors P0... Pm, where m _ n -1 to do
sorting. Create range-partition vector with m entries, on the sorting
attributes Redistribute the relation using range partitioning
all tuples that lie in the ith range are sent to processor Pi
Pi stores the tuples it received temporarily on disk Di.
This step requires I/O and communication overhead.
Each processor Pi sorts its partition of the
relation locally.
Each processors executes same operation (sort)
in parallel with other processors, without any interaction with the others (data parallelism).
Final merge operation is trivial:
range-partitioning ensures that, for 1 jm, the key values in processor Pi are
all less than the key values in Pj.
Parallel External Sort-Merge
Assume the relation has already been partitioned
among disks D0, ..., Dn-1
Each processor Pi locally sorts the data on
disk Di.
The sorted runs on each processor are then
merged to get the final sorted output.
Parallelize the merging of sorted runs as
follows:
The sorted partitions at each processor Pi are range-partitioned across the processors P0...Pm-1.
Each processor Pi performs a merge on the
streams as they are received, to get a single sorted run.
The sorted runs on processors P0,..., Pm-1 are concatenated to get the final result.
Parallel Join
The join operation requires pairs of tuples to
be tested to see if they satisfy the join condition, and if they do, the pair
is added to the join output.
Parallel join algorithms attempt to split the
pairs to be tested over several processors. Each processor then computes part
of the join locally.
In a final step, the results from each processor
can be collected together to produce the final result.
Partitioned Join
For equi-joins and natural joins, it is possible
to partition the two input relations across the processors,
and compute the join locally at each processor.
Let r and s be the input relations, and we want to compute r and s each are partitioned
into n partitions, denoted r0, r1... rn-1 and s0, s1, ..., sn-1. Can use either range partitioning or hash partitioning. r and s must be partitioned on
their join attributes r.A and s.B), using the same
range-partitioning vector or hash function.
Partitions ri and si are sent to processor Pi, Each processor Pi locally computes Any of the standard join methods can be used.
Fragment-and-Replicate Join
Partitioning not possible for some join
conditions,
e.g., non-equijoin conditions, such as r.A > s.B.
For joins were partitioning is not applicable,
parallelization can be accomplished by fragment and
replicate technique
Depicted on next slide Special case – asymmetric fragment-and-replicate:
One of the relations, say r, is partitioned; any
partitioning technique can be used.
The other relation, s, is replicated across all
the processors.
Processor Pi then locally computes the
join of ri with all of s using any join technique.
Both versions of fragment-and-replicate work
with any join condition, since every tuple in r can be tested with every
tuple in s. Usually has a higher cost than partitioning,
since one of the relations (for asymmetric fragment-and-replicate) or both
relations have to be replicated. Sometimes asymmetric fragment-and-replicate is
preferable even though partitioning could be used.
E.g., say s is small and r is large, and already partitioned. It may be
cheaper to replicate s across all processors, rather than repartition r and s on the join attributes.
Partitioned Parallel Hash-Join
Parallelizing partitioned hash join:
Assume s is smaller than r and therefore s is chosen as the build
relation.
A hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processors.
Each processor Pi reads the tuples of s that are on its disk Di, and sends each tuple to
the appropriate processor based on hash function h1. Let si denote the tuples of relation s that are sent to processor Pi.
As tuples of relation s are received at the destination processors, they are partitioned
further using another hash function, h2, which is used to compute
the hash-join locally.
Once the tuples of s have been distributed, the larger relation r is redistributed across the m processors using the hash
function h1
· Let ri denote the tuples of relation r that are sent to processor Pi.
As the r tuples are received at the
destination processors, they are repartitioned using the function h2
Each processor Pi executes the build and
probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-join.
Hash-join optimizations can be applied to the
parallel case
· e.g., the hybrid hash-join algorithm can be used
to cache some of the incoming tuples in memory and avoid the cost of writing
them and reading them back in.
Parallel Nested-Loop Join
Assume that
relation s is much smaller than relation r and that r is stored by partitioning.
there is an index on a join attribute of relation r at each of the partitions of relation r.
Use asymmetric fragment-and-replicate, with
relation s being replicated, and using the existing partitioning
of relation r.
Each processor Pj where a partition of
relation s is stored reads the tuples of relation s stored in Dj, and replicates the tuples
to every other processor Pi.
At the end of this phase, relation s is replicated at all sites that store tuples of relation r.
Each processor Pi performs an indexed
nested-loop join of relation s with the ith partition of relation
r.
Interoperator Parallelism
Pipelined parallelism
· Consider a join of four relations
· Set up a pipeline that
computes the three joins in parallel
Let P1 be assigned the computation of
And P2 be assigned the computation of
And P3 be assigned the computation of
· Each of these operations can execute in parallel,
sending result tuples it computes to the next operation even as it is computing
further results
· Provided a pipelineable join evaluation
algorithm is used.
Independent Parallelism
Consider a join of four relations,
Let P1 be assigned the computation of
And P2 be assigned the computation of
And P3 be assigned the computation of
P1 and P2 can work independently in parallel
P3 has to wait for input from P1 and P2
Can pipeline output of P1 and P2 to P3, combining independent
parallelism and
pipelined parallelism
Does not provide a high degree of parallelism
useful with a lower degree of parallelism.
less useful in a highly parallel system.
Query Optimization
Query optimization in parallel databases is
significantly more complex than query optimization in sequential databases.
Cost models are more complicated, since we must
take into account partitioning costs and issues such as skew and resource
contention.
When scheduling execution
tree in parallel system, must decide:
How to parallelize each operation and how many processors to use
for it.
What operations to pipeline, what operations to execute
independently in parallel, and what operations to execute sequentially, one
after the other.
Determining the amount of resources to allocate
for each operation is a problem.
E.g., allocating more processors than optimal can result in high
communication overhead.
Long pipelines should be avoided as the final
operation may wait a lot for inputs, while holding precious resources
Text Databases
A text database is a
compilation of documents or other information in the form of a database in which
the complete text of each referenced document is available for online viewing,
printing, or downloading. In addition to text documents, images are often
included, such as graphs, maps, photos, and diagrams. A text database is
searchable by keyword, phrase, or both. When an item in a text database is
viewed, it may appear in ASCII format (as a text file with the .txt extension),
as a word-processed file (requiring a program such as Microsoft Word), as an
PDF) file. When a document appears as a PDF file, it is usually a scanned
hardcopy of the original article, chapter, or book.
A text databases are used
by college and university libraries as a convenience to their students and staff.
Full-text databases are ideally suited to online courses of study, where the
student remains at home and obtains course materials by downloading them from
the Internet. Access to these databases is normally restricted to registered
personnel or to people who pay a specified fee per viewed item. Full-text
databases are also used by some corporations, law offices, and government agencies.
(b) Discuss the Rules, Knowledge Bases and
Image Databases. (JUNE 2010)
Rule-based systems are used as a way to store and manipulate
knowledge to interpret information in a useful way. They are often used in
artificial intelligence applications and research. Rule-based systems are
specialized software that encapsulates “Human Intelligence” like knowledge
there by make intelligent decisions quickly and in repeatable form. Also known
as Rule Based Systems, Expert Systems & Artificial Intelligence.
Rule based systems are:
– Knowledge based systems
– Part of the Artificial Intelligence field
– Computer programs that contain some
subject-specific knowledge of one or more
human experts
– Made up of a set of rules that analyze user
supplied information about a specific class of problems.
– Systems that utilize reasoning capabilities
and draw conclusions.
Knowledge Engineering – building an expert system
Knowledge Engineers – the people who build the system
Knowledge Representation – the symbols used to represent the
knowledge
Factual Knowledge – knowledge of a particular task domain that is
widely shared
Heuristic Knowledge – more judgmental knowledge of performance in
a task domain.
Uses of Rule based Systems
Very useful to companies with a high-level of experience and
expertise that cannot easily be transferred to other members.
Solves problems that would normally be tackled by a medical or
other professional.
Currently used in fields such as accounting, medicine, process
control, financial service, production, and human resources
Applications
A classic example of a
rule-based system is the domain-specific expert system that uses rules to make
deductions or choices. For example, an expert system might help a doctor choose
the correct diagnosis based on a cluster of symptoms, or select tactical moves
to play a game. Rule-based systems can be used to perform lexical analysis to
compile or interpret computer programs, or in natural language processing. Rule-based
programming attempts to derive execution instructions from a starting set of
data and rules. This is a more indirect method than that employed by an
imperative programming language, which lists execution steps sequentially.
Construction
A typical rule-based system has four basic
components:
A list of rules or rule base, which
is a specific type of knowledge base.
An inference engine or semantic reasoner, which
infers information or takes action based on the interaction of input and the
rule base. The interpreter executes a production system program by performing
the following match-resolve-act cycle:
· Match: In this first phase,
the left-hand sides of all productions are matched against the contents of
working memory. As a result a conflict set is obtained, which consists of
instantiations of all ,satisfied productions. An
instantiation of a production is an ordered
list of working memory elements that satisfies
the left-hand side of the production.
· Conflict-Resolution: In
this second phase, one of the production instantiations in the
conflict set is chosen for execution. If no
productions are satisfied, the interpreter halts.
· Act: In this third phase,
the actions of the production selected in the conflict-resolution
phase are executed. These actions may change the
contents of working memory. At the
end of this phase, execution returns to the
first phase.
Temporary working memory.
A user interface or other connection to the outside world through
which input and output signals are received and sent.
Components of an Rule Based System
Set of Rules – derived from the knowledge base and used by the
interpreter to evaluate the inputted data
Knowledge Engineer – decides how to represent the experts
knowledge and how to build the inference engine appropriately for the domain
Interpreter – interprets the inputted data and draws a conclusion
based on the users
responses.
Problem-solving Models
Forward-chaining – starts from a set of conditions and moves towards some
conclusion
Backward-chaining – starts with a list of goals and the works backwards to see if
there is any data that will allow it to conclude any of these goals.
Both problem-solving methods are built into inference engines or
inference procedures
Advantages
Provide consistent answers for repetitive decisions, processes and
tasks.
Hold and maintain significant levels of information.
Reduce employee training costs
Centralize the decision making process.
Create efficiencies and reduce the time needed to solve problems.
Combine multiple human expert intelligences
Reduce the amount of human errors.
Give strategic and comparative advantages creating entry barriers
to competitors
Review transactions that human experts may overlook.
Disadvantages
Lack human common sense needed in some decision making.
Will not be able to give the creative responses that human experts
can give in unusual circumstances.
Domain experts cannot always clearly explain their logic and
reasoning.
Challenges of automating complex processes.
Lack of flexibility and ability to adapt to changing environments.
Not being able to recognize when no answer is available.
Knowledge Bases
Knowledge-based Systems: Definition
A system that draws upon the knowledge of human experts captured
in a knowledge-base to solve problems that normally require human expertise.
Heuristic rather than algorithmic
Heuristics in search vs. in KBS general vs. domain-specific
Highly specific domain knowledge
Knowledge is separated from how it is used
KBS = knowledge-base + inference engine
KBS Architecture
The inference engine and knowledge base are
separated because:
· the reasoning mechanism
needs to be as stable as possible;
· the knowledge base must be
able to grow and change, as knowledge is added;
· this arrangement enables
the system to be built from, or converted to, a shell.
· It is reasonable to produce
a richer, more elaborate, description of the typical expert
system.
· A more elaborate
description, which still includes the components that are to be found in almost
any real-world system, would look like this:
Knowledge representation formalisms &
Inference
KR Inference
Logic Resolution principle
Production rules backward (top-down, goal
directed) forward (bottom-up, data-driven)
Semantic nets & Frames Inheritance &
advanced reasoning
Case-based Reasoning Similarity based
KBS tools – Shells
- Consist of KA Tool, Database & Development
Interface
- Inductive Shells
- simplest
- example cases represented as matrix of known
data(premises) and resulting effects
- matrix converted into decision tree or IF-THEN
statements
- examples selected for the tool
Rule-based shells
- simple to complex
- IF-THEN rules
Hybrid shells
- sophisticate & powerful
- support multiple KR paradigms & reasoning
schemes
- generic tool applicable to a wide range
Special purpose shells
- specifically designed for particular types of
problems
- restricted to specialised problems
-Scratch
- require more time and effort
- no constraints like shells
- shells should be investigated first
Some example KBSs
DENDRAL (chemical)
MYCIN (medicine)
XCON/RI (computer)
Typical tasks of KBS
(1) Diagnosis - To identify a problem given a set of symptoms
or malfunctions.
e.g. diagnose reasons for engine failure
(2) Interpretation - To provide an understanding of a situation
from available information. e.g.
DENDRAL
(3) Prediction - To predict a future state from a set of data
or observations. e.g. Drilling Advisor, PLANT
(4) Design - To develop configurations that satisfy
constraints of a design problem. e.g. XCON
(5) Planning - Both short term & long term in areas like
project management, product
development or financial planning. e.g. HRM
(6) Monitoring - To check performance & flag exceptions.
e.g., KBS monitors radar data and estimates the
position of the space shuttle
(7) Control - To collect and evaluate evidence and form
opinions on that evidence.
e.g. control patient’s treatment
(8) Instruction - To train students and correct their
performance. e.g. give medical students experience diagnosing illness
(9) Debugging - To identify and prescribe remedies for
malfunctions.
e.g. identify errors in an automated teller
machine network and ways to correct the errors
Advantages
- Increase availability of expert knowledge expertise
not accessible training future experts
- Efficient and cost effective
- Consistency of answers
- Explanation of solution
- Deal with uncertainty
Limitations
- Lack of common sense
- Inflexible, Difficult to modify
- Restricted domain of expertise
- Lack of learning ability
- Not always reliable
6. (a) Compare
Distributed databases and conventional databases. (16)
(NOV/DEC 2010)
DISTRIBUTED DATABASES VS CONVENTIONAL
DATABASES
mimics organisational structure with data
local access and autonomy without exclusion
cheaper to create and easier to expand
improved availability/reliability/performance by removing reliance
on a central site
Reduced communication overhead
Most data access is local, less expensive and performs better
Improved processing power
Many machines handling the database rather than a single server
more complex to implement
more costly to maintain
security and integrity control
standards and experience are lacking
Design issues are more complex
7. (a) Explain the Multi-Version Locks and
Recovery in Query Languages.
(NOV/DEC 2010)
Multi-Version Locks: Multiversion
concurrency control (abbreviated MCC or MVCC), in the database field of computer science, is
a concurrency control method commonly used by database management systems to
provide concurrent access to the database and in programming languages to
implement transactional memory.
For instance, a database
will implement updates not by deleting an old piece of data and overwriting it
with a new one, but instead by marking the old data as obsolete and adding the newer
"version." Thus there are multiple versions stored, but only one is
the latest. This allows the database to avoid overhead of filling in holes in
memory or disk structures but requires (generally) the system to periodically
sweep through and delete the old, obsolete data objects. For a documentoriented
database such as CouchDB, Riak or MarkLogic Server it also allows the system to
optimize documents by writing entire documents onto contiguous sections of
disk—when updated, the entire document can be re-written rather than bits and
pieces cut out or maintained in a linked, non contiguous database structure
MVCC also provides
potential "point in time" consistent views. In fact read transactions
under MVCC typically use a timestamp or transaction ID to determine what state
of the DB to read, and read these "versions" of the data.
This avoids managing locks
for read transactions because writes can be isolated by virtue of the old
versions being maintained, rather than through a process of locks or mutexes.
Writes affect future "version" but at the transaction ID that the
read is working at, everything is guaranteed to be consistent because the
writes are occurring at a later transaction ID.
In other words, MVCC
provides each user connected to the database with a "snapshot" of the
database for that person to work with. Any changes made will not be seen by
other users of the database until the transaction has been committed.
MVCC uses timestamps or
increasing transaction IDs to achieve transactional consistency. MVCC ensures a
transaction never has to wait for a database object by maintaining several
versions of an object. Each version would have a write timestamp and it would
let a transaction (Ti) read the most recent version of an object
which precedes the transaction timestamp (TS (Ti)).
If a transaction (Ti) wants to write to an
object, and if there is another transaction (Tk), the timestamp of Ti must precede the timestamp
of Tk (i.e., TS(Ti) < TS(Tk)) for the object write operation
to succeed, which is to say a write cannot complete if there are outstanding
transactions with an earlier timestamp.
Every object would also
have a read timestamp, and if a transaction Ti wanted to write to object
P, and the timestamp of that transaction is earlier than the object's read
timestamp (TS(Ti) < RTS(P)), the transaction Ti is aborted and restarted.
Otherwise, Ti creates a new version of P and sets the read/write
timestamps of P to the timestamp of the transaction TS (Ti).
The obvious drawback to
this system is the cost of storing multiple versions of objects in the database.
On the other hand reads are never blocked, which can be important for workloads
mostly involving reading values from the database. MVCC is particularly adept
at implementing true snapshot isolation, something which other methods of
concurrency control frequently do either incompletely or with high performance
costs.
At t1 the state of a DB could be
Time Object1 Object2
t1 “Hello” “Bar”
t2 “Foo” “Bar”
This indicates that the
current set of this database (perhaps a key-value store database) is Object1="Hello",
Object2="Bar". Previously, Object1 was "Foo" but that value
has been superseded. It is not deleted because the database holds “multiple
versions” but will be deleted later. If a long running transaction starts a
read operation, it will operate at transaction "t1" and see this state.
If there is a concurrent update (during that long-running read transaction)
which deletes
Object 2 and adds Object 3 = “foo-bar” the
database will look like
Time Object1 Object2 Object3
t2 “Hello” (deleted) “Foo-Bar”
t1 “Hello” Bar
t0 “Hello” Bar
Now there is a new version as of transaction ID
t2. Note critically that the long-running read transaction still has access to
a coherent snapshot of the system at t1* even though the write transaction
added data as of t2, so the read transaction is able to run in isolation from
the update transaction that created the t2 values. This is how MVCC allows
isolated, ACID, reads without any locks.
Recovery
(b)Discuss
client/server model and mobile databases. (16)
(NOV/DEC 2010)
Mobile Databases
Recent advances in portable and wireless technology led to mobile
computing, a new
dimension in data communication and processing.
Portable computing devices coupled with wireless communications
allow clients to access data from virtually anywhere and at any time.
There are a number of hardware and software problems that must be
resolved before the capabilities of mobile computing can be fully utilized.
Some of the software problems – which may involve data management,
transaction
management, and database recovery – have their
origins in distributed database systems.
In mobile computing, the problems are more difficult, mainly:
The limited and intermittent connectivity afforded by wireless
communications.
The limited life of the power supply(battery).
The changing topology of the network.
In addition, mobile computing introduces new architectural
possibilities and challenges.
Mobile Computing Architecture
The general architecture of a mobile platform is illustrated in
Fig 30.1.
It is distributed architecture where a number of computers,
generally referred to as Fixed Hosts and Base Stations are interconnected
through a high-speed wired network.
Fixed hosts are general purpose computers configured to manage
mobile units.
Base stations function as gateways to the fixed network for the
Mobile Units.
Wireless Communications –
The wireless medium have bandwidth significantly lower than those
of a wired network.
The current generation of wireless technology has data rates range
from the tens to hundreds of kilobits per second (2G cellular telephony) to
tens of megabits per second (wireless Ethernet, popularly known as WiFi).
Modern (wired) Ethernet, by comparison, provides data rates on the
order of
hundreds of megabits per second.
The other characteristics distinguish wireless connectivity
options:
interference,
locality of access,
range,
support for packet switching,
seamless roaming throughout a geographical region.
Some wireless networks, such as WiFi and Bluetooth, use unlicensed
areas of the frequency spectrum, which may cause interference with other
appliances, such as cordless telephones.
Modern wireless networks can transfer data in units called
packets, that are used in
wired networks in order to conserve bandwidth.
Client/Network Relationships –
Mobile units can move freely in a geographic mobility domain, an area that is circumscribed by wireless
network coverage.
To manage entire mobility domain is divided into one or more
smaller domains, called cells, each of which is supported by at least one
base station.
Mobile units be unrestricted throughout the cells of domain, while
maintaining information access
contiguity.
The communication architecture described earlier is designed to
give the mobile unit the impression that it is attached to a fixed network,
emulating a traditional client-server architecture.
Wireless communications, however, make other architectures
possible. One alternative is a mobile ad-hoc network (MANET), illustrated in 29.2.
In a MANET, co-located mobile units do not need to
communicate via a fixed network, but instead, form their own using
cost-effective technologies such as Bluetooth.
In a MANET, mobile units are responsible for routing their
own data, effectively acting as base stations as well as clients.
Moreover, they must be robust enough to handle changes in the
network topology, such as the arrival or departure of other mobile units.
MANET applications can be considered as peer-to-peer, meaning that
a mobile unit
is simultaneously a client and a server.
Transaction processing and data consistency control become more
difficult since there is no central control in this architecture.
Resource discovery and data routing by mobile units make computing
in a MANET even more complicated.
Sample MANET applications are multi-user games, shared whiteboard,
distributed calendars, and battle information sharing.
Characteristics of Mobile Environments
The characteristics of mobile computing include:
Communication latency
Intermittent connectivity
Limited battery life
Changing client location
The server may not be able to reach a client.
A client may be unreachable because it is dozing – in an
energy-conserving state in
which many subsystems are shut down – or because
it is out of range of a base station.
In either case, neither client nor server can reach the other, and
modifications must be made to the architecture in order to compensate for this
case.
Proxies for unreachable components are added to the architecture.
For a client (and symmetrically for a server), the proxy can cache
updates intended for the server.
Mobile computing poses challenges for servers as well as clients.
The latency involved in wireless communication makes scalability a
problem.
Since latency due to wireless communications increases the time to
service each client request, the server can handle fewer clients.
One way servers relieve this problem is by broadcasting data
whenever possible.
A server can simply broadcast data periodically.
Broadcast also reduces the load on the server, as clients do not
have to maintain
active connections to it.
Client mobility also poses many data management challenges.
Servers must keep track of client locations in order to
efficiently route messages to them.
Client data should be stored in the network location that
minimizes the traffic necessary to access it.
The act of moving between cells must be transparent to the client.
The server must be able to gracefully divert the shipment of data
from one base to another, without the client noticing.
Client mobility also allows new applications that are
location-based.
Data Management Issues
From a data management standpoint, mobile computing may be
considered a variation of distributed computing. Mobile databases can be
distributed under two possible scenarios:
The entire database is distributed mainly among the wired
components, possibly with full or partial replication.
A base station or fixed host manages its own database with a
DBMS-like functionality, with additional functionality for locating mobile
units and additional query and transaction management features to meet the requirements
of mobile environments.
The database is distributed among wired and wireless components.
Data management responsibility is shared among base stations or
fixed hosts and mobile units.
Data management issues as it is applied to mobile databases:
Data distribution and replication
Transactions models
Query processing
Recovery and fault tolerance
Mobile database design
Location-based service
Division of labor
Security
Application: Intermittently Synchronized
Databases
Whenever clients connect – through a process known in industry as
synchronization of a client with a server – they receive a batch of updates to
be installed on their local database.
The primary characteristic of this scenario is that the clients
are mostly disconnected; the server is not necessarily able reach them.
This environment has problems similar to those in distributed and
client-server databases, and some from mobile databases.
This environment is referred to as Intermittently Synchronized
Database Environment
(ISDBE).
The characteristics of Intermittently Synchronized Databases (ISDBs) make them
distinct from the mobile databases are:
A client connects to the server when it wants to exchange updates.
The communication can be unicast –one-on-one communication between the server and the client–
or multicast– one sender or server may periodically
communicate to a set of receivers or update a group of clients.
A server cannot connect to a client at will.
The characteristics of ISDBs (contd.) :
Issues of wireless versus wired client connections and power
conservation are generally immaterial.
A client is free to manage its own data and transactions while it
is disconnected. It
can also perform its own recovery to some
extent.
A client has multiple ways connecting to a server and, in case of
many servers, may
choose a particular server to connect to based
on proximity, communication nodes
available, resources available, etc.
(b)(ii) Discuss optimization and research
issues. (8) (NOV/DEC 2010)
Optimization
Query optimization is an important task in a relational DBMS.
Must understand optimization in order to understand the
performance impact of a given database design (relations, indexes) on a
workload (set of queries).
Two parts to optimizing a query:
Consider a set of alternative plans.
• Must prune search space; typically, left-deep
plans only.
Must estimate cost of each plan that is
considered.
• Must estimate size of result and cost for each
plan node.
• Key issues: Statistics, indexes, operator
implementations.
Plan: Tree of R.A. ops, with choice of alg for each op.
Each operator typically implemented using a
`pull’ interface: when an operator is
`pulled’ for the next output tuples, it `pulls’
on its inputs and computes them.
Two main issues:
For a given query, what plans are considered?
• Algorithm to search plan space for cheapest
(estimated) plan.
How is the cost of a plan estimated?
Ideally: Want to find best plan. Practically: Avoid worst plans!
We will study the System R approach.
Schema for Examples
Sailors (sid: integer, sname: string, rating:
integer, age: real)
Reserves (sid: integer, bid: integer, day:
dates, rname: string)
Similar to old schema; rname added for variations.
Reserves:
Each tuple is 40 bytes long, 100 tuples per
page, 1000 pages.
Sailors:
Each tuple is 50 bytes long, 80 tuples per page,
500 pages.
Query Blocks: Units of Optimization
An SQL query is parsed into a collection of query blocks, and
these are optimized one block at a time.
Nested blocks are usually treated as calls to a subroutine, made
once per outer tuple.
For each block, the plans considered are:
All available access methods, for each reln in
FROM clause.
All left-deep join trees (i.e., all ways to join
the relations oneat-a-time, with the inner reln in the FROM clause, considering
all reln permutations and join methods.)
8. (a)Discuss multimedia
databases in detail. (8) (NOV/DEC 2010)
Multimedia databases
To provide such database functions as indexing and consistency, it
is desirable to store multimedia data in a database
Rather than storing them outside the database,
in a file system
The database must handle large object representation.
Similarity-based retrieval must be provided by special index
structures.
Must provide guaranteed steady retrieval rates for
continuous-media data.
Multimedia Data Formats
Store and transmit multimedia data in compressed form
JPEG and GIF the most widely used formats for
image data.
MPEG standard for video data use commonalties
among a sequence of frames to
achieve a greater degree of compression.
MPEG-1 quality comparable to VHS video tape.
Stores a minute of 30-frame-per-second video and
audio in approximately 12.5 MB
MPEG-2 designed for digital broadcast systems and digital video
disks; negligible loss of video quality.
Compresses 1 minute of audio-video to
approximately 17 MB.
Several alternatives of audio encoding
MPEG-1 Layer 3 (MP3), RealAudio, WindowsMedia
format, etc.
Continuous-Media Data
Most important types are video and audio data.
Characterized by high data volumes and real-time
information-delivery requirements.
Data must be delivered sufficiently fast that
there are no gaps in the audio or video.
Data must be delivered at a rate that does not
cause overflow of system buffers.
Synchronization among distinct data streams must
be maintained
4 video of a person speaking
must show lips moving synchronously with the
audio
Video Servers
Video-on-demand systems deliver video from central video servers, across a
network, to terminals
must guarantee end-to-end delivery rates
Current video-on-demand servers are based on file systems;
existing database systems do not meet real-time response requirements.
Multimedia data are stored on several disks (RAID configuration),
or on tertiary storage for less frequently accessed data.
Head-end terminals - used to view multimedia data
PCs or TVs attached to a small, inexpensive
computer called a set-top box.
Similarity-Based Retrieval
Examples of similarity based retrieval
Pictorial data: Two pictures or images that are slightly different
as represented in the
database may be considered the same by a user.
e.g., identify similar designs for registering a
new trademark.
Audio data: Speech-based user interfaces allow the user to give a
command or identify a data item by speaking.
e.g., test user input against stored commands.
Handwritten data: Identify a handwritten data item or command
stored in the database
(b) Explain the features of active and
deductive databases in detail. (8)
(NOV/DEC 2010)
15 Deductive Databases
SQL-92 cannot express some queries:
Are we running low on any parts needed to build
a ZX600 sports car?
What is the total component and assembly cost to
build a ZX600 at today's part prices?
Can we extend the query language to cover such queries?
Yes, by adding recursion.
Recursion in SQL: The concepts discussed in this chapter are not included in the
SQL-92 standard. However, the revised version of the SQL standard, SQL:1999,
includes support for recursive queries and IBM's DB2 system already supports
recursive queries as required in SQL:1999.
15.1 Introduction to recursive queries
15.1.1 Datalog:
SQL queries can be read as follows: “If some tuples exist in the
From tables that satisfy the Where conditions, then the Select tuple is in the
answer.”
Datalog is a query language that has the same if-then flavor:
New: The answer table can appear in the From
clause, i.e., be defined recursively.
Prolog style syntax is commonly used.
The Problem with R.A. and SQL-92:
Intuitively, we must join Assembly with itself to deduce that
trike contains spoke and tire.
takes us one level down Assembly hierarchy.
To find components that are one level deeper
(e.g., rim), need another join.
To find all components, need as many joins as
there are levels in the given instance!
For any relational algebra expression, we can create an Assembly
instance for which some answers are not computed by including more levels than
the number of joins in the expression.
15.2 Theoretical Foundations
The first approach to defining what a Datalog program means is
called the least model semantics and gives users a way to understand the
program without thinking about how the program is to be executed. That is, the
semantics is declarative, like the semantics of relational calculus, and not
operational like relational algebra semantics.
The second approach, called the least fixpoint semantics, gives a
conceptual evaluation strategy to compute the desired relation instances. This
serves as the basis for recursive query evaluation in a DBMS.
The fixpoint semantics is thus operational and plays a role
analogous to that of relational algebra semantics for nonrecursive queries.
i. Least Model Semantics
The least fixpoint of a function f is a fixpoint
v of f such that every other fixpoint of f is
smaller than or equal to v.
In general, there may be no least fixpoint (we
could have two minimal fixpoints, neither of which is smaller than the other).
If we think of a Datalog program as a function
that is applied to a set of tuples and returns another set of tuples, this
function (fortunately) always has a least fixpoint.
ii. Safe Datalog Programs
Consider following program: Complex Parts
(Part):- Assembly (Part, Subpart, Qty), Qty >2 According to this rule,
complex part is defined to be any part that has more than two copies of any one
subpart. For each part mentioned in the Assembly relation, we can easily check
if it is a complex part. Database systems disallow unsafe programs by requiring
that every variable in the head of a rule must also appear in the body. Such
programs are said to be range restricted, and every range-restricted Datalog
program has a finite least model if the input relation instances are finite.
iii. The Fixpoint Operator
Let f be a function that takes values from
domain D and returns values from D. A value v in D is a fixpoint of f if
f(v)=v.
Consider the fn double+, which is applied to a
set of integers and returns a set of integers (I.e., D is the set of all sets
of integers).
E.g., double+ ({1, 2, 5}) = {2, 4, 10} Union {1,
2, 5}
the set of all integers is a fixpoint of
double+.
the set of all even integers is another fixpoint
of double+; it is smaller than the first
fixpoint.
iv. Least Model = Least Fixpoint
Further, every Datalog program is guaranteed to
have a least model, and the least model is equal to the least fixpoint of the
program! These results provide the basis for Datalog query processing. Users
can understand a program in terms of `If the body is true, the head is also true,'
thanks to the least model semantics. The DBMS can compute the answer by
repeatedly applying the program rules, thanks to the least fixpoint semantics
and the fact that the least model and the least fixpoint are identical.
b. Recursive queries with negation
Big(Part) :- Assembly(Part, Subpt, Qty), Qty
>2, not Small(Part).
Small(Part) :- Assembly(Part, Subpt, Qty), not
Big(Part).
If rules contain not there may not be a least fixpoint. Consider the Assembly instance;
trike is the only part that has 3 or more copies of some subpart. Intuitively,
it should be in Big, and it will be if we apply Rule 1 first.
But we have Small(trike) if Rule 2 is applied
first!
There are two minimal fixpoints for this
program: Big is empty in one, and contains trike in the other (and all other
parts are in Small in both fixpoints)..
15.3.1 Range-Restriction and Negation
If rules are allowed to contain not in the body, the definition of range-restriction must be extended in
order to ensure that all range-restricted programs are safe. If a relation appears
in the body of a rule preceded by not, we call this a negated occurrence. Relation occurrences in the body that are not
negated are called positive occurrences. A program is range restricted if every variable in the head of the rule appears in some positive
relation occurrence in the body.
15.3.2 Stratification
T depends on S if some rule with T in the head
contains S or (recursively) some predicate that depends on S, in the body.
Stratified program: If T depends on not S, then S cannot depend on T (or not T).
If a program is stratified, the tables in the
program can be partitioned into strata:
Stratum 0: All database tables.
Stratum I: Tables defined in terms of tables in
Stratum I and lower strata.
If T depends on not S, S is in lower stratum than T.
Relational Algebra and Stratified Datalog
Selection: Result(Y):- R(X, Y), X=c.
Projection: Result(Y):- R(X, Y).
Cross-product: Result(X, Y, U, V):- R(X, Y), S (U, V).
Set-difference: Result(X, Y):- R(X, Y), not S(U,V).
Union: Result(X, Y):- R(X, Y).
Result(X, Y):- S(X, Y).
The stratified Big/Small program is shown below
in SQL: 1999 notation, with a final additional
selection on Big2:
Big2 (Part) AS
(SELECT A1.Part FROM Assembly A1 WHERE Qty > 2)
Small2 (Part) AS
((SELECT A2.Part FROM Assembly A2)
EXCEPT
(SELECT B1.Part from Big2 B1))
SELECT * FROM Big2 B2
15.3.3 Aggregate Operations
SELECT A. Part, SUM (A.Qty)
FROM Assembly A
GROUP BY A. Part
NumParts (Part, SUM (<Qty>)):- Assembly
(Part, Subpt, Qty).
The < … > notation
in the head indicates grouping; the remaining arguments are the GROUP BY
fields.
In order to apply such a rule, must have all of
Assembly relation available.
Stratification with respect to use of < … > is the usual restriction to deal with this
problem; similar to negation.
15.4 Efficient evaluation of recursive
queries
Repeated inferences: When recursive rules are
repeatedly applied in the naïve way, we make the same inferences in several
iterations.
Unnecessary inferences: Also, if we just want to
find the components of a particular part, say wheel, computing the fixpoint of
the Comp program and then selecting tuples with wheel in the first column is
wasteful, in that we compute many irrelevant facts.
15.4.1 Fixpoint Evaluation without Repeated
Inferences
Avoiding Repeated Interferences:
Seminaive Fixpoint Evaluation: Avoid repeated
inferences by ensuring that when a rule is applied, at least one of the body
facts was generated in the most recent iteration. (Which means this inference
could not have been carried out in earlier iterations.)
for each recursive table P, use a table delta_P
to store the P tuples generated in the previous iteration.
Rewrite the program to use the delta tables, and
update the delta tables between iterations. Comp (Part, Subpt):- Assembly
(Part, Part2, Qty), delta_Comp (Part2, Subpt).
15.4.2 Pushing Selections to Avoid
Irrelevant Inferences
SameLev (S1, S2):- Assembly (P1, S1, Q1),
Assembly (P2, S2, Q2).
SameLev (S1,S2) :- Assembly(P1,S1,Q1),
SameLev(P1,P2), Assembly(P2,S2,Q2).
There is a tuple (S1,S2) in SameLev if there is
a path up from S1 to some node and down to S2 with the same number of up and
down edges.
Suppose that we want to find all SameLev tuples
with spoke in the first column. We should “push” this selection into the fixpoint
computation to avoid unnecessary inferences.
But we can’t just compute SameLev tuples with
spoke in the first column, because some other SameLev tuples are needed to
compute all such tuples:
SameLev (spoke, seat):- Assembly (wheel, spoke,
2), SameLev (wheel, frame), Assembly (frame, seat, 1).
The Magic Sets Algorithm
Idea: Define a “filter” table that computes all
relevant values, and restrict the computation
of SameLev to infer only tuples with a
relevant value in the first column.
Magic_SL (P1):- Magic_SL (S1), Assembly (P1, S1,
Q1).
Magic (spoke).
SameLev(S1,S2) :- Magic_SL(S1),
Assembly(P1,S1,Q1), Assembly(P2,S2,Q2).
SameLev(S1,S2):- Magic_SL(S1),
Assembly(P1,S1,Q1),SameLev(P1,P2), Assembly(P2,S2,Q2).
The Magic Sets program rewriting algorithm can
be summarized as follows:
Add `Magic' Filters: Modify each rule in the program by adding a
`Magic' condition to the body that acts as a filter on the set of tuples
generated by this rule.
Define the `Magic' relations: We must create new rules to define the `Magic'
relations.
Intuitively, from each occurrence of an output
relation R in the body of a program rule, we obtain a rule defining the
relation Magic R.