Wednesday 15 May 2013

CS9211 – COMPUTER ARCHITECTURE ANNA UNIVERSITY QUESTION PAPERS



CS2354 Advanced Computer Architecture Anna university Question paper for ME(Master of engineering)

M.E/M.Tech DEGREE EXAMINATION, JUNE 2010
First Semester
Computer Science and Engineering
CS9211 – COMPUTER ARCHITECTURE
(Common to M.Tech - Information Technology)
        (Regulation 2009)
Time: Three hours                                                                               Maximum: 100 Marks
                                                     Answer all the questions
                                                    Part A – (10*2=20 Marks)
1.      State the principle of locality and its types.
2.      What are the choices for encoding instruction set.
3.      What is speculation? Give an example.
4.      Mention the effects of imperfect alias analysis.
5.      What is loop unrolling?
6.      Give the uses of sentinel.
7.      Define multiprocessor cache coherence.
8.      What are the approaches used for multithreading?
9.      Which block should be replaced on a cache Miss?
10.  How is cache performance improved?
    Part B – (5*16=80 Marks)
11.  (a)(i)Explain the operations designed for media and signal processing. (10)
    (ii)Explain the ways in which a computer architect can help the compiler writer. (6)
                                                            (Or)
(b)(i)Discuss the addressing modes used for signal processing instructions. (7)
     (ii)Describe the addressing modes and instructions designed for control flow. (9)

12.  (a)Explain the techniques to overcome data hazards with dynamic scheduling. (16)
(Or)
(b)Describe the limitations of Instruction level Parallelism. (16)

13.  (a)(i)Explain the basic VLIW approach used for static multiple issues. (8)
    (ii)Enumerate the crosscutting issues in hardware and s/w speculation mechanisms. (8)
                                                            (Or)
(b)(i)Explain the hardware support for exposing more parallelism at compile time. (8)
    (ii)Describe the basic compiler techniques for exposing ILP. (8)
14.  (a)(i)Describe the design challenges in SMT processors. (8)
    (ii)Discuss the performance of shared memory multiprocessors. (8)
                                                            (Or)
(b)(i)Explain synchronization mechanisms designed for large scale multiprocessors. (9)
    (ii)Discuss the details of memory consistency models. (7)

15.  (a)(i)Explain the concept of miss penalty and out of order execution in processors. (6)
    (ii)Discuss the methods of interface between CPU and memory. (10)
                                                            (Or)
(b)Discuss in detail the different levels of RAID. (16)
Posted by N R Rejin Paul at 1:58 AM 0 comments Email This
CS2354 Advanced Computer Architecture Anna university Model Question Paper

M.E/M.Tech DEGREE EXAMINATION, JANUARY 2010
First Semester
Computer Science and Engineering
CS9211 – COMPUTER ARCHITECTURE
(Common to M.Tech - Information Technology)
        (Regulation 2009)
Time: Three hours                                                                               Maximum: 100 Marks
                                                     Answer all the questions
                                                    Part A – (10*2=20 Marks)
1.      What is hazard? State its types.
2.      Mention the techniques available to measure the performance.
3.      What is dynamic scheduling?
4.      Give the limitation of ILP.
5.      Distinguish between hardware and software speculation mechanisms.
6.      What is static branch prediction?
7.      What are the synchronization issues?
8.      What is multithreading?
9.      Define cache miss penalty?
10.  What is RAID?
                                        Part B – (5*16 = 80 Marks)
11.  (a)How does one classify ISA? Discuss their design issues. (16)
(Or)
(b)What is pipelining? Explain various hazards involved in implementing pipelining. (16)

12.  (a)Explain the instruction level parallelism with dynamic approaches. (16)
(Or)
(b)What is dynamic hardware prediction? Explain it in detail. (16)

13.  (a)Explain the different hardware support for exposing ILP. (16)
(Or)
(b)Explain the different hardware support for more parallelism. (16)

14.  (a)Explain distributed shared memory architecture with necessary life cycle diagram. (16)
(Or)
(b)(i)Differentiate software and hardware multithreading approaches. (8)
   (ii)Explain the models of memory consistency. (8)

15.  (a)How does one reduce cache miss penalty and miss rate? Explain. (16)
(Or)
(b)What are the ways available to measure the I/O performance? Explain each of them in detail

AP9222 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING ANNA UNIVERSITY QUESTION PAPERS

IMPORTANT QUESTIONS

AP9222 COMPUTER ARCHITECTURE AND PARALLEL PROCESSING


ANNA UNIVERSITY QUESTION PAPERS



Reg. No. :
M.E. DEGREE EXAMINATION, NOVEMBER/DECEMBER 2010
Second Semester
Applied Electronics
AP 9222 — COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
(Common to M.E - Computer and Communication, M.E. - VLSI Design and
M.E.- Embedded System Technologies)
(Regulation 2009)
Time : Three hours Maximum : 100 Marks
Answer ALL questions
PART A — (10 × 2 = 20 Marks)
1. Compare temporal parallelism and data parallelism.
2. What is the major difference between multiprocessor and multicomputer?
3. Define anti dependence and output dependence with respect to parallelism and
dependence relations.
4. Compare control flow and data flow computers in terms of flow mechanism.
5. Why is memory hierarchy needed?
6. Define any four scalability merits for an application.
7. What is a symbolic processor?
8. State the instruction format used in VLIW process.
9. Name any two languages supporting parallelism.
10. Name the phases of parallel compiler.

PART B — (5 × 16 = 80 Marks)
11. (a) (i) Describe Flynn's classification of computers. (10)
(ii) Explain the super computer architecture for vector processing. (6)
Or
(b) (i) Explain the various theoretical models of parallel computers. How
are they useful in scalability and programmability analysis? (8)
(ii) Distinguish between multiprocessors and multi computers based on
their structures, resource sharing and inter process
communications. (8)
12. (a) (i) Explain the functional structure of a crossbar network and how I/O
connections are achieved? (8)
(ii) Explain the static data flow computer architecture with a diagram.
(8)
Or
(b) (i) Compare the features of static network with dynamic networks.
Explain how the interconnection is established for parallel
processing. (10)
(ii) Define the various speed up performance laws. (6)
13. (a) A two level memory system has five virtual pages on a disk to be mapped
into
(i) Two page frames
(ii) Three page frames in the main memory. A certain program
generated the following page trace.
0, 1, 2, 3, 4, 4, 4, 2, 1, 0, 2, 2, 2, 3, 0.
(1) Calculate the hit ratios using FIFO, LRU, circular FIFO and
LFU replacement policies on both cases.
(2) Indicate which page replacement policy is optimal on both
cases.
(3) Calculate the improvement or reduction in hit ratio on
increase in page frames in main memory. (16)
Or
(b) (i) Describe a four level memory hierarchy technology. (8)
(ii) Explain temporal, spacial and sequential locality associated with
program/data access in a memory hierarchy. (8)
14. (a) (i) Design a pipe line multiplier to multiply a stream of input numbers
X0, X2, X3.... by a fixed number Y. Assume all X's and Y's are 16 bit
numbers. Draw a neat schematic diagram of your design. (12)
(ii) Compare linear and non linear pipelines. (4)
Or
(b) (i) Explain any one scalable multithreaded architecture with a neat
diagram. (8)
(ii) Write about the features of data flow architectures. (8)
15. (a) (i) Explain the message based communications in MACH and 0SF/1
systems. (8)
(ii) Write a program for matrix multiplication in FORTRAN using its
parallel constructs. (8)
Or
(b) Draw and explain the phases of parallel language compilers. (16)




Reg. No. :
M.E. DEGREE EXAMINATION, APRIL/MAY 2011
Second Semester
Applied Electronics
AP 9222 — COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
(Common to M.E. Computer and Communication, M.E. VLSI Design and
M.E. Embedded System and Technologies)
(Regulation 2009)
Time : Three hours Maximum : 100 marks
Answer ALL questions
PART A — (10 × 2 = 20 marks)
1. In parallel computing, what are shared memory-multiprocessors and
distributed-memory multicomputers?
2. Mention an approach to solve the mismatch problem between software
parallelism and hardware parallelism.
3. What is grain packing and scheduling in parallel programming that can help
achieve a short schedule for fast program execution?
4. What scalability metrics affect the scalability of a computer system for a given
application?
5. List the characteristics of CISC, RISC and Superscalar processors.
6. How are sequential consistency and weak consistency memory models
characterized?
7. Distinguish between asynchronous and synchronous pipeline models.
8. Distinguish between static dataflow and dynamic dataflow computers.
9. What are macrotasking, microtasking and autotasking levels of multitasking
employed for parallel execution on Cray X-MP or Y-MP multiprocessors?
10. Differentiate time-sharing and space-sharing operating systems.

PART B — (5 × 16 = 80 marks)
11. (a) Discuss how the instruction set, compiler technology. CPU
implementation and control, cache and memory hierarchy affects CPU
performance. Justify their effects in terms of program length, clock rate
and effective cycles per instruction (CPI). (16)
Or
(b) Compare the PRAM and VLSI models of parallel computers and mention
how these models facilitate the study of asymptotic behavior of
algorithms implementable on parallel computers. (16)
12. (a) Discuss the different network architectures used for interconnecting
multiprocessor or multicomputer systems highlighting their merits and
demerits. (16)
Or
(b) (i) Discuss the three speedup performance models used under different
computing objectives and resource constraints. (10)
(ii) A web server application running on a single processor is enhanced
by a 10 processor parallel computer. The parallel processor is
20 times faster than the original processor, but is available only for
40% of the time. What is the overall speedup gained by
incorporating the enhancement? (6)
13. (a) Discuss the design of a memory hierarchy for a computer system
satisfying the three properties of inclusion, coherence and locality with
due consideration to capacity planning and cost optimization. (16)
Or
(b) Discuss briefly the different shared- memory organization techniques
employed in computer system to meet the memory design goal of
matching memory bandwidth to processor bandwidth. (16)
14. (a) (i) Briefly discuss the instruction pipeline design with reference to
instruction processing, data forwarding, hazard avoidance,
scheduling and branch handling. (10)
(ii) A non pipelined processor X has a clock rate of 25 MHz and an
average CPI (Cycles Per Instruction) of 4. Processor Y. an improved
version of X, is designed with a live-stage linear instruction
pipeline. However, due to latch delay and clock skew effects, the
clock rate of Y is only 20 MHz. If a program containing 100
instructions is executed on both processors, what is the speed up of
processor Y compared with that of processor X? (6)
Or
(b) Compare the four context switching policies employed in different
multithreaded architectures: switch on cache miss, switch on every load,
switch on instruction (cycle by cycle) and switch on block of instructions.
(i) What are the advantages and shortcomings of each policy?
(ii) How can an optimal choice be made among the four policies? (16)
15. (a) Parallel programming models are specifically designed for
multiprocessors, multicomputers, or vector/SIMD computers. Briefly
discuss these models that exploit parallelism with different execution
paradigms for these computers. (16)
Or
(b) Discuss the use of spin locks, suspend locks, binary and counting
semaphores, deadlock prevention, avoidance, detection and recovery
schemes and monitors for shared-variable programming to implement
various synchronization methods among concurrent processes. (16)




CS9221 DATABASE TECHNOLOGY


ANNA UNIVERSITY- CHENNAI-JUNE 2010 & DECEMBER 2010

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SUB CODE/SUB NAME: CS9221 / DATABASE TECHNOLOGY

Part A – (10*2=20 Marks)



1. What is fragmentation? (JUNE 2010)

Fragmentation is a database server feature that allows you to control where data is stored at the table level. Fragmentation enables you to define groups of rows or index keys within a table according to some algorithm or scheme. You use SQL statements to create the fragments and assign them to dbspaces.



2. What is Concurrency control? (JUNE 2010) (NOV/ DEC 2010)

Concurrency control is the activity of coordinating concurrent accesses to a database in a multiuser system. Concurrency control allows user to access a database in a multiprogrammed fashion while preserving the consistency of the data.



3. What is Persistence? (JUNE 2010) (NOV/ DEC 2010)

Persistence is the property of an object through which its existence transcends time i.e. (the object continues to exist after its creator ceases to exist), and/or space (i.e. the object’s location moves from the address space in which it was created).



4. What is Transaction Processing? (JUNE 2010)

A Transaction Processing system (TPS) is a set of information which processes the data transaction in database system that monitors transaction programs (a special kind of program). For e.g. in an electronic payment is made, the amount must be both withdrawn from one account and added to the other; it cannot complete only one of those steps. Either both must occur, or neither. In case of a failure preventing transaction completion, the partially executed transaction must be 'rolled back' by the TPS.



5. What is Client/Server model? (JUNE 2010)



·         The server in a client/server model is simply the DBMS, whereas the client is the database application serviced by the DBMS.

·         The client/server model of a database system is classified into basic & distributed

client/server model.



6. What is the difference between data warehousing and data mining? (JUNE 2010)

Data warehousing: It is the process that is used to integrate and combine data from multiple sources and format into a single unified schema. So it provides the enterprise with a storage mechanism for its huge amount of data.

Data mining: It is the process of extracting interesting patterns and knowledge from huge amount of data. So we can apply data mining techniques on the data warehouse of an enterprise to discover useful patterns.



7. Why do we need Normalization? (JUNE 2010)

Normalization is a process followed for eliminating redundant data and establishes a meaningful relationship among tables based on rules and regulations in order to maintain integrity of data. It is done for maintaining storage space and also for performance tuning.



8. What is Integrity? (JUNE 2010) (NOV/ DEC 2010)

Integrity refers to the process of ensuring that a database remains an accurate reflection of the universe of discourse it is modeling or representing. In other words there is a close correspondence between the facts stored in the database and the real world it models



9. Give two features of Multimedia Databases (JUNE 2010) (NOV/ DEC 2010)

·         The multimedia database systems are to be used when it is required to administrate a huge amounts of multimedia data objects of different types of data media (optical storage, video, tapes, audio records, etc.) so that they can be used (that is, efficiently accessed and searched) for as many applications as needed.

·         The Objects of Multimedia Data are: text, images, graphics, sound recordings, video recordings, signals, etc., that are digitalized and stored.



10. What are Deductive Databases? (JUNE 2010) (NOV/ DEC 2010)

A Deductive Database is the combination of a conventional database containing facts, a knowledge base containing rules, and an inference engine which allows the derivation of information implied by the facts and rules.



A deductive database system specify rules through a declarative language - a language in which we specify what to achieve rather than how to achieve it. An inference engine within the system can deduce new facts from the database by interpreting these rules. The model used for deductive databases is related to the relational data model and also related to the field of logic programming and the Prolog language.



11. What is query processing? (NOV/ DEC 2010)

Query processing is a set of activities involving in getting the result of a query expressed in a high-level language. These activities include parsing the queries and translate them into expressions that can be implemented at the physical level of the file system, optimizing the query of internal form to get suitable execution strategies for processing and then doing the actual execution of queries to get the results.



12. Give two features of object-oriented databases. (NOV/ DEC 2010)

The features of Object Oriented Databases are,

1. It provides persistent storage for objects.

2. They may provide one or more of the following: a query language; indexing; transaction support with rollback and commit; the possibility of distributing objects transparently over many servers.



13. What is Data warehousing? (NOV/ DEC 2010)

Data warehousing: It is the process that is used to integrate and combine data from multiple sources and format into a single unified schema. So it provides the enterprise with a storage mechanism for its huge amount of data.



14. What is Normalization? (NOV/ DEC 2010)

Normalization is a process followed for eliminating redundant data and establishes a meaningful relationship among tables based on rules and regulations in order to maintain integrity of data. It is done for maintaining storage space and also for performance tuning.



15. Mention two features of parallel Databases. (NOV/ DEC 2010)

1. It is used to provide speedup, where queries are executed faster because more resources, such as processors and disks, are provided.
2. It is also used to provide scaleup, where increasing workloads are handled without increased response time, via an increase in the degree of parallelism.




Part B – (5*16=80 Marks) (JUNE 2010) & (DECEMBER 2010)
1. (a)Explain the architecture of Distributed Databases. (16) (JUNE 2010)
Or
(b) Discuss in detail the architecture of distributed database. (16)
(NOV/DEC 2010)

(b)Write notes on the following:
(i) Query processing. (8) (JUNE 2010)
(ii) Transaction processing. (8) (JUNE 2010)

2. (a)Discuss the Modeling and design approaches for Object Oriented Databases
(JUNE 2010)
Or
(b) Describe modeling and design approaches for object oriented database. (16)
(NOV/DEC 2010)
(b) Explain the Multi-Version Locks and Recovery in Query Languages. (16) (JUNE 2010)
Or
(a) Explain the Multi-Version Locks and Recovery in Query Languages.
(DECEMBER 2010)

3. (a) Discuss in detail Data Warehousing and Data Mining. (JUNE 2010)
Or
(a) Explain the features of data warehousing and data mining. (16)
(NOV/DEC 2010)
Data Warehouse:
Large organizations have complex internal organizations, and have data stored at different
locations, on different operational (transaction processing) systems, under different
schemas
Data sources often store only current data, not historical data
Corporate decision making requires a unified view of all organizational data, including
historical data
A data warehouse is a repository (archive) of information gathered from multiple sources,
stored under a unified schema, at a single site
· Greatly simplifies querying, permits study of historical trends
· Shifts decision support query load away from transaction processing systems
When and how to gather data
· Source driven architecture: data sources transmit new information to warehouse, either
continuously or periodically (e.g. at night)
· Destination driven architecture: warehouse periodically requests new information from
data sources
· Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit)
is too expensive
· Usually OK to have slightly out-of-date data at warehouse
· Data/updates are periodically downloaded form online transaction processing (OLTP)
systems.
What schema to use
· Schema integration
Data cleansing
· E.g. correct mistakes in addresses
· E.g. misspellings, zip code errors
Merge address lists from different sources and purge duplicates
· Keep only one address record per household (“householding”)
How to propagate updates
· Warehouse schema may be a (materialized) view of schema from data sources
· Efficient techniques for update of materialized views
What data to summarize
· Raw data may be too large to store on-line
· Aggregate values (totals/subtotals) often suffice
· Queries on raw data can often be transformed by query optimizer to use aggregate values.
Typically warehouse data is multidimensional, with very large fact tables
· Examples of dimensions: item-id, date/time of sale, store where sale was made, customer
identifier
· Examples of measures: number of items sold, price of items
Dimension values are usually encoded using small integers and mapped to full values via
dimension tables
Resultant schema is called a star schema
More complicated schema structures
· Snowflake schema: multiple levels of dimension tables
· Constellation: multiple fact tables
Data Mining
Broadly speaking, data mining is the process of semi-automatically analyzing large
databases to find useful patterns.
Like knowledge discovery in artificial intelligence data mining discovers statistical rules
and patterns
Differs from machine learning in that it deals with large volumes of data stored primarily
on disk.
Some types of knowledge discovered from a database can be represented by a set of rules.
e.g.,: “Young women with annual incomes greater than $50,000 are most likely to buy
sports cars”.
Other types of knowledge represented by equations, or by prediction functions.
Some manual intervention is usually required
· Pre-processing of data, choice of which type of pattern to find, postprocessing to
find novel patterns
Applications of Data Mining
Prediction based on past history
· Predict if a credit card applicant poses a good credit risk, based on some attributes
(income, job type, age, ..) and past history
· Predict if a customer is likely to switch brand loyalty
· Predict if a customer is likely to respond to “junk mail”
· Predict if a pattern of phone calling card usage is likely to be fraudulent
Some examples of prediction mechanisms:
Classification
· Given a training set consisting of items belonging to different classes, and a new
item whose class is unknown, predict which class it belongs to.
Regression formulae
· Given a set of parameter-value to function-result mappings for an unknown
function, predict the function-result for a new parameter-value
Descriptive Patterns
Associations
· Find books that are often bought by the same customers. If a new customer buys
one such book, suggest that he buys the others too.
· Other similar applications: camera accessories, clothes, etc.
Associations may also be used as a first step in detecting causation
· E.g. association between exposure to chemical X and cancer, or new medicine and
cardiac problems
Clusters
· E.g. typhoid cases were clustered in an area surrounding a contaminated well
· Detection of clusters remains important in detecting epidemics
Classification Rules
Classification rules help assign new objects to a set of classes. E.g., given a new
automobile insurance applicant, should he or she be classified as low risk, medium risk or
high risk?
Classification rules for above example could use a variety of knowledge, such as
educational level of applicant, salary of applicant, age of applicant, etc.
" person P, P.degree = masters and P.income > 75,000
Þ P.credit = excellent
" person P, P.degree = bachelors and
(P.income ³ 25,000 and P.income £ 75,000)
Þ P.credit = good
Rules are not necessarily exact: there may be some misclassifications
Classification rules can be compactly shown as a decision tree.
Decision Tree
· Training set: a data sample in which the grouping for each tuple is already known.
· Consider credit risk example: Suppose degree is chosen to partition the data at the root.
Since degree has a small number of possible values, one child is created for each
value.
· At each child node of the root, further classification is done if required. Here, partitions are
defined by income.
Since income is a continuous attribute, some number of intervals are chosen, and
one child created for each interval.
· Different classification algorithms use different ways of choosing which attribute to
partition on at each node, and what the intervals, if any, are.
· In general
Different branches of the tree could grow to different levels.
Different nodes at the same level may use different partitioning attributes.
· Greedy top down generation of decision trees.
Each internal node of the tree partitions the data into groups based on a partitioning
attribute, and a partitioning condition for the node
More on choosing partioning attribute/condition shortly
Algorithm is greedy: the choice is made once and not revisited as more of the tree is
constructed
The data at a node is not partitioned further if either
All (or most) of the items at the node belong to the same class, or
All attributes have been considered, and no further partitioning is possible.
Such a node is a leaf node.
Otherwise the data at the node is partitioned further by picking an attribute for
partitioning data at the node.
Decision-Tree Construction Algorithm
Procedure Grow.Tree(S)
Partition(S);
Procedure Partition (S)
if (purity(S) > dp or |S| < ds) then
return;
for each attribute A
evaluate splits on attribute A;
Use best split found (across all attributes) to partition
S into S1, S2, …., Sr,
for i = 1, 2, ….., r
Partition(Si);
Other Types of Classifiers
Further types of classifiers
Neural net classifiers
Bayesian classifiers
Neural net classifiers use the training data to train artificial neural nets
· Widely studied in AI, won’t cover here
Bayesian classifiers use Bayes theorem, which says
where
p(cj | d) = probability of instance d being in class cj,
p(d | cj) = probability of generating instance d given class cj,
p(cj) = probability of occurrence of class cj, and
p(d) = probability of instance d occuring
Naive Bayesian Classifiers
Bayesian classifiers require
computation of p(d | cj)
precomputation of p(cj)
p(d) can be ignored since it is the same for all classes
To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions,
and thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* (p(dn|cj)
Each of the p(di|cj) can be estimated from a histogram on di values for each class cj
· the histogram is computed from the training instances
Histograms on multiple attributes are more expensive to compute and store.
Regression
Regression deals with the prediction of a value, rather than a class.
Given values for a set of variables, X1, X2, …, Xn, we wish to predict the value of a variable
Y.
One way is to infer coefficients a0, a1, a1, …, an such that
Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn
Finding such a linear polynomial is called linear regression.
In general, the process of finding a curve that fits the data is also called curve fitting.
The fit may only be approximate
· because of noise in the data, or
· because the relationship is not exactly a polynomial
Regression aims to find coefficients that give the best possible fit.
Association Rules
Retail shops are often interested in associations between different items that people buy.
Someone who buys bread is quite likely also to buy milk
A person who bought the book Database System Concepts is quite likely also to buy the
book Operating System Concepts.
Associations information can be used in several ways.
E.g. when a customer buys a particular book, an online shop may suggest associated books.
Association rules:
bread Þ milk DB-Concepts, OS-Concepts Þ Networks
Left hand side: antecedent, right hand side: consequent
An association rule must have an associated population; the population consists of a set of
instances E.g. each transaction (sale) at a shop is an instance, and the set of all transactions is the
population.
Rules have an associated support, as well as an associated confidence.
Support is a measure of what fraction of the population satisfies both the antecedent and the
consequent of the rule.
E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for
the rule is milk Þ screwdrivers is low.
We usually want rules with a reasonably high support
Rules with low support are usually not very useful
Confidence is a measure of how often the consequent is true when the antecedent is true.
E.g. the rule bread Þ milk has a confidence of 80 percent if 80 percent of the purchases that
include bread also include milk.
Usually want rules with reasonably large confidence.
Finding Association Rule
We are generally only interested in association rules with reasonably high support (e.g. support of
2% or greater)
Naïve algorithm
1. Consider all possible sets of relevant items.
2. For each set find its support (i.e. count how many transactions purchase all items in
the set).
H Large itemsets: sets with sufficiently high support
3. Use large itemsets to generate association rules.
H From itemset A generate the rule A - {b} Þb for each b Î A.
4 Support of rule = support (A).
4 Confidence of rule = support (A ) / support (A - {b})
Other Types of Associations
Basic association rules have several limitations
Deviations from the expected probability are more interesting
E.g. if many people purchase bread, and many people purchase cereal, quite a few would be
expected to purchase both (prob1 * prob2)
We are interested in positive as well as negative correlations between sets of items
Positive correlation: co-occurrence is higher than predicted
Negative correlation: co-occurrence is lower than predicted
Sequence associations/correlations
E.g. whenever bonds go up, stock prices go down in 2 days
Deviations from temporal patterns
E.g. deviation from a steady growth
E.g. sales of winter wear go down in summer
Not surprising, part of a known pattern.
Look for deviation from value predicted using past patterns
Clustering
· Clustering: Intuitively, finding clusters of points in the given data such that similar points
lie in the same cluster
· Can be formalized using distance metrics in several ways
E.g. Group points into k sets (for a given k) such that the average distance of points from
the centroid of their assigned group is minimized
· Centroid: point defined by taking average of coordinates in each dimension.
Another metric: minimize average distance between every pair of points in a cluster
· Has been studied extensively in statistics, but on small data sets
Data mining systems aim at clustering techniques that can handle very large data sets
E.g. the Birch clustering algorithm (more shortly)
Hierarchical Clustering
Example from biological classification
Other examples: Internet directory systems (e.g. Yahoo, more on this later)
Agglomerative clustering algorithms
Build small clusters, then cluster small clusters into bigger clusters, and so on
Divisive clustering algorithms
Start with all items in a single cluster, repeatedly refine (break) clusters into smaller ones
(b) Discuss the features of Web databases and Mobile databases. (16) (JUNE 2010)
Mobile databases:
Recent advances in portable and wireless technology led to mobile computing, a new
dimension in data communication and processing.
Portable computing devices coupled with wireless communications allow clients to access
data from virtually anywhere and at any time.
There are a number of hardware and software problems that must be resolved before the
capabilities of mobile computing can be fully utilized.
Some of the software problems – which may involve data management, transaction
management, and database recovery – have their origins in distributed database systems.
In mobile computing, the problems are more difficult, mainly:
The limited and intermittent connectivity afforded by wireless communications.
The limited life of the power supply(battery).
The changing topology of the network.
In addition, mobile computing introduces new architectural possibilities and
challenges.
Mobile Computing Architecture
The general architecture of a mobile platform is illustrated in Fig 30.1.
It is distributed architecture where a number of computers, generally referred to as Fixed
Hosts and Base Stations are interconnected through a high-speed wired network.
Fixed hosts are general purpose computers configured to manage mobile units.
Base stations function as gateways to the fixed network for the Mobile Units.
Wireless Communications
The wireless medium have bandwidth significantly lower than those of a wired
network.
The current generation of wireless technology has data rates range from the
tens to hundreds of kilobits per second (2G cellular telephony) to tens of
megabits per second (wireless Ethernet, popularly known as WiFi).
Modern (wired) Ethernet, by comparison, provides data rates on the order of
hundreds of megabits per second.
The other characteristics distinguish wireless connectivity options:
interference,
locality of access,
range,
support for packet switching,
seamless roaming throughout a geographical region.
Some wireless networks, such as WiFi and Bluetooth, use unlicensed areas of the
frequency spectrum, which may cause interference with other appliances, such as
cordless telephones.
Modern wireless networks can transfer data in units called packets, that are used in
wired networks in order to conserve bandwidth.
Client/Network Relationships –
Mobile units can move freely in a geographic mobility domain, an area that is
circumscribed by wireless network coverage.
To manage entire mobility domain is divided into one or more smaller
domains, called cells, each of which is supported by at least one base
station.
Mobile units be unrestricted throughout the cells of domain, while
maintaining information access contiguity.
The communication architecture described earlier is designed to give the mobile
unit the impression that it is attached to a fixed network, emulating a traditional
client-server architecture.
Wireless communications, however, make other architectures possible. One
alternative is a mobile ad-hoc network (MANET), illustrated in 29.2.
In a MANET, co-located mobile units do not need to communicate via a fixed
network, but instead, form their own using cost-effective technologies such as
Bluetooth.
In a MANET, mobile units are responsible for routing their own data, effectively
acting as base stations as well as clients.
Moreover, they must be robust enough to handle changes in the network
topology, such as the arrival or departure of other mobile units.
MANET applications can be considered as peer-to-peer, meaning that a mobile unit
is simultaneously a client and a server.
Transaction processing and data consistency control become more difficult
since there is no central control in this architecture.
Resource discovery and data routing by mobile units make computing in a
MANET even more complicated.
Sample MANET applications are multi-user games, shared whiteboard,
distributed calendars, and battle information sharing.
Characteristics of Mobile Environments
The characteristics of mobile computing include:
Communication latency
Intermittent connectivity
Limited battery life
Changing client location
The server may not be able to reach a client.
A client may be unreachable because it is dozing – in an energy-conserving state in
which many subsystems are shut down – or because it is out of range of a base
station.
In either case, neither client nor server can reach the other, and
modifications must be made to the architecture in order to compensate for
this case.
Proxies for unreachable components are added to the architecture.
For a client (and symmetrically for a server), the proxy can cache updates
intended for the server.
Mobile computing poses challenges for servers as well as clients.
The latency involved in wireless communication makes scalability a problem.
Since latency due to wireless communications increases the time to service
each client request, the server can handle fewer clients.
One way servers relieve this problem is by broadcasting data whenever possible.
A server can simply broadcast data periodically.
Broadcast also reduces the load on the server, as clients do not have to maintain
active connections to it.
Client mobility also poses many data management challenges.
Servers must keep track of client locations in order to efficiently route messages to
them.
Client data should be stored in the network location that minimizes the traffic
necessary to access it.
The act of moving between cells must be transparent to the client.
The server must be able to gracefully divert the shipment of data from one base to
another, without the client noticing.
Client mobility also allows new applications that are location-based.
Data Management Issues
From a data management standpoint, mobile computing may be considered a variation of
distributed computing. Mobile databases can be distributed under two possible scenarios:
The entire database is distributed mainly among the wired components, possibly
with full or partial replication.
A base station or fixed host manages its own database with a DBMS-like functionality, with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile  Environments.
The database is distributed among wired and wireless components.
Data management responsibility is shared among base stations or fixed hosts and mobile units.
Data management issues as it is applied to mobile databases:
Data distribution and replication
Transactions models
Query processing
Recovery and fault tolerance
Mobile database design
Location-based service
Division of labor
Security
Application: Intermittently Synchronized Databases
Whenever clients connect – through a process known in industry as synchronization of a client with a server – they receive a batch of updates to be installed on their local database.
The primary characteristic of this scenario is that the clients are mostly disconnected; the server is not necessarily able reach them.
This environment has problems similar to those in distributed and client-server databases, and some from mobile databases.
This environment is referred to as Intermittently Synchronized Database Environment (ISDBE).
The characteristics of Intermittently Synchronized Databases (ISDBs) make them
distinct from the mobile databases are:
A client connects to the server when it wants to exchange updates.
The communication can be unicast –one-on-one communication between the server and the client– or multicast– one sender or server may periodically communicate to a set of receivers or update a group of clients.
A server cannot connect to a client at will.
The characteristics of ISDBs (contd.) :
Issues of wireless versus wired client connections and power conservation are
generally immaterial.
A client is free to manage its own data and transactions while it is disconnected. It
can also perform its own recovery to some extent. A client has multiple ways connecting to a server and, in case of many servers, may choose a particular server to connect to based on proximity, communication nodes available, resources available, etc
Web databases
A database system is essentially just a way to manage lists of information. The information can come from a variety of sources. For example, it can represent research data, business records, customer requests, sports statistics, sales reports, personal hobby information, personnel records, bug reports or student grades. The power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by hand.
Databases can be used by large corporations processing millions of transactions a day, of course, but even small-scale operations involving a single person maintaining information of personal interest may require a database. It's not difficult to think of situations in which the use of a database can be beneficial because you needn't have huge amounts of information before that information becomes difficult to manage.
Web Database Applications
Database systems are used now to provide services in ways that were not possible until relatively recently. The manner in which many organizations use a database in conjunction with a Web site is a good example.
Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much it costs. That's a relatively traditional use for a database. However, if your company puts up a Web site for customers to visit, you can provide an additional service: a search engine that allows customers to determine item pricing and availability themselves.
This gives your customers the information they want, and the way you provide it is by
supplying a database application to search the inventory information stored in your database for the items in question ã automatically. The customer gets the information immediately, without being put on hold listening to annoying canned music or being limited by the hours your service desk is open. And for every customer who uses your Web site, that's one less phone call that needs to be handled by a person on the service desk payroll. (Perhaps the Web site pays for itself this way.)
But you can put the database to even better use than that. Web-based inventory search requests can provide information not only to your customers, but to you as well. The queries tell you what your customers are looking for, and the query results tell you whether or not you're able to satisfy their requests. To the extent you don't have what they want, you're probably losing business. So it makes sense to record information about inventory searches: what customers were looking for, and whether or not you had it in stock. Then you can use this information to adjust your inventory and provide better service to your customers.
Another recent application for databases is to serve up banner advertisements on Web pages. We don't like them any better than you do, but the fact remains that they are a popular application for Web databases, which can be used to store advertisements, and retrieve them for display by a Web server.
Three Tier Architecture
4. (a)With an example explain E-R Model in detail. (JUNE 2010)
Or
(b) (i) Explain E-R model with an example. (8) (NOV/DEC 2010)
Entity-Relationship Model: The entity-relationship (E-R) data model perceives the real world as consisting of basic objects, called entities, and relationships among these objects.
Entity Sets
A database can be modeled as:
a collection of entities,
Relationships among entities.
An entity is an object that exists and is distinguishable from other objects.
Example: specific person, company, event, plant An entity set is a set of entities of the same type that share the same properties.
Example: set of all persons, companies, trees, holidays
Attributes
An entity is represented by a set of attributes, that is, descriptive properties possessed by all members of an entity set.
Examples:
customer = (customer-name, social-security, customer-street, customer-city)
account = (account-number, balance)
Domain -- the set of permitted values for each attribute
Attribute types:
Simple and composite attributes.
Single-valued and multi-valued attributes.
Null attributes
Derived attributes
Relationship Sets
A relationship is an association among several entities
Examples:
Hayes depositor A-102
customer-entity relationship setaccount entity
A relationship set is a mathematical relation among n³ 2 entities,each taken from entity sets
{(e1,e2,..en)ï e1ÎE1,e2ÎE2,..en Î En}
where (e1,e2,..en) is a relationship
--Example: (Hayes,A-102) Îdepositor
An attribute can also be a property of a relationship set. For instance, the depositor relationship set between entity sets customer and account may have the attribute access-date
Degree of a Relationship Set
§ Refers to number of entity sets that participate in a relationship set.
§ Relationship sets that involve two entity sets are binary (or degree two). Generally , most relationship sets in a database system are binary
§ Relationship sets may involve more than two entity sets. The entity sets customer, loan, and branch may be linked by the ternary (degree three) relationship set CLB.
Roles
Entity sets of a relationship need not be distinct
§ The labels”manager”and”worker”are called roles; they specify how employee entities
interact via the works-for relationship set.
§ Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles.
§ Role labels are optional, and are used to clarify semantics of the relationship
Design Issues
§ Use of entity sets vs. attributes
Choice mainly depends on the structure of the enterprise being modeled, and on the
semantics associated with the attribute in question.
§ Use of entity sets vs. relationship sets
Possible guideline is to designate a relationship set to describe an action that occurs
between entities
§ Binary versus n-ary relationship sets
Although it is possible to replace a nonbinary (n-ary , for n > 2) relationship set by a
number of distinct binary relationship sets, a n-ary relationship set shows more clearly that several entities participate in a single relationship.
Mapping Cardinalities
§ Express the number of entities to which another entity can be associated via a relationship set.
§ Most useful in describing binary relationship sets.
§ For a binary relationship set the mapping cardinality must be one of the following types:
– One to one
– One to many
– Many to one
– Many to many
§ We distinguish among these types by drawing either a directed line( ),signifying “one,”or an undirected line( ),signifying “many,” between the relationship set and the entity set.
One-To-One Relationship
§ A customer is associated with at most one loan via the relationship borrower
§ A loan is associated with at most one customer via borrower
One-To-Many and Many-To-One Relationship
§ In the one-to-many relationship (a), a loan is associated with at most one customer via
borrower; a customer is associated with several (including 0) loans via borrower
§ In the many-to-one relationship (b), a loan is associated with several (including 0)
customers via borrower; a customer is associated with at most one loan via borrower
Many-To-Many Relationship
§ A customer is associated with several (possibly 0) loans via borrower
§ A loan is associated with several (possibly 0) customers via borrower
Existence Dependencies
§ If the existence of entity x depends on the existence of entity y, then x is said to be
existence dependent on y.
y is a dominant entity (in example below, loan)
x is a subordinate entity (in example below, payment)
§ If a loan entity is deleted, then all its associated payment entities must be deleted also.
E-R Diagram Components
§ Rectangles represent entity sets.
§ Ellipses represent attributes.
§ Diamonds represent relationship sets.
§ Lines link attributes to entity sets and entity sets to relationship sets.
§ Double ellipses represent multivalued attributes.
§ Dashed ellipses denote derived attributes.
§ Primary key attributes are underlined.
Weak Entity Sets
§ An entity set that does not have a primary key is referred to as a weak entity set.
§ The existence of a weak entity set depends on the existence of a strong entity set; it must
relate to the strong set via a one-to-many relationship set.
§ The discriminator (or partial key) of a weak entity set is the set of attributes that
distinguishes among all the entities of a weak entity set.
§ The primary key of a weak entity set is formed by the primary key of the strong entity set on which the weak entity set is existence dependent, plus the weak entity set's
discriminator.
§ We depict a weak entity set by double rectangles.
§ We underline the discriminator of a weak entity set with a dashed line.
§ payment-number -- discriminator of the payment entity set
§ Primary key for payment -- (loan-number, payment-number)
Specialization
§ Top-down design process; we designate subgroupings within an entity set that are
distinctive from other entities in the set.
§ These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set.
§ Depicted by a triangle component labeled ISA (i.e., savings-account”is an”account)
Generalization
§ A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set
§ Specialization and generalization are simple inversions of each other; they are represented in an E-R diagram in the same way.
§ Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked.
Design Constraints on a Generalization
§ Constraint on which entities can be members of a given lower-level entity set.
condition-defined user-defined
§ Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization.
Disjoint overlapping
§ Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization.
Total Partial
Aggregation
§ Relationship sets borrower and loan-officer represent the same information
§ Eliminate this redundancy via aggregation
Treat relationship as an abstract entity
Allows relationships between relationships
Abstraction of relationship into new entity
§ Without introducing redundancy , the following diagram represents that:
A customer takes out a loan
An employee may be a loan officer for a customer-loan pair
E-R Design Decisions
§ The use of an attribute or entity set to represent an object
§ Whether a real-world concept is best expressed by an entity set or a relationship set.
§ The use of a ternary relationship versus a pair of binary relationships.
§ The use of a strong or weak entity set.
§ The use of generalization -- contributes to modularity in the design.
§ The use of aggregation -- can treat the aggregate entity set as a single unit without concern
for the details of its internal structure.
Reduction of an E-R schema to Tables
§ Primary keys allow entity sets and relationship sets to be expressed uniformly as tables
which represent the contents of the database.
§ A database which conforms to an E-R diagram can be represented by a collection of tables.
§ For each entity set and relationship set there is a unique table which is assigned the name of
the corresponding entity set or relationship set.
§ Each table has a number of columns (generally corresponding to attributes), which have
unique names.
§ Converting an E-R diagram to a table format is the basis for deriving a relational database
design from an E-R diagram.
Representing Entity Sets as Tables
§ A strong entity set reduces to a table with the same attributes.
§ A weak entity set becomes a table that includes a column for the primary key of the
identifying strong entity set.
Representing Relationship Sets as Tables
§ A many-to-many relationship set is represented as a table with columns for the primary
keys of the two participating entity sets, and any descriptive attributes of the relationship
set.
§ The table corresponding to a relationship set linking a weak entity set to its identifying
strong entity set is redundant. The payment table already contains the information that
would appear in the loan-payment table (i.e., the columns loan-number and paymentnumber).
E-R Diagram for Banking Enterprise
Representing Generalization as Tables
§ Method 1: Form a table for the generalized entity account Form a table for each entity set
that is generalized (include primary key of generalized entity set)
§ Method 2: Form a table for each entity set that is generalized table
(b) Explain the features of Temporal & Spatial Databases in detail. (JUNE 2010)
Or
(a) Give features of Temporal and Spatial Databases. Temporal Database.
(DECEMBER 2010)
Temporal Database
Time Representation, Calendars, and Time Dimensions
Time is considered ordered sequence of points in some granularity
• Use the term choronon instead of point to describe minimum granularity
A calendar organizes time into different time units for convenience.
• Accommodates various calendars
Gregorian (western), Chinese, Islamic, Etc.
Point events
• Single time point event
E.g., bank deposit
• Series of point events can form a time series data
Duration events
• Associated with specific time period
Time period is represented by start time and end time
Transaction time
• The time when the information from a certain transaction becomes valid
Bitemporal database
• Databases dealing with two time dimensions
Incorporating Time in Relational Databases Using Tuple Versioning
Add to every tuple
• Valid start time
• Valid end time
Incorporating Time in Object-Oriented Databases Using Attribute Versioning
A single complex object stores all temporal changes of the object
Time varying attribute
• An attribute that changes over time
E.g., age
Non-Time varying attribute
An attribute that does not changes over time
E.g., date of birth
Spatial Database
Types of Spatial Data
Point Data
Points in a multidimensional space
E.g., Raster data such as satellite imagery, where each
pixel stores a measured value
E.g., Feature vectors extracted from text
Region Data
Objects have spatial extent with location and boundary.
DB typically uses geometric approximations constructed using line segments, polygons,
etc., called vector data.
Types of Spatial Queries
Spatial Range Queries
Find all cities within 50 miles of Madison
Query has associated region (location, boundary)
Answer includes ovelapping or contained data regions
Nearest-Neighbor Queries
Find the 10 cities nearest to Madison
Results must be ordered by proximity
Spatial Join Queries
Find all cities near a lake
Expensive, join condition involves regions and proximity
Applications of Spatial Data
Geographic Information Systems (GIS)
E.g., ESRI’s ArcInfo; OpenGIS Consortium
Geospatial information
All classes of spatial queries and data are common
Computer-Aided Design/Manufacturing
Store spatial objects such as surface of airplane fuselage
Range queries and spatial join queries are common
Multimedia Databases
Images, video, text, etc. stored and retrieved by content
First converted to feature vector form; high dimensionality
Nearest-neighbor queries are the most common
Single-Dimensional Indexes
B+ trees are fundamentally single-dimensional indexes.
When we create a composite search key B+ tree, e.g., an index on <age, sal>, we effectively
linearize the 2-dimensional space since we sort entries first by age and then by sal.
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
Multi-dimensional Indexes
A multidimensional index clusters entries so as to exploit “nearness” in multidimensional space.
Keeping track of entries and maintaining a balanced index structure presents a challenge!
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
51
Motivation for Multidimensional Indexes
Spatial queries (GIS, CAD).
Find all hotels within a radius of 5 miles from the conference venue.
Find the city with population 500,000 or more that is nearest to Kalamazoo, MI.
Find all cities that lie on the Nile in Egypt.
Find all parts that touch the fuselage (in a plane design).
Similarity queries (content-based retrieval).
Given a face, find the five most similar faces.
Multidimensional range queries.
50 < age < 55 AND 80K < sal < 90K
Drawbacks
An index based on spatial location needed.
One-dimensional indexes don’t support multidimensional searching efficiently.
Hash indexes only support point queries; want to support range queries as well.
Must support inserts and deletes gracefully.
Ideally, want to support non-point data as well (e.g., lines, shapes).
The R-tree meets these requirements, and variants are widely used today.
R-Tree
R-Tree Properties
Leaf entry = < n-dimensional box, rid >
This is Alternative (2), with key value being a box.
Box is the tightest bounding box for a data object.
Non-leaf entry = < n-dim box, ptr to child node >
Box covers all boxes in child node (in fact, subtree).
All leaves at same distance from root.
Nodes can be kept 50% full (except root).
Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full.
Example of R-Tree
Search for Objects Overlapping Box Q
Start at root.
1. If current node is non-leaf, for each entry <E, ptr>, if box E overlaps Q, search subtree identified
by ptr.
2. If current node is leaf, for each entry <E, rid>, if E overlaps Q, rid identifies an object that might
overlap Q.
Improving Search Using Constraints
It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because
boxes can be represented compactly.
But why not use convex polygons to approximate query regions more accurately?
Will reduce overlap with nodes in tree, and reduce the number of nodes fetched by
avoiding some branches altogether.
Cost of overlap test is higher than bounding box intersection, but it is a main-memory
cost, and can actually be done quite efficiently. Generally a win.
Insert Entry <B, ptr>
Start at root and go down to “best-fit” leaf L.
Go to child whose box needs least enlargement to cover B; resolve ties by going to
smallest area child.
If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2.
Adjust entry for L in its parent so that the box now covers (only) L1.
Add an entry (in the parent node of L) for L2. (This could cause the parent node to
recursively split.)
Splitting a Node during Insertion
The entries in node L plus the newly inserted entry must be distributed between L1 and L2.
Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries.
Idea: Redistribute so as to minimize area of L1 plus area of L2.
R-Tree Variants
The R* tree uses the concept of forced reinserts to reduce overlap in tree nodes. When a node overflows, instead of splitting.
Remove some (say, 30% of the) entries and reinsert them into the tree.
Could result in all reinserted entries fitting on some existing pages, avoiding a split.
R* trees also use a different heuristic, minimizing box perimeters rather than box areas during insertion.
Another variant, the R+ tree, avoids overlap by inserting an object into multiple leaves if necessary.
Searches now take a single path to a leaf, at cost of redundancy.
GiST
The Generalized Search Tree (GiST) abstracts the “tree” nature of a class of indexes
including B+ trees and R-tree variants.
Striking similarities in insert/delete/search and even concurrency control algorithms
make it possible to provide “templates” for these algorithms that can be customized to
obtain the many different tree index structures.
B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs.
GiST provides an alternative for implementing other tree indexes in an ORDBS.
Indexing High-Dimensional Data
Typically, high-dimensional datasets are collections of points, not regions.
E.g., Feature vectors in multimedia applications.
Very sparse
Nearest neighbor queries are common.
R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions.
As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases; “nearest neighbor” is not meaningful.
In any given data set, advisable to empirically test contrast.

5. (a) Explain the features of Parallel and Text Databases in detail. (JUNE 2010)
Parallel Databases
Introduction
Parallel machines are becoming quite common and affordable
· Prices of microprocessors, memory and disks have dropped sharply
Databases are growing increasingly large
· Large volumes of transaction data are collected and stored for later analysis.
· multimedia objects like images are increasingly stored in databases
Large-scale parallel database systems increasingly used for:
· storing large volumes of data
· processing time-consuming decision-support queries
· providing high throughput for transaction processing
Parallelism in Databases
Data can be partitioned across multiple disks for parallel I/O.
Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel
· Data can be partitioned and each processor can work independently on its own partition.
Queries are expressed in high level language (SQL, translated to relational algebra)
· Makes parallelization easier.
Different queries can be run in parallel with each other.
Concurrency control takes care of conflicts.
Thus, databases naturally lend themselves to parallelism.
I/O Parallelism
Reduce the time required to retrieve relations from disk by partitioning the relations on multiple disks.
Horizontal partitioning – tuples of a relation are divided among many disks such that each tuple resides on one disk.
Partitioning techniques (number of disks = n):
Round-robin:
Send the ith tuple inserted in the relation to disk i mod n.
Hash partitioning:
Choose one or more attributes as the partitioning attributes.
Choose hash function h with range 0…n – 1
Let i denote result of hash function h applied tothe partitioning attribute value of a tuple. Send
tuple to disk i.
Partitioning techniques (cont.):
Range partitioning:
Choose an attribute as the partitioning attribute.
A partitioning vector [vo, v1, ..., vn-2] is chosen.
Let v be the partitioning attribute value of a tuple. Tuples such that vi_ vi+1 go to disk I + 1. Tuples
with v < v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1.
E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk
0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to disk2.
Comparison of Partitioning Techniques
Evaluate how well partitioning techniques support the following types of data access:
1. Scanning the entire relation.
2. Locating a tuple associatively – point queries.
E.g., r.A = 25.
3. Locating all tuples such that the value of a given attribute lies within a specified range – range
queries.
E.g., 10 _ r.A < 25.
Round robin:
Advantages
· Best suited for sequential scan of entire relation on each query.
· All disks have almost an equal number of tuples; retrieval work is thus well balanced
between disks.
Range queries are difficult to process
· No clustering -- tuples are scattered across all disks
Hash partitioning:
Good for sequential access
· Assuming hash function is good, and partitioning attributes form a key, tuples will be
equally distributed between disks
· Retrieval work is then well balanced between disks.
Good for point queries on partitioning attribute
Can lookup single disk, leaving others available for answering other queries.
Index on partitioning attribute can be local to disk, making lookup and update more
efficient
No clustering, so difficult to answer range queries
Range partitioning:
Provides data clustering by partitioning attribute value.
Good for sequential access
Good for point queries on partitioning attribute: only one disk needs to be accessed.
For range queries on partitioning attribute, one to a few disks may need to be accessed
Remaining disks are available for other queries.
Good if result tuples are from one to a few blocks.
If many blocks are to be fetched, they are still fetched from one to a few disks, and
potential parallelism in disk access is wasted
Example of execution skew.
Partitioning a Relation across Disks
! If a relation contains only a few tuples which will fit into a single disk block, then assign the relation to a single disk.
! Large relations are preferably partitioned across all the available disks.
! If a relation consists of m disk blocks and there are n disks available in the system, then the relation should be allocated min(m,n) disks.
Handling of Skew
The distribution of tuples to disks may be skewed — that is, some disks have many tuples, while others may have fewer tuples.
Types of skew:
Attribute-value skew:" Some values appear in the partitioning attributes of many tuples; all the tuples with the same value for the partitioning attribute end up in the same partition. “Can occur
with range-partitioning and hash-partitioning.
Partition skew: " With range-partitioning, badly chosen partition vector may assign too many tuples to some partitions and too few to others. " Less likely with hash-partitioning if a good hashfunction is chosen.
Handling Skew in Range-Partitioning
To create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation):
· Sort the relation on the partitioning attribute.
· Construct the partition vector by scanning the relation in sorted order as follows.
" After every 1/nth of the relation has been read, the value of the partitioning attribute of
the next tuple is added to the partition vector.
· n denotes the number of partitions to be constructed.
· Duplicate entries or imbalances can result if duplicates are present in partitioning attributes.
Alternative technique based on histograms used in practice
Handling Skew using Histograms
Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion.
Assume uniform distribution within each range of the histogram
Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation
Interquery Parallelism
Queries/transactions execute in parallel with one another.
Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second.
Easiest form of parallelism to support, particularly in a shared memory parallel database, because even sequential database systems support concurrent processing.
More complicated to implement on shared-disk or shared-nothing architectures
Locking and logging must be coordinated by passing messages between processors.
Data in a local buffer may have been updated at another processor.
Cache-coherency has to be maintained — reads and writes of data in buffer must find latest version of data.
Cache Coherency Protocol
Example of a cache coherency protocol for shared disk systems:
Before reading/writing to a page, the page must be locked in shared/exclusive mode.
On locking a page, the page must be read from disk
Before unlocking a page, the page must be written to disk if it was modified. More complex protocols with fewer disk reads/writes exist. Cache coherency protocols for shared-nothing systems are similar. Each database page is assigned a home processor. Requests to fetch the page or write it to disk are sent to the home processor.
Intraquery Parallelism
Execution of a single query in parallel on multiple processors/disks; important for speeding up long-running queries.
Two complementary forms of intraquery parallelism:
Intraoperation Parallelism – parallelize the execution of each individual operation in the query.
Interoperation Parallelism – execute the different operations in a query expression in parallel.
The first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a query.
Parallel Sort
Range-Partitioning Sort
Choose processors P0... Pm, where m _ n -1 to do sorting. Create range-partition vector with m entries, on the sorting attributes Redistribute the relation using range partitioning
all tuples that lie in the ith range are sent to processor Pi
Pi stores the tuples it received temporarily on disk Di.
This step requires I/O and communication overhead.
Each processor Pi sorts its partition of the relation locally.
Each processors executes same operation (sort) in parallel with other processors, without any interaction with the others (data parallelism).
Final merge operation is trivial: range-partitioning ensures that, for 1 jm, the key values in processor Pi are all less than the key values in Pj.
Parallel External Sort-Merge
Assume the relation has already been partitioned among disks D0, ..., Dn-1
Each processor Pi locally sorts the data on disk Di.
The sorted runs on each processor are then merged to get the final sorted output.
Parallelize the merging of sorted runs as follows:
The sorted partitions at each processor Pi are range-partitioned across the processors P0...Pm-1.
Each processor Pi performs a merge on the streams as they are received, to get a single sorted run.
The sorted runs on processors P0,..., Pm-1 are concatenated to get the final result.
Parallel Join
The join operation requires pairs of tuples to be tested to see if they satisfy the join condition, and if they do, the pair is added to the join output.
Parallel join algorithms attempt to split the pairs to be tested over several processors. Each processor then computes part of the join locally.
In a final step, the results from each processor can be collected together to produce the final result.
Partitioned Join
For equi-joins and natural joins, it is possible to partition the two input relations across the processors, and compute the join locally at each processor.
Let r and s be the input relations, and we want to compute r and s each are partitioned
into n partitions, denoted r0, r1... rn-1 and s0, s1, ..., sn-1. Can use either range partitioning or hash partitioning. r and s must be partitioned on their join attributes r.A and s.B), using the same range-partitioning vector or hash function.
Partitions ri and si are sent to processor Pi, Each processor Pi locally computes Any of the standard join methods can be used.
Fragment-and-Replicate Join
Partitioning not possible for some join conditions,
e.g., non-equijoin conditions, such as r.A > s.B.
For joins were partitioning is not applicable, parallelization can be accomplished by fragment and
replicate technique
Depicted on next slide Special case – asymmetric fragment-and-replicate:
One of the relations, say r, is partitioned; any partitioning technique can be used.
The other relation, s, is replicated across all the processors.
Processor Pi then locally computes the join of ri with all of s using any join technique.
Both versions of fragment-and-replicate work with any join condition, since every tuple in r can be tested with every tuple in s. Usually has a higher cost than partitioning, since one of the relations (for asymmetric fragment-and-replicate) or both relations have to be replicated. Sometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used.
E.g., say s is small and r is large, and already partitioned. It may be
cheaper to replicate s across all processors, rather than repartition r and s on the join attributes.
Partitioned Parallel Hash-Join
Parallelizing partitioned hash join:
Assume s is smaller than r and therefore s is chosen as the build relation.
A hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processors.
Each processor Pi reads the tuples of s that are on its disk Di, and sends each tuple to the appropriate processor based on hash function h1. Let si denote the tuples of relation s that are sent to processor Pi.
As tuples of relation s are received at the destination processors, they are partitioned further using another hash function, h2, which is used to compute the hash-join locally.
Once the tuples of s have been distributed, the larger relation r is redistributed across the m processors using the hash function h1
· Let ri denote the tuples of relation r that are sent to processor Pi.
As the r tuples are received at the destination processors, they are repartitioned using the function h2
Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-join.
Hash-join optimizations can be applied to the parallel case
· e.g., the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in.
Parallel Nested-Loop Join
Assume that
relation s is much smaller than relation r and that r is stored by partitioning.
there is an index on a join attribute of relation r at each of the partitions of relation r.
Use asymmetric fragment-and-replicate, with relation s being replicated, and using the existing partitioning of relation r.
Each processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj, and replicates the tuples to every other processor Pi.
At the end of this phase, relation s is replicated at all sites that store tuples of relation r.
Each processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation r.
Interoperator Parallelism
Pipelined parallelism
· Consider a join of four relations
· Set up a pipeline that computes the three joins in parallel
Let P1 be assigned the computation of
And P2 be assigned the computation of
And P3 be assigned the computation of
· Each of these operations can execute in parallel, sending result tuples it computes to the next operation even as it is computing further results
· Provided a pipelineable join evaluation algorithm is used.
Independent Parallelism
Consider a join of four relations,
Let P1 be assigned the computation of
And P2 be assigned the computation of
And P3 be assigned the computation of
P1 and P2 can work independently in parallel
P3 has to wait for input from P1 and P2
Can pipeline output of P1 and P2 to P3, combining independent parallelism and
pipelined parallelism
Does not provide a high degree of parallelism
useful with a lower degree of parallelism.
less useful in a highly parallel system.
Query Optimization
Query optimization in parallel databases is significantly more complex than query optimization in sequential databases.
Cost models are more complicated, since we must take into account partitioning costs and issues such as skew and resource contention.
When scheduling execution tree in parallel system, must decide:
How to parallelize each operation and how many processors to use for it.
What operations to pipeline, what operations to execute independently in parallel, and what operations to execute sequentially, one after the other.
Determining the amount of resources to allocate for each operation is a problem.
E.g., allocating more processors than optimal can result in high communication overhead.
Long pipelines should be avoided as the final operation may wait a lot for inputs, while holding precious resources
Text Databases
A text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing, printing, or downloading. In addition to text documents, images are often included, such as graphs, maps, photos, and diagrams. A text database is searchable by keyword, phrase, or both. When an item in a text database is viewed, it may appear in ASCII format (as a text file with the .txt extension), as a word-processed file (requiring a program such as Microsoft Word), as an PDF) file. When a document appears as a PDF file, it is usually a scanned hardcopy of the original article, chapter, or book.
A text databases are used by college and university libraries as a convenience to their students and staff. Full-text databases are ideally suited to online courses of study, where the student remains at home and obtains course materials by downloading them from the Internet. Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item. Full-text databases are also used by some corporations, law offices, and government agencies.

(b) Discuss the Rules, Knowledge Bases and Image Databases. (JUNE 2010)
Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way. They are often used in artificial intelligence applications and research. Rule-based systems are specialized software that encapsulates “Human Intelligence” like knowledge there by make intelligent decisions quickly and in repeatable form. Also known as Rule Based Systems, Expert Systems & Artificial Intelligence.
Rule based systems are:
– Knowledge based systems
– Part of the Artificial Intelligence field
– Computer programs that contain some subject-specific knowledge of one or more
human experts
– Made up of a set of rules that analyze user supplied information about a specific class of problems.
– Systems that utilize reasoning capabilities and draw conclusions.
Knowledge Engineering – building an expert system
Knowledge Engineers – the people who build the system
Knowledge Representation – the symbols used to represent the knowledge
Factual Knowledge – knowledge of a particular task domain that is widely shared
Heuristic Knowledge – more judgmental knowledge of performance in a task domain.
Uses of Rule based Systems
Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members.
Solves problems that would normally be tackled by a medical or other professional.
Currently used in fields such as accounting, medicine, process control, financial service, production, and human resources
Applications
A classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices. For example, an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms, or select tactical moves to play a game. Rule-based systems can be used to perform lexical analysis to compile or interpret computer programs, or in natural language processing. Rule-based programming attempts to derive execution instructions from a starting set of data and rules. This is a more indirect method than that employed by an imperative programming language, which lists execution steps sequentially.
Construction
A typical rule-based system has four basic components:
A list of rules or rule base, which is a specific type of knowledge base.
An inference engine or semantic reasoner, which infers information or takes action based on the interaction of input and the rule base. The interpreter executes a production system program by performing the following match-resolve-act cycle:
· Match: In this first phase, the left-hand sides of all productions are matched against the contents of working memory. As a result a conflict set is obtained, which consists of
instantiations of all ,satisfied productions. An instantiation of a production is an ordered
list of working memory elements that satisfies the left-hand side of the production.
· Conflict-Resolution: In this second phase, one of the production instantiations in the
conflict set is chosen for execution. If no productions are satisfied, the interpreter halts.
· Act: In this third phase, the actions of the production selected in the conflict-resolution
phase are executed. These actions may change the contents of working memory. At the
end of this phase, execution returns to the first phase.
Temporary working memory.
A user interface or other connection to the outside world through which input and output signals are received and sent.
Components of an Rule Based System
Set of Rules – derived from the knowledge base and used by the interpreter to evaluate the inputted data
Knowledge Engineer – decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain
Interpreter – interprets the inputted data and draws a conclusion based on the users
responses.
Problem-solving Models
Forward-chaining – starts from a set of conditions and moves towards some conclusion
Backward-chaining – starts with a list of goals and the works backwards to see if there is any data that will allow it to conclude any of these goals.
Both problem-solving methods are built into inference engines or inference procedures
Advantages
Provide consistent answers for repetitive decisions, processes and tasks.
Hold and maintain significant levels of information.
Reduce employee training costs
Centralize the decision making process.
Create efficiencies and reduce the time needed to solve problems.
Combine multiple human expert intelligences
Reduce the amount of human errors.
Give strategic and comparative advantages creating entry barriers to competitors
Review transactions that human experts may overlook.
Disadvantages
Lack human common sense needed in some decision making.
Will not be able to give the creative responses that human experts can give in unusual circumstances.
Domain experts cannot always clearly explain their logic and reasoning.
Challenges of automating complex processes.
Lack of flexibility and ability to adapt to changing environments.
Not being able to recognize when no answer is available.
Knowledge Bases
Knowledge-based Systems: Definition
A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise.
Heuristic rather than algorithmic
Heuristics in search vs. in KBS general vs. domain-specific
Highly specific domain knowledge
Knowledge is separated from how it is used
KBS = knowledge-base + inference engine
KBS Architecture
The inference engine and knowledge base are separated because:
· the reasoning mechanism needs to be as stable as possible;
· the knowledge base must be able to grow and change, as knowledge is added;
· this arrangement enables the system to be built from, or converted to, a shell.
· It is reasonable to produce a richer, more elaborate, description of the typical expert
system.
· A more elaborate description, which still includes the components that are to be found in almost any real-world system, would look like this:
Knowledge representation formalisms & Inference
KR Inference
Logic Resolution principle
Production rules backward (top-down, goal directed) forward (bottom-up, data-driven)
Semantic nets & Frames Inheritance & advanced reasoning
Case-based Reasoning Similarity based
KBS tools – Shells
- Consist of KA Tool, Database & Development Interface
- Inductive Shells
- simplest
- example cases represented as matrix of known data(premises) and resulting effects
- matrix converted into decision tree or IF-THEN statements
- examples selected for the tool
Rule-based shells
- simple to complex
- IF-THEN rules
Hybrid shells
- sophisticate & powerful
- support multiple KR paradigms & reasoning schemes
- generic tool applicable to a wide range
Special purpose shells
- specifically designed for particular types of problems
- restricted to specialised problems
-Scratch
- require more time and effort
- no constraints like shells
- shells should be investigated first
Some example KBSs
DENDRAL (chemical)
MYCIN (medicine)
XCON/RI (computer)
Typical tasks of KBS
(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions.
e.g. diagnose reasons for engine failure
(2) Interpretation - To provide an understanding of a situation from available information. e.g.
DENDRAL
(3) Prediction - To predict a future state from a set of data or observations. e.g. Drilling Advisor, PLANT
(4) Design - To develop configurations that satisfy constraints of a design problem. e.g. XCON
(5) Planning - Both short term & long term in areas like project management, product
development or financial planning. e.g. HRM
(6) Monitoring - To check performance & flag exceptions.
e.g., KBS monitors radar data and estimates the position of the space shuttle
(7) Control - To collect and evaluate evidence and form opinions on that evidence.
e.g. control patient’s treatment
(8) Instruction - To train students and correct their performance. e.g. give medical students experience diagnosing illness
(9) Debugging - To identify and prescribe remedies for malfunctions.
e.g. identify errors in an automated teller machine network and ways to correct the errors
Advantages
- Increase availability of expert knowledge expertise not accessible training future experts
- Efficient and cost effective
- Consistency of answers
- Explanation of solution
- Deal with uncertainty
Limitations
- Lack of common sense
- Inflexible, Difficult to modify
- Restricted domain of expertise
- Lack of learning ability
- Not always reliable
6. (a) Compare Distributed databases and conventional databases. (16)
(NOV/DEC 2010)
DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES
mimics organisational structure with data
local access and autonomy without exclusion
cheaper to create and easier to expand
improved availability/reliability/performance by removing reliance on a central site
Reduced communication overhead
Most data access is local, less expensive and performs better
Improved processing power
Many machines handling the database rather than a single server
more complex to implement
more costly to maintain
security and integrity control
standards and experience are lacking
Design issues are more complex

7. (a) Explain the Multi-Version Locks and Recovery in Query Languages.
(NOV/DEC 2010)
Multi-Version Locks: Multiversion concurrency control (abbreviated MCC or MVCC), in the database field of computer science, is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory.
For instance, a database will implement updates not by deleting an old piece of data and overwriting it with a new one, but instead by marking the old data as obsolete and adding the newer "version." Thus there are multiple versions stored, but only one is the latest. This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old, obsolete data objects. For a documentoriented database such as CouchDB, Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of disk—when updated, the entire document can be re-written rather than bits and pieces cut out or maintained in a linked, non contiguous database structure
MVCC also provides potential "point in time" consistent views. In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read, and read these "versions" of the data.
This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained, rather than through a process of locks or mutexes. Writes affect future "version" but at the transaction ID that the read is working at, everything is guaranteed to be consistent because the writes are occurring at a later transaction ID.
In other words, MVCC provides each user connected to the database with a "snapshot" of the database for that person to work with. Any changes made will not be seen by other users of the database until the transaction has been committed.
MVCC uses timestamps or increasing transaction IDs to achieve transactional consistency. MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object. Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti)).
If a transaction (Ti) wants to write to an object, and if there is another transaction (Tk), the timestamp of Ti must precede the timestamp of Tk (i.e., TS(Ti) < TS(Tk)) for the object write operation to succeed, which is to say a write cannot complete if there are outstanding transactions with an earlier timestamp.
Every object would also have a read timestamp, and if a transaction Ti wanted to write to object P, and the timestamp of that transaction is earlier than the object's read timestamp (TS(Ti) < RTS(P)), the transaction Ti is aborted and restarted. Otherwise, Ti creates a new version of P and sets the read/write timestamps of P to the timestamp of the transaction TS (Ti).

The obvious drawback to this system is the cost of storing multiple versions of objects in the database. On the other hand reads are never blocked, which can be important for workloads mostly involving reading values from the database. MVCC is particularly adept at implementing true snapshot isolation, something which other methods of concurrency control frequently do either incompletely or with high performance costs.

At t1 the state of a DB could be
Time Object1 Object2
t1 “Hello” “Bar”
t2 “Foo” “Bar”

This indicates that the current set of this database (perhaps a key-value store database) is Object1="Hello", Object2="Bar". Previously, Object1 was "Foo" but that value has been superseded. It is not deleted because the database holds “multiple versions” but will be deleted later. If a long running transaction starts a read operation, it will operate at transaction "t1" and see this state. If there is a concurrent update (during that long-running read transaction) which deletes
Object 2 and adds Object 3 = “foo-bar” the database will look like
Time Object1 Object2 Object3
t2 “Hello” (deleted) “Foo-Bar”
t1 “Hello” Bar
t0 “Hello” Bar
Now there is a new version as of transaction ID t2. Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1* even though the write transaction added data as of t2, so the read transaction is able to run in isolation from the update transaction that created the t2 values. This is how MVCC allows isolated, ACID, reads without any locks.
Recovery
 (b)Discuss client/server model and mobile databases. (16)
(NOV/DEC 2010)
Mobile Databases
Recent advances in portable and wireless technology led to mobile computing, a new
dimension in data communication and processing.
Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time.
There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized.
Some of the software problems – which may involve data management, transaction
management, and database recovery – have their origins in distributed database systems.
In mobile computing, the problems are more difficult, mainly:
The limited and intermittent connectivity afforded by wireless communications.
The limited life of the power supply(battery).
The changing topology of the network.
In addition, mobile computing introduces new architectural possibilities and challenges.
Mobile Computing Architecture
The general architecture of a mobile platform is illustrated in Fig 30.1.
It is distributed architecture where a number of computers, generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network.
Fixed hosts are general purpose computers configured to manage mobile units.
Base stations function as gateways to the fixed network for the Mobile Units.
Wireless Communications
The wireless medium have bandwidth significantly lower than those of a wired network.
The current generation of wireless technology has data rates range from the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet, popularly known as WiFi).
Modern (wired) Ethernet, by comparison, provides data rates on the order of
hundreds of megabits per second.
The other characteristics distinguish wireless connectivity options:
interference,
locality of access,
range,
support for packet switching,
seamless roaming throughout a geographical region.
Some wireless networks, such as WiFi and Bluetooth, use unlicensed areas of the frequency spectrum, which may cause interference with other appliances, such as cordless telephones.
Modern wireless networks can transfer data in units called packets, that are used in
wired networks in order to conserve bandwidth.
Client/Network Relationships –
Mobile units can move freely in a geographic mobility domain, an area that is circumscribed by wireless network coverage.
To manage entire mobility domain is divided into one or more smaller domains, called cells, each of which is supported by at least one base station.
Mobile units be unrestricted throughout the cells of domain, while maintaining information access contiguity.
The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network, emulating a traditional client-server architecture.
Wireless communications, however, make other architectures possible. One alternative is a mobile ad-hoc network (MANET), illustrated in 29.2.
In a MANET, co-located mobile units do not need to communicate via a fixed network, but instead, form their own using cost-effective technologies such as Bluetooth.
In a MANET, mobile units are responsible for routing their own data, effectively acting as base stations as well as clients.
Moreover, they must be robust enough to handle changes in the network topology, such as the arrival or departure of other mobile units.
MANET applications can be considered as peer-to-peer, meaning that a mobile unit
is simultaneously a client and a server.
Transaction processing and data consistency control become more difficult since there is no central control in this architecture.
Resource discovery and data routing by mobile units make computing in a MANET even more complicated.
Sample MANET applications are multi-user games, shared whiteboard, distributed calendars, and battle information sharing.
Characteristics of Mobile Environments
The characteristics of mobile computing include:
Communication latency
Intermittent connectivity
Limited battery life
Changing client location
The server may not be able to reach a client.
A client may be unreachable because it is dozing – in an energy-conserving state in
which many subsystems are shut down – or because it is out of range of a base station.
In either case, neither client nor server can reach the other, and modifications must be made to the architecture in order to compensate for this case.
Proxies for unreachable components are added to the architecture.
For a client (and symmetrically for a server), the proxy can cache updates intended for the server.
Mobile computing poses challenges for servers as well as clients.
The latency involved in wireless communication makes scalability a problem.
Since latency due to wireless communications increases the time to service each client request, the server can handle fewer clients.
One way servers relieve this problem is by broadcasting data whenever possible.
A server can simply broadcast data periodically.
Broadcast also reduces the load on the server, as clients do not have to maintain
active connections to it.
Client mobility also poses many data management challenges.
Servers must keep track of client locations in order to efficiently route messages to them.
Client data should be stored in the network location that minimizes the traffic necessary to access it.
The act of moving between cells must be transparent to the client.
The server must be able to gracefully divert the shipment of data from one base to another, without the client noticing.
Client mobility also allows new applications that are location-based.
Data Management Issues
From a data management standpoint, mobile computing may be considered a variation of distributed computing. Mobile databases can be distributed under two possible scenarios:
The entire database is distributed mainly among the wired components, possibly with full or partial replication.
A base station or fixed host manages its own database with a DBMS-like functionality, with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments.
The database is distributed among wired and wireless components.
Data management responsibility is shared among base stations or fixed hosts and mobile units.
Data management issues as it is applied to mobile databases:
Data distribution and replication
Transactions models
Query processing
Recovery and fault tolerance
Mobile database design
Location-based service
Division of labor
Security
Application: Intermittently Synchronized Databases
Whenever clients connect – through a process known in industry as synchronization of a client with a server – they receive a batch of updates to be installed on their local database.
The primary characteristic of this scenario is that the clients are mostly disconnected; the server is not necessarily able reach them.
This environment has problems similar to those in distributed and client-server databases, and some from mobile databases.
This environment is referred to as Intermittently Synchronized Database Environment
(ISDBE).
The characteristics of Intermittently Synchronized Databases (ISDBs) make them
distinct from the mobile databases are:
A client connects to the server when it wants to exchange updates.
The communication can be unicast –one-on-one communication between the server and the client– or multicast– one sender or server may periodically communicate to a set of receivers or update a group of clients.
A server cannot connect to a client at will.
The characteristics of ISDBs (contd.) :
Issues of wireless versus wired client connections and power conservation are generally immaterial.
A client is free to manage its own data and transactions while it is disconnected. It
can also perform its own recovery to some extent.
A client has multiple ways connecting to a server and, in case of many servers, may
choose a particular server to connect to based on proximity, communication nodes
available, resources available, etc.
(b)(ii) Discuss optimization and research issues. (8) (NOV/DEC 2010)
Optimization
Query optimization is an important task in a relational DBMS.
Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries).
Two parts to optimizing a query:
Consider a set of alternative plans.
• Must prune search space; typically, left-deep plans only.
Must estimate cost of each plan that is considered.
• Must estimate size of result and cost for each plan node.
• Key issues: Statistics, indexes, operator implementations.
Plan: Tree of R.A. ops, with choice of alg for each op.
Each operator typically implemented using a `pull’ interface: when an operator is
`pulled’ for the next output tuples, it `pulls’ on its inputs and computes them.
Two main issues:
For a given query, what plans are considered?
• Algorithm to search plan space for cheapest (estimated) plan.
How is the cost of a plan estimated?
Ideally: Want to find best plan. Practically: Avoid worst plans!
We will study the System R approach.
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)
Similar to old schema; rname added for variations.
Reserves:
Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
Sailors:
Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
Query Blocks: Units of Optimization
An SQL query is parsed into a collection of query blocks, and these are optimized one block at a time.
Nested blocks are usually treated as calls to a subroutine, made once per outer tuple.
For each block, the plans considered are:
All available access methods, for each reln in FROM clause.
All left-deep join trees (i.e., all ways to join the relations oneat-a-time, with the inner reln in the FROM clause, considering all reln permutations and join methods.)
8. (a)Discuss multimedia databases in detail. (8) (NOV/DEC 2010)
Multimedia databases
To provide such database functions as indexing and consistency, it is desirable to store multimedia data in a database
Rather than storing them outside the database, in a file system
The database must handle large object representation.
Similarity-based retrieval must be provided by special index structures.
Must provide guaranteed steady retrieval rates for continuous-media data.
Multimedia Data Formats
Store and transmit multimedia data in compressed form
JPEG and GIF the most widely used formats for image data.
MPEG standard for video data use commonalties among a sequence of frames to
achieve a greater degree of compression.
MPEG-1 quality comparable to VHS video tape.
Stores a minute of 30-frame-per-second video and audio in approximately 12.5 MB
MPEG-2 designed for digital broadcast systems and digital video disks; negligible loss of video quality.
Compresses 1 minute of audio-video to approximately 17 MB.
Several alternatives of audio encoding
MPEG-1 Layer 3 (MP3), RealAudio, WindowsMedia format, etc.
Continuous-Media Data
Most important types are video and audio data.
Characterized by high data volumes and real-time information-delivery requirements.
Data must be delivered sufficiently fast that there are no gaps in the audio or video.
Data must be delivered at a rate that does not cause overflow of system buffers.
Synchronization among distinct data streams must be maintained
4 video of a person speaking must show lips moving synchronously with the
audio
Video Servers
Video-on-demand systems deliver video from central video servers, across a network, to terminals
must guarantee end-to-end delivery rates
Current video-on-demand servers are based on file systems; existing database systems do not meet real-time response requirements.
Multimedia data are stored on several disks (RAID configuration), or on tertiary storage for less frequently accessed data.
Head-end terminals - used to view multimedia data
PCs or TVs attached to a small, inexpensive computer called a set-top box.
Similarity-Based Retrieval
Examples of similarity based retrieval
Pictorial data: Two pictures or images that are slightly different as represented in the
database may be considered the same by a user.
e.g., identify similar designs for registering a new trademark.
Audio data: Speech-based user interfaces allow the user to give a command or identify a data item by speaking.
e.g., test user input against stored commands.
Handwritten data: Identify a handwritten data item or command stored in the database
(b) Explain the features of active and deductive databases in detail. (8)
(NOV/DEC 2010)
15 Deductive Databases
SQL-92 cannot express some queries:
Are we running low on any parts needed to build a ZX600 sports car?
What is the total component and assembly cost to build a ZX600 at today's part prices?
Can we extend the query language to cover such queries?
Yes, by adding recursion.
Recursion in SQL: The concepts discussed in this chapter are not included in the SQL-92 standard. However, the revised version of the SQL standard, SQL:1999, includes support for recursive queries and IBM's DB2 system already supports recursive queries as required in SQL:1999.
15.1 Introduction to recursive queries
15.1.1 Datalog:
SQL queries can be read as follows: “If some tuples exist in the From tables that satisfy the Where conditions, then the Select tuple is in the answer.”
Datalog is a query language that has the same if-then flavor:
New: The answer table can appear in the From clause, i.e., be defined recursively.
Prolog style syntax is commonly used.
The Problem with R.A. and SQL-92:
Intuitively, we must join Assembly with itself to deduce that trike contains spoke and tire.
takes us one level down Assembly hierarchy.
To find components that are one level deeper (e.g., rim), need another join.
To find all components, need as many joins as there are levels in the given instance!
For any relational algebra expression, we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression.
15.2 Theoretical Foundations
The first approach to defining what a Datalog program means is called the least model semantics and gives users a way to understand the program without thinking about how the program is to be executed. That is, the semantics is declarative, like the semantics of relational calculus, and not operational like relational algebra semantics.
The second approach, called the least fixpoint semantics, gives a conceptual evaluation strategy to compute the desired relation instances. This serves as the basis for recursive query evaluation in a DBMS.
The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries.
i. Least Model Semantics
The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is
smaller than or equal to v.
In general, there may be no least fixpoint (we could have two minimal fixpoints, neither of which is smaller than the other).
If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples, this function (fortunately) always has a least fixpoint.
ii. Safe Datalog Programs
Consider following program: Complex Parts (Part):- Assembly (Part, Subpart, Qty), Qty >2 According to this rule, complex part is defined to be any part that has more than two copies of any one subpart. For each part mentioned in the Assembly relation, we can easily check if it is a complex part. Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body. Such programs are said to be range restricted, and every range-restricted Datalog program has a finite least model if the input relation instances are finite.
iii. The Fixpoint Operator
Let f be a function that takes values from domain D and returns values from D. A value v in D is a fixpoint of f if f(v)=v.
Consider the fn double+, which is applied to a set of integers and returns a set of integers (I.e., D is the set of all sets of integers).
E.g., double+ ({1, 2, 5}) = {2, 4, 10} Union {1, 2, 5}
the set of all integers is a fixpoint of double+.
the set of all even integers is another fixpoint of double+; it is smaller than the first
fixpoint.
iv. Least Model = Least Fixpoint
Further, every Datalog program is guaranteed to have a least model, and the least model is equal to the least fixpoint of the program! These results provide the basis for Datalog query processing. Users can understand a program in terms of `If the body is true, the head is also true,' thanks to the least model semantics. The DBMS can compute the answer by repeatedly applying the program rules, thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identical.
b. Recursive queries with negation
Big(Part) :- Assembly(Part, Subpt, Qty), Qty >2, not Small(Part).
Small(Part) :- Assembly(Part, Subpt, Qty), not Big(Part).
If rules contain not there may not be a least fixpoint. Consider the Assembly instance; trike is the only part that has 3 or more copies of some subpart. Intuitively, it should be in Big, and it will be if we apply Rule 1 first.
But we have Small(trike) if Rule 2 is applied first!
There are two minimal fixpoints for this program: Big is empty in one, and contains trike in the other (and all other parts are in Small in both fixpoints)..
15.3.1 Range-Restriction and Negation
If rules are allowed to contain not in the body, the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe. If a relation appears in the body of a rule preceded by not, we call this a negated occurrence. Relation occurrences in the body that are not negated are called positive occurrences. A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body.
15.3.2 Stratification
T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S, in the body.
Stratified program: If T depends on not S, then S cannot depend on T (or not T).
If a program is stratified, the tables in the program can be partitioned into strata:
Stratum 0: All database tables.
Stratum I: Tables defined in terms of tables in Stratum I and lower strata.
If T depends on not S, S is in lower stratum than T.
Relational Algebra and Stratified Datalog
Selection: Result(Y):- R(X, Y), X=c.
Projection: Result(Y):- R(X, Y).
Cross-product: Result(X, Y, U, V):- R(X, Y), S (U, V).
Set-difference: Result(X, Y):- R(X, Y), not S(U,V).
Union: Result(X, Y):- R(X, Y).
Result(X, Y):- S(X, Y).
The stratified Big/Small program is shown below in SQL: 1999 notation, with a final additional
selection on Big2:
Big2 (Part) AS
(SELECT A1.Part FROM Assembly A1 WHERE Qty > 2)
Small2 (Part) AS
((SELECT A2.Part FROM Assembly A2)
EXCEPT
(SELECT B1.Part from Big2 B1))
SELECT * FROM Big2 B2
15.3.3 Aggregate Operations
SELECT A. Part, SUM (A.Qty)
FROM Assembly A
GROUP BY A. Part
NumParts (Part, SUM (<Qty>)):- Assembly (Part, Subpt, Qty).
The < … > notation in the head indicates grouping; the remaining arguments are the GROUP BY fields.
In order to apply such a rule, must have all of Assembly relation available.
Stratification with respect to use of < … > is the usual restriction to deal with this problem; similar to negation.
15.4 Efficient evaluation of recursive queries
Repeated inferences: When recursive rules are repeatedly applied in the naïve way, we make the same inferences in several iterations.
Unnecessary inferences: Also, if we just want to find the components of a particular part, say wheel, computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful, in that we compute many irrelevant facts.
15.4.1 Fixpoint Evaluation without Repeated Inferences
Avoiding Repeated Interferences:
Seminaive Fixpoint Evaluation: Avoid repeated inferences by ensuring that when a rule is applied, at least one of the body facts was generated in the most recent iteration. (Which means this inference could not have been carried out in earlier iterations.)
for each recursive table P, use a table delta_P to store the P tuples generated in the previous iteration.
Rewrite the program to use the delta tables, and update the delta tables between iterations. Comp (Part, Subpt):- Assembly (Part, Part2, Qty), delta_Comp (Part2, Subpt).
15.4.2 Pushing Selections to Avoid Irrelevant Inferences
SameLev (S1, S2):- Assembly (P1, S1, Q1), Assembly (P2, S2, Q2).
SameLev (S1,S2) :- Assembly(P1,S1,Q1), SameLev(P1,P2), Assembly(P2,S2,Q2).
There is a tuple (S1,S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges.
Suppose that we want to find all SameLev tuples with spoke in the first column. We should “push” this selection into the fixpoint computation to avoid unnecessary inferences.
But we can’t just compute SameLev tuples with spoke in the first column, because some other SameLev tuples are needed to compute all such tuples:
SameLev (spoke, seat):- Assembly (wheel, spoke, 2), SameLev (wheel, frame), Assembly (frame, seat, 1).
The Magic Sets Algorithm
Idea: Define a “filter” table that computes all
relevant values, and restrict the computation
of SameLev to infer only tuples with a
relevant value in the first column.
Magic_SL (P1):- Magic_SL (S1), Assembly (P1, S1, Q1).
Magic (spoke).
SameLev(S1,S2) :- Magic_SL(S1), Assembly(P1,S1,Q1), Assembly(P2,S2,Q2).
SameLev(S1,S2):- Magic_SL(S1), Assembly(P1,S1,Q1),SameLev(P1,P2), Assembly(P2,S2,Q2).
The Magic Sets program rewriting algorithm can be summarized as follows:
Add `Magic' Filters: Modify each rule in the program by adding a `Magic' condition to the body that acts as a filter on the set of tuples generated by this rule.
Define the `Magic' relations: We must create new rules to define the `Magic' relations.
Intuitively, from each occurrence of an output relation R in the body of a program rule, we obtain a rule defining the relation Magic R.