Reference

Complete guide to all relational algebra operations

Selection (σ - sigma)

Filters tuples based on a condition. Returns only the rows that satisfy the condition.

Syntax:

select condition (Relation)

Examples:

select salary > 70000 (Employee)select dept_id = 10 (Employee)select salary > 70000 and dept_id = 10 (Employee)

SQL Equivalent:

SELECT * FROM Employee WHERE salary > 70000

PySpark Equivalent:

df.filter(df.salary > 70000)

Projection (π - pi)

Selects specific columns from a relation. Returns a new relation with only the specified attributes. Automatically removes duplicates.

Syntax:

project col1, col2, ... (Relation)

Examples:

project name (Employee)project name, salary (Employee)

SQL Equivalent:

SELECT DISTINCT name, salary FROM Employee

PySpark Equivalent:

df.select("name", "salary").distinct()

Natural Join (⨝)

Combines two relations based on common attributes. Automatically matches columns with the same name.

Syntax:

Relation1 join Relation2

Example:

Employee join Department

SQL Equivalent:

SELECT * FROM Employee NATURAL JOIN Department

PySpark Equivalent:

df1.join(df2, on=common_cols)

Theta Join (⨝θ)

Combines two relations based on a specified condition, rather than matching on common column names. Use this when you need to join on an arbitrary comparison.

Syntax:

join condition (Relation1, Relation2)

Examples:

join dept_id = Department.id (Employee, Department)join salary > budget (Employee, Department)

SQL Equivalent:

SELECT * FROM Employee JOIN Department ON Employee.dept_id = Department.id

PySpark Equivalent:

df1.join(df2, df1.dept_id == df2.id)

Cartesian Product (×)

Combines every tuple from one relation with every tuple from another. Produces all possible combinations. The result size is |R| × |S| rows.

Syntax:

Relation1 product Relation2

Example:

Employee product Department

SQL Equivalent:

SELECT * FROM Employee CROSS JOIN Department

PySpark Equivalent:

df1.crossJoin(df2)

Union (∪)

Combines tuples from two union-compatible relations (same columns). Returns all tuples that appear in either relation, removing duplicates.

Syntax:

Relation1 union Relation2

Example:

project name (select dept_id = 10 (Employee)) union project name (select dept_id = 20 (Employee))

SQL Equivalent:

SELECT name FROM Employee WHERE dept_id = 10 UNION SELECT name FROM Employee WHERE dept_id = 20

PySpark Equivalent:

df1.union(df2)

Difference (−)

Returns tuples that exist in the first relation but not in the second. Both relations must be union-compatible (same columns).

Syntax:

Relation1 difference Relation2

Example:

project name (Employee) difference project name (select salary > 80000 (Employee))

SQL Equivalent:

SELECT name FROM Employee EXCEPT SELECT name FROM Employee WHERE salary > 80000

PySpark Equivalent:

df1.subtract(df2)

Rename (ρ - rho)

Renames one or more attributes in a relation. Useful for resolving naming conflicts before joins or for clarity.

Syntax:

rename oldName->newName (Relation)

Examples:

rename id->emp_id (Employee)rename name->dept_name (Department)

SQL Equivalent:

SELECT id AS emp_id, name, salary, dept_id, hire_date FROM Employee

PySpark Equivalent:

df.withColumnRenamed("id", "emp_id")

All Operations

OperationSyntaxDescription
Selection (σ)select cond (R)Filter rows
Projection (π)project cols (R)Select columns
Product (×)R product SCartesian product
Natural Join (⨝)R join SJoin on common attributes
Theta Joinjoin cond (R, S)Join with condition
Union (∪)R union SCombine relations
Difference (-)R difference STuples in R but not S
Rename (ρ)rename old->new (R)Rename attributes

Comparison Operators

=Equal to
!=Not equal to
<Less than
>Greater than
<=Less than or equal
>=Greater than or equal

Logical Operators

and- Both conditions must be true
or- At least one condition must be true
not- Negates a condition

Example:

select (salary > 70000 and dept_id = 10) or salary > 90000 (Employee)