Reference
Complete guide to all relational algebra operations
Selection (σ - sigma)
Filters tuples based on a condition. Returns only the rows that satisfy the condition.
Syntax:
select condition (Relation)Examples:
select salary > 70000 (Employee)select dept_id = 10 (Employee)select salary > 70000 and dept_id = 10 (Employee)SQL Equivalent:
SELECT * FROM Employee WHERE salary > 70000PySpark Equivalent:
df.filter(df.salary > 70000)Projection (π - pi)
Selects specific columns from a relation. Returns a new relation with only the specified attributes. Automatically removes duplicates.
Syntax:
project col1, col2, ... (Relation)Examples:
project name (Employee)project name, salary (Employee)SQL Equivalent:
SELECT DISTINCT name, salary FROM EmployeePySpark Equivalent:
df.select("name", "salary").distinct()Natural Join (⨝)
Combines two relations based on common attributes. Automatically matches columns with the same name.
Syntax:
Relation1 join Relation2Example:
Employee join DepartmentSQL Equivalent:
SELECT * FROM Employee NATURAL JOIN DepartmentPySpark Equivalent:
df1.join(df2, on=common_cols)Theta Join (⨝θ)
Combines two relations based on a specified condition, rather than matching on common column names. Use this when you need to join on an arbitrary comparison.
Syntax:
join condition (Relation1, Relation2)Examples:
join dept_id = Department.id (Employee, Department)join salary > budget (Employee, Department)SQL Equivalent:
SELECT * FROM Employee JOIN Department ON Employee.dept_id = Department.idPySpark Equivalent:
df1.join(df2, df1.dept_id == df2.id)Cartesian Product (×)
Combines every tuple from one relation with every tuple from another. Produces all possible combinations. The result size is |R| × |S| rows.
Syntax:
Relation1 product Relation2Example:
Employee product DepartmentSQL Equivalent:
SELECT * FROM Employee CROSS JOIN DepartmentPySpark Equivalent:
df1.crossJoin(df2)Union (∪)
Combines tuples from two union-compatible relations (same columns). Returns all tuples that appear in either relation, removing duplicates.
Syntax:
Relation1 union Relation2Example:
project name (select dept_id = 10 (Employee)) union project name (select dept_id = 20 (Employee))SQL Equivalent:
SELECT name FROM Employee WHERE dept_id = 10 UNION SELECT name FROM Employee WHERE dept_id = 20PySpark Equivalent:
df1.union(df2)Difference (−)
Returns tuples that exist in the first relation but not in the second. Both relations must be union-compatible (same columns).
Syntax:
Relation1 difference Relation2Example:
project name (Employee) difference project name (select salary > 80000 (Employee))SQL Equivalent:
SELECT name FROM Employee EXCEPT SELECT name FROM Employee WHERE salary > 80000PySpark Equivalent:
df1.subtract(df2)Rename (ρ - rho)
Renames one or more attributes in a relation. Useful for resolving naming conflicts before joins or for clarity.
Syntax:
rename oldName->newName (Relation)Examples:
rename id->emp_id (Employee)rename name->dept_name (Department)SQL Equivalent:
SELECT id AS emp_id, name, salary, dept_id, hire_date FROM EmployeePySpark Equivalent:
df.withColumnRenamed("id", "emp_id")All Operations
| Operation | Syntax | Description |
|---|---|---|
| Selection (σ) | select cond (R) | Filter rows |
| Projection (π) | project cols (R) | Select columns |
| Product (×) | R product S | Cartesian product |
| Natural Join (⨝) | R join S | Join on common attributes |
| Theta Join | join cond (R, S) | Join with condition |
| Union (∪) | R union S | Combine relations |
| Difference (-) | R difference S | Tuples in R but not S |
| Rename (ρ) | rename old->new (R) | Rename attributes |
Comparison Operators
=Equal to!=Not equal to<Less than>Greater than<=Less than or equal>=Greater than or equalLogical Operators
and- Both conditions must be trueor- At least one condition must be truenot- Negates a conditionExample:
select (salary > 70000 and dept_id = 10) or salary > 90000 (Employee)