Core Concepts¶

Understanding these fundamental concepts is essential for effectively using sortition algorithms.

What is Sortition?¶

Sortition is the random selection of representatives from a larger population, designed to create panels that reflect the demographic composition of the whole group. Unlike simple random sampling (which could accidentally select all men or all young people), sortition uses stratified random selection to ensure demographic balance.

Historical Context¶

Sortition has ancient roots in Athenian democracy, where citizens were chosen by lot to serve in government. Modern applications include:

Citizens' Assemblies: Groups that deliberate on policy issues
Deliberative Polls: Representative samples for public opinion research
Jury Selection: Court juries selected from voter rolls
Participatory Budgeting: Community members deciding budget priorities

Key Components¶

Features and Feature Values¶

Features are demographic characteristics used for stratification:

Gender, Age, Education, Income, Location, etc.

Feature Values are the specific categories within each feature:

Gender: Male, Female, Non-binary
Age: 18-30, 31-50, 51-65, 65+
Location: Urban, Suburban, Rural

Quotas and Targets¶

Each feature value has minimum and maximum quotas that define the acceptable range for selection:

feature,value,min,max
Gender,Male,45,55
Gender,Female,45,55
Age,18-30,20,30
Age,31-50,30,40
Age,51+,25,35

This ensures your panel of 100 people includes 45-55 men, 45-55 women, 20-30 young adults, etc.

People Pool¶

The candidate pool contains all eligible individuals with their demographic data:

id,Name,Gender,Age,Location,Email
p001,Alice Smith,Female,18-30,Urban,alice@example.com
p002,Bob Jones,Male,31-50,Rural,bob@example.com
...

Address Checking and Household Diversity¶

A critical feature for ensuring true representativeness is address checking - preventing multiple people from the same household being selected.

Why Address Checking Matters¶

Without address checking, you might accidentally select:

Multiple family members with similar views
Several housemates from a shared address
People who influence each other's opinions

This reduces the independence and diversity of your panel.

How It Works¶

Configure address checking in your settings:

settings = Settings(
    check_same_address=True,
    check_same_address_columns=["Address", "Postcode"]
)

When someone is selected:

The algorithm identifies anyone else with matching values in the specified columns
Those people are removed from the remaining pool
This ensures geographic and household diversity

Address Column Strategies¶

Single column approach:

check_same_address_columns = ["Full_Address"]

Multi-column approach (more flexible):

check_same_address_columns = ["Street", "City", "Postcode"]

Exact vs. fuzzy matching: The current implementation requires exact string matches. For fuzzy address matching, you'd need to clean your data first.

Selection Algorithms¶

Different algorithms optimize for different fairness criteria:

Maximin (Default)¶

Goal: Maximize the minimum selection probability
Good for: Ensuring no group is severely underrepresented
Trade-off: May not optimize overall fairness

Nash¶

Goal: Maximize the product of all selection probabilities
Good for: Balanced representation across all groups
Trade-off: Complex optimization, harder to interpret

Leximin¶

Goal: Lexicographic maximin (requires Gurobi license)
Good for: Strict fairness guarantees
Trade-off: Requires commercial solver

Legacy¶

Goal: Backwards compatibility with older implementations
Good for: Reproducing historical selections
Trade-off: Less sophisticated than modern algorithms

The Selection Process¶

1. Feasibility Checking¶

Before selection begins, the algorithm verifies that quotas are achievable:

features.check_desired(number_people_wanted=100)

2. Algorithm Execution¶

The chosen algorithm finds an optimal probability distribution over possible committees.

3. Lottery Rounding¶

The probability distribution is converted to concrete selections using randomized rounding.

4. Validation¶

Selected committees are checked against quotas to ensure targets were met.

Randomness and Reproducibility¶

Random Seeds¶

For reproducible results (e.g., for auditing), set a random seed:

settings = Settings(random_number_seed=42)

Security Considerations¶

For production use, avoid fixed seeds. The library uses Python's secrets module when no seed is specified.

Data Quality Considerations¶

Feature Consistency¶

Ensure feature values are consistent between your quotas file and candidate data:

# demographics.csv
Gender,Male,45,55
Gender,Female,45,55

# candidates.csv - values must match exactly
person1,Male,...    # ✅ Matches
person2,male,...    # ❌ Case mismatch
person3,M,...       # ❌ Abbreviation mismatch

Missing Data¶

The library requires complete demographic data. Handle missing values before import:

Impute missing values
Create "Unknown" categories
Exclude incomplete records

Data Validation¶

The library performs extensive validation:

Checks for unknown feature values
Verifies quota feasibility
Validates candidate pool size

Error Handling¶

Common Errors¶

InfeasibleQuotasError: Your quotas cannot be satisfied

# Too restrictive - asking for 90+ males in a pool of 100
Gender,Male,90,100
Gender,Female,90,100

SelectionError: General selection failures

Insufficient candidates in a category
Conflicting constraints

ValueError: Invalid parameters

Negative quotas
Invalid algorithm names

Debugging Tips¶

Check quota feasibility: Sum of minimums ≤ panel size ≤ sum of maximums
Verify data consistency: Feature values match between files
Review messages: The algorithm provides detailed feedback
Test with relaxed quotas: Temporarily widen ranges to isolate issues

Best Practices¶

Quota Design¶

Start conservative: Use wider ranges initially, then narrow if needed
Consider interactions: Age and education might be correlated
Plan for edge cases: What if you have few candidates in a category?

Data Preparation¶

Standardize values: Consistent capitalization and spelling
Validate completeness: No missing demographic data
Test with samples: Verify your setup with small test runs

Address Checking¶

Clean addresses first: Standardize formatting before using address checking
Consider geography: Urban areas might need tighter address matching
Balance household diversity vs. other constraints: Address checking reduces your effective pool size

Next Steps¶

Now that you understand the core concepts:

Quick Start - Try your first selection
API Reference - Detailed function documentation
CLI Usage - Command line examples
Data Adapters - Working with different data sources
Advanced Usage - Complex scenarios and optimization