Core Concepts¶
Understanding these fundamental concepts is essential for effectively using sortition algorithms.
What is Sortition?¶
Sortition is the random selection of representatives from a larger population, designed to create panels that reflect the demographic composition of the whole group. Unlike simple random sampling (which could accidentally select all men or all young people), sortition uses stratified random selection to ensure demographic balance.
Historical Context¶
Sortition has ancient roots in Athenian democracy, where citizens were chosen by lot to serve in government. Modern applications include:
- Citizens' Assemblies: Groups that deliberate on policy issues
- Deliberative Polls: Representative samples for public opinion research
- Jury Selection: Court juries selected from voter rolls
- Participatory Budgeting: Community members deciding budget priorities
Key Components¶
Features and Feature Values¶
Features are demographic characteristics used for stratification:
- Gender, Age, Education, Income, Location, etc.
Feature Values are the specific categories within each feature:
- Gender: Male, Female, Non-binary
- Age: 18-30, 31-50, 51-65, 65+
- Location: Urban, Suburban, Rural
Quotas and Targets¶
Each feature value has minimum and maximum quotas that define the acceptable range for selection:
feature,value,min,max
Gender,Male,45,55
Gender,Female,45,55
Age,18-30,20,30
Age,31-50,30,40
Age,51+,25,35
This ensures your panel of 100 people includes 45-55 men, 45-55 women, 20-30 young adults, etc.
People Pool¶
The candidate pool contains all eligible individuals with their demographic data:
id,Name,Gender,Age,Location,Email
p001,Alice Smith,Female,18-30,Urban,alice@example.com
p002,Bob Jones,Male,31-50,Rural,bob@example.com
...
Address Checking and Household Diversity¶
A critical feature for ensuring true representativeness is address checking - preventing multiple people from the same household being selected.
Why Address Checking Matters¶
Without address checking, you might accidentally select:
- Multiple family members with similar views
- Several housemates from a shared address
- People who influence each other's opinions
This reduces the independence and diversity of your panel.
How It Works¶
Configure address checking in your settings:
settings = Settings(
check_same_address=True,
check_same_address_columns=["Address", "Postcode"]
)
When someone is selected:
- The algorithm identifies anyone else with matching values in the specified columns
- Those people are removed from the remaining pool
- This ensures geographic and household diversity
Address Column Strategies¶
Single column approach:
check_same_address_columns = ["Full_Address"]
Multi-column approach (more flexible):
check_same_address_columns = ["Street", "City", "Postcode"]
Exact vs. fuzzy matching: The current implementation requires exact string matches. For fuzzy address matching, you'd need to clean your data first.
Selection Algorithms¶
Different algorithms optimize for different fairness criteria:
Maximin (Default)¶
- Goal: Maximize the minimum selection probability
- Good for: Ensuring no group is severely underrepresented
- Trade-off: May not optimize overall fairness
Nash¶
- Goal: Maximize the product of all selection probabilities
- Good for: Balanced representation across all groups
- Trade-off: Complex optimization, harder to interpret
Leximin¶
- Goal: Lexicographic maximin (requires Gurobi license)
- Good for: Strict fairness guarantees
- Trade-off: Requires commercial solver
Legacy¶
- Goal: Backwards compatibility with older implementations
- Good for: Reproducing historical selections
- Trade-off: Less sophisticated than modern algorithms
The Selection Process¶
1. Feasibility Checking¶
Before selection begins, the algorithm verifies that quotas are achievable:
features.check_desired(number_people_wanted=100)
2. Algorithm Execution¶
The chosen algorithm finds an optimal probability distribution over possible committees.
3. Lottery Rounding¶
The probability distribution is converted to concrete selections using randomized rounding.
4. Validation¶
Selected committees are checked against quotas to ensure targets were met.
Randomness and Reproducibility¶
Random Seeds¶
For reproducible results (e.g., for auditing), set a random seed:
settings = Settings(random_number_seed=42)
Security Considerations¶
For production use, avoid fixed seeds. The library uses Python's secrets
module when no seed is specified.
Data Quality Considerations¶
Feature Consistency¶
Ensure feature values are consistent between your quotas file and candidate data:
# demographics.csv
Gender,Male,45,55
Gender,Female,45,55
# candidates.csv - values must match exactly
person1,Male,... # ✅ Matches
person2,male,... # ❌ Case mismatch
person3,M,... # ❌ Abbreviation mismatch
Missing Data¶
The library requires complete demographic data. Handle missing values before import:
- Impute missing values
- Create "Unknown" categories
- Exclude incomplete records
Data Validation¶
The library performs extensive validation:
- Checks for unknown feature values
- Verifies quota feasibility
- Validates candidate pool size
Error Handling¶
Common Errors¶
InfeasibleQuotasError: Your quotas cannot be satisfied
# Too restrictive - asking for 90+ males in a pool of 100
Gender,Male,90,100
Gender,Female,90,100
SelectionError: General selection failures
- Insufficient candidates in a category
- Conflicting constraints
ValueError: Invalid parameters
- Negative quotas
- Invalid algorithm names
Debugging Tips¶
- Check quota feasibility: Sum of minimums ≤ panel size ≤ sum of maximums
- Verify data consistency: Feature values match between files
- Review messages: The algorithm provides detailed feedback
- Test with relaxed quotas: Temporarily widen ranges to isolate issues
Best Practices¶
Quota Design¶
- Start conservative: Use wider ranges initially, then narrow if needed
- Consider interactions: Age and education might be correlated
- Plan for edge cases: What if you have few candidates in a category?
Data Preparation¶
- Standardize values: Consistent capitalization and spelling
- Validate completeness: No missing demographic data
- Test with samples: Verify your setup with small test runs
Address Checking¶
- Clean addresses first: Standardize formatting before using address checking
- Consider geography: Urban areas might need tighter address matching
- Balance household diversity vs. other constraints: Address checking reduces your effective pool size
Next Steps¶
Now that you understand the core concepts:
- Quick Start - Try your first selection
- API Reference - Detailed function documentation
- CLI Usage - Command line examples
- Data Adapters - Working with different data sources
- Advanced Usage - Complex scenarios and optimization