Core Concepts¶
Understanding these fundamental concepts is essential for effectively using sortition algorithms.
What is Sortition?¶
Sortition is the random selection of representatives from a larger population, designed to create panels that reflect the demographic composition of the whole group. Unlike simple random sampling (which could accidentally select all men or all young people), sortition uses stratified random selection to ensure demographic balance.
Historical Context¶
Sortition has ancient roots in Athenian democracy, where citizens were chosen by lot to serve in government. Modern applications include:
- Citizens' Assemblies: Groups that deliberate on policy issues
- Deliberative Polls: Representative samples for public opinion research
- Jury Selection: Court juries selected from voter rolls
- Participatory Budgeting: Community members deciding budget priorities
Key Components¶
Features and Feature Values¶
Features are demographic characteristics used for stratification:
- Gender, Age, Education, Income, Location, etc.
Feature Values are the specific categories within each feature:
- Gender: Male, Female, Non-binary
- Age: 18-30, 31-50, 51-65, 65+
- Location: Urban, Suburban, Rural
Note that sometimes Features are called "categories" and Feature Values are called "category values".
Quotas and Targets¶
Each feature value has minimum and maximum quotas that define the acceptable range for selection:
feature,value,min,max
Gender,Male,45,55
Gender,Female,45,55
Age,18-30,20,30
Age,31-50,30,40
Age,51+,25,35
This ensures your panel of 100 people includes 45-55 men, 45-55 women, 20-30 young adults, etc.
People Pool¶
The candidate pool contains all eligible individuals with their demographic data:
id,Name,Gender,Age,Location,Email
p001,Alice Smith,Female,18-30,Urban,alice@example.com
p002,Bob Jones,Male,31-50,Rural,bob@example.com
...
Address Checking and Household Diversity¶
A critical feature for ensuring true representativeness is address checking - preventing multiple people from the same household being selected.
Why Address Checking Matters¶
Without address checking, you might accidentally select:
- Multiple family members with similar views
- Several housemates from a shared address
- People who influence each other's opinions
This reduces the independence and diversity of your panel.
How It Works¶
Configure address checking in your settings:
When someone is selected:
- The algorithm identifies anyone else with matching values in the specified columns
- Those people are removed from the remaining pool
- This ensures geographic and household diversity
Address Column Strategies¶
Single column approach:
Multi-column approach (more flexible):
Exact vs. fuzzy matching: The current implementation requires exact string matches. For fuzzy address matching, you'd need to clean your data first.
Selection Algorithms¶
Different algorithms optimize for different fairness criteria:
Maximin (Default)¶
Objective: Maximize the minimum selection probability across all groups.
When to use:
- Default choice for most applications
- Ensures no group is severely underrepresented
- Good for citizen assemblies and deliberative panels
Trade-offs:
- May not optimize overall fairness
- Can be conservative in selection choices
Example scenario: A panel where ensuring minimum representation for small minorities is crucial.
Nash¶
Objective: Maximize the product of all selection probabilities.
When to use:
- Large, diverse candidate pools
- When you want balanced representation across all groups
- Academic research requiring mathematical optimality
Trade-offs:
- More complex optimization
- May be harder to explain to stakeholders
Example scenario: Research study requiring theoretically optimal fairness across all demographic groups.
Leximin¶
Objective: Lexicographic maximin optimization (requires Gurobi license).
When to use:
- Academic research requiring strongest fairness guarantees
- When you have access to Gurobi (commercial/academic license)
- High-stakes selections where maximum fairness is essential
Trade-offs:
- Requires commercial solver (Gurobi)
- More computationally intensive
- May be overkill for routine selections
Example scenario: Government-sponsored citizen assembly where mathematical proof of fairness is required.
Diverimax¶
Objective: Maximize the diversity of the panel, preferring as many unique profiles as possible
When to use:
- Ensures no group is severely underrepresented
- Tries to make all groups AND all intersections of groups (e.g Black, Woman, High income) as represented as possible
- Enhances deliberative quality by maximizing the number of unique perspectives
Trade-offs:
- Does not focus on individual fairness, only on panel composition
- Does not generate multiple committees with different probabilities, only one optimal committee
Example scenario: An assembly focused on social or community issues where diverse perspectives are critical.
Legacy¶
Objective: Backwards compatibility with earlier implementations.
When to use:
- Reproducing historical selections
- Comparison studies
- Specific compatibility requirements
Trade-offs:
- Less sophisticated than modern algorithms
- May not provide optimal fairness
Details:
- it will have multiple attempts at the core algorithm
- it goes through each target value in turn, in order of the target value that will be hardest to meet first.
- for each category, it will randomly choose a sample of people to meet the target for that value, which combined with the people already in the selected group, and add that to the selected group
- at the end it either finds a sample that meets the targets, or it fails, and another attempt is made from the start
Algorithms not in this library¶
(Though they may be added at some point in the future.)
Goldilocks¶
From the same family as Maximin and Leximin. It tries to ensure no one in the pool gets a really high or low chance of being selected.
The code is not yet open, but you can read a bit more on the lotterylab algorithms page.
Max Entropy¶
This just tries to select from every possible sample of the right size in the data, finding one at random that meets the targets. Because the number of samples is so large, there are some mathematical tricks used to create samples in a way that increases the chance of finding one that will match the targets, while keeping the randomness.
The code is not yet open, but you can read a bit more on the lotterylab algorithms page.
Swedish Selection¶
This take random samples, calculates the "distance" from the ideal targets, and saves it if the distance has gone down. After a while, the sample with the lowest distance is used. It is computationally intense - the README says "after running it for a few hours ...". For more complex targets it would take a long time. But it does have the advantage of being easy to explain to people, and the code is fairly easy to read.
A note on "distance" - there is a weight applied to each target category, and the distance for each target category is multiplied by that weight. This means the most important target categories will be more closely matched.
The code can be seen on github and it is developed by Digidem Lab.
Automatic Selection¶
This is inspired by Swedish Selection, using the same distance measure. But instead of finding fresh random samples each time, it swaps excess people out of the sample for people with target values that are below target from the pool. By doing this it can often find a sample that hits the targets exactly.
The code can be seen on github and it is developed by Analyse og Tal.
Research Background¶
The algorithms are described in this paper (open access).
DiversiMax is described here.
Other relevant papers:
- Procaccia et al. Is Sortition Both Representative and Fair?
- Tiago c Peixoto
- Reflections on the representativeness of citizens’ assemblies and similar innovations and
- How representative is it really? A correspondence on sortition
The Selection Process¶
1. Feasibility Checking¶
Before selection begins, the algorithm verifies that quotas are achievable:
2. Algorithm Execution¶
The chosen algorithm finds an optimal probability distribution over possible committees.
3. Lottery Rounding¶
The probability distribution is converted to concrete selections using randomized rounding. (This step is skipped for Diverimax, which produces a single optimal committee directly.)
4. Validation¶
Selected committees are checked against quotas to ensure targets were met.
Randomness and Reproducibility¶
Random Seeds¶
For reproducible results (e.g., for auditing), set a random seed:
Security Considerations¶
For production use, avoid fixed seeds. The library uses Python's secrets module when no seed is specified.
Data Quality Considerations¶
Feature Consistency¶
Ensure feature values are consistent between your quotas file and candidate data:
# demographics.csv
Gender,Male,45,55
Gender,Female,45,55
# candidates.csv - values must match exactly
person1,Male,... # ✅ Matches
person2,male,... # ❌ Case mismatch
person3,M,... # ❌ Abbreviation mismatch
Missing Data¶
The library requires complete demographic data. Handle missing values before import:
- Impute missing values
- Create "Unknown" categories
- Exclude incomplete records
Data Validation¶
The library performs extensive validation:
- Checks for unknown feature values
- Verifies quota feasibility
- Validates candidate pool size
Error Handling¶
Common Errors¶
InfeasibleQuotasError: Your quotas cannot be satisfied
SelectionError: General selection failures
- Insufficient candidates in a category
- Conflicting constraints
ValueError: Invalid parameters
- Negative quotas
- Invalid algorithm names
Debugging Tips¶
- Check quota feasibility: Sum of minimums ≤ panel size ≤ sum of maximums
- Verify data consistency: Feature values match between files
- Review messages: The algorithm provides detailed feedback
- Test with relaxed quotas: Temporarily widen ranges to isolate issues
Best Practices¶
Quota Design¶
- Start conservative: Use wider ranges initially, then narrow if needed
- Consider interactions: Age and education might be correlated
- Plan for edge cases: What if you have few candidates in a category?
Data Preparation¶
- Standardize values: Consistent capitalization and spelling
- Validate completeness: No missing demographic data
- Test with samples: Verify your setup with small test runs
Address Checking¶
- Clean addresses first: Standardize formatting before using address checking
- Consider geography: Urban areas might need tighter address matching
- Balance household diversity vs. other constraints: Address checking reduces your effective pool size
Next Steps¶
Now that you understand the core concepts:
- Quick Start - Try your first selection
- API Reference - Detailed function documentation
- CLI Usage - Command line examples
- Data Adapters - Working with different data sources
- Advanced Usage - Complex scenarios and optimization