Virtual DI Summit 2020 Summary

Omina Technologies > Virtual DI Summit 2020 > Virtual DI Summit 2020 Summary

How to implement fair credit scoring?

Summary

Fair lending is free of any prejudice or favoritism toward an individual or a group based on their inherent or acquired characteristics. 

Fair lending requires fair credit scores. This is challenging because they can be unintentionally unfair (unintentional discrimination) due to data bias and model bias.

The first step is awareness, which means consciously defining fairness and detecting fairness issues.

Fairness is contextual.

Fairness needs to be a design constraint.

First steps:

Define group fairness matching your context

Conceptually define how a fair lending system should behave before selecting any group-fairness measure. This includes defining which ethical fairness definition you want to use (cf. Prof. dr. Lode Lauwaert), specifying the protected variables and protected groups.

Detect group unfairness issues (cf. Nathalie Smuha: duty of care)

Select a mathematical group-fairness measure that matches your context and your conceptual definition of fairness (cf. Prof. dr. Lode Lauwaert).

Martin identified disparate impact, equal opportunity and conditional demographic parity as three useful group-fairness measures for fair credit lending. The poll amongst participants showed that 50% prefer equal opportunity as group fairness definition, followed by 33% preferring conditional demographic parity, followed by 17% preferring disparate impact.

This proves that the definition of group fairness is not only contextual but also subjective. Hence, the importance to consciously reflect on your group-fairness definition and provide an argumentation of why you choose a certain group-fairness definition.  He proves that depending on the chosen group-fairness measure, different conclusions can be drawn w.r.t. whether the AI solution is fair or not w.r.t. particular protected groups.

If group unfairness is detected, the next step in the AI pipeline is to add bias mitigation methods.

Poll on relevant group fairness definitions for lending

Virtual DI Summit 2020 Summary 1

AI, ethics and law: How to implement fair credit scoring?

Q & A

Suppose you want to make a data model for internal mobility (HR) to counter the bias against white males in senior positions, but you only have your internal data where seniors are mostly white males. How can you combat this without having to add external data to the model? And is training a bot in this way making it more or less fair?

To formalize the context of the question and the answer, one particularly important assumption is that the internal and historical data available contains a strong representation bias towards white males. Although having a representation bias in the available dataset does not necessarily lead to a biased model, we are going to consider that it is the case here: a model trained on the available dataset without additional tweaking presents an unfair advantage towards white males. A second assumption is that it is impossible to collect additional data to mitigate the representation bias in the dataset. 

 

Starting from this situation, one has three main options:

  • Artificially tweak the available data.
  • Apply in-processing mitigation methods.
  • Use post-processing mitigation methods.

 

For the first option, data processing methods are considered. As historical bias or discrimination is most likely the cause of the data bias, possible solutions are to use relabeling, data transformation, or creating artificial fake data to rebalance the dataset. 

 

Knowledge of the data is very important, and this has to be explored with the HR domain experts. This could be leveraged in the form of business rules and applying techniques that can encode them such as logical systems, including probabilistic programming and fuzzy logic programming. 

 

For instance, even if one creates a female version of every male, one still runs the risk of having the model reason on proxies of gender. Suppose that in the rebalanced dataset, males are usually more likely to play football. If activities are listed as part of the information about the candidates, although gender should not matter, the biased model could be looking for activities to discriminate candidates. 

 

A second possibility is to apply an in-processing bias mitigation method that enforces constraints during the learning process of algorithms: adversarial debiasing, classification with fairness constraints or prejudice remover regularizer. Those will act during the training of the model to mitigate bias.

 

A third option would be to use post-processing methods. Supposing that the bias of the model cannot be sufficiently mitigated, one could use different thresholds for approval of candidates, depending on if they are white males or not. This way, the model’s predictions would be adjusted depending on the model’s unfairness. 

 

Finally, and most importantly, the type of bias (hence, in this context, unfairness) needs to be defined. Is the objective to obtain an equal representation of the different groups in senior positions? Is the objective to build a model that provides equal opportunities to applicants that should get promoted to that senior position (as defined during the presentation of the DISummit)? Do you measure the size of the groups compared to the global population of the country in which you are, or compared to the actual applicants to those senior positions? 

 

The context in which you will answer and define the type of result you wish to achieve, and the usage of the model, also has an impact. As has already been experienced in the past, a fully AI-driven fair recruitment system is very difficult to achieve. A better option is to use this as support in a human-driven process. The model can be used only there to pre-select candidates for a position, and provide a list to the HR department for a second round of interviews. In this case, splitting the dataset into different groups, and training different models for each group, could prove a fairer solution, and leave the decision responsibility with humans. 

AI outcomes can be so complex; humans fail to understand them (ex. extremely advanced Go strategies). How will we be able to differentiate bad from good if an extremely advanced ethical AI develops strategies and results we cannot comprehend?

As this is a very broad question, we will explore different aspects to provide answers. 

 

First of all, even if we don’t understand why a decision is made, we can still analyze the decision as being fair or unfair. This is simple for classical machine learning. In the context of extremely advanced Go strategies and very complex AI systems, even if it is not immediately understandable, the impact of the decisions that are taken on the outcome of the game can always be considered. 

 

In the end, fairness only looks at the impact, the consequence, and difference of treatment between people that should be treated equally in a specific context. If the impact of an extremely advanced AI strategy cannot be measured on people, one could wonder whether it is actually affecting people, and why it cannot be measured. Answering why an AI produced a certain outcome is an entirely different discussion.

 

Secondly, machines do not make their own targets and problems, they are simply optimizing the targets humans give them. The target and the means to achieve this target is also defined by a human. This human (or humans) also deploys the models and observes how the models perform. Therefore, there should not be any problem with the analysis of the outcome since humans are the ones who are defining what the AI should produce as an outcome. 

Can you elaborate on "how" humans should interfere in the algorithms so that "past injustices are corrected"? To what degree should they interfere and at what extent would you consider that interference to be fair / justified? How is this quantified?

Humans can mitigate model bias by applying in-processing methods which enforce constraints during the learning process of the algorithm: adversarial debiasing, classification with fairness constraints or prejudice remover regularizer. 

 

The degree to which the human needs to interfere can be determined by repeatedly detecting/evaluating group fairness issues after model training and adding a bias mitigation in-processing method, until the model is considered as fair.  

 

In adversarial debiasing an adversarial model modifies the original model via parameters and weighting so it should no longer be possible to predict well the sensitive attribute. As the actual data values are not changed, the method is truthful and it is justified to change the model properties when group unfairness has been detected (using group fairness measures). 

Classification with fairness constraints and prejudice remover regalizer go beyond adversarial debiasing because they explicitly optimize the model taking the bias into account. This method optimizes the algorithm for predictive performance given a fairness constraint. A prejudice remover regalizer optimizes the algorithm for predictive performance and fairness simultaneously, rather than only predictive performance. Depending on how the fairness constraint or the fairness regulator term is implemented this human intervention will be justified or not.  

 

In the end, the degree to which humans should interfere, and to what extent intervention is justified, depends on the objective of fairness to be reached such as defined by the different stakeholders during the use-case definition. As an example, if the use-case context defines fairness as “equal opportunity”, in the sense of similar true positive rates between groups, and that a ratio of 95% is the numerical objective, one could argue that any intervention is justified until that objective is reached.

How do you manage trade-offs between individual and group fairness definitions?

This question can be answered from two points of view.

 

First, if the meaning is to define the balance or trade-off between the objective of individual fairness against group fairness, then this is a business and legal requirement within the context of the use case, and it is the task of the different stakeholders to formulate the trade-off. They are the ones that should choose which is more important than the other, and what kind of trade-off they find justifiable.

 

Secondly, if the meaning is to optimize the model with respect to both objectives, then one could use models where a specific loss function can be customized to include both terms, and weigh them depending on the use case’s defined objectives.

How do you make objective what properties are relevant to use in an "equal treatment" approach? E.g. is salary so relevant that as a bank I am justified not to have the same success rate for men and women in credit card applications?

Actually, that is precisely the issue that underlies this complex problem. As was explained by Nathalie Smuha during her talk on the legal point of view, and by Prof. Lode Lauwaert during his presentation from a philosophical perspective, equality and fairness are human subjective notions. They depend on the context in which they are evaluated, and depend usually on societal norms. As those norms constantly evolve over time, so does the notion of fairness. 

 

In Anglo-Saxon jurisdiction there is evidence that you can discriminate based on factors that are essential to conducting your business. Supreme Court decisions in the US (Baum & Stute, 2015) and UK (Lowenthal, 2017) allow for disparate impact if it is crucial to a legitimate business requirement. One of such requirements may be the need to accurately predict default to allow for greater financial inclusion, which – given discriminatory history- may be at the expense of some minority groups. However, in European jurisdiction the right to conduct a business is balanced with the right for non-discrimination. In Europe the best thing to do is to provide statistical evidence and business logic of why certain properties are chosen in the equal treatment approach. 

 

There is an opportunity for law to define together with the industry which properties are justified to discriminate against for particular use cases. To some extent the current legislation already has such rules. 

 

For example, for consumer credits financial organizations are required to collect the applicant’s credit history, number of existing credits at this bank, the number of people being liable to provide maintenance for. These required background information should be natural conditional properties for an equal treatment approach.

 

Another example is the use of gender as a discrimination criterion. In health care, one does understand the difference in anatomy and biological response, and its impact on any decision process. In this context, if one wants to provide different results or treatments for different groups based on gender, it would be accepted by society, and deemed fair. Another scenario involves life insurance. It is statistically proven that women tend to live longer than men. It would be logical from a financial perspective to provide cheaper life insurance policies to women compared to men. However, even though the financial and statistical evidence do push towards this conclusion, legal jurisprudence has punished this kind of discrimination, forcing insurers to provide similar policies and disregard the gender criteria.

 

Hence, for this specific question regarding salary for credit card applications, we suggest strongly consulting experts and legal teams of a bank to decide what is relevant and what is not. Even though we are not lawyers or bank employees, we are pretty sure that salaries play a major role in getting a loan. That’s exactly where the conditional metrics play a role: a male and female with similar relevant attributes should have similar chances of getting a loan.

AI outcomes can be so complex; humans fail to understand them (ex. extremely advanced Go strategies). How will we be able to differentiate bad from good if an extremely advanced ethical AI develops strategies and results we cannot comprehend?

As this is a very broad question, we will explore different aspects to provide an answer. 

 

First of all, even if we don’t understand why a decision is made, we can still analyze the decision as being fair or unfair. This is simple for classical machine learning. In the context of extremely advanced Go strategies and very complex AI systems, even if it is not immediately understandable, the impact of the decisions that are taken on the outcome of the game can always be considered. 

 

In the end, fairness only looks at the impact, the consequence, and difference of treatment between people that should be treated equally in a specific context. If the impact of an extremely advanced AI strategy cannot be measured on people, one could wonder whether it is actually affecting people, and why it cannot be measured. Answering why an AI produced a certain outcome is an entirely different discussion.

 

Secondly, machines do not make their own targets and problems, they are simply optimizing the targets humans give them. The target and the means to achieve this target is also defined by a human. This human (or humans) also deploys the models and observes how the models perform. Therefore, there should not be any problem with the analysis of the outcome since humans are the ones who are defining what the AI should produce as an outcome. 

Are the applications for predicting crime (like COMPAS) fair?

The question basically boils down to defining how you think the model should behave to be qualified as fair. The answer can only be provided by a combined effort of all the stakeholders of this use case. Hence, representing only one perspective of this problem, Omina Technologies can certainly not judge if crime predicting applications are fair or unfair.

 

What Omina would suggest however, is that since this kind of application has huge consequences and a drastic impact on people’s lives, special efforts need to be conducted to ensure fairness as defined by the stakeholders. The evolution and reevaluation of the fairness criteria, following the understanding of the problem and the perception of the application by society, needs to be regular and closely monitored.

How do you measure fairness concretely in your models, and who decides about changing or not the models based on the fairness matrix?

Three concrete ways of measuring fairness have been explained in the presentation. Other fairness metrics do use different calculation methods, and provide different results, but rely on the same basic principles: they are functions of the results of the model. The complexity of those functions vary amongst the different fairness definitions. Multiple papers and articles exist and provide an extensive summary of those fairness metrics.

 

The role of the stakeholders in the use-case is to define which fairness metric is relevant. Once this is done, it is the data scientist’s role to build the AI solution, and provide a fairness assessment of the solution’s output. Should the obtained results not correspond to the fairness objective, different mitigation strategies are available for data scientists to cure the unfairness of the model.

 

Two outcomes are then possible. On one hand, it could be that the fairness objective cannot be met. Then, the stakeholders should look for other options. On the other hand, if the fairness objective is met, the stakeholders should validate the mitigation strategies. 

 

To summarize: Stakeholders are the ones that should define the fairness metric, acceptance criteria, fairness objectives, possible changes in case of impossible targets, and final pipeline validation.

What concept, in these talks would apply to AI and not to human-based decision making?

One must identify the differences between AI and human-based decision making. A first caveat will be to compare systems that are comparable. In the last presentation (practical), an example was given with a loan application system that can either be human-driven or AI-driven. Humans base their decisions on their personal experiences. AI bases decisions on the training of a model that learns patterns from data. However, for a single AI model making decisions, one needs a multitude of humans to provide the same processing power. The concepts in the talks are relevant to analyze a single system making decisions. With a fairness definition applied to a use case, one can detect unfairness and attempt to mitigate it. The same approach cannot be used for human-driven systems, as every single human will behave differently, and might be subject to a different form of bias. Analysis of a human-driven system from a global point of view will not ensure that every human will use the same reasoning and produce the same kind of behavior. 

 

Hence, the concepts of fairness and the methodology presented in those talks are mostly relevant for AI driven systems, or any other system that shares the same characteristics. In the end, what really matters is the type of input, the type of bias and fairness to measure, and the method used for the decision making process, and the type of output.

 

As specified during the talk, the concepts presented are only a small part of solving the fairness problems in an AI system. The tools presented are required to define and measure fairness. Once this is done, other methods can be implemented to mitigate the problems that have been identified. Those can be specific to the type of problem, such as historical bias, or representation bias, or algorithmic specific, such as including fairness in the cost function of the algorithm. And those become then AI specific. 

Is it useful to prone specific legislation and guidance dedicated to AI, rather than insist that all decision-making processes comply with the same standards, regardless of how they are implemented (human reasoning, AI, ...) ?

The possibility of unintentional discrimination in automated AI-based decision making is probably what necessitates AI-specific regulation. This is why the European Commission’s White paper on Artificial Intelligence (February 19th 2020) suggests adding specific regulation for fully-automated high risk AI-based decision making. Legal institutions are now looking into to what extent the existing legislation is not sufficient. They are identifying the gaps and the need for AI-specific vertical legislation (probably for those high risk use cases in specific industries). The majority of the need for AI specific legislation is the absence of the concept that a system can be held accountable. This is why most demands for extra AI legislation are for fully automated systems where you cannot identify a specific person as being liable. For fully-automated systems, the absence of a human means that there is less contextual interpretation and a bigger risk to make unfair/unsafe decisions than for humans, so some rules might only apply for fully-automated AI-based decisions because those specific harms are very unlikely if humans would be involved in the decision making process.

On-Demand Video Recording

Get the on-demand videos including the talks by Prof. dr. Lode Lauwaert and Nathalie Smuha here!

Virtual DI Summit 2020 Summary 2