Finding Deepfakes: A Tabletop Exercise About AI, Decisionmaking, and Algorithmic Performan

Early in the scenario, participants discover algorithmic limitations in an artificial intelligence (AI)-based deepfake detection tool. As the scenario unfolds, they see first-hand how this problem degrades the tool’s performance and reliability, introduces uncertainty into the assessment of information, and sows distrust in intelligence products. Over the course of the exercise, participants confront a series of decisionmaking challenges about how to respond.

To our knowledge, this is the first exercise of its kind. We hope this article, which incorporates perspectives from the exercise designer and lead scientist, faculty, and participating students, will prompt further discussion of best practices to help future leaders explore the ramifications of new technologies before they encounter problems in an operational context.

Why Run a Tabletop Exercise About Algorithmic Performance?

AI tools are quickly becoming part of the decisionmaking process for national security and defense, but we have yet to fully understand and prepare for how these tools—and their vulnerabilities and limitations—could impact critical and military ethical decisionmaking for the warfighter. One of the best ways we must learn about and prepare for these challenges is to observe them under the conditions of an exercise in a joint professional military education (JPME) classroom with participants whose real-world expertise turns the pressure up to a maximum.

In September 2023, the College of Information and Cyberspace (CIC) at the National Defense University ran a unique tabletop exercise designed by IQT and Luminos.Law that explored issues related to algorithmic performance in the context of JPME.¹ As far as the authors of this article are aware, this was the first time an exercise of this type was conducted in a classroom setting for Department of Defense (DOD) participants. In this article, we summarize key takeaways and learnings from that experience, including important issues related to algorithmic performance raised during the exercise; observations about the exercise itself, including reactions from two students who participated; and suggested next steps and further areas of study.

In general, this experience illustrated a vital need to continue to be creative in how we adapt our teaching methods for senior decisionmakers who will use, evaluate, direct, resource, and be held responsible for outcomes linked to AI systems. While DOD has previously articulated frameworks on the development and use of AI tools, these have not (yet) incorporated training and education for future leaders about how AI technologies ought to inform critical and military ethical decisionmaking. As emerging AI tools become enmeshed in our decisionmaking support systems, are we prepared for the types of uncertainty and vulnerability these systems introduce? As researchers, educators, and future leaders (students), we wanted to find out.

Algorithmic Performance: Understanding Potential Vulnerability

All software tools are vulnerable to failure. As machine learning and AI are introduced into software products, new modes of failure become possible. For example, many research efforts have documented novel “adversarial” attacks on AI models and tools.² These types of intentional attacks, however, are not the only cause for concern. Today, many AI systems fail because of unintended causes such as engineering mistakes, insufficient training data, or algorithmic limitations.

AI models are imperfect summaries of the data on which they are trained. This means that models reflect whatever limitations are present in their training data and feature extraction methods. As the statistician George Box famously put it, “All models are wrong,” thus intimating that some models are useful.³ Not all limitations in models are cause for concern, but some can lead to serious negative outcomes.

For example, many computer vision classification algorithms are based on an implicit assumption that training data includes an equal number of examples of each of the classes (or categories) the model is trained to identify. If this is not the case, models can have worse predictive performance for some classes compared to others. A growing body of academic literature continues to document this phenomenon in facial recognition models that perform consistently worse for subjects from certain demographics because of their training data including insufficient examples of faces with darker skin tones.⁴

In the United States, algorithmic limitations are often viewed as an issue for private-sector companies that are subject to domestic antidiscrimination laws.⁵ In this framing, the limitations are commonly conceived of as an issue that is distinct from model performance, and the resulting harms are understood to be related to the rights of the subjects of analysis (that is, consumers). Training a model on biometric data—like human faces—means that legally protected characteristics, such as skin color or features associated with biological sex, are encoded into the data themselves and, by extension, the model. Unless this is explicitly addressed by model developers, the performance of the resulting model may be dependent on features that relate to protected class categories.

To make matters worse, through a problem called “short-cut learning,” models can find spurious relationships between data features.⁶ A model trained on biometric data could, for example, erroneously associate a protected characteristic (such as skin tone) with specific outcomes that have no causal link to that characteristic, particularly if there are not enough counter examples present in the training data. Depending on how a model is used, this type of algorithmic deficiency could lead to harmful and discriminatory outcomes. In a national security context, using deficient AI systems to examine data about populations outside the United States raises significant concerns, as these are most frequently the targets of interest for military scenarios. The subjects of analyses do not have the same legal recourse as U.S. citizens who more readily drive product changes.

While this framing of algorithmic limitations highlights fairness and justice concerns, it overlooks an important national security implication—that algorithmic limitations are also a critical performance concern. Even if the use of a deficient AI model does not lead to discriminatory or unfair outcomes, a model that significantly underperforms in certain situations (that is, when analyzing specific groups of subjects) is not reliable. If analysts using a deficient model are unaware of its limitations, this could lead to analytical errors that affect decisionmaking not only in how intelligence products are produced but also for senior decisionmakers informed by the analysis. Furthermore, when algorithms are based on U.S.-centric ethnic and racial categories, these models may obfuscate important differences within and across foreign populations, exacerbating performance and fairness concerns.

The authors of this article collaborated to run a tabletop exercise about algorithmic performance to explore how it could negatively impact decisionmaking in the context of military intelligence. By running this exercise in a JPME classroom setting, we hoped to help future leaders anticipate problems with AI technologies and rehearse response strategies before they encountered similar problems in an operational context.

What Happened During the Exercise?

Deepfake Incident Response, a tabletop exercise developed by IQT and Luminos.Law, involves a fictional AI tool called FakeFinder, which uses AI to predict whether a video is a deepfake.⁷ For over 90 minutes, participants role-play members of an AI incident response team at a U.S. intelligence agency. The exercise is set in the context of a highly charged geopolitical situation, where the authenticity of a particular video has grave implications and where analysts have used FakeFinder to assess the veracity of that video. A moderator facilitates discussion among the participants, first in response to an “initial incident report,” which describes an allegation of deficient performance within the FakeFinder tool, and later to a series of “injects,” or plot twists, that unfold during the exercise, complicating the situation and raising the stakes.

During the exercise, each participant is assigned a specific role on the AI incident response team. The roles—data scientist, senior analyst, legal/privacy stakeholder, communications stakeholder, information security stakeholder, foreign liaison officer, and deputy director—intentionally draw on various skill sets and types of expertise to encourage participants to anticipate and explore types of risks and concerns that emerge during the exercise.

While the scenario is fiction, it is grounded in the results of a red-teaming exercise that IQT ran in 2021,⁸ which examined several open-source deepfake detection models that were produced during a competition called the Deepfake Detection Challenge.⁹ IQT tested the top performing models from that competition, and their testing revealed troubling signs of performance limitations. For example, one model was six times more likely to flag a video as a false positive—incorrectly identifying a real video as a deepfake—if the video showed an East Asian face, compared to if it showed a White face.¹⁰ IQT previously ran the Deepfake Incident Response tabletop exercise with groups of individuals from the Intelligence Community, but this was the first time the exercise was piloted in a DOD setting. The College of Information and Cyberspace, which has a strong focus on the use of technology in strategic decisionmaking for the warfighter as well as the role of ethics in leadership, invited IQT to share the exercise as part of the active learning series on AI and miliary ethics woven throughout the master’s curriculum. CIC educates joint warfighters, national security leaders, and the cyber workforce on the cyber domain and information environment to lead, advise, and advance national and global security. The students come from military, civilian, and international roles typically at the 0-5/0-6 or GS 14/15 level and are expected to continue their careers advancing into more senior roles. The changing character of war requires new approaches for JPME such as exploring AI in military ethics through exercises like this. What Did We Learn? Two students share observations from their participation in the tabletop exercise, in the roles of the senior analyst and deputy director, respectively. Col Lisa Pagano-Wallace: Decisionmaking at the Speed of Technology. “For the exercise, I was assigned to the senior analyst role. It was not a stretch role for me; I have been practicing intelligence in the Air Force for the last 20 years and felt comfortable with the role as it was outlined. I was also interested in the idea of using AI to do predictive analysis and monitor indications and warnings, but I knew that the practice of creating a solid, reliable data repository relies heavily on analysts and can be incredibly labor-intensive, painstaking work. This experience, combined with my (admittedly shallow and mostly from podcasts¹¹) understanding of the problems with facial recognition technologies, set me up to be deeply suspicious of FakeFinder in the exercise. I was also inclined to believe the character in the scenario, who claimed that the tool in question was deficient based on tests he performed running his own videos through the tool. “First and foremost, as a student, any opportunity to learn actively is embraced, so I walked into the ethics and AI tabletop exercise with an abundance of enthusiasm. CIC students received a primer on the dangers and potential deficiencies of AI data models earlier in the semester, in our Strategic Thinking and Communications course, by way of a groundbreaking article titled ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?’¹² I had yet to practically connect the insights of this work to my work as an intelligence professional, however, and this is where the exercise pushed me into real-world takeaways from an academic moment. “This initial impression formed my strategy for the rest of the simulation — first to make certain that the assessment for the authenticity of the video remained at medium to low confidence, and second to ensure that this was seen as a tool failure and not an intelligence failure. This strategy put me diametrically opposite the data scientist because I was questioning the validity of his tool and attempting to deflect responsibility from my team to his. It also gave me two critical tasks: buying time for human analysts to verify FakeFinder’s results and convincing the deputy director to hedge her certainty that the video was real. “The scenario put pressure on my goal of buying time, by heightening the stakes — with an imminent Chinese invasion of Taiwan — and increasing the risk of ignoring FakeFinder’s findings. This classic not-enough-time-for-theintelligence-to-be-perfect twist sparked a heated debate, where participants focused on the risks of not reporting the video, given the stakes, and discussed how to qualify the confidence of our assessment appropriately. Ultimately, the group decided to push forward with the intelligence, were wrong, and were sharply rebuked by the President’s senior advisor—solidifying the lesson that making decisions at the speed of technology could be a fraught venture. Failure provides more opportunity to learn, right? “The tabletop exercise provided me with a few observations for consideration as I move forward. First, understanding how tools are developed is critical for intelligence professionals, especially in the move toward predictive capabilities. Second, relationship-building is critical in a crisis environment; by design, the participants were from different seminars, and I think our lack of insight into each other hindered our speed for decisionmaking. Third, a lesson relearned (for me), is that intelligence assessments need to be clear and sources well vetted or caveated, whether they come from traditional systems or those powered by AI. This can be more complicated when an analyst is not familiar with the data sets or algorithms that are supplying the conclusions.” CDR Hermie Mendoza: Acknowledge Our Trust in Machines. “Based on the Everett Rogers adoption curve, I am in the ‘early majority.’¹³ Like other pragmatic early majority users, I consciously integrate newer technologies into my arsenal of ‘digital tools’ once ‘early adopters’ have validated the technologies’ effectiveness in their intended use cases. As a CIC student, my early majority preferences are challenged by the faculty, especially considering Great Power competition with China and Russia. Achieving overmatch from a national security perspective demands that the U.S. Government holistically adopt an ‘innovator’ or early adopter approach to AI, quantum computing, and other emerging technologies. Therefore, current and future national security leaders should encourage information and communication technology companies to cross the ‘chasm’ between early adopters and early majority users to incorporate new technologies in government business models. “To expose students to the challenges of AI while developing student critical thinking skills, the faculty sponsored a tabletop exercise on a simulated intelligence-driven national security issue. My role in the tabletop exercise as the deputy director centered on synchronizing efforts of an interdisciplinary AI incident response team, communicating with internal and external stakeholders, and adapting organizational operations when an AI tool malfunctions. On the surface, I found the deputy director’s responsibilities straightforward. Yet as the exercise evolved, I deliberately added another responsibility—developing a candid work culture where speaking truth to power and innovation were embraced. From my perspective, this leadership practice would create collaborative conditions for the team to swiftly identify root causes, investigate dependencies, remediate the deficient AI tool in the production environment, and modify intelligence procedures and processes. Focusing on accountability at that moment would have detracted efforts from supporting higher-level decisionmaking processes. To me, it was more important to find a path forward that yielded higher levels of confidence and accuracy in intelligent products while still retaining the organization’s credibility. “Throughout the deliberations during the exercise, I found myself focusing on the intersection of ethics, my own trust in technology, and the intelligence agency’s role as an early adopter of AI. The simulated incident was unfortunate, but I saw it as a great learning opportunity to benefit the organization and the Intelligence Community. It showcased the organization’s strong technical knowledge of the underlying AI system components while demonstrating exceptional commitment to transparency and the remediation of issues. Transparency, despite a perceived organizational failure and mounting pressures from internal and external stakeholders, was the right thing to do. Transparency, over time, would hopefully lead to regaining trust in the AI tool, achieving higher confidence levels in future intelligence assessments and minimizing the unconscious absorption of AI model limitations into AI-to-human interactions. “The most concerning outcomes for me centered on the impact of skewed data and the development of human and machine decisionmaking skills. AI tools do not contextualize results, so humans must assess their confidence level of those results. Repeated use, especially with limited training data, may result in humans incorporating the misleading results in other interactions and increasing residual risks in their decisions. The exercise brought this issue to light and highlighted how humans inherently trust machine-generated results over time. It is this built trust with an AI tool that could make confronting AI incidents challenging for organizations that heavily rely on AI. Here, the timeless Russian proverb of ‘trust but verify’ remains an important leadership principle when using AI tools, regardless of one’s technological adoption preference.”

Conclusion

The complex issue of algorithmic performance and reliability could lead to undesirable downstream impacts if not well understood. Testing can help reveal certain types of vulnerabilities, and various mitigation strategies, such as training and fine-tuning models on representative datasets, can help stave off some of the most egregious forms of harm.¹⁴ However, these limitations are not a bug that can be definitively fixed; technical solutions such as retraining cannot eliminate them all, nor can they protect against all negative outcomes.

According to Filippo Santoni de Sio and Giulio Mecacci, “The notion of ‘responsibility gap’ with [AI] was originally introduced in the philosophical debate to indicate the concern that ‘learning automata’ may make [it] more difficult or impossible to attribute moral culpability to persons for untoward events.”¹⁵ The models and data contribute to this uncertainty and ethical gray space for military leaders.

An essential element in the evaluation of AI tools’ performance concerns undesirable outcomes for people. We cannot uncouple the concept of harm from the technical considerations of AI. If we treat performance failure like a software bug that must be patched or fixed, we minimize the complex human impacts and risk overlooking social, cultural, and political ramifications. We may, at our peril, also fail to see how poor tools introduce more subtle, emergent vulnerabilities into our decisionmaking.

The Deepfake Incident Response tabletop exercise illuminated several ways in which algorithmic performance cannot be addressed via technical means alone but instead demands a more holistic approach. For example, the output of AI tools is probabilistic and—as model testing often reveals—probabilities that the output is accurate may vary across categories, classes, or demographic groups of individuals. How should this inform intelligence analysts’ assessment of confidence in intelligence products informed by these tools? Or, if a tool produces incorrect information that leads to a bad decision, even after that specific deficiency is mitigated, how will the history of bad decisions affect future users’ trust in the tool? In the exercise, participants saw how a technical issue could be a manifestation of deeper, more complex problems and raise difficult questions about who is responsible for undesired outcomes. This framing helps to shift assumptions away from the idea that current models should be the norm and any deviations are glitches; instead, it approaches performance as a systemic challenge.

In JPME, students learn about ethical decisionmaking as strategic leaders. This approach relies on many perspectives to ensure critical decisions are robustly challenged at every turn. The high value placed on the skills to identify and communicate a broad range of ideas will only become more important as the national security community embraces further use of AI technologies. Exploring potential ramifications of these emerging technologies in the classroom allows students not only to practice response strategies but also to reflect on their effectiveness through critical discussion with their peers. JFQ

Notes

¹See Luminos.Law, https://www.luminos. law/, now part of ZwillGen, https://www. zwillgen.com.

²See, for example, the MITRE ATLAS, n.d., https://atlas.mitre.org/matrices/ATLAS; or the AI Incident Database, n.d., https:// incidentdatabase.ai/.

³ George E.P. Box, “Science and Statistics,” Journal of the American Statistician 71, no. 356 (December 1976), 792.

⁴ Joy Adowaa Buolamwini, “Gender Shades: Intersectional Phenotypic and Demographic Evaluation of Face Datasets and Gender Classifiers” (master’s thesis, Massachusetts Institute of Technology, 2017), https:// dspace.mit.edu/handle/1721.1/114068.

⁵ Such antidiscrimination laws are, among other areas, focused on discrimination in finance, housing, employment, and the use of government funds by contractors and other parties. See Equal Credit Opportunity Act, 15 U.S.C. § 1691; Consumer Financial Protection Act, 12 U.S.C. § 5481; Home Mortgage Disclosure Act, 12 U.S.C. § 2801; Fair Housing Act, 42 U.S.C. § 3604; Title VII of the Civil Rights Act of 1964, 42 U.S.C. § 2000d; Americans With Disabilities Act, 42 U.S.C. § 12101; Pregnancy Discrimination Act, 42 U.S.C. § 2000e; Age Discrimination in Employment Act, 29 U.S.C. § 623; Equal Pay Act, 29 U.S.C. § 206; Immigration Reform and Control Act, 8 U.S.C. § 1101; Civil Rights Act of 1866, 42 U.S.C. § 1981; Genetic Information Nondiscrimination Act, 42 U.S.C. § 2000ff. A comprehensive list of Federal nondiscrimination laws is available at https://fpf.org/blog/fpf-list-federal-antidiscrimination-laws/.

⁶ For a technical discussion of some of these issues, see, for example, Robert Geirhos et al., “Shortcut Learning in Deep Neural Networks,” Nature Machine Intelligence 2, no. 11 (2020), 665–73, https://doi.org/10.1038/s42256- 020-00257-z; Alexander D’Amour et al., “Underspecification Presents Challenges for Credibility in Modern Machine Learning,” arXiv, November 24, 2020, https://doi. org/10.48550/arXiv.2011.033955.

⁷ When this tabletop exercise was developed, Luminos.Law was BNH.AI.

⁸ Andrea Brennen and Ryan Ashley, AI Assurance Audit of FakeFinder, an OpenSource Deepfake Detection Tool (Tysons, VA: IQT Labs, October 2021), https://assets.iqt. org/pdfs/IQTLabs_AiA_FakeFinderAudit_ DISTRO__1_.pdf/web/viewer.html. This work is also summarized in Andrea Brennen and Ryan Ashley, “AI Assurance: What Happened When We Audited a Deepfake Detection Tool Called Fakefinder,” IQT Labs, January 4, 2022, https://www.iqt.org/library/ ai-assurance-what-happened-when-we-auditeda-deepfake-detection-tool-called-fakefinder. For technical details on FakeFinder, see “IGTLabs/ FakeFinder,” GitHub, https://github.com/ IQTLabs/FakeFinder.

⁹ “Deepfake Detection Challenge,” Kaggle, April 23, 2020, https://www.kaggle.com/c/ deepfake-detection-challenge/.

¹⁰ See Andrea Brennen and Ryan Ashley, “AI Assurance: Do Deepfakes Discriminate?” IQT Labs, February 1, 2022, https://www.iqt. org/ai-assurance-do-deepfakes-discriminate/. Full results of the testing conducted on FakeFinder are also available in the audit report by Brennen and Ashley, AI Assurance Audit of FakeFinder.

¹¹ Michael Barbaro, “The End of Privacy as We Know It?” The Daily (New York Times), produced by Annie Brown and Daniel Guillemette, podcast audio, 31:18, with transcript, February 10, 2020, https:// www.nytimes.com/2020/02/10/podcasts/ the-daily/facial-recognition-surveillance. html; Annie Brown, “Wrongfully Accused by an Algorithm,” The Daily (New York Times), produced by Daniel Guillemette et al., podcast audio, 28:36, with transcript, August 3, 2020, https://www.nytimes.com/2020/08/03/ podcasts/the-daily/algorithmic-justice-racism. html; Michael Barbaro, “The ‘Enemies List’ at Madison Square Garden,” The Daily (New York Times), produced by Asthaa Chaturvedi et al., podcast audio, 25:05, with transcript, January 18, 2023, https://www.nytimes. com/2023/01/18/podcasts/the-daily/facialrecognition-madison-square-garden.html.

¹²Emily M. Bender et al., “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” FAccT ‘21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 2021), 610–23, https:// doi.org/10.1145/3442188.3445922.

¹³ Everett M. Rogers, Diffusion of Innovation, 4th ed. (New York: Free, 1995).

¹⁴ In prior work, IQT Labs consulted with Luminos.Law, a law firm that specialized in AI liability assessment to understand best practices for performance testing. Luminos.Law’s founders advised us on a legally defensible methodology, which is based on how existing case law informs a determination of disparate treatment. This approach relies on three metrics: 1) adverse impact ratio, the rates at which an outcome occurs for members of certain groups (that is, defined by protected characteristics such as race or gender) compared to a control group; 2) differential validity, a comparison of model performance across groups/classes; and 3) statistical significance, t-tests on differences in outcomes (true positives) across groups.

¹⁵ Filippo Santoni de Sio and Giulio Mecacci, “Four Responsibility Gaps With Artificial Intelligence: Why They Matter and How to Address Them,” Philosophy & Technology 34, no. 4 (2021), 1057, https:// doi.org/10.1007/s13347-021-00450-x.

National Defense University Fort Lesley J. McNair Washington, DC 20319-5066
Contact NDU

Finding Deepfakes: A Tabletop Exercise About AI, Decisionmaking, and Algorithmic Performance

Why Run a Tabletop Exercise About Algorithmic Performance?

Algorithmic Performance: Understanding Potential Vulnerability

What Happened During the Exercise?

Conclusion

Notes