The Nation as Dataset

The Nation as Dataset
Nations or the datasets ?

When Citizens Are Strategic Infrastructure

In November 2025, France and Germany convened a Summit on European Digital Sovereignty in Paris. The final declaration committed EU member states to strengthening digital sovereignty as "a cornerstone of economic resilience, social prosperity, competitiveness and security." A joint task force was launched to report in 2026.

The language was measured, but the subtext was not. What European leaders were acknowledging—without quite saying it in these terms—is that the data generated by European citizens and institutions has become a strategic resource that Europe does not control. Over 80% of the EU's digital infrastructure is currently supplied by non-European providers. The populations of member states generate enormous volumes of data every day—through health systems, financial transactions, education platforms, public administration—and the AI capabilities built on that data accrue overwhelmingly to firms headquartered elsewhere.

This is not a uniquely European problem. From Lagos to Jakarta to Riyadh, governments are reaching the same conclusion through different paths: when citizens become, in effect, a national dataset, the relationship between the state and the individual begins to change in ways that existing governance frameworks are not designed to address.


The Population as Training Signal

Large-scale AI systems require data. Not just any data—linguistically rich, culturally specific, behaviorally diverse data that reflects the patterns of real human activity. Health records. Financial transactions. Language use. Mobility patterns. Administrative interactions. Educational histories.

Every population produces these signals continuously. What has changed is that this ambient output now has direct strategic value. A country with a large, digitally active, linguistically coherent population possesses something that cannot easily be replicated or purchased: a living training corpus.

India understood this early. A December 2025 white paper from the Office of the Principal Scientific Adviser explicitly framed the country's data output as a national asset, arguing that curated, interoperable datasets would be India's competitive advantage in building inclusive AI solutions. India generates nearly one-fifth of the world's data, yet hosts only a small share of global data center capacity. The white paper proposed treating AI compute, datasets, and models as Digital Public Goods—shared public utilities rather than proprietary assets.

This reframes familiar policy categories. India's Aadhaar biometric identity system, which covers over a billion people, is not only administrative modernization—it is data infrastructure. The Unified Payments Interface that processes billions of transactions is not only financial inclusion—it is behavioral telemetry at national scale. The IndiaAI Mission's deployment of over 34,000 GPUs and the creation of platforms like IndiaAIKosh and Bhashini—repositories hosting thousands of datasets across healthcare, agriculture, and Indian languages—represent a deliberate strategy to retain the value of citizen-generated data within national borders.

None of this requires conspiratorial framing. It requires only the recognition that data, once generated, enters supply chains—and that those supply chains increasingly determine who can build capable AI systems and who cannot.


Datafication as Industrial Policy

Several governments have already begun treating their populations explicitly as data assets. The approaches differ, but the structural logic converges.

Saudi Arabia declared 2026 its official "Year of Artificial Intelligence." The Saudi Data and Artificial Intelligence Authority, established by royal decree in 2019, now oversees a national strategy built around six pillars: ambition, competencies, policies, investment, innovation, and ecosystem. Government spending on emerging technologies rose by more than 56% in 2024. AI companies operating in the Kingdom have secured $9.1 billion in funding. The country has launched the Shaheen III supercomputer and the Hexagon data center—described as the world's largest government data facility at 480 megawatts. And through HUMAIN, a government-owned company under the Public Investment Fund, Saudi Arabia is developing large Arabic-language models trained on domestically generated data.

The message is unambiguous: the Kingdom's population and its digital activity are national strategic assets to be developed, processed, and retained.

Indonesia has reached a similar conclusion from a different starting point. With over 270 million people, more than 700 local languages, and a vast archipelago geography, Indonesia faces a particular challenge: foreign AI models trained on English-language datasets do not adequately reflect local contexts. In 2025, the government finalized a National AI Roadmap and began building what officials call "sovereign AI"—domestic computing clusters and localized language models tailored to Indonesian law, languages, and public services. Government Regulation 71/2019 mandates local data storage. Healthcare records cannot leave the country. Financial data has been localized since 2021.

The question of who benefits from a population's data output is becoming a question of political economy. When Indonesia digitizes its public services, who gains access to the resulting datasets? When Saudi Arabia's health system adopts AI-driven diagnostics, whose models are being trained? When India's education platforms scale across a national school system, where does the behavioral data go?

These are not hypothetical concerns. They describe existing arrangements in which the data generated by citizens flows—often by default rather than design—into private AI development pipelines, frequently controlled by firms headquartered in other jurisdictions.


The Sovereignty Paradox

This creates what might be called a datafication sovereignty paradox: the more a state digitizes its institutions to modernize and compete, the more data it generates—and the more exposed it becomes to extraction by actors with superior AI infrastructure.

Nigeria illustrates this bind with unusual clarity. In April 2025, the government told Google, Microsoft, and Amazon to set concrete deadlines for building data centers in the country—a demand it had been making for four years without result. A working group was formed to ensure that citizen data would be stored within Nigerian territory. As the country's technology regulator put it: no more waivers.

Yet the structural asymmetry is stark. Nigeria has 17 data centers. South Africa has 56. The United States has thousands. Amazon, Microsoft, and Google together control over 60% of global cloud spending. Nigeria's entire installed data center capacity is roughly 56 megawatts—less than a single hyperscale facility in Virginia. The country's unreliable power grid, averaging about four hours of stable supply per day, forces data centers to run on diesel generators, raising costs and limiting scale.

This means that for Nigeria, digitizing government services, expanding fintech, and building an AI-capable economy requires relying on infrastructure it does not own, located in jurisdictions it does not govern, operated by firms whose interests do not necessarily align with its own. The data leaves. The capability accrues elsewhere. The structural dependency deepens.

The dynamic mirrors older patterns of resource extraction, but with a difference. Unlike oil or minerals, data can be copied, aggregated, and processed remotely. It does not deplete. And the value it generates—in the form of model capability—compounds over time, widening the gap between those who extract and those who are extracted from. A New America Foundation analysis published in mid-2025 warned directly that individual African countries negotiating alone with powerful multinationals enables regulatory arbitrage that weakens the continent's collective position.


Citizens as Infrastructure, Not Just Rights-Holders

When a population is reconceived as a dataset, the citizen occupies a dual position. They remain a political subject with rights, expectations, and democratic agency. But they also become, functionally, a node in a national data infrastructure—a generator of training signal whose collective output has measurable economic and strategic value.

At the India AI Impact Summit in February 2026, a panel on "Data, People, and Pre-Empting Mass Exclusion" made this tension vivid. Osama Manzar of the Digital Empowerment Foundation pointed out that rural Indians often trek five kilometers and pay significant fees for digital authentication in an unfamiliar language—just to claim basic food subsidies. His question cut to the heart of the issue: is this data of the people, or data for the people? Another panelist cited the case of a Peruvian farmworker whose worn fingerprints caused an algorithm to deny her a food subsidy, and the Dutch welfare scandal in which migrant families were falsely flagged as fraud risks by biased datasets.

These cases reveal the gap. Privacy law addresses individual data rights. National security frameworks address critical infrastructure. Industrial policy addresses competitiveness. But no existing framework adequately addresses the condition in which the population itself is the infrastructure—in which the aggregate behavior of citizens constitutes a strategic resource while the individual citizens whose data feeds the system face exclusion, misidentification, or harm.


Three Emerging Tensions

This framing surfaces tensions that will shape AI governance in the coming years.

Modernization versus exposure. States that digitize rapidly generate more data—and more vulnerability to extraction. The European Commission's forthcoming Cloud and AI Development Act, expected in early 2026, attempts to square this circle by requiring cloud providers and AI developers to use European infrastructure when serving the EU market. But even this ambitious approach raises questions: can sovereignty requirements be enforced without fragmenting the global AI ecosystem? Gartner projects that by 2027, over a third of European enterprises will use localized AI platforms—up from just 5% today. The shift reflects real concern, but also real cost. A McKinsey analysis of European sovereign AI scenarios found that enforcing local solutions without matching capability could reduce AI adoption, leaving Europe behind on productivity.

Individual rights versus collective value. Privacy frameworks protect individual data. But the strategic value of a national dataset is collective—it emerges from aggregation, not from any single record. The EU's November 2025 Digital Omnibus package proposed simplifying the GDPR, including expanding the legitimate interests basis to cover AI model training. Critics argued this would limit sovereignty over personal data in the name of competitiveness. The tension is structural: consent-based models were not designed for a world in which the aggregate, not the individual data point, is the asset.

Openness versus sovereignty. Indonesia's experience makes this tension concrete. The country has signed the Regional Comprehensive Economic Partnership, which prohibits data localization requirements—yet simultaneously enforces sector-specific data localization in finance, healthcare, and strategic sectors. The contradiction is not accidental. It reflects the impossible position of nations that need open data flows to attract investment and build AI capability while simultaneously needing to prevent their data from becoming raw material for someone else's models.


What This Means for Governance

If citizens are becoming strategic infrastructure, several governance questions become urgent.

First, data provenance and flow. Governments need clearer maps of where citizen-generated data goes, who processes it, and what capabilities it feeds. The EU Parliament's March 2026 call for a European register disclosing every work used to train AI models points in this direction. But the principle extends beyond copyright to all citizen-generated data.

Second, public AI infrastructure. India's decision to treat AI compute and datasets as Digital Public Goods—subsidized GPU access at rates under 100 rupees per hour, open model hubs, shared data repositories—represents one model. Saudi Arabia's state-owned HUMAIN company building Arabic-language models represents another. Indonesia's sovereign computing clusters a third. The common thread: without domestic capacity to process data, digitization becomes strategic exposure.

Third, new institutional frameworks. The concept of a population as a dataset does not fit neatly into existing categories. It is not adequately addressed by privacy regulation, trade policy, or national security doctrine alone. It may require new institutional forms—data trusts, national AI infrastructure authorities, or multilateral frameworks that prevent the one-by-one negotiation pattern that currently allows technology firms to exploit regulatory fragmentation across the Global South.

None of these are simple. But the absence of frameworks does not mean the absence of consequences. The structural shift is already underway. The question is whether governance catches up before the terms are set.


The Unanswered Question

There is a deeper question beneath the policy layer, one that is easy to defer but difficult to avoid: what is the relationship between a state and its citizens when the state's strategic position depends not on what citizens do, but on what they emit?

Democratic theory has answers for when the state needs citizens to work, to fight, to vote, to pay taxes. It is less prepared for a condition in which the most strategically valuable thing a citizen does is simply exist within a digitized system—generating data through the ordinary activity of living.

India's 1.4 billion people producing one-fifth of the world's data. Indonesia's 270 million citizens, half under 30, fueling one of the fastest-growing digital economies on earth. Nigeria's 240 million, with a median age of 18, generating unprecedented volumes of mobile data. Saudi Arabia's population feeding a $9 billion AI ecosystem. Europe's 450 million residents whose data trains models they do not control.

These are not abstractions. They are the raw inputs of a new geopolitical economy. How that economy is governed will depend on whether institutions develop the conceptual tools to see it clearly.

That work is only beginning.


Aletheia is a research lab studying AI, digital sovereignty, and institutional change. We publish frameworks, essays, and analysis at the intersection of technology and governance.