Data Thinking — How AI Sees the World

25 min

What You'll Learn

  • Distinguish between structured and unstructured data
  • Understand at a high level how machine learning models learn from data
  • Evaluate data quality and recognise common data pitfalls
  • Read AI outputs critically, including hallucinations and confidence levels

Structured vs Unstructured Data

All AI systems fundamentally operate on data, and understanding the types of data is crucial for any professional working with AI. Structured data is organised in predefined formats — think Excel spreadsheets, SQL databases, CSV files. Each piece of data sits in a specific row and column with a clear label. An employee database with columns for Name, Employee ID, Department, and Salary is structured data. It is clean, searchable, and easy for both humans and machines to process.

Unstructured data, on the other hand, does not follow a fixed format. Emails, social media posts, customer support chat logs, images, audio recordings, PDF documents, and video files are all unstructured. An estimated 80-90% of the world's data is unstructured, and this is where modern AI — particularly LLMs and computer vision models — has made the biggest breakthroughs. Before LLMs, extracting useful insights from 10,000 customer emails required weeks of manual analysis. Now, Claude can read and summarise them in minutes.

For your career, the practical implication is this: much of the value you can create with AI lies in converting unstructured data into structured insights. A marketing analyst who can use AI to analyse thousands of social media mentions and produce a structured sentiment report is far more valuable than one who can only work with pre-formatted spreadsheet data. This skill — structuring the unstructured — is one of the most in-demand capabilities in the Indian job market today.

Did You Know?

India generates over 20 exabytes of data annually — that is 20 billion gigabytes. Most of this is unstructured: WhatsApp messages, UPI transactions, call centre recordings, and social media posts in dozens of languages. Companies that can make sense of this data using AI have a massive competitive advantage.

How Models Learn and Data Quality

At a high level, machine learning models learn by optimising patterns in training data. During training, a model is shown millions of examples and adjusts its internal parameters to minimise prediction errors. Think of it as a student taking thousands of practice tests — over time, the student learns which patterns lead to correct answers. The key difference is scale: modern LLMs are trained on trillions of words from the internet, books, code repositories, and academic papers.

Data quality directly determines model quality — the principle of "garbage in, garbage out" (GIGO) is absolute in AI. If training data contains errors, biases, or gaps, the model will reproduce and amplify them. For example, if a hiring AI is trained on historical data from a company that predominantly hired men for engineering roles, the model will learn to favour male candidates — not because men are better engineers, but because the data reflects past bias. Indian companies deploying AI must be particularly conscious of caste, gender, and regional biases that may be embedded in historical data.

As a professional, you may never train a model yourself, but you will constantly evaluate whether AI outputs are reliable. The key questions to ask are: What data was this model trained on? Is the training data representative of my use case? How recent is the data? A model trained on pre-2023 data will not know about recent policy changes, market shifts, or new regulations. Always assess the data behind the AI before trusting its outputs for important decisions.

Reading AI Outputs Critically

AI hallucination — when a model generates confident but factually incorrect information — is the single biggest risk in professional AI use. LLMs do not "know" facts; they predict the most probable next words based on patterns in their training data. This means they can seamlessly fabricate research papers that do not exist, attribute quotes to people who never said them, or generate plausible-sounding statistics that are completely made up.

Developing a critical reading habit for AI outputs is essential. First, verify any specific claims — numbers, dates, names, citations — through independent sources. Second, watch for hedging language: if the AI says "it is widely believed that..." or "according to various sources..." without naming specific sources, treat the claim as unverified. Third, cross-check important outputs by asking a different AI model the same question and comparing answers. If ChatGPT and Claude give different answers, dig deeper before relying on either.

Confidence calibration is another critical skill. AI models often present uncertain information with the same confident tone as well-established facts. A model might say "The Indian AI market is worth $12 billion" with the same certainty as "Water boils at 100°C." Learn to distinguish between facts the AI likely knows well (widely documented information) and claims it might be fabricating (niche statistics, recent events, specific figures). When stakes are high — job interviews, client presentations, financial reports — always verify independently.

Real-World Example

In a well-known 2023 case, a US lawyer submitted a legal brief containing six case citations generated by ChatGPT — all six were completely fabricated. The lawyer was fined and publicly reprimanded. This case is now studied in law and business schools as a cautionary tale about unverified AI outputs. Always verify before you present.

Key Takeaway

AI sees the world through data — structured and unstructured. Data quality determines output quality (garbage in, garbage out). The most critical professional skill is reading AI outputs with healthy scepticism: verify claims, watch for hallucinations, cross-check with multiple sources, and never present unverified AI-generated information as fact.