Mastering Data Formats: A Quick Guide to Reading CSV, JSON, XML, HTML, SQL, and Parquet Files | by Ankit Gupta | Jul, 2024

Within the realm of knowledge evaluation, info is available in a wide range of styles and sizes, every requiring specialised processing to make sure environment friendly storage, transmission, and evaluation. Happily, there’s a variety of standardized file codecs designed to satisfy these goals. Whether or not you’re working with tabular knowledge, hierarchical buildings, net content material, relational databases, or huge knowledge, understanding these codecs and learn them is important for any knowledge skilled.

1. CSV (Comma-Separated Values)

CSV recordsdata are maybe the only and most generally used format for storing tabular knowledge. Every line in a CSV file represents a row, and fields inside every row are separated by commas (or different delimiters).
Instance:

---usual construction of CSV knowledge
Title,Age,Metropolis
John,28,New York
Alice,25,Los Angeles
Bob,30,Chicago

import csvwith open('knowledge.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
print(', '.be part of(row))

2. JSON (JavaScript Object Notation)

JSON is a light-weight knowledge interchange format used to construction knowledge in a readable format. It helps nested knowledge buildings, making it very best for representing complicated hierarchical knowledge.

JSON is prevalent in net growth for transmitting knowledge between a server and net software. It’s additionally used for configuration recordsdata, APIs, and NoSQL databases.
Instance:

---usual construction of JSON knowledge
{
"workers": [
{"firstName": "John", "lastName": "Doe", "age": 30},
{"firstName": "Anna", "lastName": "Smith", "age": 25},
{"firstName": "Peter", "lastName": "Jones", "age": 45}
]
}

import jsonwith open('knowledge.json') as jsonfile:
knowledge = json.load(jsonfile)
print(knowledge)

3. XML (eXtensible Markup Language)

XML is a markup language designed to retailer and transport knowledge with a deal with simplicity and readability. It makes use of tags to outline knowledge parts and attributes.

XML is extensively utilized in net companies, configuration recordsdata, knowledge interchange between totally different techniques, and as a format for storing semi-structured knowledge.
Instance:

---usual construction of XML knowledge
<knowledge>
<individual>
<title>John</title>
<age>28</age>
<metropolis>New York</metropolis>
</individual>
<individual>
<title>Alice</title>
<age>25</age>
<metropolis>Los Angeles</metropolis>
</individual>
<individual>
<title>Bob</title>
<age>30</age>
<metropolis>Chicago</metropolis>
</individual>
</knowledge>

import xml.etree.ElementTree as ETtree = ET.parse('knowledge.xml')
root = tree.getroot()
for individual in root.findall('individual'):
title = individual.discover('title').textual content
age = individual.discover('age').textual content
metropolis = individual.discover('metropolis').textual content
print(f"Title: {title}, Age: {age}, Metropolis: {metropolis}")

4. HTML (HyperText Markup Language)

HTML is the usual markup language for creating net pages and net functions. It buildings content material utilizing tags, defining parts equivalent to headings, paragraphs, and lists.
Instance:

---usual construction of HTML knowledge
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Pattern HTML Web page</title>
</head>
<physique>
<h1>Hiya, World!</h1>
<p>This can be a pattern HTML web page.</p>
</physique>
</html>

from bs4 import BeautifulSoupwith open('index.html') as html_file:
soup = BeautifulSoup(html_file, 'html.parser')
print(soup.prettify())

5. SQL (Structured Question Language)

SQL is a domain-specific language utilized in programming and designed for managing knowledge held in a relational database administration system (RDBMS).
Instance:

---usual construction of SQL knowledge
CREATE TABLE workers (
id INTEGER PRIMARY KEY,
title TEXT NOT NULL,
age INTEGER,
metropolis TEXT
);INSERT INTO workers (title, age, metropolis) VALUES ('John', 28, 'New York');
INSERT INTO workers (title, age, metropolis) VALUES ('Alice', 25, 'Los Angeles');
INSERT INTO workers (title, age, metropolis) VALUES ('Bob', 30, 'Chicago');

import sqlite3conn = sqlite3.join('instance.db')
cursor = conn.cursor()
cursor.execute('SELECT * FROM workers')
rows = cursor.fetchall()
for row in rows:
print(row)
conn.shut()

6. Parquet

Parquet is a columnar storage file format optimized to be used with huge knowledge processing frameworks. It’s environment friendly for each storage and processing of huge datasets.

Parquet recordsdata are generally utilized in huge knowledge environments equivalent to Apache Hadoop, Apache Spark, and different knowledge processing frameworks. They’re helpful for analytical inquiries and knowledge warehousing due to their environment friendly storage and retrieval capacities.
Instance:

'''Parquet knowledge itself is not human-readable like CSV or JSON. It is a binary
format designed for environment friendly storage and retrieval by computer systems. Since we 
cannot immediately view the binary knowledge, the next is only a illustration
for understanding.'''knowledge.parquet
| Title  | Age | Metropolis       |
|-------|-----|------------|
| John  | 28  | New York   |
| Alice | 25  | Los Angeles|
| Bob   | 30  | Chicago    |

import pandas as pddf = pd.read_parquet('knowledge.parquet')
print(df)

Lastly, studying the nuances of those fundamental file codecs — CSV, JSON, XML, HTML, SQL, and Parquet — allows knowledge professionals to confidently and proficiently navigate the varied world of knowledge evaluation. Understanding learn and use these codecs is important when working with structured tabular knowledge, parsing hierarchical info, managing net content material, querying relational databases, or processing massive datasets.

Thanks to your time!

Source link

Mastering Data Formats: A Quick Guide to Reading CSV, JSON, XML, HTML, SQL, and Parquet Files | by Ankit Gupta | Jul, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Preparing Finance Data for AI: A 5-Step Data Cleansing Checklist

Our Picks

SweetLife: Website-Based Diabetes Prediction System Using Random Forest Algorithm | by Rasyad bimasatya | Jul, 2024

Unveiling Jamba: AI21’s Groundbreaking Hybrid SSM-Transformer Open-Source Model

The Data Scientist Show: Full-Stack Data Scientists Explained. | by Omar | Mar, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Mastering Data Formats: A Quick Guide to Reading CSV, JSON, XML, HTML, SQL, and Parquet Files | by Ankit Gupta | Jul, 2024

1. CSV (Comma-Separated Values)

2. JSON (JavaScript Object Notation)

3. XML (eXtensible Markup Language)

4. HTML (HyperText Markup Language)

5. SQL (Structured Question Language)

6. Parquet

Related Posts