MyNixOS website logo
Description

URL canonicalization library for semantic link identity.

A Haskell library that converts arbitrary URLs into canonical, semantically stable identifiers. Handles redirect resolution, tracking parameter removal, and domain-specific normalization for platforms like YouTube, Amazon, Twitter/X, and more.

link-canonical

A Haskell library for converting arbitrary URLs into canonical, semantically stable identifiers. Enables URL deduplication and stable link identity in systems that ingest URLs from multiple sources.

Features

  • Tracking parameter removal - Strips UTM parameters, ad click IDs (gclid, fbclid, msclkid), and other marketing trackers
  • Redirect chain resolution - Follows URL shorteners and redirects to find the final destination
  • Domain-specific normalization - Intelligent rules for YouTube, Amazon, Twitter/X, GitHub, Instagram, and Reddit
  • RFC 3986 compliance - Proper dot segment normalization, percent-encoding, and path handling
  • Security features - Private IP blocking (SSRF prevention), HTTPS downgrade protection, redirect loop detection

Installation

Add to your cabal file:

build-depends:
    link-canonical

Or with Nix flakes:

{
  inputs.link-canonical.url = "github:shinzui/link-canonical";
}

Usage

Quick Start

import Link.Canonical
import Text.URI (mkURI)

main :: IO ()
main = do
  let Right uri = mkURI "https://youtu.be/dQw4w9WgXcQ?utm_source=twitter"
  result <- normalizeWithDefaults uri
  case result of
    Left err -> print err
    Right canonical -> print canonical
    -- Output: https://www.youtube.com/watch?v=dQw4w9WgXcQ

API Layers

The library provides three layers of functionality:

-- Pure normalization (no IO, no redirects)
normalizeUri :: NormConfig -> [DomainRule] -> URI -> Either NormError URI

-- With redirect resolution (IO)
normalizeLink :: NormConfig -> URI -> IO (Either NormError NormResult)

-- Convenient defaults
normalizeWithDefaults :: URI -> IO (Either NormError NormResult)

Configuration

import Link.Canonical
import Link.Canonical.Types
import Link.Canonical.Config

-- Customize configuration
customConfig :: NormConfig
customConfig = defaultConfig
  & #redirects . #maxRedirects .~ 5
  & #redirects . #timeout .~ 30
  & #tracking . #allowlist .~ Set.fromList ["ref"]

Configuration Options

OptionDefaultDescription
redirects.maxRedirects10Maximum redirect hops
redirects.timeout10sRequest timeout
redirects.allowDowngradeFalseAllow HTTPS to HTTP redirects
redirects.blockPrivateIPsTrueBlock private/local IPs (SSRF protection)
tracking.denyPatternsSee belowPatterns for tracking parameters
stripFragmentTrueRemove URL fragments
sortParamsTrueSort query parameters alphabetically

Domain Rules

YouTube

All YouTube URL formats normalize to a consistent watch URL:

InputOutput
youtu.be/dQw4w9WgXcQyoutube.com/watch?v=dQw4w9WgXcQ
youtube.com/embed/dQw4w9WgXcQyoutube.com/watch?v=dQw4w9WgXcQ
youtube.com/shorts/dQw4w9WgXcQyoutube.com/watch?v=dQw4w9WgXcQ

Amazon

Amazon product URLs preserve the regional TLD and normalize to the canonical /dp/{ASIN} format:

InputOutput
amazon.com/Some-Product/dp/B08N5WRWNW/ref=sr_1_1amazon.com/dp/B08N5WRWNW
amazon.co.uk/dp/B08N5WRWNWamazon.co.uk/dp/B08N5WRWNW

Twitter/X

Twitter URLs normalize to X.com:

InputOutput
twitter.com/user/status/123x.com/user/status/123
x.com/user/status/123?s=20x.com/user/status/123

GitHub

GitHub URLs preserve meaningful fragments (line numbers):

InputOutput
github.com/owner/repo/blob/main/file.hs#L10-L20github.com/owner/repo/blob/main/file.hs#L10-L20
github.com/owner/repo?tab=readmegithub.com/owner/repo

Instagram

Instagram URLs normalize subdomains:

InputOutput
instagram.com/p/ABC123www.instagram.com/p/ABC123

Reddit

Reddit URLs normalize to the main domain:

InputOutput
old.reddit.com/r/haskell/comments/abcwww.reddit.com/r/haskell/comments/abc

Tracking Parameters

The following tracking parameters are removed by default:

  • Google Analytics: utm_source, utm_medium, utm_campaign, utm_term, utm_content, _ga, _gl, _gid
  • Ad Platforms: gclid, fbclid, msclkid, dclid
  • Marketing: mc_*, oly_*, _hsenc, _hsmi, mkt_tok
  • Social: igshid, si, ref, source
  • Other: zanpid

Development

Prerequisites

  • GHC 9.12+
  • Cabal 3.0+
  • Or Nix (recommended)

Setup with Nix

# Enter development shell
nix develop

# Build
cabal build

# Run tests
cabal test

# Format code
treefmt

Setup without Nix

# Build
cabal build

# Run tests
cabal test

Project Structure

src/Link/Canonical/
├── Canonical.hs      # Main entry point
├── Types.hs          # Core types
├── Config.hs         # Default configuration
├── Normalize.hs      # Generic URL normalization
├── Redirect.hs       # Redirect resolution
├── Tracking.hs       # Tracking parameter stripping
└── Rules/            # Domain-specific rules
    ├── YouTube.hs
    ├── Amazon.hs
    ├── Twitter.hs
    ├── GitHub.hs
    ├── Instagram.hs
    └── Reddit.hs

Testing

cabal test

The test suite includes:

  • Generic normalization tests (scheme, host, port, path, query, fragment)
  • Tracking parameter removal tests
  • Redirect resolution tests (including loop detection, timeout handling)
  • Edge case tests (empty URLs, special characters, encoding)
  • Domain-specific rule tests for each supported platform

License

MIT License - see LICENSE for details.

Copyright 2025 Nadeem Bitar.

Metadata

Version

0.1.0.0

License

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows