Description

Repair Malformed JSON Strings.

Description

Repairs malformed JSON strings, particularly those generated by Large Language Models. Handles missing quotes, trailing commas, unquoted keys, and other common JSON syntax errors.

README.md

cran.r-project.org

llmjson

llmjson repairs malformed JSON strings, particularly those generated by Large Language Models (LLMs). It uses Rust for fast, reliable JSON repair based on a vendored and bug-fixed version of the llm_json crate.

Features

Repairs missing quotes around keys and values
Handles trailing commas
Fixes unquoted keys
Repairs incomplete arrays and objects
Converts single quotes to double quotes
Removes extra non-JSON characters
Auto-completes missing values with sensible defaults
Returns R objects directly with return_objects = TRUE
Schema validation and type conversion with intuitive schema builders
Control field presence with .required and use .default for missing required fields

Installation

You can install the development version of llmjson from GitHub:

# install.packages("remotes")
remotes::install_github("DyfanJones/llmjson")

Or r-universe:

install.packages('llmjson', repos = c('https://dyfanjones.r-universe.dev', 'https://cloud.r-project.org'))

System Requirements

This package requires the Rust toolchain to be installed on your system. If you don't have Rust installed:

Install from https://rust-lang.org/tools/install/
Minimum required version: Rust 1.65.0

Usage

Basic JSON Repair

library(llmjson)

# Repair JSON with trailing comma
repair_json_str('{"key": "value",}')
#> [1] "{\"key\":\"value\"}"

# Repair JSON with unquoted keys
repair_json_str('{key: "value"}')
#> [1] "{\"key\":\"value\"}"

# Repair incomplete JSON
repair_json_str('{"name": "John", "age": 30')
#> [1] "{\"name\":\"John\",\"age\":30}"

# Repair JSON with single quotes
repair_json_str("{'name': 'John'}")
#> [1] "{\"name\":\"John\"}"

Return R Objects Directly

Instead of returning a JSON string, you can get R objects directly:

# Return as R list instead of JSON string
result <- repair_json_str('{"name": "Alice", "age": 30}', return_objects = TRUE)
result
#> $name
#> [1] "Alice"
#>
#> $age
#> [1] 30

# Works with all repair functions
result <- repair_json_file("data.json", return_objects = TRUE)

Handling Large Integers (64-bit)

JSON numbers that exceed R's 32-bit integer range (beyond -2,147,483,648 to 2,147,483,647) need special handling. The int64 parameter controls how these large integers are converted:

json_str <- '{"id": 9007199254740993}'

# Option 1: "double" (default) - Convert to R numeric (may lose precision)
result <- repair_json_str(json_str, return_objects = TRUE, int64 = "double")
result$id
#> [1] 9.007199e+15  # Lost precision: actual value is 9007199254740992

# Option 2: "string" - Preserve exact value as character
result <- repair_json_str(json_str, return_objects = TRUE, int64 = "string")
result$id
#> [1] "9007199254740993"  # Exact value preserved

# Option 3: "bit64" - Use bit64 package for true 64-bit integers
# Requires: install.packages("bit64")
result <- repair_json_str(json_str, return_objects = TRUE, int64 = "bit64")
result$id
#> integer64
#> [1] 9007199254740993  # Exact value preserved with integer type

Which option should I use?

Use "double" (default) if your integers fit safely in double precision and you don't need exact integer arithmetic
Use "string" if you need to preserve exact values and plan to pass them to other systems
Use "bit64" if you need exact integer arithmetic on large integers in R

Schema Validation and Type Conversion

Define schemas to validate JSON structure and ensure correct R types. The schema system is inspired by the structr package and provides an intuitive way to define expected JSON structures:

# Define a schema for a user object
schema <- json_object(
  name = json_string(),
  age = json_integer(),
  email = json_string()
)

# Repair and validate with schema
result <- repair_json_str(
  '{"name": "Alice", "age": "30", "email": "[email protected]"}',
  schema = schema,
  return_objects = TRUE
)

# Note: age is coerced from string "30" to integer 30
str(result)
#> List of 3
#>  $ name : chr "Alice"
#>  $ age  : int 30
#>  $ email: chr "[email protected]"

Required vs Optional Fields and Default Values

Control how missing fields are handled with .required and .default parameters:

Required fields (.required = TRUE):

Missing fields are added with their .default value (or their type's default if no explicit default)
Always appear in the output

Optional fields (.required = FALSE, the default):

Missing fields are omitted entirely from the output
Only appear if present in the input JSON

# Example 1: Required field with explicit default
schema <- json_object(
  name = json_string(.required = TRUE),
  age = json_integer(.default = 25L, .required = TRUE)  # required, will use default if missing
)

result <- repair_json_str('{"name": "Alice"}', schema = schema, return_objects = TRUE)
result
#> $name
#> [1] "Alice"
#>
#> $age
#> [1] 25

# Example 2: Optional field (omitted when missing)
schema <- json_object(
  name = json_string(.required = TRUE),
  nickname = json_string(.required = FALSE)  # optional, omitted if not in input
)

result <- repair_json_str('{"name": "Bob"}', schema = schema, return_objects = TRUE)
result
#> $name
#> [1] "Bob"
# Note: nickname is not present since it was optional and missing from input

# Example 3: Required field with type default
schema <- json_object(
  name = json_string(.required = TRUE),
  age = json_integer(.required = TRUE)  # required, will use type default (0L) if missing
)

result <- repair_json_str('{"name": "Charlie"}', schema = schema, return_objects = TRUE)
result
#> $name
#> [1] "Charlie"
#>
#> $age
#> [1] 0

Nested Schemas and Arrays

Build complex schemas with nested objects and arrays:

# Schema with nested object and array
schema <- json_object(
  name = json_string(),
  address = json_object(
    city = json_string(),
    zip = json_integer()
  ),
  scores = json_array(json_integer())
)

json_str <- '{
  "name": "Alice",
  "address": {"city": "NYC", "zip": "10001"},
  "scores": [90, 85, 95]
}'

result <- repair_json_str(json_str, schema = schema, return_objects = TRUE)
str(result)
#> List of 3
#>  $ name   : chr "Alice"
#>  $ address:List of 2
#>   ..$ city: chr "NYC"
#>   ..$ zip : int 10001
#>  $ scores : int [1:3] 90 85 95

Build Schemas for Better Performance

For repeated use with the same schema, use json_schema() to compile the schema once and reuse it many times.

# Define your schema
schema <- json_object(
  name = json_string(),
  age = json_integer(),
  email = json_string()
)

# Build it once - this creates an optimized internal representation
built_schema <- json_schema(schema)

# Reuse many times - much faster!
for (json_str in json_strings) {
  result <- repair_json_str(json_str, built_schema, return_objects = TRUE)
  # Process result...
}

Performance comparison (complex nested schema):

Without json_schema(): ~266µs per call
With json_schema(): ~51µs per call (5.2x faster)
No schema: ~44µs per call

The performance benefit is especially significant for:

Complex nested schemas with multiple levels
Batch processing of many JSON strings
Performance-critical applications
Real-time data processing pipelines

Repair JSON from Files

# Read and repair JSON from a file
repair_json_file("malformed.json")

# With schema validation
schema <- json_object(
  name = json_string(.required = TRUE),
  age = json_integer(.default = 25L, .required = TRUE)  # required field with default
)
result <- repair_json_file("data.json", schema = schema, return_objects = TRUE)

Repair JSON from Raw Bytes

# Repair JSON from raw byte vector
raw_data <- charToRaw('{"key": "value",}')
repair_json_raw(raw_data)
#> [1] "{\"key\":\"value\"}"

# With return_objects
result <- repair_json_raw(raw_data, return_objects = TRUE)

Repair JSON from Connections

Read and repair JSON from any R connection (files, URLs, pipes, compressed files, etc.):

# Read from a file connection
conn <- file("malformed.json", "r")
result <- repair_json_conn(conn)
close(conn)

# Read from a URL
conn <- url("https://api.example.com/data.json")
result <- repair_json_conn(conn, return_objects = TRUE)
close(conn)

# Read from a compressed file
conn <- gzfile("data.json.gz", "r")
result <- repair_json_conn(conn, return_objects = TRUE, int64 = "string")
close(conn)

# Use with() to ensure connection is closed automatically
result <- local({
  conn <- file("malformed.json", "r")
  on.exit(close(conn))
  repair_json_conn(conn, return_objects = TRUE)
})

Use Case: Working with LLM Outputs

Large Language Models often generate JSON that is almost correct but has minor syntax errors. This package helps you handle those cases gracefully:

# LLM might output JSON with trailing commas and unquoted keys
llm_output <- '{
  users: [
    {name: "Alice", age: 30,},
    {name: "Bob", age: 25,},
  ],
}'

# Option 1: Repair and parse with your chosen JSON parser (e.g., jsonlite)
repaired <- repair_json_str(llm_output)
(parsed <- jsonlite::fromJSON(repaired))
#> $users
#>   age  name
#> 1  30 Alice
#> 2  25   Bob

# Option 2: Use schema with return_objects for type safety
schema <- json_object(
  users = json_array(json_object(
    name = json_string(),
    age = json_integer()
  ))
)

result <- repair_json_str(llm_output, schema = schema, return_objects = TRUE)
str(result)
#> List of 1
#>  $ users:List of 2
#>   ..$ :List of 2
#>   .. ..$ name: chr "Alice"
#>   .. ..$ age : int 30
#>   ..$ :List of 2
#>   .. ..$ name: chr "Bob"
#>   .. ..$ age : int 25

Available Functions

Repair Functions

All repair functions support the schema, return_objects, ensure_ascii, and int64 parameters:

repair_json_str(json_str, schema = NULL, return_objects = FALSE, ensure_ascii = TRUE, int64 = "double") - Repair a malformed JSON string
repair_json_file(path, schema = NULL, return_objects = FALSE, ensure_ascii = TRUE, int64 = "double") - Read and repair JSON from a file
repair_json_raw(raw_bytes, schema = NULL, return_objects = FALSE, ensure_ascii = TRUE, int64 = "double") - Repair JSON from a raw byte vector
repair_json_conn(conn, schema = NULL, return_objects = FALSE, ensure_ascii = TRUE, int64 = "double") - Read and repair JSON from an R connection (file, URL, pipe, etc.)

Parameters:

schema - Optional schema definition (R list from json_object(), etc.) or built schema (from json_schema())
return_objects - If TRUE, returns R objects instead of JSON strings
ensure_ascii - If TRUE (default), escape non-ASCII characters in the output JSON
int64 - Policy for handling 64-bit integers: "double" (default), "string", or "bit64"

Schema Functions

json_schema(schema) - Compile a schema definition for efficient reuse (5x performance improvement)
json_object(..., .required) - Define a JSON object with named fields
json_integer(.default, .required) - Integer field (default: 0L)
json_number(.default, .required) - Number/numeric field (default: 0.0)
json_string(.default, .required) - String field (default: "")
json_boolean(.default, .required) - Boolean field (default: FALSE)
json_enum(.values, .default, .required) - Enum field with allowed values (default: first value)
json_date(.default, .format, .required) - Date field with format specification
json_timestamp(.default, .format, .tz, .required) - POSIXct datetime field
json_array(items, .required) - Array with specified item type
json_any(.required) - Accept any JSON type

Comparison with Similar Packages

While R has several JSON parsing packages like jsonlite, they typically fail when encountering malformed JSON. llmjson is specifically designed to handle the common errors that LLMs make when generating JSON output, making it ideal for:

Processing LLM API responses
Parsing structured data from AI-generated text
Building robust data pipelines with LLM integrations
Working with JSON data from web scraping or unreliable sources

Acknowledgments

This package includes a vendored and bug-fixed version of the llm_json Rust crate (v1.0.1) by Ribelo, which is itself a Rust port of the Python json_repair library by Stefano Baccianella (mangiucugna). Our vendored version includes critical bug fixes for array parsing not present in the upstream release.

The schema system was inspired by the structr package, which provides elegant patterns for defining and validating data structures in R.

Code of Conduct

Please note that the llmjson project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

r-llmjson

llmjson

Features

Installation

System Requirements

Usage

Basic JSON Repair

Return R Objects Directly

Handling Large Integers (64-bit)

Schema Validation and Type Conversion

Required vs Optional Fields and Default Values

Nested Schemas and Arrays

Build Schemas for Better Performance

Repair JSON from Files

Repair JSON from Raw Bytes

Repair JSON from Connections

Use Case: Working with LLM Outputs

Available Functions

Repair Functions

Schema Functions

Comparison with Similar Packages

Acknowledgments

Code of Conduct

Version

License

Status

Source

Homepage

Platforms (80)

llmjson

Features

Installation

System Requirements

Usage

Basic JSON Repair

Return R Objects Directly

Handling Large Integers (64-bit)

Schema Validation and Type Conversion

Required vs Optional Fields and Default Values

Nested Schemas and Arrays

Build Schemas for Better Performance

Repair JSON from Files

Repair JSON from Raw Bytes

Repair JSON from Connections

Use Case: Working with LLM Outputs

Available Functions

Repair Functions

Schema Functions

Comparison with Similar Packages

Acknowledgments

Code of Conduct

Version

License

Status

Source

Homepage

Platforms80 (80)

Platforms (80)