Repair Malformed JSON Strings.
llmjson
llmjson repairs malformed JSON strings, particularly those generated by Large Language Models (LLMs). It uses Rust for fast, reliable JSON repair based on a vendored and bug-fixed version of the llm_json crate.
Features
- Repairs missing quotes around keys and values
- Handles trailing commas
- Fixes unquoted keys
- Repairs incomplete arrays and objects
- Converts single quotes to double quotes
- Removes extra non-JSON characters
- Auto-completes missing values with sensible defaults
- Returns R objects directly with
return_objects = TRUE - Schema validation and type conversion with intuitive schema builders
- Control field presence with
.requiredand use.defaultfor missing required fields
Installation
You can install the development version of llmjson from GitHub:
# install.packages("remotes")
remotes::install_github("DyfanJones/llmjson")
Or r-universe:
install.packages('llmjson', repos = c('https://dyfanjones.r-universe.dev', 'https://cloud.r-project.org'))
System Requirements
This package requires the Rust toolchain to be installed on your system. If you don't have Rust installed:
- Install from https://rust-lang.org/tools/install/
- Minimum required version: Rust 1.65.0
Usage
Basic JSON Repair
library(llmjson)
# Repair JSON with trailing comma
repair_json_str('{"key": "value",}')
#> [1] "{\"key\":\"value\"}"
# Repair JSON with unquoted keys
repair_json_str('{key: "value"}')
#> [1] "{\"key\":\"value\"}"
# Repair incomplete JSON
repair_json_str('{"name": "John", "age": 30')
#> [1] "{\"name\":\"John\",\"age\":30}"
# Repair JSON with single quotes
repair_json_str("{'name': 'John'}")
#> [1] "{\"name\":\"John\"}"
Return R Objects Directly
Instead of returning a JSON string, you can get R objects directly:
# Return as R list instead of JSON string
result <- repair_json_str('{"name": "Alice", "age": 30}', return_objects = TRUE)
result
#> $name
#> [1] "Alice"
#>
#> $age
#> [1] 30
# Works with all repair functions
result <- repair_json_file("data.json", return_objects = TRUE)
Handling Large Integers (64-bit)
JSON numbers that exceed R's 32-bit integer range (beyond -2,147,483,648 to 2,147,483,647) need special handling. The int64 parameter controls how these large integers are converted:
json_str <- '{"id": 9007199254740993}'
# Option 1: "double" (default) - Convert to R numeric (may lose precision)
result <- repair_json_str(json_str, return_objects = TRUE, int64 = "double")
result$id
#> [1] 9.007199e+15 # Lost precision: actual value is 9007199254740992
# Option 2: "string" - Preserve exact value as character
result <- repair_json_str(json_str, return_objects = TRUE, int64 = "string")
result$id
#> [1] "9007199254740993" # Exact value preserved
# Option 3: "bit64" - Use bit64 package for true 64-bit integers
# Requires: install.packages("bit64")
result <- repair_json_str(json_str, return_objects = TRUE, int64 = "bit64")
result$id
#> integer64
#> [1] 9007199254740993 # Exact value preserved with integer type
Which option should I use?
- Use
"double"(default) if your integers fit safely in double precision and you don't need exact integer arithmetic - Use
"string"if you need to preserve exact values and plan to pass them to other systems - Use
"bit64"if you need exact integer arithmetic on large integers in R
Schema Validation and Type Conversion
Define schemas to validate JSON structure and ensure correct R types. The schema system is inspired by the structr package and provides an intuitive way to define expected JSON structures:
# Define a schema for a user object
schema <- json_object(
name = json_string(),
age = json_integer(),
email = json_string()
)
# Repair and validate with schema
result <- repair_json_str(
'{"name": "Alice", "age": "30", "email": "[email protected]"}',
schema = schema,
return_objects = TRUE
)
# Note: age is coerced from string "30" to integer 30
str(result)
#> List of 3
#> $ name : chr "Alice"
#> $ age : int 30
#> $ email: chr "[email protected]"
Required vs Optional Fields and Default Values
Control how missing fields are handled with .required and .default parameters:
Required fields (.required = TRUE):
- Missing fields are added with their
.defaultvalue (or their type's default if no explicit default) - Always appear in the output
Optional fields (.required = FALSE, the default):
- Missing fields are omitted entirely from the output
- Only appear if present in the input JSON
# Example 1: Required field with explicit default
schema <- json_object(
name = json_string(.required = TRUE),
age = json_integer(.default = 25L, .required = TRUE) # required, will use default if missing
)
result <- repair_json_str('{"name": "Alice"}', schema = schema, return_objects = TRUE)
result
#> $name
#> [1] "Alice"
#>
#> $age
#> [1] 25
# Example 2: Optional field (omitted when missing)
schema <- json_object(
name = json_string(.required = TRUE),
nickname = json_string(.required = FALSE) # optional, omitted if not in input
)
result <- repair_json_str('{"name": "Bob"}', schema = schema, return_objects = TRUE)
result
#> $name
#> [1] "Bob"
# Note: nickname is not present since it was optional and missing from input
# Example 3: Required field with type default
schema <- json_object(
name = json_string(.required = TRUE),
age = json_integer(.required = TRUE) # required, will use type default (0L) if missing
)
result <- repair_json_str('{"name": "Charlie"}', schema = schema, return_objects = TRUE)
result
#> $name
#> [1] "Charlie"
#>
#> $age
#> [1] 0
Nested Schemas and Arrays
Build complex schemas with nested objects and arrays:
# Schema with nested object and array
schema <- json_object(
name = json_string(),
address = json_object(
city = json_string(),
zip = json_integer()
),
scores = json_array(json_integer())
)
json_str <- '{
"name": "Alice",
"address": {"city": "NYC", "zip": "10001"},
"scores": [90, 85, 95]
}'
result <- repair_json_str(json_str, schema = schema, return_objects = TRUE)
str(result)
#> List of 3
#> $ name : chr "Alice"
#> $ address:List of 2
#> ..$ city: chr "NYC"
#> ..$ zip : int 10001
#> $ scores : int [1:3] 90 85 95
Build Schemas for Better Performance
For repeated use with the same schema, use json_schema() to compile the schema once and reuse it many times.
# Define your schema
schema <- json_object(
name = json_string(),
age = json_integer(),
email = json_string()
)
# Build it once - this creates an optimized internal representation
built_schema <- json_schema(schema)
# Reuse many times - much faster!
for (json_str in json_strings) {
result <- repair_json_str(json_str, built_schema, return_objects = TRUE)
# Process result...
}
Performance comparison (complex nested schema):
- Without
json_schema(): ~266µs per call - With
json_schema(): ~51µs per call (5.2x faster) - No schema: ~44µs per call
The performance benefit is especially significant for:
- Complex nested schemas with multiple levels
- Batch processing of many JSON strings
- Performance-critical applications
- Real-time data processing pipelines
Repair JSON from Files
# Read and repair JSON from a file
repair_json_file("malformed.json")
# With schema validation
schema <- json_object(
name = json_string(.required = TRUE),
age = json_integer(.default = 25L, .required = TRUE) # required field with default
)
result <- repair_json_file("data.json", schema = schema, return_objects = TRUE)
Repair JSON from Raw Bytes
# Repair JSON from raw byte vector
raw_data <- charToRaw('{"key": "value",}')
repair_json_raw(raw_data)
#> [1] "{\"key\":\"value\"}"
# With return_objects
result <- repair_json_raw(raw_data, return_objects = TRUE)
Repair JSON from Connections
Read and repair JSON from any R connection (files, URLs, pipes, compressed files, etc.):
# Read from a file connection
conn <- file("malformed.json", "r")
result <- repair_json_conn(conn)
close(conn)
# Read from a URL
conn <- url("https://api.example.com/data.json")
result <- repair_json_conn(conn, return_objects = TRUE)
close(conn)
# Read from a compressed file
conn <- gzfile("data.json.gz", "r")
result <- repair_json_conn(conn, return_objects = TRUE, int64 = "string")
close(conn)
# Use with() to ensure connection is closed automatically
result <- local({
conn <- file("malformed.json", "r")
on.exit(close(conn))
repair_json_conn(conn, return_objects = TRUE)
})
Use Case: Working with LLM Outputs
Large Language Models often generate JSON that is almost correct but has minor syntax errors. This package helps you handle those cases gracefully:
# LLM might output JSON with trailing commas and unquoted keys
llm_output <- '{
users: [
{name: "Alice", age: 30,},
{name: "Bob", age: 25,},
],
}'
# Option 1: Repair and parse with your chosen JSON parser (e.g., jsonlite)
repaired <- repair_json_str(llm_output)
(parsed <- jsonlite::fromJSON(repaired))
#> $users
#> age name
#> 1 30 Alice
#> 2 25 Bob
# Option 2: Use schema with return_objects for type safety
schema <- json_object(
users = json_array(json_object(
name = json_string(),
age = json_integer()
))
)
result <- repair_json_str(llm_output, schema = schema, return_objects = TRUE)
str(result)
#> List of 1
#> $ users:List of 2
#> ..$ :List of 2
#> .. ..$ name: chr "Alice"
#> .. ..$ age : int 30
#> ..$ :List of 2
#> .. ..$ name: chr "Bob"
#> .. ..$ age : int 25
Available Functions
Repair Functions
All repair functions support the schema, return_objects, ensure_ascii, and int64 parameters:
repair_json_str(json_str, schema = NULL, return_objects = FALSE, ensure_ascii = TRUE, int64 = "double")- Repair a malformed JSON stringrepair_json_file(path, schema = NULL, return_objects = FALSE, ensure_ascii = TRUE, int64 = "double")- Read and repair JSON from a filerepair_json_raw(raw_bytes, schema = NULL, return_objects = FALSE, ensure_ascii = TRUE, int64 = "double")- Repair JSON from a raw byte vectorrepair_json_conn(conn, schema = NULL, return_objects = FALSE, ensure_ascii = TRUE, int64 = "double")- Read and repair JSON from an R connection (file, URL, pipe, etc.)
Parameters:
schema- Optional schema definition (R list fromjson_object(), etc.) or built schema (fromjson_schema())return_objects- IfTRUE, returns R objects instead of JSON stringsensure_ascii- IfTRUE(default), escape non-ASCII characters in the output JSONint64- Policy for handling 64-bit integers:"double"(default),"string", or"bit64"
Schema Functions
json_schema(schema)- Compile a schema definition for efficient reuse (5x performance improvement)json_object(..., .required)- Define a JSON object with named fieldsjson_integer(.default, .required)- Integer field (default: 0L)json_number(.default, .required)- Number/numeric field (default: 0.0)json_string(.default, .required)- String field (default: "")json_boolean(.default, .required)- Boolean field (default: FALSE)json_enum(.values, .default, .required)- Enum field with allowed values (default: first value)json_date(.default, .format, .required)- Date field with format specificationjson_timestamp(.default, .format, .tz, .required)- POSIXct datetime fieldjson_array(items, .required)- Array with specified item typejson_any(.required)- Accept any JSON type
Comparison with Similar Packages
While R has several JSON parsing packages like jsonlite, they typically fail when encountering malformed JSON. llmjson is specifically designed to handle the common errors that LLMs make when generating JSON output, making it ideal for:
- Processing LLM API responses
- Parsing structured data from AI-generated text
- Building robust data pipelines with LLM integrations
- Working with JSON data from web scraping or unreliable sources
Acknowledgments
This package includes a vendored and bug-fixed version of the llm_json Rust crate (v1.0.1) by Ribelo, which is itself a Rust port of the Python json_repair library by Stefano Baccianella (mangiucugna). Our vendored version includes critical bug fixes for array parsing not present in the upstream release.
The schema system was inspired by the structr package, which provides elegant patterns for defining and validating data structures in R.
Code of Conduct
Please note that the llmjson project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.