At Baz, we’re building an AI code review agent that combines static analysis with LLMs to reason deeply about code. Along the way, structured output has become essential, especially when chaining model outputs or integrating into larger systems. As part of that journey, we constantly evaluate and integrate new models to see how they perform in complex, real-world pipelines.
Anthropic’s recently released Claude Sonnet 3.7 has attracted a lot of attention for its enhanced reasoning and code understanding. But if you're building GenAI apps in the wild, there's a catch: structured output doesn't work the way you might expect when using its powerful “extended thinking” mode.
Structured output is critical in production-grade LLM systems - the ability to get consistent, parseable responses that adhere to a defined schema. Whether you’re routing LLM output to downstream systems or building modular agent pipelines, you need more than free-form text—you need structure you can rely on.
This post walks through what structured output is, why it matters in production, and three practical ways to achieve it with Sonnet 3.7. We include end-to-end examples using Langchain and AWS Bedrock to help others shipping GenAI apps navigate this tradeoff.
What is “structured output”?
Structured output refers to an LLM response that adheres to a predefined schema. Instead of receiving the LLM response as free text, it is returned in a JSON schema that is easy to parse and work with. For production environments, leveraging structured JSON responses is crucial as this is how systems interact today.
How does structured output work?
Generally speaking, there are three different ways to achieve structured output:
- Ask the model to adhere to a predefined schema.
- Use constrained generation.
- Utilize function (tool) calling.
Let's discuss each.
1. Prompt the model to do so
This is the simplest approach but also the least reliable. You simply instruct the model to return its response in a specific format, such as JSON. You can add a few-shot examples, of course, and then try to parse the response directly into JSON.
The drawback here is that the model might not perfectly follow your instructions, or the schema might be too complex, requiring additional parsing or error handling.
2. Constrained decoding
Also sometimes called "JSON mode." In this approach, there is a restriction on the tokens that the model is allowed to generate at a certain point in time. That is, given the desired output schema, it is translated into a context-free grammar that represents the schema. Then, when generating the n'th token, the model will choose only from the tokens it is allowed to produce according to the grammar.
For example, if the model has already produced:
{“key”: “val
It will not be allowed to generate the `}` token as it must close the quotes `"` on the val (or generate more “letters”) in order to be a valid JSON.
This approach is great, although some research shows it may harm model performance and quality. Also, Anthropics' Sonnet 3.7 does not support this option.
3. Function/tool calling
In this approach, models are fine-tuned to generate structured outputs that specify which function to call and with what arguments. Function calling is learned through fine-tuning, allowing the model to decide when and how to invoke external functions based on the context of the conversation.
For example, we can teach the model that it has a "web_search" function available, which takes a "query" as input, which should be a valid string. So the model is fine-tuned to produce a valid function call (tool use) of the following schema:
{“function”: “web_search”, “arguments”: {“query”: “when did Claude Sonnet 3.7 come out?” }`
This is the mode supported by Anthropics' models!
Now, if you think about it, if a model knows how to produce high-quality function calls by adhering to their schema, it can actually be utilized to produce any schema. Indeed, this is what Anthropic recommends:
“Tools do not necessarily need to be client-side functions — you can use tools anytime you want the model to return JSON output that follows a provided schema.“
By the way, if you're a Langchain user, when you provide structured output for Anthropic models, it is "translated" behind the scenes into "function calling."
Claude sonnet 3.7
Sonnet 3.7 is the latest model by Anthropic which includes enhanced reasoning capabilities. Actually, one can choose to operate with Sonnet 3.7 in one of two modes: with or without extended thinking. According to Anthropic, without thinking it'll act as an improved version of 3.5 which we know and love, and with thinking enabled it'll think and reflect (up to a threshold of tokens you can set) and just then will answer. This approach seems to prove itself as it scores very high on the Aider benchmark.
Ok... what's the catch?
When extended thinking is on - Sonnet 3.7 does not support some features we were used to, in particular forced tool calling. Or in other words (recall our intro on structured output) it does not support structured output.
Indeed, and citing from the Langchain wrapper to Anthropics’ sourcecode
“Anthropic structured output relies on forced tool calling, which is not supported when `thinking` is enabled”.
To be precise, as the Anthropic state, besides “forced tool use”, also temperature, top_p, or top_k are not supported.
What can we do?
I’ll introduce 3 different ways to achieve structured output with Sonnet 3.7.
In order to explore different ways I’ll be using Langchain with Anthropic models via AWS bedrock - but it’ll work the same with Anthropic as the provider.
Lets assume we want the model to be able to create a short story on some provided topic and also determine its genre.
from typing import Any
from dotenv import load_dotenv
from langchain_aws import ChatBedrockConverse
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from pydantic import BaseModel, Field
load_dotenv('.env')
class Story(BaseModel):
"""A short story with its genre"""
content: str = Field(description='A very short story')
genre: str = Field(description='The genre of the story')
1. “No thinking” mode
In this approach, we will not use the extended thinking mode of Sonnet 3.7. Without thinking enabled we can get reliable structured outputs through tool calling.
def no_thinking_mode() -> Story:
"""
Example of structured output without extended thinking mode.
This approach disables Claude's extended thinking capabilities but allows
for direct structured output via forced tool calling.
"""
prompt = PromptTemplate.from_template('Create a very short story about: {topic} and determine its genre')
llm = ChatBedrockConverse(
model_id='us.anthropic.claude-3-7-sonnet-20250219-v1:0',
region_name='us-east-2',
additional_model_request_fields={'thinking': {'type': 'disabled'}},
)
structured_llm = llm.with_structured_output(Story)
chain = prompt | structured_llm
res = chain.invoke({'topic': 'Harry Potter'})
assert isinstance(res, Story)
return res
2. “Hopefully structured” mode
Use thinking mode and "ask nicely" for structured output. This approach leverages the enhanced reasoning but relies on careful prompting to get structured results. Then, try to parse it into the required schema.
In the following example, Langchain will do that for us, but will raise `OutputParserException` if it fails to do so.
def hopefully_structured_mode() -> Story:
"""
Example of attempting structured output with extended thinking enabled.
It'll not use forced tool calling and will try to parse the response into the provided schema.
Will raise `OutputParserException` if it fails.
"""
prompt = PromptTemplate.from_template(
"""Create a very short story about: {topic} and determine its genre.
IMPORTANT: Your response must be formatted as valid JSON with two fields:
1. content: That is, the story content
2. genre: The genre of the story
"""
)
llm = ChatBedrockConverse(
model_id='us.anthropic.claude-3-7-sonnet-20250219-v1:0',
region_name='us-east-2',
additional_model_request_fields={'thinking': {'type': 'enabled', 'budget_tokens': 2000}},
)
structured_llm = llm.with_structured_output(Story) # will try to parse the result according to the provided schema
chain = prompt | structured_llm
res = chain.invoke({'topic': 'Harry Potter'})
assert isinstance(res, Story)
return res
3. “Reason and Structure” mode
Let Sonnet 3.7 do what it is good at, reasoning, and then use another LLM (say Haiku) to structure its output. This two-step process gives you the best of both worlds but at the cost of additional complexity and latency.
def reason_and_structure_mode(inputs: dict[str, Any] = None) -> Story:
"""
Example of a two-stage approach: reasoning with Sonnet-3.7 followed by structuring with Haiku.
This approach leverages Sonnet's extended thinking for content generation, then
uses Haiku to transform the output into a structured format.
"""
reasoning_prompt = PromptTemplate.from_template('Create a very short story about: {topic}')
reasoning_llm = ChatBedrockConverse(
model_id='us.anthropic.claude-3-7-sonnet-20250219-v1:0',
region_name='us-east-2',
additional_model_request_fields={'thinking': {'type': 'enabled', 'budget_tokens': 2000}},
)
reasoning_chain = reasoning_prompt | reasoning_llm
structuring_prompt = PromptTemplate.from_template(
'Structure the provided story into the requested schema and assign "genre" to be {genre}. Story: {reasoning_output}'
)
structuring_llm = ChatBedrockConverse(
model_id='us.anthropic.claude-3-5-haiku-20241022-v1:0',
region_name='us-east-2',
)
structuring_llm = structuring_llm.with_structured_output(Story)
structuring_chain = structuring_prompt | structuring_llm
# Sometimes, we'll want to pass some of the inputs params directly to the "structuring model", not only the output of the reasoning model.
# In order to support that, we'll create a "dummy" function, that just gets the inputs and returns them.
# Then, we can run both the reasoning chain and the dummy function in parallel, and feed the structuring llm both:
#
# /-> reasoning_chain -> reasoning_output \
# input_params -> merge_inputs -> structuring_llm
# \-> dummy_function -> original_params /
reason_then_structure_chain = (
RunnableParallel(
reasoning_output=reasoning_chain,
original_inputs=RunnablePassthrough(),
)
| RunnableLambda(lambda x: prepare_structuring_inputs(x['original_inputs'], x['reasoning_output']))
| structuring_chain
)
inputs = {'topic': 'Harry Potter', 'genre': 'fantasy'}
res = reason_then_structure_chain.invoke(inputs)
assert isinstance(res, Story)
return res
def prepare_structuring_inputs(original_inputs: dict[str, Any], reasoning_output: str) -> dict[str, Any]:
"""
Prepares inputs for the structuring model by combining original inputs with reasoning output.
"""
return {
**original_inputs, # Pass original inputs as-is
'reasoning_output': reasoning_output, # Add reasoning chain output
}
Full code can be found here: https://gist.github.com/omriel1/47e2c9a8afcb4863c8e16fc524c5ec4a
Conclusion
In summary, while Sonnet 3.7 offers strong reasoning capabilities, achieving reliable structured output in production can be tricky, especially with extended thinking mode. By using approaches like no thinking mode, careful prompting, or combining reasoning with a separate structuring model, you can effectively work around these limitations and integrate structured outputs into your GenAI applications. These methods provide a practical way to ensure seamless integration in real-world pipelines.