Simon Willison’s Weblog

Recent

Nov. 9, 2024

Visualizing local election results with Datasette, Observable and MapLibre GL

Alex Garcia and myself hosted the first Datasette Open Office Hours on Friday—a live-streamed video session where we hacked on a project together and took questions and tips from community members on Discord.

[... 3,390 words]

11:32 pm / geospatial, gis, mapping, politics, projects, datasette, datasette-cloud, alex-garcia

This is a very friendly and supportive place where you are surrounded by peers - we all want to help each other succeed. The golden rule of this server is:

Don't ever try to impress anyone here with your knowledge! Instead try to impress folks here with your desire to learn, and desire to help others learn.

— fast.ai Discord Server

# 6:59 am / discord, jeremy-howard, fastai, community

Nov. 8, 2024

uv 0.5.0. The first backwards-incompatible (in minor ways) release after 30 releases without a breaking change.

I found out about this release this morning when I filed an issue about a fiddly usability problem I had encountered with the combo of uv and conda... and learned that the exact problem had already been fixed in the brand new version!

# 11:54 pm / uv, packaging, python

ChainForge. I'm still on the hunt for good options for running evaluations against prompts. ChainForge offers an interesting approach, calling itself "an open-source visual programming environment for prompt engineering".

The interface is one of those boxes-and-lines visual programming tools, which reminds me of Yahoo Pipes.

It's open source (from a team at Harvard) and written in Python, which means you can run a local copy instantly via uvx like this:

uvx chainforge serve

You can then configure it with API keys to various providers (OpenAI worked for me, Anthropic models returned JSON parsing errors due to a 500 page from the ChainForge proxy) and start trying it out.

The "Add Node" menu shows the full list of capabilities.

The JavaScript and Python evaluation blocks are particularly interesting: the JavaScript one runs outside of a sandbox using plain eval(), while the Python one still runs in your browser but uses Pyodide in a Web Worker.

# 8:52 pm / pyodide, evals, uv, ai, llms, prompt-engineering, prompt-injection, python, javascript, generative-ai

Nov. 7, 2024

Datasette Public Office Hours, Friday Nov 8th at 2pm PT. Tomorrow afternoon (Friday 8th November) at 2pm PT we'll be hosting the first Datasette Public Office Hours - a livestream video session on Discord where Alex Garcia and myself will live code on some Datasette projects and hang out to chat about the project.

This is our first time trying this format. If it works out well I plan to turn it into a series.

Discord event card promoting Datasette Public Office Hours

# 7:10 pm / discord, open-source, datasette, alex-garcia

Project: VERDAD—tracking misinformation in radio broadcasts using Gemini 1.5

I’m starting a new interview series called Project. The idea is to interview people who are building interesting data projects and talk about what they’ve built, how they built it, and what they learned along the way.

[... 1,025 words]

6:41 pm / data-journalism, youtube, ai, prompt-engineering, generative-ai, llms, gemini

If you have worked in search, you know how freaking hard even getting started with something close to this with traditional methods. Now, you can zero-shot it.

System Instructions: As a query categorization expert, you try to break down the intent of a search query. First, provide your reasoning and then describe the intent using a single category (broad, detailed, comparision)

User: The query from the user is "nike versus adidas for terrain running". The user is a female, age 22.

Model: The user is clearly looking to compare two specific brands, Nike and Adidas, for a particular activity, terrain running. While the user's demographics might be helpful in some situations (e.g., recommending specific product lines), the core intent remains a comparison. Category: Comparison

There's a lot of hand-waving around query intent classification; it's always been like that. Now, it's straightforward (add a few examples to improve accuracy). But my point is that you could only dream about building something like this without having access to lots of interaction data.

— Jo Kristian Bergum

# 3:34 pm / prompt-engineering, generative-ai, search, ai, llms

Nov. 6, 2024

yet-another-applied-llm-benchmark. Nicholas Carlini introduced this personal LLM benchmark suite back in February as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against the kinds of tasks he uses them for.

There are two defining features of this benchmark that make it interesting. Most importantly, I've implemented a simple dataflow domain specific language to make it easy for me (or anyone else!) to add new tests that realistically evaluate model capabilities. This DSL allows for specifying both how the question should be asked and also how the answer should be evaluated. [...] And then, directly as a result of this, I've written nearly 100 tests for different situations I've actually encountered when working with LLMs as assistants

The DSL he's using is fascinating. Here's an example:

"Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> \
    VisionLLMRun("What flag is shown in this image?") >> \
    (SubstringEvaluator("United States") | SubstringEvaluator("USA")))

This triggers an LLM to execute the prompt asking for a C program that renders an American Flag, runs that through a C compiler and interpreter (executed in a Docker container), then passes the output of that to a vision model to guess the flag and checks that it returns a string containing "United States" or "USA".

The DSL itself is implemented entirely in Python, using the __rshift__ magic method for >> and __rrshift__ to enable strings to be piped into a custom object using "command to run" >> LLMRunNode.

# 8 pm / evals, llms, ai, generative-ai, dsl, python, nicholas-carlini

Nov. 5, 2024

Generating documentation from tests using files-to-prompt and LLM. I was experimenting with the wasmtime-py Python library today (for executing WebAssembly programs from inside CPython) and I found the existing API docs didn't quite show me what I wanted to know.

The project has a comprehensive test suite so I tried seeing if I could generate documentation using that:

cd /tmp
git clone https://github.com/bytecodealliance/wasmtime-py
files-to-prompt -e py wasmtime-py/tests -c | \
  llm -m claude-3.5-sonnet -s \
  'write detailed usage documentation including realistic examples'

More notes in my TIL. You can see the full Claude transcript here - I think this worked really well!

# 10:37 pm / llm, webassembly, generative-ai, ai, llms, claude, claude-3-5-sonnet, ai-assisted-programming, documentation

You already know Donald Trump. He is unfit to lead. Watch him. Listen to those who know him best. He tried to subvert an election and remains a threat to democracy. He helped overturn Roe, with terrible consequences. Mr. Trump's corruption and lawlessness go beyond elections: It's his whole ethos. He lies without limit. If he's re-elected, the G.O.P. won't restrain him. Mr. Trump will use the government to go after opponents. He will pursue a cruel policy of mass deportations. He will wreak havoc on the poor, the middle class and employers. Another Trump term will damage the climate, shatter alliances and strengthen autocrats. Americans should demand better. Vote.

— NY Times Editorial Board

# 1:33 am / new-york-times, politics

Nov. 4, 2024

New OpenAI feature: Predicted Outputs (via) Interesting new ability of the OpenAI API - the first time I've seen this from any vendor.

If you know your prompt is mostly going to return the same content - you're requesting an edit to some existing code, for example - you can now send that content as a "prediction" and have GPT-4o or GPT-4o mini use that to accelerate the returned result.

OpenAI's documentation says:

When providing a prediction, any tokens provided that are not part of the final completion are charged at completion token rates.

I initially misunderstood this as meaning you got a price reduction in addition to the latency improvement, but that's not the case: in the best possible case it will return faster and you won't be charged anything extra over the expected cost for the prompt, but the more it differs from your prediction the more extra tokens you'll be billed for.

I ran the example from the documentation both with and without the prediction and got these results. Without the prediction:

"usage": {
  "prompt_tokens": 150,
  "completion_tokens": 118,
  "total_tokens": 268,
  "completion_tokens_details": {
    "accepted_prediction_tokens": 0,
    "audio_tokens": null,
    "reasoning_tokens": 0,
    "rejected_prediction_tokens": 0
  }

That took 5.2 seconds and cost 0.1555 cents.

With the prediction:

"usage": {
  "prompt_tokens": 166,
  "completion_tokens": 226,
  "total_tokens": 392,
  "completion_tokens_details": {
    "accepted_prediction_tokens": 49,
    "audio_tokens": null,
    "reasoning_tokens": 0,
    "rejected_prediction_tokens": 107
  }

That took 3.3 seconds and cost 0.2675 cents.

Further details from OpenAI's Steve Coffey:

We are using the prediction to do speculative decoding during inference, which allows us to validate large batches of the input in parallel, instead of sampling token-by-token!

[...] If the prediction is 100% accurate, then you would see no cost difference. When the model diverges from your speculation, we do additional sampling to “discover” the net-new tokens, which is why we charge rejected tokens at completion time rates.

# 11:55 pm / openai, llms, ai, generative-ai, llm-pricing

Claude 3.5 Haiku

Anthropic released Claude 3.5 Haiku today, a few days later than expected (they said it would be out by the end of October).

[... 478 words]

7:34 pm / ai, openai, generative-ai, llms, llm, anthropic, claude, gemini, llm-pricing

Nous Hermes 3. The Nous Hermes family of fine-tuned models have a solid reputation. Their most recent release came out in August, based on Meta's Llama 3.1:

Our training data aggressively encourages the model to follow the system and instruction prompts exactly and in an adaptive manner. Hermes 3 was created by fine-tuning Llama 3.1 8B, 70B and 405B, and training on a dataset of primarily synthetically generated responses. The model boasts comparable and superior performance to Llama 3.1 while unlocking deeper capabilities in reasoning and creativity.

The model weights are on Hugging Face, including GGUF versions of the 70B and 8B models. Here's how to try the 8B model (a 4.58GB download) using the llm-gguf plugin:

llm install llm-gguf
llm gguf download-model 'https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF/resolve/main/Hermes-3-Llama-3.1-8B.Q4_K_M.gguf' -a Hermes-3-Llama-3.1-8B
llm -m Hermes-3-Llama-3.1-8B 'hello in spanish'

Nous Research partnered with Lambda Labs to provide inference APIs. It turns out Lambda host quite a few models now, currently providing free inference to users with an API key.

I just released the first alpha of a llm-lambda-labs plugin. You can use that to try the larger 405b model (very hard to run on a consumer device) like this:

llm install llm-lambda-labs
llm keys set lambdalabs
# Paste key here
llm -m lambdalabs/hermes3-405b 'short poem about a pelican with a twist'

Here's the source code for the new plugin, which I based on llm-mistral. The plugin uses httpx-sse to consume the stream of tokens from the API.

# 6:20 pm / llm, generative-ai, llama, ai, edge-llms, llms, meta, projects, nous-research

Nov. 3, 2024

Building technology in startups is all about having the right level of tech debt. If you have none, you’re probably going too slow and not prioritizing product-market fit and the important business stuff. If you get too much, everything grinds to a halt. Plus, tech debt is a “know it when you see it” kind of thing, and I know that my definition of “a bunch of tech debt” is, to other people, “very little tech debt.”

— Tom MacWright

# 4:36 pm / technical-debt, tom-macwright

California Clock Change. The clocks go back in California tonight and I finally built my dream application for helping me remember if I get an hour extra of sleep or not, using a Claude Artifact. Here's the transcript.

This is one of my favorite examples yet of the kind of tiny low stakes utilities I'm building with Claude Artifacts because the friction involved in churning out a working application has dropped almost to zero.

(I added another feature: it now includes a note of what time my Dog thinks it is if the clocks have recently changed.)

# 5:11 am / claude-artifacts, ai-assisted-programming, projects, ai, timezones, llms

Docling. MIT licensed document extraction Python library from the Deep Search team at IBM, who released Docling v2 on October 16th.

Here's the Docling Technical Report paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.

Those models are available on Hugging Face.

Here's how to try out the Docling CLI interface using uvx (avoiding the need to install it first - though since it downloads models it will take a while to run the first time):

uvx docling mydoc.pdf --to json --to md

This will output a mydoc.json file with complex layout information and a mydoc.md Markdown file which includes Markdown tables where appropriate.

The Python API is a lot more comprehensive. It can even extract tables as Pandas DataFrames:

from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
for table in result.document.tables:
    df = table.export_to_dataframe()
    print(df)

I ran that inside uv run --with docling python. It took a little while to run, but it demonstrated that the library works.

# 4:57 am / ocr, ai, pdf, python, ibm, hugging-face, uv

Nov. 2, 2024

Claude Token Counter. Anthropic released a token counting API for Claude a few days ago.

I built this tool for running prompts, images and PDFs against that API to count the tokens in them.

The API is free (albeit rate limited), but you'll still need to provide your own API key in order to use it.

Screenshot of a Claude Token Counter interface showing: Title Claude Token Counter, system prompt this counts tokens, user message You can attach images and PDFs too, file upload area with llm-jq-card.jpg and dxweb.pdf attached (both with Remove buttons), a Count Tokens button, and JSON output showing input_tokens: 3320

Here's the source code. I built this using two sessions with Claude - one to build the initial tool and a second to add PDF and image support. That second one is a bit of a mess - it turns out if you drop an HTML file onto a Claude conversation it converts it to Markdown for you, but I wanted it to modify the original HTML source.

The API endpoint also allows you to specify a model, but as far as I can tell from running some experiments the token count was the same for Haiku, Opus and Sonnet 3.5.

# 6:52 pm / anthropic, claude, ai-assisted-programming, tools, generative-ai, ai, llms, claude-artifacts, llm-pricing

Please publish and share more. 💯 to all of this by Jeff Triplett:

Friends, I encourage you to publish more, indirectly meaning you should write more and then share it. [...]

You don’t have to change the world with every post. You might publish a quick thought or two that helps encourage someone else to try something new, listen to a new song, or binge-watch a new series.

Jeff shares my opinion on conclusions: giving myself permission to hit publish even when I haven't wrapped everything up neatly was a huge productivity boost for me:

Our posts are done when you say they are. You do not have to fret about sticking to landing and having a perfect conclusion. Your posts, like this post, are done after we stop writing.

And another 💯 to this footnote:

PS: Write and publish before you write your own static site generator or perfect blogging platform. We have lost billions of good writers to this side quest because they spend all their time working on the platform instead of writing.

# 3:17 pm / jeff-triplett, blogging

SmolLM2 (via) New from Loubna Ben Allal and her research team at Hugging Face:

SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters. They are capable of solving a wide range of tasks while being lightweight enough to run on-device. [...]

It was trained on 11 trillion tokens using a diverse dataset combination: FineWeb-Edu, DCLM, The Stack, along with new mathematics and coding datasets that we curated and will release soon.

The model weights are released under an Apache 2 license. I've been trying these out using my llm-gguf plugin for LLM and my first impressions are really positive.

Here's a recipe to run a 1.7GB Q8 quantized model from lmstudio-community:

llm install llm-gguf
llm gguf download-model https://huggingface.co/lmstudio-community/SmolLM2-1.7B-Instruct-GGUF/resolve/main/SmolLM2-1.7B-Instruct-Q8_0.gguf -a smol17
llm chat -m smol17

Or at the other end of the scale, here's how to run the 138MB Q8 quantized 135M model:

llm gguf download-model https://huggingface.co/lmstudio-community/SmolLM2-135M-Instruct-GGUF/resolve/main/SmolLM2-135M-Instruct-Q8_0.gguf' -a smol135m
llm chat -m smol135m

The blog entry to accompany SmolLM2 should be coming soon, but in the meantime here's the entry from July introducing the first version: SmolLM - blazingly fast and remarkably powerful .

# 5:27 am / llm, hugging-face, generative-ai, ai, llms, open-source, edge-llms

Nov. 1, 2024

From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code (via) Google's Project Zero security team used a system based around Gemini 1.5 Pro to find a previously unreported security vulnerability in SQLite (a stack buffer underflow), in time for it to be fixed prior to making it into a release.

A key insight here is that LLMs are well suited for checking for new variants of previously reported vulnerabilities:

A key motivating factor for Naptime and now for Big Sleep has been the continued in-the-wild discovery of exploits for variants of previously found and patched vulnerabilities. As this trend continues, it's clear that fuzzing is not succeeding at catching such variants, and that for attackers, manual variant analysis is a cost-effective approach.

We also feel that this variant-analysis task is a better fit for current LLMs than the more general open-ended vulnerability research problem. By providing a starting point – such as the details of a previously fixed vulnerability – we remove a lot of ambiguity from vulnerability research, and start from a concrete, well-founded theory: "This was a previous bug; there is probably another similar one somewhere".

LLMs are great at pattern matching. It turns out feeding in a pattern describing a prior vulnerability is a great way to identify potential new ones.

# 8:15 pm / gemini, security, sqlite, google, generative-ai, ai, llms, prompt-engineering

Claude API: PDF support (beta) (via) Claude 3.5 Sonnet now accepts PDFs as attachments:

The new Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) model now supports PDF input and understands both text and visual content within documents.

I just released llm-claude-3 0.7 with support for the new attachment type (attachments are a very new feature), so now you can do this:

llm install llm-claude-3 --upgrade
llm -m claude-3.5-sonnet 'extract text' -a mydoc.pdf

Visual PDF analysis can also be turned on for the Claude.ai application:

Also new today: Claude now offers a free (albeit rate-limited) token counting API. This addresses a complaint I've had for a while: previously it wasn't possible to accurately estimate the cost of a prompt before sending it to be executed.

# 6:55 pm / vision-llms, claude-3-5-sonnet, llm, anthropic, claude, ai, llms, pdf, generative-ai, projects

Lord Clement-Jones: To ask His Majesty's Government what assessment they have made of the cybersecurity risks posed by prompt injection attacks to the processing by generative artificial intelligence of material provided from outside government, and whether any such attacks have been detected thus far.

Lord Vallance of Balham: Security is central to HMG's Generative AI Framework, which was published in January this year and sets out principles for using generative AI safely and responsibly. The risks posed by prompt injection attacks, including from material provided outside of government, have been assessed as part of this framework and are continually reviewed. The published Generative AI Framework for HMG specifically includes Prompt Injection attacks, alongside other AI specific cyber risks.

— Question for Department for Science, Innovation and Technology, UIN HL1541, tabled on 14 Oct 2024

# 3:14 pm / politics, prompt-injection, security, generative-ai, ai, uk, llms

Control your smart home devices with the Gemini mobile app on Android (via) Google are adding smart home integration to their Gemini chatbot - so far on Android only.

Have they considered the risk of prompt injection? It looks like they have, at least a bit:

Important: Home controls are for convenience only, not safety- or security-critical purposes. Don't rely on Gemini for requests that could result in injury or harm if they fail to start or stop.

The Google Home extension can’t perform some actions on security devices, like gates, cameras, locks, doors, and garage doors. For unsupported actions, the Gemini app gives you a link to the Google Home app where you can control those devices.

It can control lights and power, climate control, window coverings, TVs and speakers and "other smart devices, like washers, coffee makers, and vacuums".

I imagine we will see some security researchers having a lot of fun with this shortly.

# 2:35 pm / gemini, prompt-injection, security, google, generative-ai, ai, llms, android

Oct. 31, 2024

Cerebras Coder (via) Val Town founder Steve Krouse has been building demos on top of the Cerebras API that runs Llama3.1-70b at 2,000 tokens/second.

Having a capable LLM with that kind of performance turns out to be really interesting. Cerebras Coder is a demo that implements Claude Artifact-style on-demand JavaScript apps, and having it run at that speed means changes you request are visible within less than a second:

Steve's implementation (created with the help of Townie, the Val Town code assistant) demonstrates the simplest possible version of an iframe sandbox:

<iframe
    srcDoc={code}
    sandbox="allow-scripts allow-modals allow-forms allow-popups allow-same-origin allow-top-navigation allow-downloads allow-presentation allow-pointer-lock"
/>

Where code is populated by a setCode(...) call inside a React component.

The most interesting applications of LLMs continue to be where they operate in a tight loop with a human - this can make those review loops potentially much faster and more productive.

# 10:39 pm / val-town, llms, react, iframes, ai-assisted-programming, generative-ai, sandboxing, ai, steve-krouse, llama, cerebras

Australia/Lord_Howe is the weirdest timezone (via) Lord Howe Island - part of Australia, population 382 - is unique in that the island's standard time zone is UTC+10:30 but is UTC+11 when daylight saving time applies. It's the only time zone where DST represents a 30 minute offset.

# 10:03 pm / timezones

Oct. 30, 2024

Creating a LLM-as-a-Judge that drives business results (via) Hamel Husain's sequel to Your AI product needs evals. This is packed with hard-won actionable advice.

Hamel warns against using scores on a 1-5 scale, instead promoting an alternative he calls "Critique Shadowing". Find a domain expert (one is better than many, because you want to keep their scores consistent) and have them answer the yes/no question "Did the AI achieve the desired outcome?" - providing a critique explaining their reasoning for each of their answers.

This gives you a reliable score to optimize against, and the critiques mean you can capture nuance and improve the system based on that captured knowledge.

Most importantly, the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge. In other words, it should be detailed enough that a new employee could understand it.

Once you've gathered this expert data system you can switch to using an LLM-as-a-judge. You can then iterate on the prompt you use for it in order to converge its "opinions" with those of your domain expert.

Hamel concludes:

The real value of this process is looking at your data and doing careful analysis. Even though an AI judge can be a helpful tool, going through this process is what drives results. I would go as far as saying that creating a LLM judge is a nice “hack” I use to trick people into carefully looking at their data!

# 6:08 pm / evals, generative-ai, hamel-husain, ai, llms

docs.jina.ai—the Jina meta-prompt. From Jina AI on Twitter:

curl docs.jina.ai - This is our Meta-Prompt. It allows LLMs to understand our Reader, Embeddings, Reranker, and Classifier APIs for improved codegen. Using the meta-prompt is straightforward. Just copy the prompt into your preferred LLM interface like ChatGPT, Claude, or whatever works for you, add your instructions, and you're set.

The page is served using content negotiation. If you hit it with curl you get plain text, but a browser with text/html in the accept: header gets an explanation along with a convenient copy to clipboard button.

# 5:07 pm / llm, jina, generative-ai, ai, documentation, llms

W̶e̶e̶k̶n̶o̶t̶e̶s̶ Monthnotes for October

I try to publish weeknotes at least once every two weeks. It’s been four since the last entry, so I guess this one counts as monthnotes instead.

[... 797 words]

4:20 am / weeknotes, llms, llm

Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview. The big announcement from GitHub Universe: Copilot is growing support for alternative models.

GitHub Copilot predated the release of ChatGPT by more than year, and was the first widely used LLM-powered tool. This announcement includes a brief history lesson:

The first public version of Copilot was launched using Codex, an early version of OpenAI GPT-3, specifically fine-tuned for coding tasks. Copilot Chat was launched in 2023 with GPT-3.5 and later GPT-4. Since then, we have updated the base model versions multiple times, using a range from GPT 3.5-turbo to GPT 4o and 4o-mini models for different latency and quality requirements.

It's increasingly clear that any strategy that ties you to models from exclusively one provider is short-sighted. The best available model for a task can change every few months, and for something like AI code assistance model quality matters a lot. Getting stuck with a model that's no longer best in class could be a serious competitive disadvantage.

The other big announcement from the keynote was GitHub Spark, described like this:

Sparks are fully functional micro apps that can integrate AI features and external data sources without requiring any management of cloud resources.

I got to play with this at the event. It's effectively a cross between Claude Artifacts and GitHub Gists, with some very neat UI details. The features that really differentiate it from Artifacts is that Spark apps gain access to a server-side key/value store which they can use to persist JSON - and they can also access an API against which they can execute their own prompts.

The prompt integration is particularly neat because prompts used by the Spark apps are extracted into a separate UI so users can view and modify them without having to dig into the (editable) React JavaScript code.

# 1:23 am / gemini, anthropic, openai, ai, llms, ai-assisted-programming, github-copilot, github, claude-artifacts, react, javascript

Oct. 29, 2024

Generating Descriptive Weather Reports with LLMs. Drew Breunig produces the first example I've seen in the wild of the new LLM attachments Python API. Drew's Downtown San Francisco Weather Vibes project combines output from a JSON weather API with the latest image from a webcam pointed at downtown San Francisco to produce a weather report "with a style somewhere between Jack Kerouac and J. Peterman".

Here's the Python code that constructs and executes the prompt. The code runs in GitHub Actions.

# 11:12 pm / vision-llms, drew-breunig, llm, generative-ai, ai, llms, github-actions, prompt-engineering