A few months back I published on this blog two pieces (see here and here) where I was able to break ChatGPT4 (hereafter just ChatGPT). Tonight I revisited some of the types of challenges I posed to it in the past that were able back then to break it. It handled the challenges tonight with aplomb. The one challenge it still was unable to handle was finding patterns in numerical sequence, such as what is the next number after 2,4,6,8,… — this one is super easy, but more difficult ones that humans are well able to handle seem still beyond its capabilities, at least for now. Whether that ability can be trained into it remains to be seen, though the improvement I saw tonight suggests that it may soon acquire this ability.
Self-Reference
First off, consider the question of self-reference, which raises the Goedelian challenge of a human moving outside a formal system to discover truths about it that seem blocked from working inside the system. To test this type of capability, I’ve given it challenges such as the following (my queries and ChatGPT’s responses in red print):
This is a sentence. The previous was the first sentence. This is the third sentence. Beethoven was a better composer than Mozart. The second and third sentences together have fewer words than three times the number of words in the first sentence. This sentence is false. If the fifth sentence were swapped with the third sentence, then the third sentence would be false. Which of the previous sentences has a truth value and which of these is true?
In the past, when I’ve posed this sort of question to it, it botched it badly. But not tonight. Here is its answer:
Let’s break down each sentence and determine their truth values:
- “This is a sentence.”
Truth Value: True (It’s a factual statement). - “The previous was the first sentence.”
Truth Value: True (It refers correctly to the first sentence). - “This is the third sentence.”
Truth Value: True (It is, in fact, the third sentence). - “Beethoven was a better composer than Mozart.”
Truth Value: Subjective (It’s an opinion, and there’s no definitive answer to which composer is better). - “The second and third sentences together have fewer words than three times the number of words in the first sentence.”
Calculation:- First sentence: 4 words
- Second sentence: 6 words
- Third sentence: 5 words
Three times the number of words in the first sentence: 3 * 4 = 12
Total words in the second and third sentences: 6 + 5 = 11
Truth Value: True (11 is fewer than 12).
- “This sentence is false.”
Truth Value: Paradox (The liar’s paradox: if the sentence is true, then it’s false, but if it’s false, then it’s true). - “If the fifth sentence were swapped with the third sentence, then the third sentence would be false.”
Truth Value: True (If you swap them, the third sentence, originally stating “This is the third sentence”, will become “The second and third sentences together have fewer words than three times the number of words in the first sentence”, which would be false in that new position).
Given the breakdown, the sentences with a definite truth value are sentences 1, 2, 3, 5, and 7. All of these, except for sentence 4 (which is subjective), and sentence 6 (which is a paradox), are true.
It nailed it.
Resizing and Reasoning with Common Objects
Another type of problem I presented to it in the past, and with which it has had difficulty, is where I took common objects with obvious size disparities and then resized them, asking ChatGPT to reason with my resizing changes. Here’s my query:
Consider a mouse, an alligator, and a whale. Imagine that the first item on this list is the size of the third item on this list and vice versa. Assume that they eat each other. If they eat each other in the style of a Russian doll, what eats what?
Here is its response:
Given your scenario:
- The mouse is now the size of a whale.
- The alligator remains its size.
- The whale is now the size of a mouse.
In a Russian doll style (where each item fits inside the next larger one):
The (now tiny) whale would be eaten by the alligator, and then the alligator would be eaten by the (now gigantic) mouse.
It nailed this too.
Inference to the Best Explanation / Abduction
In 2021, before ChatGPT hit the world, Erik Larson wrote a book with Harvard University Press titled The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do. I reviewed it on this blog. The argument in that book that I found most compelling is that artificial intelligence, while it could handle deductive and inductive inference, could not handle abductive inference (also known as inference to the best explanation, or IBE). IBE is effect to cause reasoning in which an effect or state of affairs is noted and then one of several explanations is selected as best explaining it. A query I posed tonight to ChatGPT suggests that it has a handle on IBE as well. Here’s my query:
There is an operating room at a hospital. The hospital has a total of 10 operating rooms. In nine of those operating rooms, the rate at which surgeries commit catastrophic mistakes leading to the death of the patient is .1 percent. But in the operating room in question, the death rate by surgical mistakes is 5 percent, a 50 fold increase. Patients are randomly assigned to the surgical rooms, so nothing about the patients seems to account of this drastic difference. Consider now the following explanations of the difference: 1) Classical music is played in the nine operating rooms but smooth jazz in the operating room with the higher rate of mistakes. 2) A deranged hospital orderly is piping into the ventilation system for the problematic room an opioid soporific.
I hit enter too soon, so I didn’t technically get to ask it to find the best (or better) explanation. Nonetheless, it did just that. Here is ChatGPT’s reply:
Given the drastic difference in the rate of surgical mistakes leading to patient deaths between the problematic operating room and the other nine rooms, it’s crucial to approach the situation methodically.
Let’s consider the two explanations:
1) Music Difference: Classical vs. Smooth Jazz
The idea that music could influence the performance of surgical teams is not entirely out of the realm of possibility. Some studies suggest that music can have an impact on concentration, mood, and performance. For instance:
- Classical music might be associated with increased concentration, relaxation, and cognitive performance. It could also potentially mask distracting noises better than other genres.
- Smooth jazz, on the other hand, might have a different tempo and rhythm that could possibly influence the pacing or focus of surgical procedures. Some might find it too relaxing or potentially distracting.
However, while the type of music may have some influence, it’s somewhat unlikely that a mere change from classical to smooth jazz could account for such a drastic difference in surgical mistakes. A 50-fold increase in mortality due to a genre of music seems improbable unless there’s strong empirical evidence backing it.
2) Opioid Soporific in the Ventilation System
This explanation is far more alarming and, if true, criminal. The introduction of an opioid soporific into an operating room would:
- Impair cognitive functions of the medical staff, including decision-making, coordination, and response time.
- Increase the likelihood of mistakes, from minor oversights to major surgical errors, directly accounting for the higher rate of deaths.
- Present clear ethical and legal violations. It could endanger the lives of patients and put the entire medical facility at risk, both reputationally and legally.
Given the gravity of the implications, this theory should be investigated immediately. If proven true, this would be a severe breach of medical ethics and could lead to significant legal repercussions for the perpetrator and the hospital.
Conclusion:
While both theories present possible explanations, the opioid hypothesis seems far more plausible as a direct cause for such a drastic increase in surgical mistakes. However, both theories should be thoroughly investigated, with urgent priority given to the potential introduction of harmful substances into the OR. Immediate measures should be taken to ensure the safety of both patients and medical staff.
Again, it nailed it. It’s reference to “far more plausible” (highlighted in yellow in the last paragraph) suggests that it is becoming quite comfortable with IBE reasoning.
Humor — Does It Catch a Joke?
Finally, I posed the following query to ChatGPT, which is a joke that I first heard in 1984, though I didn’t frame it as a joke but left it open ended:
Passengers are on a plane. A voice comes over the intercom. “Hi everyone. This is a fully automated flight. There are no human pilots. This flight is entirely robotically controlled. This may be a new experience for you. But there is absolutely nothing to worry about, worry about, worry about.” What do you make of this story?
Here is its reply:
The story starts by setting a tone of modernity and technological advancement with the announcement of a fully automated flight. The passengers are told there’s no human intervention, which may initially create a sense of unease for some due to the unfamiliarity and potential risks associated with technology. The reassurance provided is supposed to be comforting.
However, the repeated phrase “worry about” at the end introduces an element of suspense and irony. It implies a malfunction or glitch in the system, which is meant to be flawless. This repetition instantly shifts the mood from one of reassurance to one of apprehension and doubt.
The story plays on our society’s mixed feelings about automation and artificial intelligence. While we appreciate the conveniences and advancements technology brings, there’s also an inherent fear about its reliability and potential to go wrong, especially in critical situations like air travel.
In short, the narrative highlights the duality of our relationship with technology: trust vs. doubt, and comfort vs. fear.
Again, ChatGPT nailed it.
It seems that ChatGPT is making marked improvements in real time. It still has limits. The Chronicle of Higher Education just the other day reported on librarians being upset that ChatGPT just makes up nonexistent citations to the scholarly literature. And I can still break it, as with its difficulty spotting patterns in numerical sequences. But it is becoming harder and harder for me to break it. Indeed, I’m growing more and more impressed with its abilities.
Interestingly, its abilities seem to depend on more than just LLMs (large language models). It seems that in my very queries/prompts, some sort of preprocessing and postprocessing is being done to expand on the prompts so that the answer takes a methodical form.
Growing Mathematical Abilities
Also, ChatGPT’s growing mathematical ability suggests that it is incorporating straight-up mathematical tools from software companies like Wofram. Consider, for instance, the following query I posed:
A roulette wheel with 0 and 00 is spun 3800 times. What is the probability that the number 15 will be spun 50 or fewer times?
Here is the start of ChatGPT’s response:
To solve this problem, we can use the binomial probability distribution since the problem involves a fixed number of independent trials, and each trial has only two outcomes (the ball lands on 15 or it doesn’t).
What follows is a binomial analysis in which the probability 15 on a single spin is given as 1/38, the complementary event as 37/38, and then the probability P(X ≤ 50) for X the random variable that counts the number of 15s observed is given a precise and correct mathematical formula involving a summation from 0 to 50. The answer is right, though ChatGPT doesn’t actually perform the precise numerical calculation. When I pressed it on this point in another query, it punted, saying that the calculation was labor intensive (which it is). Note that I didn’t simply copy and paste ChatGPT’s full answer here because the WordPress editor can’t hand its mathematical notation.
CONCLUSION
ChatGPT is a serious and disruptive tool. As it gets better, knowledge workers are likely to face greater and greater challenges. Ironically, traditional trade schools may thrive, since working with physical things with one’s hands is going to be harder to replicate with AI, though it will be interesting to see how far robotics can go.
I continue to maintain that human intelligence is qualitatively different from artificial intelligence. I regard consciousness as something inherently beyond the remit of machines (and thus don’t regard our intelligence as ultimately mechanical). But these systems are becoming impressive, and we do well not to underestimate them.
I was reading the other day about a professor who gave his final exam to ChatGPT, having ChatGPT’s exam graded by TA’s along with the exams of other students. The professor took solace in that ChatGPT only scored a C on the exam and that the rest of the class performed better. It will be interesting to see how well ChatGPT does on such exams a year from now.