It looks like you're new here. If you want to get involved, click one of these buttons!
I've been prodded to disseminate this a little more widely during Week 1. For reference, the two approaches were as follows:
// Approach 1
function anagram(text) {
var a = text.split("");
for (var i = 0; i < a.length; i += 1) {
var letter = a[i];
var j = Math.floor(Math.random() * a.length);
a[i] = a[j];
a[j] = letter;
}
return a.join("");
}
and
// Approach 2
function anagram(text) {
return text.split("").sort(function () {return 0.5- Math.random()}).join("");
}
A thing to note about these is that they don't achieve the same thing. I initially posted a demonstration of that here.
Firstly, note that both approaches are biased in terms of the distribution of permutations they create.
There's a simple counting argument here as to the bias in the two approaches. (Both might be suitable for a follow-up question during this hypothetical interview.)
For the first approach, we make n
swaps, each picking a target from n
elements. There are thus n^n
potential paths through the algorithm. However, there are n!
distinct permutations. Because the former value doesn't divide cleanly by the latter [example: n = 8, n! = 40320, n^n = 16777216, 16777216 isn't evenly divisible by 40320], the argument is that there must be some bias in the outcome - some "anagrams" crop up more frequently than others. (This is evident even when we're producing anagrams of a simple string like "abc", which you can try by hand.)
For the second approach, any analysis of behaviour needs to rely on the fact that the sorting algorithm is likely to have been picked with a runtime of n . log_2 n
. [There are multiple ways to implement a sort, but that's a common behaviour. For n=8, a sort may make 8 . (log_2 8) = 24 comparisons.] Each time the sort algorithm makes a comparison, it'll perform one of two operations: swap, or don't swap. Thus, there are 2 ^ (n log_2 n)
potential paths through the algorithm - or again, n^n
. [For the case n=8, this comes out as 16777216 again.]
However, sorting algorithms try hard to not make comparisons when they "know" what the result would be; thus, this approach produces a much stronger bias in the sampled outcomes. Indeed, some potential anagrams of eight characters don't show up at all during the sample run.
[Assuming the sort function isn't broken, all permutations can turn up - we'd just require far more runs before we expected to see the extreme cases.]
On the question of which of those approaches is "stronger": the first approach produces outputs which are closer to uniform. Additionally, it requires a far smaller change to bring its output to a strictly uniform distribution. The second approach may be "clever", but it's by no means clear that it could be adjusted, should uniformity of output be desired.
If I had to interpret those two samples, I'd say it was much more likely that the second was a reproduction of a trick seen elsewhere, whereas the first seems more likely to be an on-the-spot invention.
Why desire a uniform distribution? There's certainly a certain symmetrical draw to it. At the other end of the spectrum, sorting all input characters may suffice - or simply returning the input string. (Where Unicode strings are concerned, "returning the value you've been given" is about the safest thing you can do - see below.)
If the requirement in the problem statement ["An anagram contains all the same characters of the initial string or word in a different order"] is a strong one, then the last option doesn't work - and both of the suggested approaches also have a flaw. However, under those circumstances, we should note that there are inputs where no algorithm can meet a strict interpretation of that requirement. [Examples: any single-character string, or "AAA", etc.]
There may be other, less specific versions of "better," when picking a distribution - for instance, finding letter-orderings which produce more challenging anagrams for a human solver, or ones that embed wry phrases.
Pedantry to one side, however...
In typical i18n-blindness, all approaches here don't deal correctly with combining characters and other tricky Unicode features.
I'd also observe that there's a western bias - perhaps even an Anglophone one - to the problem statement. I'm not sure that the notion of an anagram even applies across all the languages that Unicode covers. But the problem there can be found closer to home: if you asked Herr Gauß how many anagrams his surname generated, the answer probably wouldn't be 24. The Unicode consortium offers some partial approaches to dealing with this, but they are by no means comprehensive (and a desire for a comprehensive approach is misplaced, usually arising from an underestimation of how complex the problem space is). Software that deals with human text is notoriously hard to get even close to correct in a cross-cultural fashion. (cf. the question of "how should software handle names?" which has been the subject of a number of writeups.)
Comments
Excellent commentary. Thanks for bring up a few great facts.
Choosing a random permutation using the Math.random()-0.5 does not give a uniform distribution and should never be used in "real" code. It is BAD
https://news.ycombinator.com/item?id=2728914
The proper algorithm to use is the Fischer-Yates Shuffle https://medium.com/@oldwestaction/randomness-is-hard-e085decbcbb2 https://en.wikipedia.org/wiki/Fisher–Yates_shuffle
Anything dealing with text is WAY HARDER than people understand. They have to know about normalization (the two characters LATIN CAPITAL LETTER E followed by COMBING ACUTE ACCENT should be the same as the single character LATIN CAPITAL LETTER E WITH ACUTE) and then there are emojis and languages with all kinds of markers and things etc. etc. Most programming languages, including JavaScript and Java, do a terrible job with them.
I've been fighting the battle against "i18n blindness" so it's nice to see this. Characters are very, very hard. I've written about this here: https://cs.lmu.edu/~ray/notes/charenc/. I tell students:
Anagramming is indeed an example of a problem that makes sense only in certain cultural contexts. All text processing is culturally dependent.
Thanks for taking the time to post your analysis. When I read these two contrasting examples from the Introduction, I also thought there was an inconsistency about the assertion that “both code snippets do the same thing”, since — of course — the whole point of the discussion is that they don't. But on re-reading, I note @markcmarino was careful to write that the second example “performs the same operation” as the first (Marino 2020, 6). For my money, that's just the right amount of wiggle room to keep the (rhetorically effective) example intact, notwithstanding your analysis.
What makes your analysis nevertheless so useful is that illustrates how running the same code over and over again (trivially cheap for such an example) can sometimes debunk the apparent “sameness” — let's call it functional equivalence — of two implementations, which may only become obvious in the limit case of repeated execution.
Also, I really liked your observation that certain inputs (e.g.
'aaaa'
) can reveal deficiencies in the problem statement of the interview question. Adversarial inputs — e.g. “naughty strings” — are a useful tool for software testers. In the spirit of joint enterprise, CCS should avail itself of as much of these ideas as possible. Pedantry... with purpose!@jang -- Re: your comment
...where is that implementation of
random.shuffle()
?Also, I wasn't too familiar with ß and haven't thought about it in terms of anagrams. Is the idea here that valid anagrams of Gauß might include Gaßu or Gussa OR Gasus?
Interesting point on i18n, unicode, and the cultural specificity of anagrams. It also points to the cultural context of the questioning -- interviewees might perhaps expect that they were being asked a question about performing unit operations on character sequences, rather than creating outputs that are semantically or alphabetically valid in a particular language. It also points to the metagame of the interview, as @rtoal suggests: when can (and should) an interviewee or a working programmer ask clarifying questions about the domain of application, assumptions, and goals for the code they produce, even (or especially) if it is viewed as a mere utility.
There are lots of reasons, but for certain kinds of problems -- in areas like cryptography, compression, search, or the financial or gambling industries -- then it might be extremely important to understand distribution bias and efficiently mitigate it where needed (it might also be helpful to understanding PNRGs in general when working on such problems). For example, knowing outlier values that come up more or less often could provide the crucial element that a gambler needs to "beat the house" or a codebreaker to "crack the code."
It's worth noting as well that while usually one might prefer more concise code, giving a sort algorithm an unstable comparator is a violation of most APIs and for me would be a red flag in an interview.
As a bit of a rabbit-hole, unstable comparison functions can increase the likelihood of worst-case runtime performance. In this particular case it seems that mozilla prefers mergesort (which has an optimal timebound as @jang assumed) but chromium uses quicksort which has a non-optimal worst-case complexity -- although this worst case is triggered by having unequal sized partitions which this particular function is not biased to trigger.
In any case, the first result has linear runtime, and at the heart of most interviews I've been in is establishing a basic understanding of the performance of code, and a preference for more efficient code absent a good reason to do otherwise (such as distributional bias).
While the performance of the anagram function itself may not be critical to the success of the hypothetical system using it, job interviews, and educational assignments, generally use the meaningless as a proxy for the meaningful and expect one to treat tedium with reverence (which is perhaps a cheeky summary of computing as a whole, echoing @eamonnbell's 'pedantry with purpose').
@eamonnbell -- thanks for sharing the Big List of Naughty Strings. If you look at the .json file of the strings, you indeed find a big list -- presented without comment. But if you look at the bins.txt file, it is broken into labeled categories of input types, sometimes with short descriptions.
The categories of "naughty strings" are interesting -- some are specific to certain languages or platforms or even applications (like IOS iMessage), some are cultural practices (like emoticons), some are forms of hacking / known vulnerabilities (like SQL injection), and some are second order effects, like innocuous strings which may be blocked by profanity filters (the Scunthorpe problem).
The document is also quite playful. One category, "Human injection: Strings which may cause human to reinterpret worldview" simply contains:
Even beyond the complexity of it all, it is impossible for me to browse through these categories without being constantly made aware how deeply culturally embedded and situated text processing is -- and in how many different ways.
The full list of categories:
(On the question of Python's random.shuffle:)
You'll find it here
There's a quick mention of its rationale here.
Something like this ought to suffice also:
or in Python:
@jang
s[i], s[j] = s[j], s[i]
is a fun way to do a swap.@rtoal an equivalent result to using the Fisher-Yates shuffle to get a uniform distribution would be the use of Factoradics (Factorial Number System), which were also described by Knuth, though originating elsewhere.
The main difference is that we randomly select a permutation up front in the range of [0, n! -1]. The mixed-radix representation of the Factoradic then allows us to map (or 'unrank' as its sometimes termed, at least with the similar idea of Combinadic numbers) this permutation ID to a specific permutation of the input. Runtime is also linear.
This approach could be preferable to minimize the amount of entropy consumed from the random source, or if we want a compact representation of anagrams to then store in bitmaps or dictionaries.