Since last spring quarter, I’ve found myself in an awkward place as I try to work on my Qualifying Paper (QP). The ideal case, as one of my Advisors recently stated, is for students to decide on a QP topic during the spring of the 1st year, to read and conceptualize over the summer, to write a proposal in the fall, to collect data in the winter, and to write up the research in the form of a QP in the spring of the 2nd year. Very few students follow this idealized path, but my process is especially odd because there are a number of fundamental assumptions about the research process built not only into the pathway as a whole, but in the particular (and particularly ordered) milestones along the way.
Over the last year, I’ve increasingly come to identify myself as a Learning Analytics (LA) researcher. LA is a field which is in its infancy, and as such has very few established norms, seminal papers, and codified methods. My belief is that, like in other areas that have undergone data revolutions, education and education research will be totally upended by LA as it matures. In a world of easily available big data, in short, it’s increasingly difficult to justify research that ignores big data, especially if the primary reason for ignoring it is discomfort or a lack of expertise, and not its irrelevance (since it can be relevant to almost any research agenda).
The implications of a coming data revolution in education for an individual researcher who is still very much “in training” are many. I find myself torn in multiple directions, needing not only to develop a broad and deep understanding of existing literature in educational research – particularly on topics that interest me – but also needing urgently to develop computational and statistical competencies that go well beyond what is traditionally taught in schools of education. Regression models don’t cut it anymore.
This raises an essential point about the research process – particularly for an apprentice researcher – at this stage in the development of LA as a field. In traditional education research, tremendous emphasis is put on the development of good research questions. This is equally important in LA research, but with a key difference: in LA the most important and interesting questions are liable to arise well after data has been collected and analysis has already begun. In traditional research, collecting data before articulating, motivating, and conceptualizing key research questions and developing or appropriating instruments would be insanity. In LA – at least for a beginner who is still developing technical competencies, in addition to conceptual ones – it’s sometimes impossible to ask the question without first knowing what the data looks like, how it can be manipulated, and what kinds of questions are liable to yield interesting results.
Now, I’ve heard in my first year and a half as a PhD student the admonition that research should never be driven by methods. But research is always driven by methods. It’s a fine ideal to say that we should ask the most important and interesting questions, regardless of the methods necessary to answer that question, but in practice that’s an impossible way to do research. Now, what I’m not talking about here is the narrow refusal of so-called “quantitative” and “qualitative” researchers to use each other’s methods to answer questions. It is patently absurd for a researcher to refuse to ask a certain type of research question, generating the answer to which would require conducting interviews, simply because his own professional expertise is in regression models. That’s what collaboration is for.
No, what I mean is that in the very definition of what makes a good research question in Education there are implicit methodological biases that go well beyond the quantitative / qualitative divide. There are a set of methods which, though vastly different from each other, generally constrain the entirety of modern educational research. These are effective methods, no doubt, in many cases. They are tried and true, tested and approved, beloved and practiced. They are what any first year PhD student in a school of Education will be trained in, or at least exposed to, as a part of the core curriculum. They are what Advisors teach their advisees. I speak, of course, of regression modeling, of survey design, of psychological 2-by-2s, of “think alouds” and semi-structured interviews, of observation protocols, of video capture, of discourse analysis, of… you get the idea: the bread and butter of research in education. This set of methods for data collection and analysis constrains research questions, not because researchers cannot ask questions that these methods cannot answer, but rather because researchers do not call such questions research questions.
If I want to know how students use language in a Massive Open Online Course, broadly, I cannot use any of those existing methodologies without significantly altering my question. Such was my first pass at developing a research question, in which I planned to do discourse analysis on a sub-set of threads in the forums, sampled based on certain key elements (namely, debates around a particular problem from one of the problem sets). To do that research, however, would have been arbitrarily limiting, driven by the methods of traditional educational research. But the data set is so much richer than that. The problem is that there is no existing literature from which to borrow research questions, and no set of accepted methods for this kind of data that can help scope a research agenda. In falling back on traditional educational research, my first efforts at crafting a study were arbitrarily limiting, driven by a lack of technical expertise.
So what is the best question to pose? The answer is: the first question that an LA researcher needs to ask is exactly, “what is the best question to pose?” The first pass at data analysis ought to be driven by theory, of course, and a sense of what questions might be interesting, but the wonder of computation and massive data sets is that it is relatively cheap to get data, relatively cheap to test hypotheses with it, and relatively easy to totally change one’s research questions. In contrast to traditional educational research, where a change in research question may totally invalidate all of the collected data, in LA research there is so much data that ignoring significant chunks of it is essential to making progress in the first place.
In my case, in particular, the question generation process is even more emergent because my computational and statistical abilities are a work in progress. I know I am interested in language use, in one form or another, but I am still a novice in the area of Natural Language Processing (NLP), and thus do not know what questions are answerable and what are not. Every time I learn to do something new computationally, the possible research questions I might be able to ask change. As research questions change with technical skill, so to do the needs and demands for conceptual literature reviews. I’ve been hesitant to commit to a single conceptual framing, therefore, and to do an in-depth literature review because almost every week the interesting and important questions look so different that the literature I need to review changes.
What’s more, I can think of only one researcher / research team in LA so far that has done much NLP (Simon Buckingham Shum’s group at the Open University in the UK). Of course I should learn from Simon, but his research alone hardly constitutes a field from which I can draw on accepted norms, best practices, and a technical or conceptual training regimen. NLP itself is a fairly well-developed field, but as I begin to learn its methods and apply them to my big educational data sets I am faced with a conundrum: spend more time building technical expertise, or use my current expertise – limited though it is – to begin to do educational research. This is an important point: LA is not, in the end, a computer science field. It uses computer science, it uses NLP, it uses machine learning, it uses database management… but at the end of the day it’s a branch of educational research, with its roots in the Learning Sciences. I suspect many LA researchers – and particularly Educational Data Mining (EDM) researchers – would disagree with that statement, valuing the higher prestige ties to computer sciences and mathematics over the less sexy relationship with education, but I think it is vitally important that we don’t let computational power totally overwhelm the theoretical and scientific knowledge generated by the last century of research in education, lest we make the classic mistake of assuming that learning is simpler than it is.
So the question is unresolved: draw on expertise from NLP and other computational fields (data visualization, for example, and data mining for the purposes of clustering before performing NLP analyses) and thereby invest my time and energy into expanding my technical base, which in turn opens up richer and more interesting avenues for research, or begin to develop a stronger conceptual base for the kinds of research questions I feel I can ask now, and make progress on completing the QP. The answer, of course, is both. The problem, of course, is that each step in the former direction heretofore has forced a (totally unanticipated) redefinition of the problem in the latter. Add to this that the process of LA research is almost necessarily collaborative (in my case it is), and the problem becomes even more intractable. I cannot arbitrarily stop my collaborators from developing new and exciting questions simply because I have now invested time and energy into better conceptualizing a prior question. Ultimately, as I build my technical expertise, and as the field adopts norms and expectations, and as training in LA becomes more formal and less ad-hoc, this problem begins to go away, both for me and those who will come after me. But for now there is a tension.
Such is the challenge of doing new research.