Which programming language for this?

Anything to do with the traditional world of get a degree, get a job as well as its alternatives
Post Reply
ertyu
Posts: 2914
Joined: Sun Nov 13, 2016 2:31 am

Which programming language for this?

Post by ertyu »

One of the jobs that seems to be available in my hometown, paying 500-600 euro per month, is media content analyst. As I understand it, you trawl sites for articles and you classify them by topic and by whether they are positive, negative or ambivalent (assessment varies by project). My friend who works it says it's a lot of work, but he is happy with how he's been treated, the promotions he's been given (I think he is now an analyst proper, as opposed to a junior analyst, and earns somewhere between 600-1000 euro per month). He is now being trained in search engine optimization.

Normally, I would not touch a job like this with a ten foot pole. But I've had the thought that this is probably a process that can be significantly automated if you know the right programming language. If I can earn 500 euro per month with an hour's worth of effort per day, I would be more than glad to work this, especially given you can work from home/long distance, which makes me location independent - at least within my home country. It would probably make me location independent in south east asia, too, if i supplement with the stash.

This is still a very undeveloped idea, but I know there are a couple of tech-savvy people on the forum, so I thought I'd ask. To what extent is this job possible to automate? What programming language would I need to know to automate it? And how long would you estimate I would need to reach this level? Assume average intelligence - I am annoyingly mediocre at all areas of knowledge - I can become averagely proficient in all but not a particular whiz in any of them. So I could probably become a solid coder but hardly a genius programmer. No prior knowledge of programming.

Thank you for your thoughts
Last edited by ertyu on Wed Jun 17, 2020 3:24 am, edited 1 time in total.

Quadalupe
Posts: 268
Joined: Fri Jan 23, 2015 4:56 am
Location: the Netherlands

Re: Which programming language for this?

Post by Quadalupe »

I'd say python. Python has a large number of libraries that can be used for web scraping (beautifulsoup for example). The other two tasks are classification tasks and sentiment analysis, both in the field of machine learning. Python has some good libraries for this as well, scikit-learn for traditional machine learning and tensorflow/pytorch/fastai/... for deep learning architectures. Nowadays, you can often pick a pre-trained model of the shelf, tweak it a little bit on your dataset and you're done(ish). This works surprisingly well for a lot of tasks.

If there isn't a pretrained model yet for your language, it's harder.

As for time: to hack something together with zero experience in programming?

web scraping: ~1-4 weeks
text-classification/sentiment analysis: ~1-4 weeks if language models are available, ? if you'd have to train it all from scratch.

JCD
Posts: 139
Joined: Sat Jul 20, 2019 9:12 am

Re: Which programming language for this?

Post by JCD »

It seems like there are two different answers, depending on the level of automation. The first level of automation is to simply do some very basic analysis. For example, maybe you have 3 sets of words, grouped by your classification system (positive, negative or ambivalent), and you just want to count the number of times each word shows up. Assuming that your main goal is to just count words, perl is often pointed at as the king of text parsing. However it is a old language and a bit archaic. Since you asked the question without listing any knowledge in programming, I assume you don't have a lot of programming experience, so I'd avoid perl. I'd pick any modern language, since this sort of solution is trivial. The biggest effort would be to strip the HTML out so you only analysis was the text, so maybe pick a language with an existing module that solves that problem for you. I'm mostly a C#/Java guy and find they make sense to me, but lots of people today are exited by Python and Go. Basically it is all taste since they will all solve this problem equally well. It would take me 1-2 hours to read in a list of URLs, have a program request those URLs and get the HTML back. Getting rid of the HTML would probably take several days as I would look for a module and have to figure out how to make the magic actually work. Probably add some safeties like sites with content less than 200 characters are marked as suspect and need manual review. Maybe any with sentiment scores where the positive and negative scores are extremely high or low also get manually reviewed. Basically this is hand tuning you'd do over a few days/weeks.

Then there is the second level of solution. This is trying to build a classifier system using AI/Machine Learning/Data Science. While not a moral judgement, if you have the skill to do work at this level you probably shouldn't be taking this job, since you could earn 5-10x working on bigger more important problems. I guess maybe you want this job since it requires less time, although I'm not even sure that would be true given the data needs you would have to get the AI going. In general Python is the language for this, but frankly it isn't the programming code that matters. I'm not an AI expert, but largely most modern AI is machine learning based. Machine learning is basically a fancy way of doing statistical analysis. Basically you're teaching a machine to decide a classification by features. Those features are defined by you, which means you have to figure out what you want to measure. This could be word count, it could be word classification, it could be HTML complexity, it could be domain name, it could be time of day the content was posted, it could be the background colors--who knows what matters. The point is you gather features together, meaning you write lots of code to parse the site in order to create the features. Next you need another data set of known classified pages. You could do your regular job for a month and keep track of the results to keep this. Then you take your features and your tracked data and then mix them in a vat calling a magic AI library to have it learn how to "decide" what a page should be classified as. Finally, using more data that is correctly classified but the AI has never seen, you test the job the AI does at predicting to make sure it is tuned correctly. If it is not, play with the magic statistical engine until it does what you want. This is an art rather than a science. It is difficult to predict how accurate you would get things, how good your predictions would be and worse yet if your boss ever asked you to explain how you decided site X was Y description, it would be impossible to say since the whole thing would be black box. I'm not sure I'd want to do all this work to trust an AI that might only be 90% accurate with keeping me employed. Once you have to double check things, the first system seems a heck of a lot simpler and 90% what you'd need.

My experience: 5 years of testing systems that classified fraud through automated processes.

ertyu
Posts: 2914
Joined: Sun Nov 13, 2016 2:31 am

Re: Which programming language for this?

Post by ertyu »

it really makes me wonder - if there is a solution a qualified professional could automate, why hire an army of dweeb "content analysts"? I assume they're probably for the manual review of the thrown-out sites.

Also, yes - I do not have prior knowledge of programming. Have updated original post to include that.

User avatar
fiby41
Posts: 1614
Joined: Tue Jan 13, 2015 8:09 am
Location: India
Contact:

Re: Which programming language for this?

Post by fiby41 »

I cant seem to find it but there are jupyter notebooks on github of people who've did this for imdb comments among other things with accompanying videos explaining how to go about it.

You'll need NLTK and WordNet on python.

Most of your time will be spent cleaning up punctuation, stop words and lemmas.

Keywords to search for are sentiment analysis and natural language processing.
Last edited by fiby41 on Wed Jun 17, 2020 4:17 am, edited 1 time in total.

JCD
Posts: 139
Joined: Sat Jul 20, 2019 9:12 am

Re: Which programming language for this?

Post by JCD »

it really makes me wonder - if there is a solution a qualified professional could automate, why hire an army of dweeb "content analysts"? I assume they're probably for the manual review of the thrown-out sites.
Version 1: What you described.
Hire someone at: 500x12=~6k + taxes

Version 2: My first solution
Hire someone at: 3000x2= ~6k + taxes then hire someone to work on the failures at 250x12=~3k + taxes. You might have to have a contractor come fix the code every once in a while, so add another 3-30k a year on top of that, depending on how wrong things go.

Version 3: ML solution
Hire someone at 100,000-250,000 (Maybe less in Eastern Europe?). Pay someone to re evaluate the data after 6-12 months.

Basically machine learning is awesome when you need lots of humans making decisions on a frequent basis. If you only need to have 1-20 humans do the work, then just hire humans. Computers are about scale. It is why google in 2010 had something like 1/10th the employees that GM has today. The mix of time frames is because Google has way more employees now that they are trying for various moon shots programs because they want to keep their growth rate going. Google had ~30k employees in 2010 and 100k today, creating automated cars is not cheap, but if you can fire 5-50 million workers world wide, it is likely worth it. Creating code and experimenting up front is expensive, you need to be willing to wait for the pay off.

Another possibility is it just isn't worth anyone's time to disrupt due to low cash flows and a lazy business. Or perhaps they are under contract to have humans do the reviews--possibly as a check on the algos. Or you might be the army creating the data for future ML systems. It is hard to know without the context.

2Birds1Stone
Posts: 1606
Joined: Thu Nov 19, 2015 11:20 am
Location: Earth

Re: Which programming language for this?

Post by 2Birds1Stone »

I was in the business of replacing swivel chair workers with software bots, and JCD hit the nail on the head with his last post. Creating a custom solution, maintaining it, auditing its results is not worth the capex/opex unless you have the scale to get a payoff.

The industry is rapidly evolving but off the shelf automation solutions still leave a LOT to be desired. Then you have the whole security/compliance can of worms to deal with.

jacob
Site Admin
Posts: 15969
Joined: Fri Jun 28, 2013 8:38 pm
Location: USA, Zone 5b, Koppen Dfa, Elev. 620ft, Walkscore 77
Contact:

Re: Which programming language for this?

Post by jacob »

ertyu wrote:
Wed Jun 17, 2020 3:22 am
it really makes me wonder - if there is a solution a qualified professional could automate, why hire an army of dweeb "content analysts"? I assume they're probably for the manual review of the thrown-out sites.
Alternatively, they're there to provide/generate the training set for the [existing] model.

biaggio
Posts: 35
Joined: Sun Apr 23, 2017 5:31 am

Re: Which programming language for this?

Post by biaggio »

jacob wrote:
Wed Jun 17, 2020 7:47 am
Alternatively, they're there to provide/generate the training set for the [existing] model.
My thinking exactly. I know places where they outsource efforts of similar type to teams in other countries via custom built tooling. Usually the tools will assign the same task to multiple different people (how many depends on the cost of the task, how critical the quality is, and so on). That way you can filter out those that don't perform well (e.g. tend to disagree with the majority in the group that has been assigned the same task for much larger fraction of tasks than the average). Just saying that since 500 EUR/month means about 2.8 EUR/h they probably replicate the same task, so you could get flagged if you automate it.

JCD
Posts: 139
Joined: Sat Jul 20, 2019 9:12 am

Re: Which programming language for this?

Post by JCD »

I just thought of another point that might completely kill this idea. It is very possible, even likely they will require you to install software on your box that allows them to spy on you. They might even be using your behavior metrics as an additional thing to sell--to show client sites how people use their site. If they do anything like that, any automation you have becomes worthless since you need to behave like you are not automated and that is genuinely hard to do. Having worked in fraud I will tell you many sites collect metrics around how you move your mouse, how quickly you key in text, etc. These sorts of metrics help detect bot-based fraud. It is pretty trivial to write spyware that records all of this and make it a plugin to the browser or something installed in the OS. They may even audit you by playing back your mouse and keyboard moves to make sure you don't alt-tab into other non-work based browsing. You'd certainly want to check in on this before working on automating anything.

ertyu
Posts: 2914
Joined: Sun Nov 13, 2016 2:31 am

Re: Which programming language for this?

Post by ertyu »

All very good points. I will ask my friend in detail about what his job entails exactly - it is quite likely that everyone is correct and this will not be a viable opportunity for all of the reasons discussed above. Thank you guys for letting me pick your brains.

ertyu
Posts: 2914
Joined: Sun Nov 13, 2016 2:31 am

Re: Which programming language for this?

Post by ertyu »

Update: I talked to my friend and asked him very detailed questions about his work. It turns out that they are, indeed training the AI. While they haven't been directly told this, they do have to work on a company-provided software where articles are fed to them - once you process one article, another appears. They have to code up articles by theme, often consulting very granular coding manuals provided by the project leaders. The main metric they're evaluated on, and that they seem to take great pride in, is their "accuracy rate" - the extent to which the way they have coded up a particular article by topics, subtopics, mood, etc., concurs with "quality control," of which there are two tiers: one at the central office in Bulgaria, and one in Austria. Friend recounted with considerable pride an instance where our quality control took issue with the way he had coded up an article, only to have been vindicated by the next tier up. You start at 550 euro per month, my friend now makes 750 and is very proud of how his life is going.

Post Reply